CN102056026A - Audio/video synchronization detection method and system, and voice detection method and system - Google Patents

Audio/video synchronization detection method and system, and voice detection method and system Download PDF

Info

Publication number
CN102056026A
CN102056026A CN2009102374145A CN200910237414A CN102056026A CN 102056026 A CN102056026 A CN 102056026A CN 2009102374145 A CN2009102374145 A CN 2009102374145A CN 200910237414 A CN200910237414 A CN 200910237414A CN 102056026 A CN102056026 A CN 102056026A
Authority
CN
China
Prior art keywords
audio
video
short
time
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102374145A
Other languages
Chinese (zh)
Other versions
CN102056026B (en
Inventor
陈欣伟
方力
沈亮
高屹
常静
侯优优
阮征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Design Institute Co Ltd
Original Assignee
China Mobile Group Design Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Design Institute Co Ltd filed Critical China Mobile Group Design Institute Co Ltd
Priority to CN2009102374145A priority Critical patent/CN102056026B/en
Publication of CN102056026A publication Critical patent/CN102056026A/en
Application granted granted Critical
Publication of CN102056026B publication Critical patent/CN102056026B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an audio/video synchronization detection method, an audio/video synchronization detection system, a voice detection method and a voice detection system. The audio/video synchronization detection method comprises the following steps of: determining the staring play time of an audio band matched with audio reference data and the staring play time of a video frame matched with video reference data in an audio/video file played at a target terminal; according to the staring play time of the audio band matched with the audio reference data and the staring play time of the video frame matched with the video reference data, determining audio/video play time difference when the audio/video file is played at the target terminal; acquiring audio/video play time difference when the audio/video file is played at a source terminal; and according to audio/video play time difference when the audio/video file is played at the source terminal and at the target terminal, determining an audio/video synchronization condition when the audio/video file is played at the target terminal. By the invention, accuracy of audio/video synchronization detection can be improved.

Description

Audio-visual synchronization detection method and system thereof, speech detection method and system thereof
Technical field
The present invention relates to the audio frequency and video detection technique in the communications field, relate in particular to a kind of audio-visual synchronization detection method and system thereof, and a kind of speech detection method and system thereof.
Background technology
In the mobile communication video traffic, because Voice ﹠ Video does not carry temporal information in cataloged procedure, the synchronizing information that therefore obtains audio frequency and video becomes quite difficult.
If add temporal information in packets of audio data in advance behind audio/video coding and the video packets of data respectively, then the audio-video document after encoding is after Network Transmission arrives receiving terminal, resolve by the audio-video document that receiving terminal is received, parse the temporal information of carrying in packets of audio data and the video packets of data, judge the synchronous situation of audio frequency and video then according to the temporal information that parses.
But there is following problem in above-mentioned audio-visual synchronization detection method:
(1) although Voice ﹠ Video is carrying temporal information after the packing respectively, but the temporal information after the two grouping packing does not have corresponding corresponding relation, moreover the size of the frame length of Voice ﹠ Video and packet is also inequality, therefore can't accurately determine the relative time delay of Voice ﹠ Video;
(2) audio-visual synchronization is carried out the result of synchronous detecting according to the temporal information of carrying in packets of audio data and the video packets of data packet header, the propagation delay time that only can reflect network, and in the actual play process, the audio-video document player of receiving terminal is provided with buffer memory, audio stream and video flowing through decoding are adjusted by buffer memory synchronously by player, therefore, carry out result that audio-visual synchronization detects according to the temporal information of carrying in packets of audio data and the video packets of data packet header and can not reflect that the audio-video document player adjusts the back synchronously to influence that audio-visual synchronization produced, that is, adopting this kind mode to carry out audio-visual synchronization, to detect resulting result inaccurate.
Summary of the invention
The embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, in order to solve the existing low problem of audio-visual synchronization detection accuracy.
The technical scheme that the embodiment of the invention provides comprises:
A kind of audio-visual synchronization detection method comprises the steps:
Determine in the audio-video document that destination end plays, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching;
According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
It is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time of described audio-video document when source end and destination end are play, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
A kind of audio-visual synchronization detection system comprises:
The audio identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the audio section of audioref Data Matching;
The video identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the frame of video of video reference Data Matching;
The time difference determination module, be used for initial reproduction time that determine according to described audio identification module and the audio section audioref Data Matching, and the described video identification module initial reproduction time with the frame of video video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
Synchronous detection module, it is poor to be used to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the described audio frequency and video reproduction time difference that gets access to and described time difference determination module are determined, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
The above embodiment of the present invention, the audio-video document of playing for destination end, determine the initial reproduction time of the audio section of itself and audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, thereby the audio frequency and video reproduction time when obtaining the destination end broadcast is poor, compare with the audio frequency and video reproduction time difference of this audio-video document when the source end is play then, thereby determine the audio-visual synchronization situation of this audio-video document when described destination end is play, compared with prior art, the audio-visual synchronization of the embodiment of the invention detects the temporal information that does not rely in the audio, video data bag, but carry out synchronous detecting according to the audio-video document of destination end institute actual play, simultaneously the factor of in the audio/video decoding course of destination end audio-visual synchronization being adjusted is taken into account, therefore resulting audio-visual synchronization testing result is more accurate.Be particularly useful for process to the audio-visual synchronization situation detection of audio frequency and video after Network Transmission.
The embodiment of the invention also provides a kind of speech detection method and system thereof, is used to solve the low problem of prior art speech detection accuracy.
The technical scheme that the embodiment of the invention provides comprises:
A kind of speech detection method comprises the steps:
According to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
A kind of speech detection system comprises:
First search module, be used for according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
Second search module is used for searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold when described first search module, continues along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
The voice segments determination module, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value when described second search module, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
The above embodiment of the present invention, in the speech detection process, discern more effective at voice segments standby average energy when background noise is smaller, discern relatively effectively characteristics at the average zero-crossing rate of time standby that background noise is bigger, the short-time average magnitude and the short-time average zero-crossing rate of voice signal have been taken all factors into consideration, on basis based on the short-time average magnitude detection method, investigate the short-time average zero-crossing rate of voice signal again, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect, thereby make detected voice segments terminal more accurate.
Description of drawings
Fig. 1 is the schematic flow sheet that audio-visual synchronization detects in the embodiment of the invention;
Fig. 2 is the schematic flow sheet that IP network video telephone audio-visual synchronization detects in the embodiment of the invention;
Fig. 3 is the dynamic route search schematic diagram of speech recognition process in the embodiment of the invention;
Fig. 4 is the audio-visual synchronization scoring model schematic diagram in the embodiment of the invention;
Fig. 5 is the structural representation of the audio sync detection system in the embodiment of the invention;
Fig. 6 is the structural representation of the speech detection system in the embodiment of the invention.
Embodiment
The problems referred to above at the prior art existence, the embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, adopt the mode of pattern recognition to carry out the audio-visual synchronization detection, promptly respectively the audio-video document of broadcast and the reference data of these audio frequency and video are carried out pattern recognition at transmitting terminal and receiving terminal, note the audio frame that is complementary with audioref data and video reference data and the initial reproduction time of frame of video respectively, the audio frequency and video reproduction time that obtains transmitting terminal and receiving terminal is poor, by being compared, the audio frequency and video reproduction time difference of transmitting terminal and receiving terminal calculates delay inequality again, thus the audio-visual synchronization situation the when audio-video document that obtains receiving terminal is play.
In the embodiment of the invention, before carrying out the audio-visual synchronization detection, to prepare audioref data and video reference data earlier, be used for detecting the audioref point and the video reference point of audio-video document, thereby determine the audio-visual synchronization parameter according to audioref point and video reference point in synchronization detection process.The audioref data can be the audio volume control data, and the video reference data can be vedio datas, and audioref data and video reference data can be stored in the feature ancient term for country school in advance.
Referring to Fig. 1, be the schematic flow sheet of audio-visual synchronization detection in the embodiment of the invention.This flow process can be applicable to assess the influence of Network Transmission to audio-visual synchronization, also can be used for assessing different influences of playing end to audio-visual synchronization.If be used to assess the influence of Network Transmission to audio-visual synchronization, then the source end in this flow process is meant that transmitting terminal, the destination end of audio-video document are meant the receiving terminal that audio-video document arrives after Network Transmission; Play the influences of end to audio-visual synchronization if be used to assess difference, then the source end in this flow process can be that audio frequency and video are play end to the audio-visual synchronization quality preferably, destination end is meant the audio frequency and video broadcast end that need carry out the audio-visual synchronization quality evaluation.This flow process comprises the steps:
Step 101, adopt the audio mode recognition methods to find out in the audio-video document that destination end plays,, and write down the initial reproduction time of this audio section with the audio section of audioref Data Matching;
Step 102, adopt video mode recognition method to find out in the audio-video document that destination end plays,, and write down the initial reproduction time of this frame of video with the frame of video of video reference Data Matching;
Step 103, according to initial reproduction time record and the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, the reproduction time of determining audio frequency and video is poor;
Step 104, poor according to the audio frequency and video reproduction time of determining, and this audio-video document is when the source end is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching, determine this destination end and play the audio-visual synchronization situation of this audio-video document, as, compare with the audio-visual synchronization time delay of source end, variable quantity of the synchronization delayed time of the audio frequency and video of destination end (situation of change of the time span of comparing with the source end as the time span of or hysteresis video leading) or degree at the destination end audio frequency, and can further the audio-visual synchronization situation be mapped as corresponding audio-visual synchronization credit rating.
In the step 101 and step 102 of above-mentioned flow process, the time of being write down can be the destination end current system time, also can be the time of playing starting point with respect to this audio-video document.Step 101 in the above-mentioned flow process and step 102 are not strict with on sequential, that is, this two step can go up exchange in proper order, also can executed in parallel.
Usually, audioref data and video reference data are one to one, and in order to make synchronous detecting more accurate, how right audioref data and video reference data are generally.At many to audioref data and video reference data conditions, the reproduction time difference that the step 103 of flow process shown in Figure 1 is determined also be with audioref data and video reference data to one to one, promptly, determine initial reproduction time with the audio section of its coupling at audioref data, at determining initial reproduction time with the frame of video of its coupling with the pairing video reference data of these audioref data, it is poor to pairing audio frequency and video reproduction time that both time differences are with these audioref data and video reference data; In like manner, can obtain in the step 104, audio-video document is when the source end is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching.
Can be in advance obtain audio frequency and video time difference of the audio-video document that this synchronous detecting uses in the above described manner at transmitting terminal, and when follow-up this audio-video document of each use carries out the audio-visual synchronization detection, directly use the audio frequency and video time difference of this detected in advance transmitting terminal audio frequency and video time difference and receiving terminal to compare, thereby determine the audio-visual synchronization situation of this audio-video document after transmission.
Generally, in order accurately to detect the audio-visual synchronization situation, audio-visual synchronization detects the audioref data of usefulness and video reference data should be had and comparatively significantly be convenient to the feature discerning and be convenient to carry out pattern matching, audio-visual synchronization detect then comprise in the audio-video document of usefulness with the audio section of audioref Data Matching and with the frame of video of video reference Data Matching.Preferably, audio-visual synchronization is detected in the video file of usefulness, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of corresponding video reference Data Matching, identical on the sampled point meaning, promptly the audio frequency and video time difference is 0.In this case, in the step 104 of flow process shown in Figure 1, because the audio frequency and video reproduction time difference of audio-video document when the source end is play be 0, the audio frequency and video reproduction time of then can be directly determining according to step 103 is poor, makes the audio-visual synchronization situation that this destination end is play this audio-video document.
Detecting with IP network video telephone audio-visual synchronization is example, the audio-video document of using as synchronous detecting, aspect audio frequency, comprise the pronunciation of numeral 1,2,3,4,5, the picture that aspect video, comprises 5 kinds of different human body gestures that show before the solid background, and during the pronunciation of a numeral of every appearance, show corresponding a kind of gesture on the picture in the playing process; The audioref data are the audio volume control data of each numeric utterance in the numeral 1,2,3,4,5, are stored in the audio frequency characteristics storehouse; The video reference data are the vedio data of each gesture in 5 kinds of human body gestures under the solid background, are stored in the video features storehouse; This audio-video document is when transmitting terminal is play, and each numeric utterance is known with the synchronization time difference of corresponding gesture picture.In network transmission process, the Voice ﹠ Video in this audio-video document transmits respectively, forms WAV audio file and AVI video file at receiving terminal.Detect the process of this audio-video document, can comprise the steps: as shown in Figure 2 in the audio-visual synchronization situation of receiving terminal
Obtain the WAV audio file (step 201) in the audio-video document that the audio frequency and video receiving terminal receives, the terminal of determining wherein each voice segments according to audio signal is to find out voice segments (step 202), adopt the audio mode recognition methods, the speech data of each numeric utterance in each voice segments and the audio frequency characteristics storehouse is compared, determine numeral 1 in each voice segments respectively, 2,3,4, the voice segments (step 203) of 5 pronunciations, and write down the start-stop reproduction time of these voice segments, thereby in time (the then corresponding more time that writes down of repetition being arranged) (step 204) that the audio frequency and video receiving terminal can write down at least 5 audio sections as the digital pronunciation in the WAV audio file;
Obtain the AVI video file (step 205) in the audio-video document that the audio frequency and video receiving terminal receives, extract the every two field picture (step 206) in the AVI video file, adopt video mode recognition method, the view data of various gestures in each video frame images and the video features storehouse is compared, determine the wherein frame of video of various gestures respectively, usually only get the frame of video (step 207) that first identifies, and write down the initial reproduction time of these frame of video, thereby in time (the then corresponding more time that writes down of repetition being arranged) (step 208) of at least 5 frame of video of audio frequency and video receiving terminal record as the gesture picture in the AVI video file;
The initial reproduction time of frame of video of the gesture that the numeral 1 of the initial reproduction time of numeral 1 pronunciation of record and record is corresponding subtracts each other, the audio frequency and video reproduction time that obtains digital 1 correspondence poor (time of being write down all is that the system time with receiving terminal is a benchmark), and the like, obtain the corresponding audio frequency and video reproduction time poor (step 209) of other numerals respectively;
The resulting audio frequency and video reproduction time of step 209 is poor, compare in the reproduction time difference of transmitting terminal with known this audio-video document, determine with respect to the audio frequency and video time delay (210) of this audio-video document of transmitting terminal at receiving terminal;
According to the result of step 210, determine corresponding audio-visual synchronization credit rating or MOS score value (step 211).
In the embodiment of the invention aspect being provided with of audioref data, the subjective feeling of considering the people is to the starting point (from noiseless to sound) of audio frequency and the asynchronous relatively sensitivity of terminating point (from sound to noiseless) and picture material, preferably, audioref is chosen at voice segments (as the voice segments of digital 1-5 pronunciation), therefore, when the audio section of definite and audioref Data Matching, at first to detect the terminal position of each voice segments in the audio volume control of this audio-video document, then voice segments and the audioref data determined be carried out audio mode identification.
For detecting the voice segments in the audio file, the embodiment of the invention can adopt traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude.Traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude is a kind of detection method of simple gate limit in essence, a kind of stronger in order to obtain than conventional method adaptability, the audiotime message of extracting is sound end detecting method more accurately, the invention process is also improved traditional speech detection method, and adopts the speech detection method after improving to carry out speech detection.Speech detection method after the improvement, discern more effective at voice segments standby average energy when background noise is smaller, discern relatively effectively characteristics at the average zero-crossing rate of time standby that background noise is bigger, the short-time average magnitude and the short-time average zero-crossing rate of voice signal have been taken all factors into consideration, on basis based on the short-time average magnitude detection method, investigate the short-time average zero-crossing rate of voice signal again, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect.
The foundation that can realize these judgements is that the various parameters in short-term of voice of different nature have different probability density functions and adjacent some frame voice should have consistent characteristics of speech sounds, and promptly they can not undergone mutation at voiced sound, voiceless sound, between noiseless.Usually, the short-time average magnitude maximum of voice signal voiced sound, noiseless short-time average magnitude minimum; The short-time average zero-crossing rate maximum of voiceless sound, noiseless placed in the middle, the short-time average zero-crossing rate minimum of voiced sound.
In the speech detection method that the embodiment of the invention adopted, at first rule of thumb value is determined two amplitude threshold parameter MH and ML (MH>ML), and a short-time zero-crossing rate threshold value Z0.The value of MH should be set than higher, makes when the short-time average magnitude M of frame voice signal value during above MH, and can be voiced sound just than to determine this frame voice signal be not noiseless and sizable possibility is arranged surely.When the short-time average magnitude M of voice signal when being reduced to ML greatly, adopt short-time average zero-crossing rate to proceed judgement, when the short-time average zero-crossing rate of voice signal is lower than threshold value Z0, can determine that it is the end points (beginning or end) of voice segments.
The statistical analysis of short-time average magnitude and short-time average zero-crossing rate be can carry out according to a large amount of speech samples, and amplitude threshold value MH and ML determined in conjunction with the short-time average magnitude of actual sample.The process of determining amplitude thresholding MH according to speech samples is:
Data in each speech samples are carried out windowing divide frame.According to people's the physilogical characteristics and the result who comes out of lot of data statistics, generally window length is made as 20ms, step-length is set at half of window length, then the total amount of frame=total sampling number/step-length;
According to the short-time average magnitude in the computing formula unit of account frame of following short-time average magnitude:
M m = Σ n = m N + m - 1 | S w ( n - m ) |
According to the short-time average zero-crossing rate in the computing formula unit of account frame of following short-time average zero-crossing rate;
Z m = 1 2 { Σ n = m N + m - 1 | sgn [ s w ( n ) ] - sgn [ s n ( n - 1 ) ] | }
All speech frames in each speech samples are traveled through statistical analysis, with the short-time average magnitude that draws speech samples and the distribution situation of short-time average zero-crossing rate;
Distribution situation according to the short-time average magnitude and the short-time average zero-crossing rate of speech samples, short-time average magnitude according to quiet period, set out the threshold value MH of a thresholding, with fixed bigger of this threshold value, to guarantee that short-time average magnitude in each speech samples is a voice segments greater than the part of MH, to get then the zero-crossing rate threshold value Z0 of period three short-time average zero-crossing rate doubly that mourn in silence as voice segments.
According to the amplitude thresholding MH that determines and ML and short-time average zero-crossing rate thresholding Z0, the speech detection process of the embodiment of the invention is:
Determine former and later two time points A1 and A2 in the audio signal to be detected according to MH, wherein, when the short-time average magnitude M of voice signal surpasses MH, this is designated as A1 constantly, the moment when A1 drops to MH first with voice signal backward is designated as A2; Substantially can be defined as voice segments between A1 and the A2;
Continue search before A1 and in the voice signal after the A2; When searching for forward,, then current time can be designated as B1 if the short-time average magnitude M of voice signal reduces to ML from big to small by A1; In like manner, when searching for backward,, then current time is designated as B2 if the short-time average magnitude M of voice signal reduces to ML from big to small by A2.Still can determine it is voice segments between B1 and the B2;
Continuation is searched for forward and by B2 backward by B1.When searching for forward,, drop to Z0 suddenly when following, current time is designated as C1 and as the starting point of voice segments up to Z if the short-time zero-crossing rate Z of voice signal all the time greater than Z0, thinks that then these voice signals still belong to voice segments by B1; In like manner, when searching for backward,, drop to Z0 suddenly when following, current time is designated as C2 and as the terminal point of this voice segments up to Z if the short-time zero-crossing rate Z of voice signal all the time greater than Z0, thinks that then these voice signals still belong to voice segments by B2;
And the like, detect all audio sections and starting point and terminal point in the audio file voice signal.
Take the reason of this algorithm to be: before the B1 and B2 may be one section voiceless consonant section afterwards, their energy quite a little less than, rely on short-time average magnitude not differentiate they and unvoiced segments fully, but their short-time average zero-crossing rate but will be apparently higher than noiseless, thereby enough this parameters of energy are judged the cut-point of the two, just real starting point and the terminal point of voice accurately.
This kind algorithm not only is adapted to the voice segments testing process in the embodiment of the invention, is applicable to that also other need detect the application scenarios of the voice segments in the audio signal.
After obtaining the temporal information of voice segments, also need the voice segments that obtains is carried out pattern recognition, to determine the voice segments with the audioref Data Matching.The embodiment of the invention adopts the linear forecasting technology (LPCC) in the audio frequency to carry out audio mode identification.
Obtaining of LPCC characteristic parameter mainly is divided into four steps: preliminary treatment, auto-correlation are calculated, moral guest's algorithm is found the solution linear predictor coefficient (LPC) regular equation and LPCC recursion.Wherein, in preliminary treatment, the preemphasis employing promotes high frequency to the mode that voice signal adds single order FIR filter, is used to compensate the decay of glottal excitation and the radiation-induced high frequency spectrum of mouth and nose; The preferred window shape Hamming window of this algorithm picks of window adding technology is as window function.
Voice signal has just changed into one group of LPCC characteristic vector after each frame is extracted the LPCC characteristic parameter.Speech recognition is exactly the speech feature vector of this stack features and reference audio data will be carried out pattern matching, thereby seeks the shortest pattern of distance.
Adopt pattern matching method to carry out speech recognition and be divided into two classes usually: training stage and cognitive phase.Form standard form in the training stage, at cognitive phase, the standard form vector that treating after the transmission attenuation known in speech characteristic vector and the standard form carries out similarity calculating.In the embodiment of the invention, be the characteristic vector of audioref data by formed standard form of training stage.
But consider the influence of the factors such as decay packet loss of audio file in transmission course, voice sequence length after the raw tone sequence is transmitted with process may be unequal, for addressing this problem, the embodiment of the invention adopts based on the DTW recognizer of dynamic time warping coupling carries out pattern recognition.
In the DTW method that the embodiment of the invention provided, at first calculate input pattern (being the audio signal characteristic vector of each voice segments to be identified) and reference model (being the characteristic vector of audioref data) apart from matrix, then, in distance matrix, find out an optimal path, the accumulation distance minimum in this path, this paths are exactly the non-linear relation between the time calculation degree of two patterns.Its algorithm principle is as follows:
Suppose that input pattern to be identified and reference model represent with T and R respectively,, can calculate the distortion D[T between them, R for the similarity between them relatively], the more little similarity of the distortion factor is high more.In order to calculate this distortion, the distortion from T and R between each corresponding frame is counted.If N and M are respectively the totalframes among T and the R, n and m are respectively optional frame numbers among T and the R, D[T (n), R (m)] represent the distortion between these two characteristic vectors, then:
When N=M (being that the T pattern is identical with the frame number of R pattern), directly T (1) and R (1) frame, T (2) and R (2) frame ..., T (m) and R (m) frame coupling, calculate D[T (1), R (1)], D[T (2), R (2)] ..., D[T (m), R (m)] the distortion factor, and ask itself and, promptly obtain total distortion;
When N ≠ M (frame number that is T pattern and R pattern is inequality), adopt dynamic programming method to carry out route searching, be specially: with (the n=1~N) mark on the transverse axis in a two-dimentional rectangular coordinate system of each frame number among the T, with (the m=1~M) on the ordinate of this coordinate system, mark of each frame number among the R, as shown in Figure 3, each crosspoint (n in the formed grid of horizontal ordinate, m) plotted point of a certain frame among the expression T, the route searching process just can be summed up as seeks a path by some crosspoints in these grids, and the crosspoint that the path is passed through promptly is the voice frame number that carries out distortion computation among T and the R.
Wherein, the path is not elective, considers that the speed of voice has variation, but the precedence of each several part can not change, therefore selected path should be from the lower left corner, finish in the upper right corner.Secondly, in order to prevent planless search, can further leave out those to the n axle or to the undue path that tilts of m axle, this be because the pressure of the voice in the reality, expand always limited, so just can in the path respectively the maximum and the minimum value of G-bar in the path by point limited, usually, greatest gradient is decided to be 2, minimum slope location 1/2.
The path cost function that defines in the present embodiment is: d[(ni, mi)], its meaning be from starting point (n0, m0) set out current point (computing formula is as follows for ni, each frame distortion aggregate-value mi):
d[(ni,mi)]=D[T(ni),R(mi)]+d[(ni-1,mi-1)]
d[(ni-1,mi-1)]=min{d[(ni-1,mi)],d[(ni-1,mi-1)],d[(ni-1,mi-2)]}
According to above formula, can be in the hope of needed D[T (ni), R (mi)] value.More than Ding Yi path cost function only is a kind of example, does not get rid of the algorithm of other path costs.
The video mode recognition method that the embodiment of the invention adopted is meant image-recognizing method, promptly, each frame of video that intercepting is play compares each two field picture that intercept and the video frame images in the feature database, thus find out with feature database in the video frame images frame of video of mating.This image recognition processes mainly is divided into two stages: video interception and image recognition.
Video interception can utilize the AVIFile library file of windows operating system to realize, is specially:
At first, initialization AVIFile storehouse, open the avi file for the treatment of synchronous detecting then and obtain its file interface address, if open file successfully (being that video format meets the requirements), then obtain needed avi file information according to the file interface address, these information can comprise: the data rate of file maximum (bytes persecond), document flow number, file height (pixels), width (pixels), sample rate (samples persecond), file size (frames), kind of document etc.; Can obtain the interface IP address of AVI stream according to the file interface address, interface IP address according to AVI stream, obtain the avi file stream information, because audio/video flow is a separate processes, so the stream information of Huo Deing only is a video flowing here, these information can comprise: the kind class description of document flow kind, frame rate (fps), start frame, end frame, image quality value, document flow etc.;
Then, handle the Video stream information obtain, call the address that corresponding decoding functions obtains data behind the decompress(ion), and the memory address of every frame data (being used to preserve into the BMP file), so far, just obtained needed image data information;
At last, write the header file of this image data information again, it is preserved into needed BMP file.The frame number of BMP file AVI video flowing by name, frame time can multiply by frame time by current frame number and obtain at interval, wherein frame period information can find in being specifically designed to the structure of preserving avi file information, for example, the file playback rate is 15fps, it is 66666ns that interframe is divided into 1/15, so it is poor with respect to the reproduction time of start frame to be easy to obtain each frame.
Intercept out the BMP picture from avi file after, the known BMP file of preserving is 24 RGB bitmaps, and further work promptly is that the BMP picture is carried out image recognition.Image recognition processes can be: with the colored bitmap-converted of 24RGB is the binary picture of 8RGB, the feature of outstanding target object, adopt pixel statistics and profile track algorithm to ask the area and the girth of detected image target object, it and image in the feature database are compared, specifically can be divided into following several steps:
Step 1, with target image (image that promptly is truncated to) gray processing, obtain corresponding grey value profile;
Step 2, grey value profile is carried out interative computation, calculate threshold value;
Step 3, according to threshold value with image binaryzation (be converted into black and white picture, white is background, and black is target object);
Step 4, the image of binaryzation is carried out pixels statistics, calculate the area (pixel number) of target object;
Step 5, carry out next step image processing, depict the profile of target object;
Step 6, carry out pixels statistics, calculate the girth of target object profile;
The information of the respective image of storing in the area that step 7, usefulness obtain and girth and the feature database is compared, and judges whether this image is required target image, is then to note reproduction time.
In the embodiment of the invention, when the audio-visual synchronization situation is estimated, can compare the degree of lead and lag according to audio ﹠ video, mapping obtains corresponding audio-visual synchronization grade and corresponding MOS score value.
The MOS score value of the audio-visual synchronization in the embodiment of the invention is with reference to the scoring algorithm in ITU-R.BT 1359 standards, copy its segmentation Calculation Method, according to the subjective feeling of people, set the threshold value of 4 kinds of audio-visual synchronization credit ratings to the audio-visual synchronization situation.Audio-visual synchronization scoring model can be as shown in Figure 4, transverse axis is the time of audio frequency hysteresis video among the figure, vertical pivot is represented the score value of marking, and A, B, C, A ', B ', C ' each point are represented the Three Estate thresholding formulated, will estimate score value and be divided into 4 grades, the corresponding MOS score value of each audio-visual synchronization credit rating, maximum score value is 4.0, and minimum score value is 1.0, and floating space is 0.3, each audio-visual synchronization grade and thresholding thereof and corresponding MOS score value, can be as shown in table 1:
Table 1
Figure B2009102374145D0000141
In order more accurately to estimate the audio-visual synchronization quality objectively, a plurality of monitoring points are set to detect the audio-visual synchronization situation and to carry out the audio-visual synchronization quality evaluation in the embodiment of the invention, when carrying out the audio sync quality evaluation, with the synchronous MOS score value addition of these a plurality of monitoring points, then obtain overall synchronous MOS score value.The MOS score value of general synchronization can be used as the MOS score value that draws the video traffic total quality after an important indicator and audio frequency MOS, the video MOS score value weighted calculation.
Based on the embodiment of the invention in audio-visual synchronization detect identical technical conceive, the embodiment of the invention also provides a kind of audio-visual synchronization detection system.As shown in Figure 5, this system comprises: audio identification module 501, video identification module 502, time difference determination module 503 and synchronous detection module 504, wherein:
Audio identification module 501 can be determined in the audio-video document that destination end plays by the audio mode RM, with the initial reproduction time of the audio section of audioref Data Matching;
Video identification module 502 can be determined in the audio-video document that destination end plays by the video mode RM, with the initial reproduction time of the frame of video of video reference Data Matching;
Time difference determination module 503, the initial reproduction time that is used for the audio section of and audioref Data Matching that determine according to audio identification module 501, and the initial reproduction time of video identification module 502 frame of video with the video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of audio-video document when destination end is play;
Synchronous detection module 504, it is poor to be used to obtain the audio frequency and video reproduction time of audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the audio frequency and video reproduction time difference that gets access to and time difference determination module 503 are determined, determine the audio-visual synchronization situation of this audio-video document when described destination end is play.
The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned audio-visual synchronization testing process, do not repeat them here.
Based on the technical conceive identical with speech detection in the embodiment of the invention, the embodiment of the invention also provides a kind of speech detection system, as shown in Figure 6, this system comprises: first search module 601, second search module 602, voice segments determination module 603, wherein:
First search module 601, receive the audio signal to be measured of input, according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of amplitude threshold MH, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below the amplitude threshold MH first, search for audio signal backward from current time;
Second search module 602 is used for searching short-time average magnitude forward and backward when dropping to the audio signal of amplitude threshold ML when first search module 601, continues along former direction of search search audio signal according to short-time average zero-crossing rate;
Voice segments determination module 603, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value Z0 when second search module 602, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value Z0, with the terminal point of current time as voice segments.
This system can comprise that also threshold value is provided with module 604, be used for distributing to determine amplitude threshold MH, amplitude threshold ML and zero-crossing rate threshold value Z0 according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than amplitude threshold MH is voice signal, in the voice signal of short-time average magnitude below amplitude threshold ML, the audio signal that short-time average zero-crossing rate is lower than zero-crossing rate threshold value Z0 is not a voice signal.
The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned speech detection flow process, do not repeat them here.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (22)

1. an audio-visual synchronization detection method is characterized in that, comprises the steps:
Determine in the audio-video document that destination end plays, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching;
According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
It is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time of described audio-video document when source end and destination end are play, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
2. the method for claim 1 is characterized in that, it is poor to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, and comprising:
Determine in the audio-video document that the source end play, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching;
According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when the source end is play.
3. method as claimed in claim 1 or 2 is characterized in that, described audioref data are speech data;
Determine and the process of the initial reproduction time of the audio section of audioref Data Matching, comprising:
Detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of being play;
By detected voice segments and described audioref data are carried out voice recognition processing, determine voice segments with described audioref Data Matching.
4. method as claimed in claim 3 is characterized in that, the voice segments that comprises in the audio-video document of determining to be play and the process of start-stop reproduction time thereof comprise:
In the audio-video document of being play, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
5. method as claimed in claim 4, it is characterized in that, described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.
6. method as claimed in claim 3 is characterized in that, determines and the process of the voice segments of described audioref Data Matching, comprising:
According to the characteristic vector of each voice segments audio signal, and the characteristic vector of described phonetic reference data, by the definite similarity to each other of the space length that calculates each voice segments and described phonetic reference data;
According to the similarity of determining, get wherein and the most similar voice segments of described phonetic reference data, as with the voice segments of described audioref Data Matching.
7. method as claimed in claim 6 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of computing voice section and described phonetic reference data was specially:
Each audio frame frame number of described voice segments is mapped on the transverse axis in the two-dimentional rectangular coordinate system, each audio frame frame number of audioref data is mapped on the ordinate of this coordinate system, on the direction of the upper right corner, determine a paths along the lower left corner of described coordinate system; According to the coordinate points of described path process, determine with described voice segments in the frame number of each frame number corresponding audio reference data;
According to the corresponding relation of the frame number of determining, utilize the characteristic vector of audio signal, calculate the distortion factor of two frame audio signals with corresponding relation, according to the distortion factor that calculates, determine the space length between described voice segments and the described audioref data.
8. method as claimed in claim 7, it is characterized in that, the described path of determining on along the lower left corner of described coordinate system to upper right corner direction, slope at the joint place of the frame number that each ordinate and abscissa identified, be no more than first slope threshold value, be not less than second slope threshold value, described first slope threshold value is greater than second slope threshold value.
9. method as claimed in claim 1 or 2 is characterized in that, determines and the process of the initial reproduction time of the frame of video of video reference Data Matching, comprising:
Extract the frame of video that comprises in the audio-video document of being play;
Carry out image recognition processing by frame of video and the described video reference data that will extract, determine frame of video and initial reproduction time thereof with described video reference Data Matching.
10. the method for claim 1 is characterized in that, determines the audio-visual synchronization situation of described audio-video document, comprising:
Determine described audio-video document when destination end is play with respect to the audio frequency and video time delay variable quantity that when the source end is play, is produced;
According to the audio frequency and video time delay variable quantity of determining, determine corresponding audio-visual synchronization credit rating or mark.
11. an audio-visual synchronization detection system is characterized in that, comprising:
The audio identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the audio section of audioref Data Matching;
The video identification module is used for the audio-video document of determining that destination end is play, with the initial reproduction time of the frame of video of video reference Data Matching;
The time difference determination module, be used for initial reproduction time that determine according to described audio identification module and the audio section audioref Data Matching, and the described video identification module initial reproduction time with the frame of video video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;
Synchronous detection module, it is poor to be used to obtain the audio frequency and video reproduction time of described audio-video document when the source end is play, poor according to the audio frequency and video reproduction time that the described audio frequency and video reproduction time difference that gets access to and described time difference determination module are determined, determine the audio-visual synchronization situation of described audio-video document when described destination end is play.
12. system as claimed in claim 11, it is characterized in that, when described synchronous detection module is obtained the audio frequency and video reproduction time difference of described audio-video document when the source end is play, determine in the audio-video document that the source end play, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching; Then, in the audio-video document of being play according to the source end, the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when the source end is play.
13. system as claimed in claim 12 is characterized in that, described audioref data are speech data;
Described audio identification module or described synchronous detection module are determined and the process of the initial reproduction time of the audio section of audioref Data Matching, being comprised:
Detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of being play;
By detected voice segments and described audioref data are carried out voice recognition processing, determine voice segments with described audioref Data Matching.
14. system as claimed in claim 13 is characterized in that, the voice segments that comprises in the audio-video document that described audio identification module or described synchronous detection module are determined to be play and the process of start-stop reproduction time thereof comprise:
In the audio-video document of being play, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
15. system as claimed in claim 13 is characterized in that, described audio identification module is determined and the process of the voice segments of described audioref Data Matching, being comprised:
According to the characteristic vector of each voice segments audio signal, and the characteristic vector of described phonetic reference data, by the definite similarity to each other of the space length that calculates each voice segments and described phonetic reference data;
According to the similarity of determining, get wherein and the most similar voice segments of described phonetic reference data, as with the voice segments of described audioref Data Matching.
16. system as claimed in claim 15 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of described audio identification module computing voice section and described phonetic reference data was specially:
Each audio frame frame number of described voice segments is mapped on the transverse axis in the two-dimentional rectangular coordinate system, each audio frame frame number of audioref data is mapped on the ordinate of this coordinate system, on the direction of the upper right corner, determine a paths along the lower left corner of described coordinate system; According to the coordinate points of described path process, determine with described voice segments in the frame number of each frame number corresponding audio reference data;
According to the corresponding relation of the frame number of determining, utilize the characteristic vector of audio signal, calculate the distortion factor of two frame audio signals with corresponding relation, according to the distortion factor that calculates, determine the space length between described voice segments and the described audioref data.
17. system as claimed in claim 12 is characterized in that, described video identification module is determined and the process of the initial reproduction time of the frame of video of video reference Data Matching, being comprised:
Extract the frame of video that comprises in the audio-video document of being play;
Carry out image recognition processing by frame of video and the described video reference data that will extract, determine frame of video and initial reproduction time thereof with described video reference Data Matching.
18. system as claimed in claim 11 is characterized in that, described synchronous detection module is determined the audio-visual synchronization situation of described audio-video document, comprising:
Determine described audio-video document when destination end is play with respect to the audio frequency and video time delay variable quantity that when the source end is play, is produced;
According to the audio frequency and video time delay variable quantity of determining, determine corresponding audio-visual synchronization credit rating or mark.
19. a speech detection method is characterized in that, comprises the steps:
According to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
When searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
When searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
20. method as claimed in claim 19, it is characterized in that, described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.
21. a speech detection system is characterized in that, comprising:
First search module, be used for according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude when surpassing the audio signal of first amplitude threshold, search for audio signal forward from current time, and after this moment, search short-time average magnitude when dropping to audio signal below first amplitude threshold first, search for audio signal backward from current time;
Second search module is used for searching short-time average magnitude forward and backward when dropping to the audio signal of second amplitude threshold when described first search module, continues along former direction of search search audio signal according to short-time average zero-crossing rate; Described second amplitude threshold is less than described first amplitude threshold;
The voice segments determination module, be used for searching short-time average zero-crossing rate forward when dropping to audio signal below the zero-crossing rate threshold value when described second search module, with the starting point of current time as voice segments, when searching short-time average zero-crossing rate backward when dropping to audio signal below the zero-crossing rate threshold value, with the terminal point of current time as voice segments.
22. system as claimed in claim 21 is characterized in that, also comprises:
Threshold value is provided with module, be used for distributing to determine described first amplitude threshold, second amplitude threshold and zero-crossing rate threshold value according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not a voice signal.
CN2009102374145A 2009-11-06 2009-11-06 Audio/video synchronization detection method and system, and voice detection method and system Active CN102056026B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102374145A CN102056026B (en) 2009-11-06 2009-11-06 Audio/video synchronization detection method and system, and voice detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102374145A CN102056026B (en) 2009-11-06 2009-11-06 Audio/video synchronization detection method and system, and voice detection method and system

Publications (2)

Publication Number Publication Date
CN102056026A true CN102056026A (en) 2011-05-11
CN102056026B CN102056026B (en) 2013-04-03

Family

ID=43959877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102374145A Active CN102056026B (en) 2009-11-06 2009-11-06 Audio/video synchronization detection method and system, and voice detection method and system

Country Status (1)

Country Link
CN (1) CN102056026B (en)

Cited By (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051921A (en) * 2013-01-05 2013-04-17 北京中科大洋科技发展股份有限公司 Method for precisely detecting video and audio synchronous errors of video and audio processing system
CN103974143A (en) * 2014-05-20 2014-08-06 北京速能数码网络技术有限公司 Method and device for generating media data
CN104538041A (en) * 2014-12-11 2015-04-22 深圳市智美达科技有限公司 Method and system for detecting abnormal sounds
CN104796578A (en) * 2015-04-29 2015-07-22 成都陌云科技有限公司 Multi-screen synchronization method based on program audio features
CN104993901A (en) * 2015-07-09 2015-10-21 广东威创视讯科技股份有限公司 Data synchronization method and device of distributed system
CN105608935A (en) * 2015-12-29 2016-05-25 北京奇艺世纪科技有限公司 Detection method and device of audio and video synchronization
CN105609118A (en) * 2015-12-30 2016-05-25 生迪智慧科技有限公司 Speech detection method and device
CN105898498A (en) * 2015-12-15 2016-08-24 乐视网信息技术(北京)股份有限公司 Video synchronization method and system
CN106157952A (en) * 2016-08-30 2016-11-23 北京小米移动软件有限公司 Sound identification method and device
CN106415719A (en) * 2014-06-19 2017-02-15 苹果公司 Robust end-pointing of speech signals using speaker recognition
CN106470339A (en) * 2015-08-17 2017-03-01 南宁富桂精密工业有限公司 Terminal unit and audio video synchronization detection method
CN107810529A (en) * 2015-06-29 2018-03-16 亚马逊技术公司 Language model sound end determines
CN107920245A (en) * 2017-11-22 2018-04-17 北京奇艺世纪科技有限公司 A kind of method and apparatus for detecting video playing and starting the time
CN108632557A (en) * 2017-03-20 2018-10-09 中兴通讯股份有限公司 A kind of method and terminal of audio-visual synchronization
CN108769559A (en) * 2018-05-25 2018-11-06 数据堂(北京)科技股份有限公司 The synchronous method and device of multimedia file
CN108882019A (en) * 2017-05-09 2018-11-23 腾讯科技(深圳)有限公司 Video playing test method, electronic equipment and system
CN109039994A (en) * 2017-06-08 2018-12-18 中国移动通信集团甘肃有限公司 A kind of method and apparatus calculating the audio and video asynchronous time difference
CN109472487A (en) * 2018-11-02 2019-03-15 深圳壹账通智能科技有限公司 Video quality detecting method, device, computer equipment and storage medium
CN109859744A (en) * 2017-11-29 2019-06-07 宁波方太厨具有限公司 A kind of sound end detecting method applied in range hood
CN110267083A (en) * 2019-06-18 2019-09-20 广州虎牙科技有限公司 Detection method, device, equipment and the storage medium of audio-visual synchronization
CN110503982A (en) * 2019-09-17 2019-11-26 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of voice quality detection
CN110585702A (en) * 2019-09-17 2019-12-20 腾讯科技(深圳)有限公司 Sound and picture synchronous data processing method, device, equipment and medium
US10600432B1 (en) * 2017-03-28 2020-03-24 Amazon Technologies, Inc. Methods for voice enhancement
CN111093108A (en) * 2019-12-18 2020-05-01 广州酷狗计算机科技有限公司 Sound and picture synchronization judgment method and device, terminal and computer readable storage medium
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
CN112039612A (en) * 2020-09-01 2020-12-04 广州市百果园信息技术有限公司 Time delay measuring method, device, equipment, system and storage medium
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
CN112351273A (en) * 2020-11-04 2021-02-09 新华三大数据技术有限公司 Video playing quality detection method and device
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
CN112447185A (en) * 2019-08-30 2021-03-05 广州虎牙科技有限公司 Audio synchronization error testing method and device, server and readable storage medium
CN112653916A (en) * 2019-10-10 2021-04-13 腾讯科技(深圳)有限公司 Method and device for audio and video synchronization optimization
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
CN113555132A (en) * 2020-04-24 2021-10-26 华为技术有限公司 Multi-source data processing method, electronic device and computer-readable storage medium
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
CN113744368A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Animation synthesis method and device, electronic equipment and storage medium
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
CN114999453A (en) * 2022-05-25 2022-09-02 中南大学湘雅二医院 Preoperative visit system based on voice recognition and corresponding voice recognition method
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6744922B1 (en) * 1999-01-29 2004-06-01 Sony Corporation Signal processing method and video/voice processing device
EP1081960B1 (en) * 1999-01-29 2007-12-19 Sony Corporation Signal processing method and video/voice processing device
CN101159834B (en) * 2007-10-25 2012-01-11 中国科学院计算技术研究所 Method and system for detecting repeatable video and audio program fragment
CN101494049B (en) * 2009-03-11 2011-07-27 北京邮电大学 Method for extracting audio characteristic parameter of audio monitoring system

Cited By (126)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11928604B2 (en) 2005-09-08 2024-03-12 Apple Inc. Method and apparatus for building an intelligent automated assistant
US11671920B2 (en) 2007-04-03 2023-06-06 Apple Inc. Method and system for operating a multifunction portable electronic device using voice-activation
US11348582B2 (en) 2008-10-02 2022-05-31 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US11423886B2 (en) 2010-01-18 2022-08-23 Apple Inc. Task flow identification based on user intent
US11120372B2 (en) 2011-06-03 2021-09-14 Apple Inc. Performing actions associated with task items that represent tasks to perform
US11269678B2 (en) 2012-05-15 2022-03-08 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US11321116B2 (en) 2012-05-15 2022-05-03 Apple Inc. Systems and methods for integrating third party services with a digital assistant
CN103051921B (en) * 2013-01-05 2014-12-24 北京中科大洋科技发展股份有限公司 Method for precisely detecting video and audio synchronous errors of video and audio processing system
CN103051921A (en) * 2013-01-05 2013-04-17 北京中科大洋科技发展股份有限公司 Method for precisely detecting video and audio synchronous errors of video and audio processing system
US11636869B2 (en) 2013-02-07 2023-04-25 Apple Inc. Voice trigger for a digital assistant
US10978090B2 (en) 2013-02-07 2021-04-13 Apple Inc. Voice trigger for a digital assistant
US11388291B2 (en) 2013-03-14 2022-07-12 Apple Inc. System and method for processing voicemail
US11798547B2 (en) 2013-03-15 2023-10-24 Apple Inc. Voice activated device for use with a voice-based digital assistant
US11727219B2 (en) 2013-06-09 2023-08-15 Apple Inc. System and method for inferring user intent from speech inputs
CN103974143B (en) * 2014-05-20 2017-11-07 北京速能数码网络技术有限公司 A kind of method and apparatus for generating media data
CN103974143A (en) * 2014-05-20 2014-08-06 北京速能数码网络技术有限公司 Method and device for generating media data
US10878809B2 (en) 2014-05-30 2020-12-29 Apple Inc. Multi-command single utterance input method
US11257504B2 (en) 2014-05-30 2022-02-22 Apple Inc. Intelligent assistant for home automation
US11133008B2 (en) 2014-05-30 2021-09-28 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11810562B2 (en) 2014-05-30 2023-11-07 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US11670289B2 (en) 2014-05-30 2023-06-06 Apple Inc. Multi-command single utterance input method
US11699448B2 (en) 2014-05-30 2023-07-11 Apple Inc. Intelligent assistant for home automation
CN106415719A (en) * 2014-06-19 2017-02-15 苹果公司 Robust end-pointing of speech signals using speaker recognition
CN106415719B (en) * 2014-06-19 2019-10-18 苹果公司 It is indicated using the steady endpoint of the voice signal of speaker identification
US11516537B2 (en) 2014-06-30 2022-11-29 Apple Inc. Intelligent automated assistant for TV user interactions
CN104538041A (en) * 2014-12-11 2015-04-22 深圳市智美达科技有限公司 Method and system for detecting abnormal sounds
US11087759B2 (en) 2015-03-08 2021-08-10 Apple Inc. Virtual assistant activation
US11842734B2 (en) 2015-03-08 2023-12-12 Apple Inc. Virtual assistant activation
US10930282B2 (en) 2015-03-08 2021-02-23 Apple Inc. Competing devices responding to voice triggers
CN104796578A (en) * 2015-04-29 2015-07-22 成都陌云科技有限公司 Multi-screen synchronization method based on program audio features
CN104796578B (en) * 2015-04-29 2018-03-13 成都陌云科技有限公司 A kind of multi-screen synchronous method based on broadcast sounds feature
US11468282B2 (en) 2015-05-15 2022-10-11 Apple Inc. Virtual assistant in a communication session
US11070949B2 (en) 2015-05-27 2021-07-20 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on an electronic device with a touch-sensitive display
CN107810529A (en) * 2015-06-29 2018-03-16 亚马逊技术公司 Language model sound end determines
US11010127B2 (en) 2015-06-29 2021-05-18 Apple Inc. Virtual assistant for media playback
CN107810529B (en) * 2015-06-29 2021-10-08 亚马逊技术公司 Language model speech endpoint determination
US11947873B2 (en) 2015-06-29 2024-04-02 Apple Inc. Virtual assistant for media playback
CN104993901B (en) * 2015-07-09 2017-08-29 广东威创视讯科技股份有限公司 Distributed system method of data synchronization and device
CN104993901A (en) * 2015-07-09 2015-10-21 广东威创视讯科技股份有限公司 Data synchronization method and device of distributed system
CN106470339B (en) * 2015-08-17 2018-09-14 南宁富桂精密工业有限公司 Terminal device and audio video synchronization detection method
CN106470339A (en) * 2015-08-17 2017-03-01 南宁富桂精密工业有限公司 Terminal unit and audio video synchronization detection method
US11809483B2 (en) 2015-09-08 2023-11-07 Apple Inc. Intelligent automated assistant for media search and playback
US11853536B2 (en) 2015-09-08 2023-12-26 Apple Inc. Intelligent automated assistant in a media environment
US11500672B2 (en) 2015-09-08 2022-11-15 Apple Inc. Distributed personal assistant
US11550542B2 (en) 2015-09-08 2023-01-10 Apple Inc. Zero latency digital assistant
US11126400B2 (en) 2015-09-08 2021-09-21 Apple Inc. Zero latency digital assistant
US11526368B2 (en) 2015-11-06 2022-12-13 Apple Inc. Intelligent automated assistant in a messaging environment
US11886805B2 (en) 2015-11-09 2024-01-30 Apple Inc. Unconventional virtual assistant interactions
CN105898498A (en) * 2015-12-15 2016-08-24 乐视网信息技术(北京)股份有限公司 Video synchronization method and system
US11853647B2 (en) 2015-12-23 2023-12-26 Apple Inc. Proactive assistance based on dialog communication between devices
CN105608935A (en) * 2015-12-29 2016-05-25 北京奇艺世纪科技有限公司 Detection method and device of audio and video synchronization
CN105609118B (en) * 2015-12-30 2020-02-07 生迪智慧科技有限公司 Voice detection method and device
CN105609118A (en) * 2015-12-30 2016-05-25 生迪智慧科技有限公司 Speech detection method and device
US11037565B2 (en) 2016-06-10 2021-06-15 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11657820B2 (en) 2016-06-10 2023-05-23 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US11152002B2 (en) 2016-06-11 2021-10-19 Apple Inc. Application integration with a digital assistant
US11749275B2 (en) 2016-06-11 2023-09-05 Apple Inc. Application integration with a digital assistant
US11809783B2 (en) 2016-06-11 2023-11-07 Apple Inc. Intelligent device arbitration and control
CN106157952A (en) * 2016-08-30 2016-11-23 北京小米移动软件有限公司 Sound identification method and device
US11656884B2 (en) 2017-01-09 2023-05-23 Apple Inc. Application integration with a digital assistant
CN108632557B (en) * 2017-03-20 2021-06-08 中兴通讯股份有限公司 Audio and video synchronization method and terminal
CN108632557A (en) * 2017-03-20 2018-10-09 中兴通讯股份有限公司 A kind of method and terminal of audio-visual synchronization
US10600432B1 (en) * 2017-03-28 2020-03-24 Amazon Technologies, Inc. Methods for voice enhancement
CN108882019B (en) * 2017-05-09 2021-12-10 腾讯科技(深圳)有限公司 Video playing test method, electronic equipment and system
CN108882019A (en) * 2017-05-09 2018-11-23 腾讯科技(深圳)有限公司 Video playing test method, electronic equipment and system
US10741181B2 (en) 2017-05-09 2020-08-11 Apple Inc. User interface for correcting recognition errors
US11599331B2 (en) 2017-05-11 2023-03-07 Apple Inc. Maintaining privacy of personal information
US11580990B2 (en) 2017-05-12 2023-02-14 Apple Inc. User-specific acoustic models
US11405466B2 (en) 2017-05-12 2022-08-02 Apple Inc. Synchronization and task delegation of a digital assistant
US11380310B2 (en) 2017-05-12 2022-07-05 Apple Inc. Low-latency intelligent automated assistant
US11532306B2 (en) 2017-05-16 2022-12-20 Apple Inc. Detecting a trigger of a digital assistant
US11675829B2 (en) 2017-05-16 2023-06-13 Apple Inc. Intelligent automated assistant for media exploration
CN109039994B (en) * 2017-06-08 2020-12-08 中国移动通信集团甘肃有限公司 Method and equipment for calculating asynchronous time difference between audio and video
CN109039994A (en) * 2017-06-08 2018-12-18 中国移动通信集团甘肃有限公司 A kind of method and apparatus calculating the audio and video asynchronous time difference
CN107920245A (en) * 2017-11-22 2018-04-17 北京奇艺世纪科技有限公司 A kind of method and apparatus for detecting video playing and starting the time
CN109859744B (en) * 2017-11-29 2021-01-19 宁波方太厨具有限公司 Voice endpoint detection method applied to range hood
CN109859744A (en) * 2017-11-29 2019-06-07 宁波方太厨具有限公司 A kind of sound end detecting method applied in range hood
US11710482B2 (en) 2018-03-26 2023-07-25 Apple Inc. Natural assistant interaction
US11169616B2 (en) 2018-05-07 2021-11-09 Apple Inc. Raise to speak
US11854539B2 (en) 2018-05-07 2023-12-26 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11900923B2 (en) 2018-05-07 2024-02-13 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US11487364B2 (en) 2018-05-07 2022-11-01 Apple Inc. Raise to speak
CN108769559A (en) * 2018-05-25 2018-11-06 数据堂(北京)科技股份有限公司 The synchronous method and device of multimedia file
CN108769559B (en) * 2018-05-25 2020-12-01 数据堂(北京)科技股份有限公司 Multimedia file synchronization method and device
US10720160B2 (en) 2018-06-01 2020-07-21 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11360577B2 (en) 2018-06-01 2022-06-14 Apple Inc. Attention aware virtual assistant dismissal
US10984798B2 (en) 2018-06-01 2021-04-20 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11009970B2 (en) 2018-06-01 2021-05-18 Apple Inc. Attention aware virtual assistant dismissal
US11431642B2 (en) 2018-06-01 2022-08-30 Apple Inc. Variable latency device coordination
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
CN109472487A (en) * 2018-11-02 2019-03-15 深圳壹账通智能科技有限公司 Video quality detecting method, device, computer equipment and storage medium
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11217251B2 (en) 2019-05-06 2022-01-04 Apple Inc. Spoken notifications
US11705130B2 (en) 2019-05-06 2023-07-18 Apple Inc. Spoken notifications
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11888791B2 (en) 2019-05-21 2024-01-30 Apple Inc. Providing message response suggestions
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
US11237797B2 (en) 2019-05-31 2022-02-01 Apple Inc. User activity shortcut suggestions
US11657813B2 (en) 2019-05-31 2023-05-23 Apple Inc. Voice identification in digital assistant systems
US11360739B2 (en) 2019-05-31 2022-06-14 Apple Inc. User activity shortcut suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
CN110267083A (en) * 2019-06-18 2019-09-20 广州虎牙科技有限公司 Detection method, device, equipment and the storage medium of audio-visual synchronization
CN112447185A (en) * 2019-08-30 2021-03-05 广州虎牙科技有限公司 Audio synchronization error testing method and device, server and readable storage medium
CN112447185B (en) * 2019-08-30 2024-02-09 广州虎牙科技有限公司 Audio synchronization error testing method and device, server and readable storage medium
CN110585702B (en) * 2019-09-17 2023-09-19 腾讯科技(深圳)有限公司 Sound and picture synchronous data processing method, device, equipment and medium
CN110585702A (en) * 2019-09-17 2019-12-20 腾讯科技(深圳)有限公司 Sound and picture synchronous data processing method, device, equipment and medium
CN110503982A (en) * 2019-09-17 2019-11-26 腾讯科技(深圳)有限公司 A kind of method and relevant apparatus of voice quality detection
CN110503982B (en) * 2019-09-17 2024-03-22 腾讯科技(深圳)有限公司 Voice quality detection method and related device
US11488406B2 (en) 2019-09-25 2022-11-01 Apple Inc. Text detection using global geometry estimators
CN112653916B (en) * 2019-10-10 2023-08-29 腾讯科技(深圳)有限公司 Method and equipment for synchronously optimizing audio and video
CN112653916A (en) * 2019-10-10 2021-04-13 腾讯科技(深圳)有限公司 Method and device for audio and video synchronization optimization
CN111093108A (en) * 2019-12-18 2020-05-01 广州酷狗计算机科技有限公司 Sound and picture synchronization judgment method and device, terminal and computer readable storage medium
CN111093108B (en) * 2019-12-18 2021-12-03 广州酷狗计算机科技有限公司 Sound and picture synchronization judgment method and device, terminal and computer readable storage medium
CN113555132A (en) * 2020-04-24 2021-10-26 华为技术有限公司 Multi-source data processing method, electronic device and computer-readable storage medium
US11765209B2 (en) 2020-05-11 2023-09-19 Apple Inc. Digital assistant hardware abstraction
US11924254B2 (en) 2020-05-11 2024-03-05 Apple Inc. Digital assistant hardware abstraction
CN112039612A (en) * 2020-09-01 2020-12-04 广州市百果园信息技术有限公司 Time delay measuring method, device, equipment, system and storage medium
CN112351273A (en) * 2020-11-04 2021-02-09 新华三大数据技术有限公司 Video playing quality detection method and device
CN112351273B (en) * 2020-11-04 2022-03-01 新华三大数据技术有限公司 Video playing quality detection method and device
CN113744368A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Animation synthesis method and device, electronic equipment and storage medium
CN114999453A (en) * 2022-05-25 2022-09-02 中南大学湘雅二医院 Preoperative visit system based on voice recognition and corresponding voice recognition method

Also Published As

Publication number Publication date
CN102056026B (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN102056026B (en) Audio/video synchronization detection method and system, and voice detection method and system
US11631404B2 (en) Robust audio identification with interference cancellation
CN105405439B (en) Speech playing method and device
CN108900725B (en) Voiceprint recognition method and device, terminal equipment and storage medium
US9558744B2 (en) Audio processing apparatus and audio processing method
US20060053009A1 (en) Distributed speech recognition system and method
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
CN103377651B (en) The automatic synthesizer of voice and method
US20220059075A1 (en) Word replacement in transcriptions
CN100356446C (en) Noise reduction and audio-visual speech activity detection
CN106372653A (en) Stack type automatic coder-based advertisement identification method
CN102714034A (en) Signal processing method, device and system
CN103050116A (en) Voice command identification method and system
CN110223678A (en) Audio recognition method and system
KR101022519B1 (en) System and method for voice activity detection using vowel characteristic, and method for measuring sound spectral similarity used thereto
CN111009261B (en) Arrival reminding method, device, terminal and storage medium
CN107274892A (en) Method for distinguishing speek person and device
US11488604B2 (en) Transcription of audio
JP3798530B2 (en) Speech recognition apparatus and speech recognition method
JP2001520764A (en) Speech analysis system
Eyben et al. Audiovisual vocal outburst classification in noisy acoustic conditions
CN109065024B (en) Abnormal voice data detection method and device
Rodríguez et al. Speech/speaker recognition using a HMM/GMM hybrid model
CN110265062A (en) Collection method and device after intelligence based on mood detection is borrowed
CN113160796B (en) Language identification method, device and equipment for broadcast audio and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant