CN104598541A

CN104598541A - Identification method and device for multimedia file

Info

Publication number: CN104598541A
Application number: CN201410849018.9A
Authority: CN
Inventors: 王晓萌; 谭傅伦; 许泽军; 王英杰; 袁斌
Original assignee: LeTV Information Technology Beijing Co Ltd
Current assignee: LeTV Information Technology Beijing Co Ltd
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2015-05-06

Abstract

The invention discloses an identification method and an identification device for a multimedia file. The identification method for the multimedia file comprises the following steps: obtaining mixed audio data corresponding to a target multimedia, wherein the mixed audio data comprises audio data and audio watermark data of the target multimedia file; extracting the audio watermark data in the mixed audio data; matching the audio watermark data with a preset audio watermark sample to obtain a first matched result; determining a feature sample part corresponding to the first matched result in a preset feature sample; extracting feature information of the audio data in the target multimedia file in the mixed audio data; matching the feature information with the feature sample part to obtain a second matched result; identifying the target multimedia file according to the second matched result. The identification method and device can be used for improving fineness of audio identification.

Description

The recognition methods of multimedia file, device

Technical field

The present invention relates to multimedia file recognition technology field, specifically, particularly a kind of multimedia file knows method for distinguishing, device.

Background technology

Current video search mode, normally used is video " key word " search.This not only requires that user knows the relevant information of this video, also requires search service to provide simultaneously and can safeguard in time and video " key word " database one to one.And in fact, we usually can suffer from such embarrassment: meet one section of interesting video before streets and lanes or televisor by chance, but we and be unfamiliar with the information even not knowing this section of video, let alone search this section of video by " key word ".

Thus, just arise at the historic moment under the promotion of this actual demand based on voice recognition video, it achieve by the voice recognition video of video itself.In the technology based on voice recognition video, mainly comprise following two kinds: the video identification technology based on audio frequency watermark and the video identification technology based on audio-frequency fingerprint.

Wherein, based in the video identification technology of audio frequency watermark, the video identification technology that conventional is based on voice print code, its principle is: utilize people's ear to the insensitive feature of high-frequency sound, by adding the voice print code carrying customizing messages in the high band of voice data, identification terminal get this carry the audio files of voice print code after, therefrom can extract the voice print code that it carries, by the voice print code sample matches in the voice print code of extraction and database, thus achieve by voice recognition video.Its advantage is that recognition speed is fast, is generally Millisecond.

But, this technology is when distinguishing video, voice print code data is only relied on to distinguish, thus the video adding identical voice print code data cannot be distinguished, such as, when the voice print code data that the many collection TV play belonging to same collection of drama is added is identical, each collection TV play cannot be distinguished, thus identify certain collection TV play time, this collection TV play can only be recognized and belong to a certain collection of drama, and can not recognize this collection TV play be specially in this collection of drama which collection; When the voice print code data that certain film adds is identical, the vidclip in this film cannot be distinguished, thus when identifying some fragments in this film, this vidclip can only be recognized and belong to a certain film, and this vidclip can not be recognized be specially which fragment in this film, therefore, the identification range of this video identification technology based on voice print code is limited, identifies that fineness is low.

For the problem that video identification fineness in prior art is low, at present effective solution is not yet proposed.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of multimedia file to know method for distinguishing, device, to solve the problem that in prior art, video identification fineness is low.

According to one aspect of the present invention, provide a kind of recognition methods of multimedia file.

Recognition methods according to multimedia file of the present invention comprises: obtain the mixing audio data that destination multimedia is corresponding, wherein, mixing audio data comprises voice data and the audio frequency watermark data of destination multimedia file; Extract the audio frequency watermark data in mixing audio data; Coupling audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result; The first matching result characteristic of correspondence sample portion is determined in the feature samples preset; Extract the characteristic information of the voice data of the destination multimedia file in mixing audio data; Matching characteristic information and feature samples part, to obtain the second matching result; According to the second matching result identification destination multimedia file.

Further, mixing audio data also comprises user voice data, and the method also comprises: extract the user voice data in mixing audio data; Match user speech data and the speech samples preset, to obtain the 3rd matching result; And from the described destination multimedia file obtained according to the second matching result identification, select destination multimedia file described in one according to described 3rd matching result.

Further, the audio frequency watermark data extracted in mixing audio data comprise: the voice data extracting the HFS in mixing audio data; The characteristic information extracting the voice data of the destination multimedia file in mixing audio data comprises: the characteristic information extracting the voice data of the low frequency part in mixing audio data; The user voice data extracted in mixing audio data comprises: the voice data extracting the low frequency part in mixing audio data; Remove the voice data of the destination multimedia file in the voice data of low frequency part, to obtain user voice data.

Further, the characteristic information extracting the voice data of the destination multimedia file in mixing audio data comprises: left channel data and the right data of extracting the low frequency part in mixing audio data; Following formula is adopted to merge left channel data and right data, to obtain the stereo data of low frequency part: s=a*l+b*r, wherein, a+b=1, s is the stereo data of low frequency part, l is the left channel data of low frequency part, and r is the right data of low frequency part, a and b is default parameter; And the time-frequency characteristics data extracting stereo data obtain the finger print information of destination multimedia file, wherein, finger print information forms the characteristic information of the voice data of destination multimedia file.

Further, if destination multimedia file is a sub-multimedia file of the second multimedia file, first matching result is the identification information of the second multimedia file, second matching result is the identification information of destination multimedia file, feature samples is at least one multimedia recording stored in default property data base, multimedia recording comprises the finger print information of multimedia file, the identification information of the multimedia file corresponding with finger print information, then: in the feature samples preset, determine that the first matching result characteristic of correspondence sample portion comprises: in property data base, navigate to one or more multimedia recording that the identification information of the second multimedia file is corresponding, matching characteristic information and feature samples part, comprise to obtain the second matching result: finger print information and one or more multimedia recording navigated to of coupling destination multimedia, to determine the identification information of destination multimedia.

Further, the stereo data of low frequency part is N number of stereo data, and wherein, i-th stereo data in N number of stereo data is s _i=a _i* l+b _i* r, a _i'+b _i'=1, i=1,2,3 ... N, then mate the finger print information of destination multimedia file and one or more multimedia recording navigated to, to determine that the identification information of destination multimedia file comprises: the time-frequency characteristics data of each stereo data mated with one or more multimedia recording navigated to respectively, obtain multiple matching rates that stereo data is corresponding; According to the identification information of a multimedia recording determination destination multimedia file corresponding to the maximal value in multiple matching rate.

According to another aspect of the present invention, provide a kind of recognition device of multimedia file.

Recognition device according to multimedia file of the present invention comprises: acquisition module, and for obtaining mixing audio data corresponding to destination multimedia, wherein, mixing audio data comprises voice data and the audio frequency watermark data of destination multimedia file; First extraction module, for extracting the audio frequency watermark data in mixing audio data; First matching module, for mating audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result; Determination module, for determining the first matching result characteristic of correspondence sample portion in the feature samples preset; Second extraction module, for extracting the characteristic information of the voice data of the destination multimedia file in mixing audio data; Second matching module, for matching characteristic information and feature samples part, to obtain the second matching result; Identification module, for according to the second matching result identification destination multimedia file.

Further, mixing audio data also comprises user voice data, and this device also comprises: the 3rd extraction module, for extracting the user voice data in mixing audio data; 3rd matching module, for match user speech data and the speech samples preset, to obtain the 3rd matching result; And authentication module, for selecting destination multimedia file described according to described 3rd matching result from the described destination multimedia file obtained according to the second matching result identification.

Further, the step that the first extraction module specifically performs when extracting audio frequency watermark data is: the voice data extracting the HFS in mixing audio data; The step that second extraction module specifically performs when characteristic information extraction is: the characteristic information extracting the voice data of the low frequency part in mixing audio data; The step that 3rd extraction module specifically performs when extracting user voice data is: the voice data extracting the low frequency part in mixing audio data; Remove the voice data of the destination multimedia file in the voice data of low frequency part, to obtain user voice data.

Further, the second extraction module comprises: left and right acoustic channels data extraction module, for extracting left channel data and the right data of the low frequency part in mixing audio data; Stereo data synthesis module, left channel data and right data is merged for adopting following formula, to obtain the stereo data of low frequency part: s=a*l+b*r, wherein, a+b=1, s are the stereo data of low frequency part, and l is the left channel data of low frequency part, r is the right data of low frequency part, a and b is default parameter; And finger print information extraction module, the time-frequency characteristics data for extracting stereo data obtain the finger print information of destination multimedia file, and wherein, finger print information forms the characteristic information of the voice data of destination multimedia file.

Further, if destination multimedia file is a sub-multimedia file of the second multimedia file, first matching result is the identification information of the second multimedia file, second matching result is the identification information of destination multimedia file, feature samples is at least one multimedia recording stored in default property data base, multimedia recording comprises the finger print information of multimedia file, the identification information of the multimedia file corresponding with finger print information, then: the step that determination module specifically performs when determining feature samples part is: in property data base, navigate to one or more multimedia recording that the identification information of the second multimedia file is corresponding, the step that second matching module specifically performs when obtaining the second matching result is: finger print information and one or more multimedia recording navigated to of coupling destination multimedia, to determine the identification information of destination multimedia.

Further, the stereo data of low frequency part is N number of stereo data, and wherein, i-th stereo data in N number of stereo data is s _i=a _i* l+b _i* r, a _i'+b _i'=1, i=1,2,3 ... N, then the second matching module comprises: matching rate determination module, for the time-frequency characteristics data of each stereo data being mated with one or more multimedia recording navigated to respectively, obtains multiple matching rates that stereo data is corresponding; Identification information determination module, for the identification information according to a multimedia recording determination destination multimedia file corresponding to the maximal value in multiple matching rate.

Pass through the present invention, when carrying out the identification of multimedia file, the audio frequency watermark data of destination multimedia file are mated with the audio frequency watermark sample preset, obtain the first matching result, screen according to the first matching result in the feature samples preset, screen the first matching result characteristic of correspondence sample portion, the preliminary identification of realize target multimedia file, reduces the scope of identification; Based on this, again the characteristic information of the voice data of this destination multimedia file is mated with above-mentioned feature samples part, the second matching result can be obtained, also namely identify further in the above-mentioned identification range reduced, last according to the second matching result identification destination multimedia file, during according to the method identification video, the problem that in prior art, video identification fineness is low can be solved.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.

Accompanying drawing explanation

By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 is the method flow diagram according to the embodiment of the present invention one;

Fig. 2 is the method flow diagram according to the embodiment of the present invention two;

Fig. 3 is the method flow diagram according to the embodiment of the present invention three;

Fig. 4 is the method flow diagram according to the embodiment of the present invention four;

Fig. 5 is the method flow diagram according to the embodiment of the present invention five;

Fig. 6 is the method flow diagram according to the embodiment of the present invention six;

Fig. 7 is the terminal recording module block diagram according to the embodiment of the present invention seven;

Fig. 8 is the terminal audio frequency identification module block diagram according to the embodiment of the present invention seven;

Fig. 9 is server according to the embodiment of the present invention seven and database schematic diagram;

Figure 10 is the method flow diagram according to the embodiment of the present invention seven;

Figure 11 is the device block diagram according to the embodiment of the present invention eight.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention will be further described.It is pointed out that when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.

Embodiments provide multimedia file and know method for distinguishing, in the method, first utilize the audio frequency watermark data of destination multimedia file to mate with the audio frequency watermark sample preset, obtain the first matching result; Then in the feature samples preset, the first matching result characteristic of correspondence sample portion is obtained; Again the characteristic information of the voice data of this destination multimedia file is mated with above-mentioned feature samples part, the second matching result can be obtained, finally according to the second matching result identification destination multimedia file.

Therefrom can find out, in the method, first utilize based on the fast advantage of the video identification technology processing speed of audio frequency watermark, the audio frequency watermark data of based target multimedia file identify, can obtain a preliminary recognition result fast; Then the video identification technology based on audio-frequency fingerprint is utilized can to identify the advantage of the voice data in any source, on above-mentioned preliminary recognition result basis, the characteristic information of the voice data of further based target multimedia file identifies, effectively combine the video identification technology based on audio frequency watermark and the video identification technology based on audio-frequency fingerprint, improve the fineness of identification.

Below will be described in detail various embodiments provided by the present invention.

Embodiment one

This embodiment one provides a kind of embodiment of recognition methods of multimedia file, in the method for this embodiment, audio frequency watermark data are added in advance in the voice data of destination multimedia file, when identifying destination multimedia file, the feature samples that will use default audio frequency watermark sample and preset, particularly, as shown in Figure 1, the method comprises following step S102 to step S114.

Step S102: obtain the mixing audio data that destination multimedia is corresponding, wherein, mixing audio data comprises voice data and the audio frequency watermark data of destination multimedia file.

This destination multimedia file can be video (or audio frequency), in the process that this video (or audio frequency) is play, recognition device starts sound pick-up outfit and carries out sound recording, thus obtain the mixing audio data of this video (or audio frequency), owing to the addition of audio frequency watermark data in advance in this video (or audio frequency), thus both comprise the voice data of destination multimedia file itself in the mixing audio data be recorded to, also comprise the audio frequency watermark data added.

This recognition device can be intelligent mobile communication terminal, such as mobile phone, PAD; Also can be computing machine; Or be embedded in the device needing to carry out multimedia file identification as independently recognition unit.

Step S104: extract the audio frequency watermark data in mixing audio data.

According to the feature of audio frequency watermark data, audio frequency watermark data in mixing audio data are extracted, such as, audio frequency watermark data are voice print code data, because voice print code data is the customizing messages added in the high band of voice data, therefore, by extracting the data of mixing audio data medium-high frequency part, voice print code data can be obtained.

Step S106: coupling audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result.

This audio frequency watermark sample preset can be stored in the audio frequency watermark database being arranged in recognition device this locality, thus recognition device completes audio frequency watermark data and the mating of audio frequency watermark sample of presetting, also recognition device far-end can be stored in, be arranged in the audio frequency watermark database of audio frequency watermark identified server, recognition device and audio frequency watermark identified server carry out alternately, audio frequency watermark data are transferred to audio frequency watermark identified server, audio frequency watermark data and the mating of audio frequency watermark sample of presetting is completed by audio frequency watermark identified server, no matter who performs this coupling step, all can obtain the first matching result.

This first matching result for the multimedia file group be made up of multiple multimedia file, also, by this step, can determine destination multimedia file in this multimedia file group.

Such as, in advance all films of identical actor are all added identical audio frequency watermark data, and the audio frequency watermark sample preset is many audio frequency watermark records be stored in audio frequency watermark database, and every bar audio frequency watermark record is acted the leading role name by audio frequency watermark data and the film corresponding with audio frequency watermark data and formed.If destination multimedia file is the film of certain actor, when mating the audio frequency watermark data of destination multimedia file and default audio frequency watermark sample, an audio frequency watermark record must be navigated in this audio frequency watermark database according to the audio frequency watermark data of destination multimedia file, obtain the protagonist name of this film, also namely, by audio frequency watermark data identification to this destination multimedia file be certain performer A act the leading role film.

Step S108: determine the first matching result characteristic of correspondence sample portion in the feature samples preset.

This feature samples preset can be stored in the audio fingerprint database being arranged in recognition device this locality, thus recognition device completes the determination of feature samples part in this locality; Also audio fingerprint database that can be stored in recognition device far-end, that be arranged in audio-frequency fingerprint identified server, recognition device and audio-frequency fingerprint identified server carry out alternately, first matching result is transferred to audio-frequency fingerprint identified server, the determination of feature samples part is completed by audio-frequency fingerprint identified server, no matter who performs this determining step, all can filter out a part from the feature samples preset, this part is the first matching result characteristic of correspondence sample portion.

By this step, when carrying out audio-frequency fingerprint coupling, the characteristic information without the need to the voice data by destination multimedia file mates with the feature samples entirety preset, and only characteristic information need be mated with the first matching result characteristic of correspondence sample portion.

Such as, the feature samples preset is many multimedia recordings be stored in audio fingerprint database, and every bar multimedia recording is acted the leading role name by the finger print information of multimedia file, the movie name corresponding with finger print information and film and formed.At the first matching result obtained by step S106 be: this destination multimedia is the film that certain performer A acts the leading role, then in this step, according to this first matching result, in audio fingerprint database, one or more multimedia recording that this performer A is corresponding can be navigated to.

Step S110: the characteristic information extracting the voice data of the destination multimedia file in mixing audio data.

Such as, when audio frequency watermark data are voice print code data, because voice print code data is the customizing messages added in the high band of voice data, therefore, by extracting the data of mixing audio data medium and low frequency part, the voice data of the destination multimedia file in mixing audio data can be obtained.

Further, the characteristic information extracting the data of this low frequency part obtains the characteristic information of the voice data of destination multimedia file.Specifically can adopt the feature extracting method of arbitrary voice data in prior art, such as, can extract the temporal signatures data of audio frequency, the concrete amplitude as extracted audio fragment, also can extract the time-frequency characteristics data of audio frequency.

Step S112: matching characteristic information and feature samples part, to obtain the second matching result.

When the feature samples preset is stored in the audio fingerprint database being arranged in recognition device this locality, to complete mating of characteristic information and feature samples part in this locality by recognition device; When the feature samples preset be stored in recognition device far-end, the audio fingerprint database that is arranged in audio-frequency fingerprint identified server, mating of characteristic information and feature samples part is completed by audio-frequency fingerprint identified server, no matter who performs this coupling step, all can obtain the result matched with characteristic information.

Such as, in this audio fingerprint database, feature samples part is one or more multimedia recording that performer A is corresponding, then the finger print information of characteristic information with this one or more multimedia recording is mated respectively, to obtain and this characteristic information finger print information that the match is successful, thus the movie name in the multimedia recording at this finger print information place is the movie name of destination multimedia file.

Step S114: according to the second matching result identification destination multimedia file.

The recognition methods of the multimedia file adopting this embodiment to provide, first the audio frequency watermark data of based target multimedia file identify, make use of based on the fast advantage of the video identification technology processing speed of audio frequency watermark, a preliminary recognition result can be obtained fast, the fineness of this recognition result may be lower, such as just multimedia file one is subordinate to scope, but the speed obtaining this scope is fast, also namely the method determines a scope residing for destination multimedia file first fast; In this range, the characteristic information of the voice data of further based target multimedia file identifies, the video identification technology that make use of based on audio-frequency fingerprint can identify the advantage of the voice data in any source, can recognize this destination multimedia file.Thus the method combines the video identification technology based on audio frequency watermark and the video identification technology based on audio-frequency fingerprint effectively, compared with the method used merely based on the video identification technology of audio frequency watermark, improve the fineness of identification and the range of application of video identification; Compared with the video identification technology used merely based on audio-frequency fingerprint, shorten the time of identification.

Embodiment two

This embodiment two provides a kind of embodiment of recognition methods of multimedia file, and this embodiment is a preferred embodiment on the basis of embodiment one.In the method for this embodiment, destination multimedia file is target video; Voice print code data is added in advance in the voice data of this target video; While the voice data obtaining target video, get the speech data of user; This recognition methods on the basis that embodiment one recognizes destination multimedia file, can verify the accuracy of recognition result further according to the speech data of user; When identifying target video, the feature samples that will use default voice print code sample and preset; When the accuracy of the speech data checking recognition result according to user, will use default speech samples, particularly, as shown in Figure 2, the method comprises following step S202 to step S212.

Step S202: obtain mixing audio data corresponding to target video, wherein, mixing audio data comprises the voice data of target video, voice print code data and user voice data.

User, in the process of viewing video, may be familiar with certain the detail content in the videos such as the scene occurred in video to be identified, performer or even article, and user can by its content be familiar with of phonetic entry in the process of this video playback.Such as, in the playing process of target video, recognition device starts sound pick-up outfit and carries out sound recording, be recorded to all acoustic informations in current environment, also the mixing audio data that namely target video is corresponding, this mixing audio data comprises the voice data of target video itself, also comprises the audio frequency watermark data added, also comprises the user voice data that user sends.

Step S204: extract the audio frequency watermark data in mixing audio data, and with the audio frequency watermark sample preset, to obtain the first matching result.

This step is identical with step S106 with the step S104 in embodiment one, repeats no more herein.

Step S206: determine the first matching result characteristic of correspondence sample portion in the feature samples preset, extract the characteristic information of the voice data of the target video in mixing audio data, and characteristic information is mated with feature samples part, to obtain the second matching result, and according to the second matching result identification target video.

This step is identical to step S114 with the step S108 in embodiment one, repeats no more herein.

Step S208: extract the user voice data in mixing audio data.

According to the feature of voice print code data, by extracting the data of mixing audio data medium and low frequency part, voice print code data can be removed from mixing audio data, again the voice data of the target video removed in the mixing audio data of voice print code data is removed, also the voice data by the target video in the data of low frequency part is removed, and can obtain user voice data.

Particularly, during the voice data of target video in the data removing low frequency part, the voice data getting target video is needed.Owing to recognizing target video in step S206, in this step, the voice data that the target video utilizing step S206 to recognize is corresponding extracts to realize user voice data in this step, particularly, such as, the second matching result comprises the URL information of target video, can get the voice data of target video according to this URL information, then deduct the voice data of the target video got by the data of low frequency part, can user voice data be obtained.

Step S210: match user speech data and the speech samples preset, to obtain the 3rd matching result.

This speech samples preset can be stored in the speech database being arranged in recognition device this locality, thus recognition device completing user speech data and the mating of speech samples of presetting, also speech database that can be stored in recognition device far-end, that be arranged in speech recognition server, recognition device and speech recognition server carry out alternately, user voice data is transferred to speech recognition server, by speech recognition server completing user speech data and the mating of speech samples of presetting, no matter who performs this coupling step, all can obtain the 3rd matching result.

Such as, the speech samples preset is many voice records be stored in speech database, and every bar voice record is made up of voice characteristics information and the key word corresponding with voice characteristics information.When match user speech data is with the speech samples preset, first the voice characteristics information of user speech is extracted according to user voice data, then this is extracted the voice characteristics information of user speech to mate with the voice characteristics information in speech database, thus one or more voice record can be navigated in this speech database, and then obtain key word corresponding to this user voice data.Concrete as, obtain key word corresponding to user voice data for " spot for photography " and " Hainan ".

Step S212: the destination multimedia file obtained according to the 3rd matching result and the second matching result identification.Owing to having recognized target video by the second matching result, but, what possibility the second matching result provided is multiple target videos, has been recognized the key word of user's input by the 3rd matching result, and the target video identified can be determined further in the key word inputted by user.

Concrete as, having recognized target video according to the second matching result is certain film B and film C, key word corresponding to the user voice data that the 3rd matching result recognizes is " spot for photography " and " Hainan ", and film B is " Hainan " with taking, the shooting ground of film C is " Beijing ", then in this step, by judging whether the spot for photography in this film B is that selective recognition dump movie B or film C is carried out in Hainan.

For another example, the 7th, 8,10 collection that target video is certain TV play have been recognized according to the second matching result, key word corresponding to the user voice data that the 3rd matching result recognizes is " performer " and " Liu Ruoying ", and the 7th, 8 concentrate there is no Liu Ruoying, then in this step, by judging that the 3rd recognition result judges that target video file is the 10th collection of TV play.

Generally speaking, target video can be identified further according to the key word recognized, the accuracy of the recognition result of checking target video, thus the recognition result of high-accuracy can be provided to user.

The recognition methods of the multimedia file adopting the preferred embodiment to provide, on the basis of the technique effect of above-described embodiment one, can improve in conjunction with user voice data the accuracy identifying destination multimedia file.

Embodiment three

This embodiment three provides a kind of embodiment of recognition methods of multimedia file, and this embodiment is another preferred embodiment on the basis of embodiment one.In the method for this embodiment, audio frequency watermark data are added in advance in the voice data of destination multimedia file, when identifying destination multimedia file, the feature samples that will use default audio frequency watermark sample and preset, particularly, as shown in Figure 3, the method comprises following step S302 to step S320.

Step S302: obtain the mixing audio data that destination multimedia is corresponding, wherein, mixing audio data comprises voice data and the audio frequency watermark data of destination multimedia file.

Step S304: extract the audio frequency watermark data in mixing audio data.

Step S306: coupling audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result.

Above-mentioned step S302 to step S306 distinguishes one_to_one corresponding with the step S102 in embodiment one to step S106 respectively, repeats no more herein.

Step S308: navigate to many multimedia file records corresponding to the first matching result in the property data base preset.

At least one multimedia recording is stored in this property data base preset, the multimedia recording stored in this property data base forms the feature samples preset, every bar multimedia recording comprises the finger print information of multimedia file, the identification information of the multimedia file corresponding with finger print information, wherein, multiple fingerprint values that the finger print information in every bar multimedia recording is obtained by the time-frequency characteristics data calculating multimedia voice data are formed.

Such as, in advance all TV programme that certain television channel broadcasts all are added identical audio frequency watermark data, and the audio frequency watermark sample preset is many audio frequency watermark records be stored in audio frequency watermark database, and every bar audio frequency watermark record is made up of audio frequency watermark data and the television channel name corresponding with audio frequency watermark data.If destination multimedia file is the TV programme that certain television channel broadcasts, when mating the audio frequency watermark data of destination multimedia file and default audio frequency watermark sample, an audio frequency watermark record must be navigated in this audio frequency watermark database according to the audio frequency watermark data of destination multimedia file, obtain the television channel of this TV programme, also namely, by audio frequency watermark data identification to this TV programme be television channel A broadcast program.

Multimedia file identification information corresponding with finger print information in every bar multimedia recording can be television channel name and the television programme title of multimedia file, after the first matching result determines that this destination multimedia file is the program of television channel A broadcast, then by this step, can in this property data base, navigate to many multimedia recordings that this television channel A is corresponding, these many multimedia recordings are multiple TV programme that this television channel A broadcasts.

Step S310: left channel data and the right data of extracting the low frequency part in mixing audio data.

By extracting the data of mixing audio data medium and low frequency part, can obtain the voice data of the destination multimedia file in mixing audio data, this voice data is made up of left channel data and right data two parts data.

Step S312: merge left channel data and right data, to obtain N number of stereo data of low frequency part.

Particularly, following formula is adopted to merge:

s _i＝a _i*l+b _i*r

Wherein, a _i'+b _i'=1, i=1,2,3 ... N, s ₁be first stereo data, s _nbe N number of stereo data, s _ibe i-th stereo data, a _i' and b _i' be default weight parameter, regulate a _i' and b _i' size, the proportion regulating left and right acoustic channels data shared in stereo data can be realized.

Step S314: the time-frequency characteristics data calculating each stereo data, obtains multiple fingerprint values of each stereo data.

For each stereo data, multiple fingerprint values of each stereo data form the finger print information of self, and the finger print information of this N number of stereo data forms the finger print information of destination multimedia file.

Particularly, for certain stereo data, when the time-frequency characteristics data calculating this stereo data obtain multiple fingerprint value of this stereo data, comprise the following steps that S3142 is to step S3148:

Step S3142: carry out Short Time Fourier Transform to this stereo data, to obtain the time frequency distribution map of this stereo data;

Step S3144: obtain the energy maximum point in time frequency distribution map;

Step S3146: according to two not maximum point A [ta in the same time, fa, Va], B [tb, fb, Vb] to build a fingerprint value be fp [ta, fa, fb, tb-ta], and be converted to Hash codes fp [hashData, ta], wherein, the moment of ta residing for a little bigger A of extreme value, the frequency of fa residing for a little bigger A of extreme value, Va is the energy of a little bigger A of extreme value, the moment of tb residing for a little bigger B of extreme value, the frequency of fb residing for a little bigger B of extreme value, Vb is the energy of a little bigger B of extreme value, ta<tb, maximum point A and a little bigger B of extreme value is any two adjacent energy maximum points in time frequency distribution map,

Step S3148: the multiple fingerprint values all fingerprint values built being obtained this stereo data according to time sequencing combination.

Correspondingly, in property data base, for the finger print information in every bar multimedia recording, when multiple fingerprint value that the time-frequency characteristics data calculating multimedia voice data obtain, the multimedia stereo data of preferred employing is as voice data, the method calculated fingerprint value of the time-frequency characteristics data that preferred employing is above-mentioned, to ensure that the characteristic information of destination multimedia file is consistent with the characteristic information in property data base, improves matching accuracy rate.

Step S316: multiple fingerprint values of each stereo data are mated with many multimedia recordings navigated to respectively, obtains the matching rate that each stereo data is corresponding.

Such as: the finger print information fp (hashdata, t) of certain the first stereo data comprises multiple fingerprint value, is followed successively by: [(10001,1), (10002,1), (20001,2) (30001,3) ... ];

The finger print information fp (hashdata, t) of certain the second stereo data comprises multiple fingerprint value, is followed successively by: [(10002,11), (10004,11), (30001,14) (30005,16) ... ];

The characteristic information navigated in Article 1 multimedia recording is: [(10003,10), (10002,20), (20001,21) (30001,31) ... ];

The characteristic information navigated in Article 2 multimedia recording is: [(10002,11), (10004,11), (30001,14) (30005,16) ... ].

The matching rate that then the first stereo data is corresponding comprises: with Article 1 multimedia recording mate number 3 and and the coupling number 2 of Article 2 multimedia recording; The matching rate that second stereo data is corresponding comprises: with the coupling number 4 of mating number 2 and Article 2 multimedia recording of Article 1 multimedia recording.

Step S318: the identification information corresponding according to the multimedia recording determination destination multimedia file that the maximal value in multiple matching rate is corresponding.

Such as, maximum matching rate be the second stereo data with Article 2 multimedia recording mate number 4, thus, the identification information of identification information corresponding to the destination multimedia file determined of this step also namely in this Article 2 multimedia recording.

Step S320: the identification information identification destination multimedia file corresponding according to destination multimedia file.

Such as, the record of two TV programme that the corresponding television channel A of above-mentioned two multimedia recordings broadcasts, and the television program names in identification information in Article 2 multimedia recording is called " seeing ", the destination multimedia file recognized is " the seeing " that television channel A broadcasts.

The recognition methods of the multimedia file adopting this embodiment to provide, on the basis of the technique effect of above-described embodiment one, when identifying destination multimedia file, the voice data of the destination multimedia file got is the stereo data merged by left channel data and right data, correspondingly, the feature samples preset also is the feature of stereo data, make the source data type of the characteristic information of destination multimedia file consistent with the source data type of feature samples, all adopt stereo data, improve the accuracy rate of identification; And when merging left and right sound channels data are stereo data, weight parameter a and b is set, so that left and right acoustic channels data proportion shared in stereo data can be adjusted according to actual needs.

Further, when the characteristic information of establishing target multimedia file, organize weight parameter by arranging more, by the left and right acoustic channels data transformations of destination multimedia file for organize stereo data more, calculate and often organize fingerprint value corresponding to stereo data, thus the characteristic information of destination multimedia file comprises and organizes fingerprint value more.When carrying out destination multimedia file identification, match often organizing fingerprint value respectively with many multimedia file records navigated to, the multimedia file record identification destination multimedia file corresponding according to maximum matching rate, increases the accuracy identified further.

Embodiment four

This embodiment four provides a kind of embodiment of searching method of multimedia file, and as shown in Figure 4, the method comprises following step S402 to step S406.

Step S402: receive searching request, wherein, this searching request comprises the mixing audio data of destination multimedia file to be searched.

Step S404: according to searching request identification destination multimedia file.

Step S406: according to recognition result search destination multimedia file.

In this embodiment, during search destination multimedia file, first need to recognize this multimedia file, then according to the further searching multimedia files of identification information of this multimedia file recognized.Wherein, when this place identifies destination multimedia file, above-mentioned any embodiment can be adopted.

Embodiment five

This embodiment five provides a kind of embodiment of searching method of multimedia file, and the executive agent of the method can be arbitrary terminal, and as shown in Figure 5, the method comprises following step S502 to step S512.

Step S502: obtain the mixing audio data that destination multimedia is corresponding, wherein, destination multimedia file mixing audio data comprises voice data and the audio frequency watermark data of destination multimedia file.

Step S504: extract the audio frequency watermark data in destination multimedia file mixing audio data.

Step S506: send audio frequency watermark data to audio frequency watermark identified server, to obtain the first matching result, wherein, destination multimedia file first matching result is the matching result that audio frequency watermark identified server coupling audio frequency watermark data obtain with the audio frequency watermark sample preset.

In this embodiment, audio frequency watermark identified server is provided with audio frequency watermark database, this audio frequency watermark database is for storing default audio frequency watermark sample, wherein, store many audio frequency watermark records in audio frequency watermark database, every bar audio frequency watermark record comprises audio watermark information and the multimedia file identification information corresponding with this audio watermark information.

After the audio frequency watermark data of destination multimedia file are sent to audio frequency watermark identified server by terminal, audio frequency watermark identified server is at the audio frequency watermark record of audio frequency watermark database-located to the audio frequency watermark data match with destination multimedia file, thus obtain the first matching result, also namely from this audio frequency watermark record, obtain multimedia file identification information corresponding to this destination multimedia file.

The multimedia file identification information at this place is for identification destination multimedia file, the fineness identified is relatively thick, also namely can not uniquely set the goal multimedia file really, such as, recognize destination multimedia file by the identification information at this place and belong to certain collection of TV plays, but which collection during specifically this collection of TV plays can not be determined; And for example, recognize by the identification information at this place the TV programme that destination multimedia file belongs to certain television channel, but which TV programme during specifically this television channel can not be determined.

Step S508: the characteristic information extracting the voice data of the destination multimedia file in destination multimedia file mixing audio data.

Step S510: the characteristic information of the voice data of transmission destination multimedia file and the first matching result are to audio-frequency fingerprint identified server, to obtain the second matching result, wherein, second matching result is after audio-frequency fingerprint identified server determines the first matching result characteristic of correspondence sample portion in the feature samples preset, the second matching result that matching characteristic information and feature samples part obtain.

In this embodiment, audio-frequency fingerprint identified server is provided with audio fingerprint database, this audio fingerprint database is for storing default feature samples, wherein, store many multimedia recordings in audio fingerprint database, every bar multimedia recording is by the identification information of multimedia finger print information, multimedia file corresponding to finger print information.

The multimedia file identification information at this place is for identification destination multimedia file, and the fineness of identification is relatively thin, and by the content of this identification information, can uniquely set the goal multimedia file really.This identification information can comprise the multimedia file identification information in above-mentioned audio frequency watermark database, also comprises and describes the finer information of multimedia file, the memory location of such as multimedia file, the title etc. of multimedia file.

After the characteristic information of voice data and the first matching result are sent to audio-frequency fingerprint identified server by terminal, first audio-frequency fingerprint identified server navigates to one or more multimedia recording corresponding with the first matching result in audio fingerprint database, and then the characteristic information of voice data is mated with one or more multimedia recording navigated to, thus obtain the second matching result, uniquely to recognize this destination multimedia file.

Step S512: send the second matching result to multimedia administration server, to obtain destination multimedia file, wherein, this destination multimedia file is the multimedia file that multimedia administration server gets according to the second matching result.

The URL of destination multimedia file such as can be obtained by the second matching result, second matching result is sent to multimedia administration server by terminal, the related data of destination multimedia file is back to terminal after getting destination multimedia file according to the URL in the second matching result by multimedia administration server.This related data can be the stream medium data of destination multimedia file, and terminal receives this stream medium data and play-overs destination multimedia file; Also can be the download address of destination multimedia file, after terminal receives this download address, correspondingly server download destination multimedia file and plays.

In a preferred embodiment provided by the invention, destination multimedia file mixing audio data also comprises user voice data, and before step S512, the method also comprises following step:

Step S514: extract the user voice data in mixing audio data.

Step S516: send user voice data to speech recognition server, to obtain the 3rd matching result, wherein, the 3rd matching result is the matching result that speech recognition server match user speech data obtains with the speech samples preset.

In the preferred embodiment, speech recognition server is provided with speech database, this speech database is for storing default speech samples, wherein, store many voice records in speech database, every bar voice record is made up of voice characteristics information and the key word corresponding with voice characteristics information.

After user voice data is sent to speech recognition server by terminal, first speech recognition server extracts the voice characteristics information of user speech according to user voice data, then this is extracted the voice characteristics information of user speech to mate with the voice characteristics information in speech database, thus one or more voice record can be navigated in this speech database, and then obtain key word corresponding to this user voice data.

Step S518: according to the second matching result identification destination multimedia file, and identify that whether the recognition result that destination multimedia file obtains is correct according to the 3rd matching result checking, wherein, wherein, when recognition result is correct, perform step S512.

In step S510, after obtaining the second matching result, in this step, after the second matching result identification destination multimedia file, by the accuracy of the 3rd matching result checking recognition result, and when recognition result is accurate, then the second matching result is sent to multimedia administration server.

Such as, destination multimedia file is certain film Q, can recognize this film by the second matching result is film Q, and the protagonist that the descriptor obtaining this film Q further comprises this film Q is the information of king XX, the key word that can be obtained by the 3rd matching result is " protagonist " and " king XX ", then verify that recognition result is correct by the 3rd matching result, now the second matching result is sent to multimedia administration server, to obtain this film Q.

In another preferred embodiment provided by the invention, when step S504 extracts the characteristic information of the voice data of the destination multimedia file in destination multimedia file mixing audio data, the feature information extraction mode described in above-described embodiment three can be adopted, repeat no more herein.

Embodiment six

This embodiment six provides the embodiment of the searching method of another kind of multimedia file, in the method for this embodiment, is added with voice print code data in advance in the voice data of target video; When obtaining the video segment voice data of target video, get the speech data of user; When identifying target video, the audio fingerprint database that will use default audio frequency watermark database and preset; When identifying the speech data of user, default speech database will be used.Particularly, as shown in Figure 6, the method comprises following step S602 to step S608.

Step S602: open recording module, obtain the mixing audio data of the video segment of target video, wherein, this mixing audio data comprises the voice data of target video and the speech data of user.

After recording module is opened, acoustic information in real-time recording current environment is to obtain mixing audio data, in the process that target video is play, if there is user speech, then the acoustic information be recorded to comprises some the background sound data in the voice data of the video segment of target video, the speech data of user and environment.

After recording module is opened, whenever duration T2 time of arrival of recording, be then the voice data encapsulation of T2 by length, in the voice data after encapsulation, comprise the voice data of a video segment and the speech data of user and background sound.

Step S604: pre-service is carried out to the audio file of mixing audio data.

Specifically comprise following step:

1. audio format conversion.

Call third-party software (as: ffmpeg) and be converted to unified for the audio file of different-format the voice data that time span is the pcm encoder of T2.

2. extract the voice data of HFS.

Use Hi-pass filter (frequency range that the frequency range of filtering and voice print code take is consistent, and is assumed to be H1Hz to H2Hz), acquisition time length is T2, and frequency range is the voice data Music1 of H1Hz to H2Hz.

3. extract the voice data of low frequency part.

Use low-pass filter, acquisition time length is T2, and frequency range is the voice data Music2 of L1Hz to L2Hz.

Step S606: the audio file according to pretreated mixing audio data identifies.

Particularly, need the content identified to comprise the video segment of target video and key word corresponding to user voice data, comprise following step:

1. receive the audio file of pretreated mixing audio data, also namely receive two audio fragments, comprise Music1 and Music2.

2. being spliced by the voice data of the low frequency part got, is that voice extract preparation data.

Often receive a low frequency part voice data Music2, be then spliced into the voice data Music3 that time span is N*T2 (N is present video fragment sum) in chronological order.

3. utilize voice print code lock onto target video.

Identify voice print code information entrained in HFS voice data Music1, obtain recognition result Result1.

Such as, recognition result Result1 is No. ID (TrackID of unique identification video finger print) of target video in audio fingerprint database.

Result1：{TrackID：“……”}。

4. the video segment of accurate localizing objects video

Extract the finger print information of Music2, the finger print information of extraction is mated with the finger print information in the audio fingerprint database pointed by Result1, obtains matching result Result2.

The information that Result2 comprises is time range timeStart in target video of the index information TrackID of target video in audio fingerprint database, stored position information URL and video segment and timeStop.

Result2：{TrackID：“……”，URL:“http://……”，timeStart：“……”，timeStop：“……”}；

5. extract the voice data of video segment of target video, also the primary sound of i.e. this video segment.

Read Result2, find video file vedio according to stored position information URL; Extract the voice data music of the file vedio of video; Extract the voice data music_clip of special time period (being also timeStart to timeStop) according to the temporal information in Result2, this music_clip is the primary sound of video segment.

6. extract user voice data.

For the voice data Music3 spliced, be jointly made up of following three parts in fact:

Voice data+a2* user voice data+a3* background sound data (a1, a2, a3 are constant) of Music3=a1* video segment

Suppose that recording conditions is enough good, that is: a3=0, then:

The voice data (b1, b2 are constant) of user voice data=b1*Music3 – b2* video segment

Therefore, user voice data can be extracted:

User voice data word=b1*Music3 – b2*music_clip.

7. resolve phonetic order by user voice data.

User voice data word is mated with the phonetic order in speech database, obtains the phonetic order Command the most close with word:

Command：{index：{“music，title，……”}}

Step S608: return result for retrieval according to recognition result.

Such as, index information in Command comprises musical designation, singer's name, then can obtain by Result2 all information describing this video segment, comprise the Item Information occurred in this video segment, scene information, the musical designation of background music, the information such as singer's name of this background music, in this all information, judge whether content corresponding to index information in Command obtains describing in all information of this video segment at Result2, if, then can be found video file vedio by the URL in Result2, the information such as the chained address of this video file vedio or this video file are returned as result for retrieval.

Embodiment seven

This embodiment seven provides the embodiment of the searching method of another kind of multimedia file, in this embodiment, describes the search system realizing this searching method.

Particularly, the system realizing the method is made up of terminal and server and database two parts, is respectively described below.

First, terminal comprises recording module, audio frequency pretreatment module and audio identification module.Wherein,

Recording module: for obtaining acoustic information.The acoustic information of input is made up of two parts: the voice data (comprising voice print code) of (1) video; (2) speech data of user.User can its voice messaging of random time phonetic entry in Recording Process.

Audio frequency pretreatment module: as shown in Figure 7, the voice data that recording module obtains by audio format converting unit carries out data conversion, and carry out audio extraction by high frequency extraction unit and low frequency components abstraction unit respectively, be next step video identification and speech recognition preparation data.

Audio identification module: as shown in Figure 8, this audio identification module receives after pre-service and voice data, comprises high-frequency sound data and low-frequency sound data, output audio fingerprint result for retrieval, also be the result for retrieval of target video, also export the speech data of user, each unit is respectively described below:

Voice print code recognition unit: for mutual with voice print code identified server, upload high-frequency audio information to this voice print code identified server, obtains the voice print code recognition result that voice print code identified server is sent.

Fingerprint identification unit: for receiving voice print code recognition result, together with low-frequency sound data upload to audio-frequency fingerprint identified server, receives the audio-frequency fingerprint recognition result that fingerprint recognition server returns.

Audio splicing unit: for the fragment assembly of low-frequency sound data is become complete, prepares data for user speech extracts.

Voice recognition unit: be used for audio reception fingerprint recognition result on the one hand, this result is uploaded to video management server, and the audio fragment of the target video of receiver, video management server transmission.Be used on the other hand obtaining original sound audio data according to the identification of audio-frequency fingerprint recognition result, and according to the low-frequency sound data that this original sound audio data and audio splicing unit send, extract the speech data of user.Be used on the one hand speech data to be uploaded to speech recognition server again, and receive the voice identification result that speech recognition server returns.

Audio identification module can also comprise: recognition result authentication unit, whether the recognition result for sentencing fingerprint identification unit according to the speech data of user is correct, and when correct, audio-frequency fingerprint recognition result is sent to video management server as search intention, and the returning results of receiver, video management server.

Terminal also comprises: display module, and the result that the video management server for being received by recognition result authentication unit returns is shown to user, also for the type according to return message, calls multiple multimedia file resource, result is shown to user.

Above describe the terminal in system, with reference to figure 9, below server in descriptive system and database are formed.

1. speech recognition server and speech database.

The speech data that speech recognition server receiving terminal sends, identifies according to speech data, and returns the phonetic order corresponding with speech data in speech database.

The phonetic order preset is stored in sound database:

Command:{ " key word 1 ", " key word 2 ", " key word 3 " ....

Instruction can describe with key word, and these key words can comprise: video type (as: TV play, film, news), the information (as: performer, article, place) of video content.

2. audio frequency watermark identified server and audio frequency watermark database.

The HFS voice data that receiving terminal is sent, parses the voice print code that it carries from this HFS voice data, is mated by the voice print code parsed with voice print code data in database, obtains the matching result that voice print code matching degree is the highest.Matching result is returned to terminal.

In audio frequency watermark database, the data structure of storage can adopt following structure:

{ " ID ", " url ", " voice print code ", " TrackID " }

" ID " is the unique identification of this voice print code in voice print code data storehouse." url " is the memory location of the video file of this voice print code correspondence." voice print code ": be one section of " 01010101 ... " binary ordered series of numbers, the voice print code that each video file is corresponding unique.During use, this voice print code is loaded in the voice data of high frequency." TrackID " finger print information mark in audio fingerprint database corresponding to the video data of this voice print code correspondence.

3. audio-frequency fingerprint identified server and audio fingerprint database.

The finger print data stored in fingerprint database, its data structure can adopt following result:

TrackID:{}, fp:{}, " key word 1 ": { }, " key word 2 ": { } ...

TrackID is the unique identification of this finger print information in audio fingerprint database; Fp is the audio fingerprint data that video file is corresponding, its structure be " Hash1 ", " time1 ", " Hash2 ", " time2 ", " Hash3 ", " time3 " ...; " key word N ": the key word one_to_one corresponding in this key word and voice server, its structure { " content 1 ", " time1 ", " content 2 ", " time2 ", " content 3 ", " time3 " ... (example: " key word=performer ", then the content identification video of this key word is in the different broadcasting moment, the name of the performer occurred, is stored in " content ").

The function declaration of audio-frequency fingerprint identified server is as follows:

The low frequency part voice data that receiving terminal video identification module is sent, the recognition result that audio frequency watermark identified server is sent; The finger print information of target audio is extracted from voice data; Extract TrackID in the recognition result that audio frequency watermark identified server is sent, determine according to TrackID the scope that fingerprint is retrieved; In the range of search determined, target fingerprint is mated with the finger print information that TrackID points to; Obtain the temporal information of current broadcasting video, the audio fragment matched is returned to video identification module.

4. video management server and video database.

Video database is for storing up video file.

The function declaration of video management server is as follows

(1) recognition result sent of receiving terminal video identification module audio fingerprint identification unit; Extract result and obtain video index information (URL) and time slice information in database, retrieve the fragment of corresponding video; Extract the voice data of this video segment; This voice data is returned to the voice recognition unit of terminal video identification module.

(2) search intention sent of receiving terminal recognition result authentication unit, returns Search Results to terminal.

Adopt above-mentioned terminal and server, with reference to Figure 10, the process prescription realizing the searching method of the present embodiment is as follows:

Step one:

Open terminal recording module, obtain the voice data of target video and the voice messaging of user.Recording module is once open, and terminal just records the acoustic information in current environment by its sound pick-up outfit.Be greater than the time T1 preset then stop recording until receive video frequency searching result that server sends or total duration of recording.

Whenever duration T2 time of arrival (T2<<T1) of recording, it is then the voice data encapsulation of T2 by length, comprise the voice data of video and the phonetic order of user in data, simultaneously by this data upload to voice data pretreatment module.

Step 2:

This step is that next step video identification prepares data.Implementation procedure is as follows:

1. time of reception length is the audio file of T2;

2. audio format conversion.Call third-party software (as: ffmpeg) and be converted to unified for the audio file of different-format the voice data that time span is the pcm encoder of T2.

3. extract the voice data of HFS.Use Hi-pass filter (frequency range that the frequency range of filtering and voice print code take is consistent, and is assumed to be H1 ~ H2Hz), acquisition time length is T2, and frequency range is the voice data Music1 of H1 ~ H2Hz.

4. extract the voice data of low frequency part.Use low-pass filter, acquisition time length is T2, and frequency range is the voice data Music2 of L1 ~ L1Hz.

5. Music1 and Music2 is uploaded to video identification module.

Step 3:

Step 3 is divided into two stages: the stage one extracts voice messaging and stage two speech recognition, and wherein, the stage one carries out in terminal, needs video management server to provide the primary sound audio fragment of target video; Stage two carries out at speech recognition server, needs terminal to provide user voice data.In this step, according to the voice data obtained, retrieve target video, obtain the result for retrieval of target video and the voice messaging of user, and then identify the phonetic order of user, particularly, two stages are described as follows:

Stage one

1. video identification module is from audio frequency pretreatment module audio reception fragment (comprising Music1 and Music2).

2. the low frequency part voice data splicing will got, prepares data for voice extract.

Often receive a low frequency part voice data Music2, be then spliced into the bass audio data Music3 that time span is N*T2 (N is present video fragment sum) in chronological order, until receive video frequency searching result.

3. utilize voice print code lock onto target video.

HFS voice data Music1 is uploaded to voice print code identified server, is extracted by voice print code identified server and identify voice print code information entrained in Music1.Recognition result Result1 is returned to the video identification module of terminal.Recognition result Result1 is No. ID (TrackID of unique identification video finger print) of target video in " video fingerprint data storehouse ".

Result1：{TrackID：“……”}；

4. the plays clip of accurate positioning playing video.

By low frequency part voice data Music2 together with Result1, be together uploaded to " audio-frequency fingerprint identified server ".Extract the target video finger print information in Music2 by fingerprint recognition server, target video fingerprint is mated with the fingerprint pointed by Result1, matching result Result2 is returned to the video identification module of terminal.

The information that Result2 comprises is that result for retrieval is at the index information TrackID of fingerprint base, the index information URL in video database and time range timeStart and timeStop.

5. video identification module stops audio reception fragment, sends the information stopping recording to recording module simultaneously.

6. extract the audio frequency primary sound data of target video

Read Result2, Result2 is uploaded to video management server; Video management server finds video file vedio according to index information; Extract the voice data music of video; The voice data music_clip of special time period is extracted according to the temporal information in Result2; Music_clip is returned to the video information recommending module of terminal.

7. extract user speech information

We obtain, and the voice data Music3 spliced, is made up of jointly following three parts:

Music3=a1* primary sound audio frequency+a2* user speech+a3* neighbourhood noise; (a1, a2, a3 are constant)

Under we suppose the condition that recording conditions is enough good (that is: a3 ~=0), we can obtain simply by such as under type:

User speech=b1*Music3 – b2* primary sound audio frequency (b1, b2 are constant)

Wherein primary sound audio frequency is obtained by step 6: music_clip.

User speech (word)=b1*Music3 – b2*music_clip;

Stage two

8. resolve user instruction by user speech

Word is uploaded to speech recognition server, mates with the phonetic order in speech database.The phonetic order Command the most close with word is returned to recognition result authentication unit.

Command：{index：{“music，title，belowing……”}}

9., according to user instruction, return result for retrieval

Judge whether content corresponding to index information in Command obtains describing in all information of this video segment at Result2, if, then Result2 is uploaded to video management server.By the URL position matching video file in Result2, and return video file.

In this embodiment, by voice recognition unit identification user voice data, target video is identified further by recognition result.Make search system in conjunction with the speech data of user, result for retrieval high for accuracy can be back to user.

More than the description that recognition methods and the searching method of the multimedia file that the embodiment of the present invention provides are carried out, the recognition device of the multimedia file that an embodiment of the present invention will be described below provides.

Embodiment eight

This embodiment eight provides a kind of embodiment of recognition device of multimedia file, as shown in figure 11, this device comprises acquisition module 810, first extraction module 820, first matching module 830, determination module 840, second extraction module 850, second matching module 860 and identification module 870.

Acquisition module 810 is for obtaining mixing audio data corresponding to destination multimedia, and wherein, mixing audio data comprises voice data and the audio frequency watermark data of destination multimedia file.This destination multimedia file can be video (or audio frequency), in the process that this video (or audio frequency) is play, recognition device starts sound pick-up outfit and carries out sound recording, thus obtain the mixing audio data of this video (or audio frequency), owing to the addition of audio frequency watermark data in advance in this video (or audio frequency), thus both comprise the voice data of destination multimedia file itself in the mixing audio data be recorded to, also comprise the audio frequency watermark data added.This recognition device can be intelligent mobile communication terminal, such as mobile phone, PAD; Also can be computing machine; Or be embedded in the device needing to carry out multimedia file identification as independently recognition unit.

First extraction module 820 is for extracting the audio frequency watermark data in mixing audio data, according to the feature of audio frequency watermark data, audio frequency watermark data in mixing audio data are extracted, such as, audio frequency watermark data are voice print code data, because voice print code data is the customizing messages added in the high band of voice data, therefore, by extracting the data of mixing audio data medium-high frequency part, voice print code data can be obtained.

First matching module 830 is for mating audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result, this first matching result can for the multimedia file group be made up of multiple multimedia file, also be, by the method, determine destination multimedia file in this multimedia file group.

Determination module 840 for determining the first matching result characteristic of correspondence sample portion in the feature samples preset, by this module, when carrying out audio-frequency fingerprint coupling, characteristic information without the need to the voice data by destination multimedia file mates with the feature samples entirety preset, and only characteristic information need be mated with the first matching result characteristic of correspondence sample portion.

Second extraction module 850 is for extracting the characteristic information of the voice data of the destination multimedia file in mixing audio data, specifically can adopt the feature extracting method of arbitrary voice data in prior art, such as can extract the temporal signatures data of audio frequency, the concrete amplitude as extracted audio fragment, also can extract the time-frequency characteristics data of audio frequency.

Second matching module 860 for matching characteristic information and feature samples part, to obtain the second matching result.Identification module 870 is for according to the second matching result identification destination multimedia file.

The recognition device of the multimedia file adopting this embodiment to provide, effectively combine the video identification technology based on audio frequency watermark and the video identification technology based on audio-frequency fingerprint, compared with the method used merely based on the video identification technology of audio frequency watermark, improve the fineness of identification and the range of application of video identification; Compared with the video identification technology used merely based on audio-frequency fingerprint, shorten the time of identification.

Preferably, also comprise user voice data in mixing audio data, this device also comprises the 3rd extraction module, the 3rd matching module and authentication module.Wherein, the 3rd extraction module is for extracting the user voice data in mixing audio data, and the 3rd matching module is used for match user speech data and the speech samples preset, to obtain the 3rd matching result; From the described destination multimedia file obtained according to the second matching result identification, a destination multimedia file is selected according to described 3rd matching result.

Adopt the preferred embodiment, can improve in conjunction with user voice data the accuracy identifying destination multimedia file.

Further preferably, the step that the first extraction module specifically performs when extracting audio frequency watermark data is: the voice data extracting the HFS in mixing audio data; The step that second extraction module specifically performs when characteristic information extraction is: the characteristic information extracting the voice data of the low frequency part in mixing audio data; The step that 3rd extraction module specifically performs when extracting user voice data is: the voice data extracting the low frequency part in mixing audio data; Remove the voice data of the destination multimedia file in the voice data of low frequency part, to obtain user voice data.

Preferably, the second extraction module comprises left and right acoustic channels data extraction module, stereo data synthesis module and finger print information extraction module.Wherein, left and right acoustic channels data extraction module is for extracting left channel data and the right data of the low frequency part in mixing audio data; Stereo data synthesis module merges left channel data and right data for adopting following formula, to obtain the stereo data of low frequency part: s=a*l+b*r, wherein, a+b=1, s is the stereo data of low frequency part, l is the left channel data of low frequency part, and r is the right data of low frequency part, a and b is default parameter; And finger print information extraction module obtains the finger print information of destination multimedia file for the time-frequency characteristics data extracting stereo data, wherein, finger print information forms the characteristic information of the voice data of destination multimedia file.

Adopt the preferred embodiment, when identifying destination multimedia file, the voice data of the destination multimedia file got is the stereo data merged by left channel data and right data, correspondingly, the feature samples preset also is the feature of stereo data, make the source data type of the characteristic information of destination multimedia file consistent with the source data type of feature samples, all adopt stereo data, improve the accuracy rate of identification; And when merging left and right sound channels data are stereo data, weight parameter a and b is set, so that left and right acoustic channels data proportion shared in stereo data can be adjusted according to actual needs.

Further preferably, if destination multimedia file is a sub-multimedia file of the second multimedia file, first matching result is the identification information of the second multimedia file, second matching result is the identification information of destination multimedia file, feature samples is at least one multimedia recording stored in default property data base, multimedia recording comprises the finger print information of multimedia file, the identification information of the multimedia file corresponding with finger print information, the step that then determination module specifically performs when determining feature samples part is: in property data base, navigate to one or more multimedia recording that the identification information of the second multimedia file is corresponding, the step that second matching module specifically performs when obtaining the second matching result is: finger print information and one or more multimedia recording navigated to of coupling destination multimedia, to determine the identification information of destination multimedia.

Further preferably, the stereo data of low frequency part is N number of stereo data, and wherein, i-th stereo data in N number of stereo data is s _i=a _i* l+b _i* r, a _i'+b _i'=1, i=1,2,3 ... N, then the second matching module comprises matching rate determination module and identification information determination module.Wherein, matching rate determination module is used for the time-frequency characteristics data of each stereo data to mate with one or more multimedia recording navigated to respectively, obtains multiple matching rates that stereo data is corresponding; Identification information determination module is used for the identification information according to a multimedia recording determination destination multimedia file corresponding to the maximal value in multiple matching rate.

Adopt the preferred embodiment, when the characteristic information of establishing target multimedia file, organize weight parameter by arranging more, by the left and right acoustic channels data transformations of destination multimedia file for organize stereo data more, calculate and often organize stereo data characteristic of correspondence information, thus the characteristic information of destination multimedia file comprises the western Sydney of many stack features.When carrying out destination multimedia file identification, every stack features information and many multimedia file records navigated to are matched respectively, the multimedia file record identification destination multimedia file corresponding according to maximum matching rate, increases the accuracy identified further.

The above; be only the present invention's preferably embodiment, but protection scope of the present invention is not limited thereto, any people being familiar with this technology is in the technical scope disclosed by the present invention; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. a recognition methods for multimedia file, is characterized in that, comprising:

Obtain the mixing audio data that destination multimedia is corresponding, wherein, described mixing audio data comprises voice data and the audio frequency watermark data of described destination multimedia file;

Extract the audio frequency watermark data in described mixing audio data;

Mate described audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result;

Described first matching result characteristic of correspondence sample portion is determined in the feature samples preset;

Extract the characteristic information of the voice data of the described destination multimedia file in described mixing audio data;

Mate described characteristic information and described feature samples part, to obtain the second matching result;

Destination multimedia file according to described second matching result identification.

2. the recognition methods of multimedia file according to claim 1, is characterized in that, described mixing audio data also comprises user voice data, and described method also comprises:

Extract the user voice data in described mixing audio data;

Mate described user voice data and the speech samples preset, to obtain the 3rd matching result; And

From the described destination multimedia file obtained according to the second matching result identification, destination multimedia file described in one is selected according to described 3rd matching result.

3. the recognition methods of multimedia file according to claim 2, is characterized in that,

The audio frequency watermark data extracted in described mixing audio data comprise: the voice data extracting the HFS in described mixing audio data;

The characteristic information extracting the voice data of the described destination multimedia file in described mixing audio data comprises: the characteristic information extracting the voice data of the low frequency part in described mixing audio data;

The user voice data extracted in described mixing audio data comprises: the voice data extracting the low frequency part in described mixing audio data; Remove the voice data of the described destination multimedia file in the voice data of described low frequency part, to obtain described user voice data.

4. the recognition methods of multimedia file according to claim 1, is characterized in that, the characteristic information extracting the voice data of the described destination multimedia file in described mixing audio data comprises:

Extract left channel data and the right data of the low frequency part in described mixing audio data;

Following formula is adopted to merge described left channel data and described right data, to obtain the stereo data of described low frequency part: s=a*l+b*r, wherein, a+b=1, s is the stereo data of described low frequency part, l is the left channel data of described low frequency part, and r is the right data of described low frequency part, a and b is default parameter; And

The time-frequency characteristics data extracting described stereo data obtain the finger print information of described destination multimedia file, and wherein, described finger print information forms the characteristic information of the voice data of described destination multimedia file.

5. the recognition methods of multimedia file according to claim 4, it is characterized in that, if described destination multimedia file is a sub-multimedia file of the second multimedia file, described first matching result is the identification information of described second multimedia file, described second matching result is the identification information of described destination multimedia file, described feature samples is at least one multimedia recording stored in default property data base, described multimedia recording comprises the finger print information of multimedia file, the identification information of the multimedia file corresponding with described finger print information, then:

In the feature samples preset, determine that described first matching result characteristic of correspondence sample portion comprises: in described property data base, navigate to one or more multimedia recording that the identification information of described second multimedia file is corresponding;

Mate described characteristic information and described feature samples part, comprise to obtain the second matching result: the finger print information mating described destination multimedia and described one or more multimedia recording navigated to, to determine the identification information of described destination multimedia.

6. the recognition methods of multimedia file according to claim 5, is characterized in that, the stereo data of described low frequency part is N number of stereo data, and wherein, i-th stereo data in described N number of stereo data is s _i=a _i* l+b _i* r, a _i'+b _i'=1, i=1,2,3 ... N, then the finger print information mating described destination multimedia file and described one or more multimedia recording navigated to, to determine that the identification information of described destination multimedia file comprises:

The time-frequency characteristics data of each stereo data are mated with described described one or more multimedia recording navigated to respectively, obtains multiple matching rates that described stereo data is corresponding;

A described multimedia recording corresponding according to the maximal value in described multiple matching rate determines the identification information of described destination multimedia file.

7. a recognition device for multimedia file, is characterized in that, comprising:

Acquisition module, for obtaining mixing audio data corresponding to destination multimedia, wherein, described mixing audio data comprises voice data and the audio frequency watermark data of described destination multimedia file;

First extraction module, for extracting the audio frequency watermark data in described mixing audio data;

First matching module, for mating described audio frequency watermark data and the audio frequency watermark sample preset, to obtain the first matching result;

Determination module, for determining described first matching result characteristic of correspondence sample portion in the feature samples preset;

Second extraction module, for extracting the characteristic information of the voice data of the described destination multimedia file in described mixing audio data;

Second matching module, for mating described characteristic information and described feature samples part, to obtain the second matching result;

Identification module, for destination multimedia file according to described second matching result identification.

8. the recognition device of multimedia file according to claim 7, is characterized in that, described mixing audio data also comprises user voice data, and described device also comprises:

3rd extraction module, for extracting the user voice data in described mixing audio data;

3rd matching module, for mating described user voice data and the speech samples preset, to obtain the 3rd matching result; And

Authentication module, selects a destination multimedia file according to described 3rd matching result from the described destination multimedia file obtained according to the second matching result identification.

9. the recognition device of multimedia file according to claim 8, is characterized in that,

The step that described first extraction module specifically performs when extracting audio frequency watermark data is: the voice data extracting the HFS in described mixing audio data;

The step that described second extraction module specifically performs when characteristic information extraction is: the characteristic information extracting the voice data of the low frequency part in described mixing audio data;

The step that described 3rd extraction module specifically performs when extracting user voice data is: the voice data extracting the low frequency part in described mixing audio data; Remove the voice data of the described destination multimedia file in the voice data of described low frequency part, to obtain described user voice data.

10. the recognition device of multimedia file according to claim 7, is characterized in that, described second extraction module comprises:

Left and right acoustic channels data extraction module, for extracting left channel data and the right data of the low frequency part in described mixing audio data;

Stereo data synthesis module, described left channel data and described right data is merged for adopting following formula, to obtain the stereo data of described low frequency part: s=a*l+b*r, wherein, a+b=1, s are the stereo data of described low frequency part, and l is the left channel data of described low frequency part, r is the right data of described low frequency part, a and b is default parameter; And

Finger print information extraction module, the time-frequency characteristics data for extracting described stereo data obtain the finger print information of described destination multimedia file, and wherein, described finger print information forms the characteristic information of the voice data of described destination multimedia file.

The recognition device of 11. multimedia files according to claim 10, it is characterized in that, if described destination multimedia file is a sub-multimedia file of the second multimedia file, described first matching result is the identification information of described second multimedia file, described second matching result is the identification information of described destination multimedia file, described feature samples is at least one multimedia recording stored in default property data base, described multimedia recording comprises the finger print information of multimedia file, the identification information of the multimedia file corresponding with described finger print information, then:

The step that described determination module specifically performs when determining described feature samples part is: in described property data base, navigates to one or more multimedia recording that the identification information of described second multimedia file is corresponding;

The step that described second matching module specifically performs when obtaining the second matching result is: the finger print information mating described destination multimedia and described one or more multimedia recording navigated to, to determine the identification information of described destination multimedia.

The recognition device of 12. multimedia files according to claim 11, is characterized in that, the stereo data of described low frequency part is N number of stereo data, and wherein, i-th stereo data in described N number of stereo data is s _i=a _i* l+b _i* r, a _i'+b _i'=1, i=1,2,3 ... N, then described second matching module comprises:

Matching rate determination module, for the time-frequency characteristics data of each stereo data being mated with described described one or more multimedia recording navigated to respectively, obtains multiple matching rates that described stereo data is corresponding;

Identification information determination module, determines the identification information of described destination multimedia file for a described multimedia recording corresponding according to the maximal value in described multiple matching rate.