CN110047515A

CN110047515A - A kind of audio identification methods, device, equipment and storage medium

Info

Publication number: CN110047515A
Application number: CN201910270746.7A
Authority: CN
Inventors: 鲁霄
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2019-04-04
Filing date: 2019-04-04
Publication date: 2019-07-23
Anticipated expiration: 2039-04-04
Also published as: WO2020199384A1; CN110047515B

Abstract

The embodiment of the invention discloses a kind of audio identification methods, device, equipment and storage mediums；The embodiment of the present invention can extract the audio-frequency fingerprint of audio to be identified as reference finger, the similarity of calculating benchmark fingerprint and preset fingerprint library sound intermediate frequency fingerprint；According to the similarity of reference finger and fingerprint base sound intermediate frequency fingerprint, candidate fingerprint collection is filtered out in fingerprint base；Reference fingerprint is selected in candidate fingerprint concentration, and obtains the unisonance fingerprint of reference fingerprint；In reference fingerprint and its corresponding audio of unisonance fingerprint, the corresponding target audio of audio to be identified is selected.The program improves the fining degree of audio identification, and identification obtains more accurate target audio.

Description

A kind of audio identification methods, device, equipment and storage medium

Technical field

The present invention relates to fields of communication technology, and in particular to a kind of audio identification methods, device, equipment and storage medium.

Background technique

It listens song to know the music that Qu Gongneng likes for the music public's retrieval and provides a kind of very convenient way of search, User need to only record the music in environment, or humming snatch of song, and input application software can identify which first song this is It is bent.Current to listen song to know bent, is mainly retrieved in the song library of magnanimity according to the characteristic information of input song, select with it is defeated Enter the most like song of song.

In the research and practice process to the prior art, the inventors found that: the audio fragment that user uploads It may be corresponding with the audio of multiple versions, and current music platform audio identification process is coarse, does not consider different editions Between difference, the segment for causing music platform to provide according to user come the song selected may not be that audio fragment is real Source is not that user really wants.As can be seen that current audio identification accuracy is poor.

Summary of the invention

The embodiment of the present invention provides a kind of audio identification methods, device, equipment and storage medium, it is intended to improve audio identification Accuracy.

The embodiment of the present invention provides a kind of audio identification methods, comprising:

The audio-frequency fingerprint of audio to be identified is extracted as reference finger, calculates the reference finger and preset fingerprint library middle pitch The similarity of frequency fingerprint；

According to the similarity of the reference finger and fingerprint base sound intermediate frequency fingerprint, candidate is filtered out in the fingerprint base and is referred to Line collection；

Reference fingerprint is selected in candidate fingerprint concentration, and obtains the unisonance fingerprint of the reference fingerprint；

In the reference fingerprint and its corresponding audio of unisonance fingerprint, the corresponding target sound of the audio to be identified is selected Frequently.

In some embodiments, the unisonance fingerprint for obtaining the reference fingerprint, comprising:

It calculates the reference fingerprint and candidate fingerprint concentrates the registration of other candidate fingerprints；

According to the registration, the unisonance fingerprint of the reference fingerprint is selected in other described candidate fingerprints.

In some embodiments, the calculating reference fingerprint concentrates being overlapped for other candidate fingerprints with candidate fingerprint Degree, comprising:

It obtains the reference fingerprint and candidate fingerprint concentrates the longest common subsequence of other candidate fingerprints, statistics is described most The length of long common subsequence；

According to the length of the longest common subsequence, being overlapped for the reference fingerprint and other candidate fingerprints is calculated Degree.

In some embodiments, described according to the registration, the reference is selected in other described candidate fingerprints to be referred to The unisonance fingerprint of line, comprising:

In other described candidate fingerprints, filters out and be greater than or equal to preset threshold with the registration of the reference fingerprint Candidate fingerprint, the unisonance fingerprint as the reference fingerprint.

In some embodiments, the method also includes:

If the candidate fingerprint for being greater than or equal to preset threshold with the registration of the reference fingerprint is not found, by the ginseng It examines the corresponding audio of fingerprint and is determined as the corresponding target audio of the audio to be identified.

In some embodiments, reference fingerprint is selected in candidate fingerprint concentration, comprising:

The candidate fingerprint is concentrated, the maximum candidate fingerprint of similarity numerical value with the reference finger is determined as joining Examine fingerprint.

In some embodiments, the similarity for calculating the reference finger and preset fingerprint library sound intermediate frequency fingerprint, packet It includes:

The quantity for the identical cryptographic Hash that the reference finger and each audio-frequency fingerprint in preset fingerprint library are included is counted respectively；

According to the quantity of the identical cryptographic Hash, the phase of the reference finger with audio-frequency fingerprint each in fingerprint base is calculated separately Like degree.

In some embodiments, described in the reference fingerprint and its corresponding audio of unisonance fingerprint, select it is described to Identify the corresponding target audio of audio, comprising:

It obtains the reference fingerprint and its corresponding audio of unisonance fingerprint is unisonance audio, obtain the version letter of unisonance audio Breath；

According to the version information, the version priority of the unisonance audio is determined；

Using the unisonance audio of version highest priority as the corresponding target audio of the audio to be identified.

In addition, the embodiment of the present invention also provides a kind of speech recognizing device, comprising:

Fingerprint unit, for extracting the audio-frequency fingerprint of audio to be identified as reference finger, calculate the reference finger with The similarity of preset fingerprint library sound intermediate frequency fingerprint；

Candidate unit, for the similarity according to the reference finger and fingerprint base sound intermediate frequency fingerprint, in the fingerprint base In filter out candidate fingerprint collection；

Unisonance unit for selecting reference fingerprint in candidate fingerprint concentration, and obtains the unisonance of the reference fingerprint Fingerprint；

Audio unit, for selecting the sound to be identified in the reference fingerprint and its corresponding audio of unisonance fingerprint Frequently corresponding target audio.

In addition, the embodiment of the present invention also provides a kind of audio recognition devices, the audio recognition devices include: memory, Processor and the audio identification program that is stored on the memory, and can run on the processor, the audio identification It realizes when program is executed by the processor such as the step in any audio identification methods provided in an embodiment of the present invention.

In some embodiments, the audio recognition devices further include audio collecting device, and the audio collecting device is used In acquisition audio to be identified.

In addition, the embodiment of the present invention also provides a kind of storage medium, the storage medium is stored with a plurality of instruction, the finger It enables and being loaded suitable for processor, to execute the step in any audio identification methods provided in an embodiment of the present invention.

Audio-frequency fingerprint of the embodiment of the present invention by extracting audio to be identified calculates the reference finger as reference finger With the similarity of preset fingerprint library sound intermediate frequency fingerprint；According to the similarity of the reference finger and fingerprint base sound intermediate frequency fingerprint, Candidate fingerprint collection is filtered out in the fingerprint base；Reference fingerprint is selected in candidate fingerprint concentration, and obtains the reference and refers to The unisonance fingerprint of line；In the reference fingerprint and its corresponding audio of unisonance fingerprint, it is corresponding to select the audio to be identified Target audio.As a result, the program retrieve with after the approximate candidate fingerprint of reference finger, although candidate fingerprint is referred to benchmark Line is matched, but it may lead to the presence of uncertainty because of the version problem etc. of audio to be identified.Therefore, the program into One step selects reference fingerprint in candidate fingerprint concentration, and then passes through other candidate fingerprints of calculating in candidate fingerprint collection of registration In select unisonance fingerprint, realize the further screening to candidate fingerprint.The program is by repeatedly screening obtained reference fingerprint And its unisonance fingerprint, it include closest with the reference finger of audio to be identified, and corresponding audio is identical or can be considered identical Audio-frequency fingerprint.To which the target audio selected in reference fingerprint and its corresponding audio of unisonance fingerprint is the sound of optimal version Frequently, the real source or source of audio to be identified be can be used as, while having ensured the accuracy of target audio content and version, improved The whole efficiency and user experience of audio identification.The program by screening the audio-frequency fingerprint in fingerprint base layer by layer, carefully Audio identification granularity is changed, has improved the fining degree of audio identification, so that retrieval obtains more accurate target audio.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 a is the schematic diagram of a scenario of information interaction system provided in an embodiment of the present invention；

Fig. 1 b is the flow diagram of audio identification methods provided in an embodiment of the present invention；

Fig. 2 a is audio identification schematic diagram of a scenario provided in an embodiment of the present invention；

Fig. 2 b is candidate fingerprint collection schematic diagram provided in an embodiment of the present invention；

Fig. 2 c is recognition result display interface schematic diagram provided in an embodiment of the present invention；

Fig. 3 is speech recognizing device structural schematic diagram provided in an embodiment of the present invention；

Fig. 4 a is audio recognition devices structural schematic diagram provided in an embodiment of the present invention；

Fig. 4 b is another audio recognition devices structural schematic diagram provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall within the protection scope of the present invention.

The embodiment of the present invention provides a kind of audio identification methods, device, equipment and storage medium.

As shown in Figure 1a, the embodiment of the present invention provides a kind of information interaction system, which includes that the embodiment of the present invention is appointed One speech recognizing device provided, the speech recognizing device can integrate in the equipment such as server；In addition, the system can be with Including other equipment, for example, client etc..Client can be terminal or personal computer (PC, Personl Computer) Etc. equipment, for acquiring audio to be identified and/or uploading audio to be identified to server.

Client will record or local audio is as audio to be identified, be sent to server, request carries out audio identification.Clothes The audio to be identified that device reception client of being engaged in is sent, extracts the audio-frequency fingerprint of audio to be identified as reference finger, then calculates The similarity of the reference finger and preset fingerprint library sound intermediate frequency fingerprint；To according to the reference finger and fingerprint base sound intermediate frequency The similarity of fingerprint filters out candidate fingerprint collection in the fingerprint base；Then, it selects to refer in candidate fingerprint concentration and refer to Line, and obtain the unisonance fingerprint of the reference fingerprint；In the reference fingerprint and its corresponding audio of unisonance fingerprint, institute is selected State the corresponding target audio of audio to be identified.

As a result, the program retrieve with after the approximate candidate fingerprint of reference finger, although candidate fingerprint is referred to benchmark Line is matched, but it may lead to the presence of uncertainty because of the version problem etc. of audio to be identified.Therefore, the program into One step selects reference fingerprint in candidate fingerprint concentration, and then passes through other candidate fingerprints of calculating in candidate fingerprint collection of registration In select unisonance fingerprint, realize the further screening to candidate fingerprint.The program is by repeatedly screening obtained reference fingerprint And its unisonance fingerprint, it include closest with the reference finger of audio to be identified, and corresponding audio is identical or can be considered identical Audio-frequency fingerprint.To which the target audio selected in reference fingerprint and its corresponding audio of unisonance fingerprint is the sound of optimal version Frequently, the real source or source of audio to be identified be can be used as, while having ensured the accuracy of target audio content and version, improved The whole efficiency and user experience of audio identification.The program by screening the audio-frequency fingerprint in fingerprint base layer by layer, carefully Audio identification granularity is changed, has improved the fining degree of audio identification, so that retrieval obtains more accurate target audio.

It is described in detail separately below.

The present embodiment will be described from the angle of speech recognizing device, which specifically can integrate in net In network equipment, which can be the equipment such as terminal or server, wherein the terminal can be mobile phone, tablet computer, pen Remember this computer or personal computer (PC, Personal Computer) etc..

As shown in Figure 1 b, the detailed process of the audio identification methods can be such that

101, the audio-frequency fingerprint for obtaining audio to be identified is benchmark fingerprint, is calculated in the reference finger and preset fingerprint library The similarity of audio-frequency fingerprint.

Wherein, the audio-frequency fingerprint and audio-frequency fingerprint and audio of each audio in audio repository are stored in preset fingerprint base The mapping relations of each audio in library.For example, speech recognizing device can carry out audio-frequency fingerprint to each audio in audio repository in advance Extraction, extraction is obtained into the storage of each audio-frequency fingerprint into fingerprint base, and record the mapping relations of each audio and audio-frequency fingerprint.

For example, speech recognizing device obtains audio to be identified, the extraction of audio-frequency fingerprint is carried out, and by the sound of audio to be identified Frequency fingerprint is as reference finger, for inquiring and its closest or most like audio-frequency fingerprint.

In some embodiments, image retrieving apparatus can receive audio identification request, obtain audio to be identified；To described Audio to be identified carries out audio-frequency fingerprint extraction, Hash sequence is obtained, using the Hash sequence as reference finger.

For example, the identification request of client input audio can be used in user, speech recognizing device is asked receiving audio identification After asking, notice client starts to carry out audio collection, to record to the sound etc. in the humming sound or environment of user Sound obtains audio to be identified, which is that this audio identification requests corresponding audio to be identified.Certainly, user Client can also be locally stored, or the audio downloaded from network is uploaded to speech recognizing device, audio is known as a result, Other device obtains audio identification request and its corresponding audio to be identified.

Wherein, client can be sound pick-up outfit or mobile phone, plate, personal computer with audio collection function etc. eventually End equipment.

Then, speech recognizing device carries out audio-frequency fingerprint extraction to the audio signal of audio to be identified, obtains sound to be identified The audio-frequency fingerprint of frequency, the audio-frequency fingerprint contain the audio feature information of audio to be identified.Wherein, the audio of audio signal is referred to Line extraction, which can specifically include, carries out framing, adding window, FFT (Fast Fourier Transform, in quick Fu to audio signal Leaf transformation) frequency-domain transform, extraction local peaking and conversion Hash sequence etc..

Specifically, speech recognizing device carries out framing after obtaining audio to be identified, to the audio signal of audio to be identified And windowing process.Framing is that whole section audio signal is cut into multistage by preset rules, and each section is a frame, so that audio signal It is smoothly, so as to the Audio Signal Processing smoothing input signal for the later period on microcosmic.Then, speech recognizing device uses Preset windowed function carries out adding window to every frame audio respectively, and preset windowed function can be Hamming window etc., to make framing Audio signal afterwards is more coherent, shows periodic function feature.

Then, speech recognizing device carries out FFT frequency-domain transform to each frame audio signal, obtains the frequency comprising frequency domain information Spectrum.In turn, speech recognizing device extracts the local peaking in frequency spectrum, and it is as to be identified to be converted into the Hash sequence Hash sequence The audio-frequency fingerprint of audio.It should be noted that may include multiple cryptographic Hash in the Hash sequence.

Speech recognizing device comes calculating benchmark fingerprint and default finger using the audio-frequency fingerprint of audio to be identified as reference finger The similarity of line library sound intermediate frequency fingerprint realizes the retrieval or matching of audio-frequency fingerprint.

In some embodiments, the audio-frequency fingerprint in reference finger and fingerprint base is characterized using Hash sequence, step " meter Calculate the similarity of the reference finger Yu preset fingerprint library sound intermediate frequency fingerprint " may include: count respectively the reference finger with The quantity for the identical cryptographic Hash that each audio-frequency fingerprint is included in preset fingerprint library；According to the quantity of the identical cryptographic Hash, respectively Calculate the similarity of each audio-frequency fingerprint in the reference finger and fingerprint base.

By taking audio-frequency fingerprint any in fingerprint base as an example, speech recognizing device by reference finger Hash sequence cryptographic Hash with Cryptographic Hash in the audio-frequency fingerprint Hash sequence is compared one by one, and counts the quantity quantity of identical cryptographic Hash, audio identification Device is using the quantity of obtained identical cryptographic Hash as the similarity of reference finger and the audio-frequency fingerprint.Audio identification fills as a result, Set the similarity for calculating separately to obtain each audio-frequency fingerprint in reference finger and fingerprint base.

102, according to the similarity of the reference finger and fingerprint base sound intermediate frequency fingerprint, time is filtered out in the fingerprint base Select fingerprint collection.

For example, speech recognizing device can be according to preset similarity threshold, it will be similar to reference finger in fingerprint base Degree value be greater than the similarity threshold audio-frequency fingerprint screen, as with the matched candidate fingerprint of reference finger.

It should be noted that with the matched candidate fingerprint of reference finger, it can be understood as its corresponding audio with it is to be identified Audio is identical or can be considered identical, such as same song or the different same song of music.

In turn, the candidate fingerprint that screening obtains is configured in identity set by speech recognizing device, obtains candidate fingerprint collection. Candidate fingerprint concentration includes the one or more and matched candidate fingerprint of reference finger as a result,.

103, reference fingerprint is selected in candidate fingerprint concentration, and obtains the unisonance fingerprint of the reference fingerprint.

Wherein, reference fingerprint is the candidate fingerprint most like with reference finger.For example, speech recognizing device can will be described Candidate fingerprint is concentrated, and the maximum candidate fingerprint of similarity numerical value with the reference finger is determined as reference fingerprint.

Then, speech recognizing device selects the unisonance fingerprint of the reference fingerprint.It should be noted that unisonance fingerprint can be managed Solution is that its corresponding audio audio corresponding with reference fingerprint is identical or can be considered identical.For example, in the song of music platform In library, there is number difference but be multiple audios of same song in fact, for example be the different editions of same song, it is different Singer turns over the different editions sung, or takes in the same song in different albums or radio station, will belong to multiple sounds of same first song Frequency is defined as unisonance audio, their audio-frequency fingerprint is unisonance fingerprint.

In some embodiments, step " the unisonance fingerprint for obtaining the reference fingerprint " may include: to calculate the reference Fingerprint and the candidate fingerprint concentrate the registration of other candidate fingerprints；According to the registration, in other described candidate fingerprints In select the unisonance fingerprint of the reference fingerprint.

Wherein, the registration of reference fingerprint and other candidate fingerprints can pass through the side such as correlation, longest common subsequence Formula is calculated.Wherein, correlation can be the variance for calculating reference fingerprint and other candidate fingerprint Hash sequences, by variance yields Registration as reference fingerprint and other candidate fingerprints.Then, variance value is met preset requirement by speech recognizing device Other candidate fingerprints, the unisonance fingerprint as reference fingerprint.

It is illustrated with longest common subsequence (LCS, Longest Common Subsequence), step " meter Calculate the registration that the reference fingerprint concentrates other candidate fingerprints with the candidate fingerprint " it may include: to obtain the reference to refer to Line and candidate fingerprint concentrate the longest common subsequence of other candidate fingerprints, count the length of the longest common subsequence；Root According to the length of the longest common subsequence, the registration of the reference fingerprint Yu other candidate fingerprints is calculated.

Wherein, reference fingerprint and candidate fingerprint concentrate other candidate fingerprints to characterize using Hash sequence.

As a particular sequence, subsequence refers under conditions of not changing element relative rank Hash sequence, will The sequence that zero or more element removes in sequence.If a sequence while the subsequence as multiple Hash sequences, The sequence is the common subsequence of this multiple Hash sequence.And the longest common subsequence of Hash sequence, it is multiple Hash The longest shared subsequence of sequence.The length of longest common subsequence is the quantity of element in common subsequence.

For example, Dynamic Programming (DP, Dynamic Programming) can be used to calculate reference fingerprint and other candidate fingerprints The longest common subsequence length of Hash sequence.In the present embodiment, reference fingerprint and other candidate fingerprint Hash sequences are most Long common subsequence calculating formula of length is as follows:

Nlcs=LCS (res [i] .hash_seq, res [0] .hash_seq)

Wherein, nlcs is longest common subsequence length, and LCS is Dynamic Programming longest common subsequence length computation letter Number, res [i] hash_seq are i-th of candidate fingerprint Hash sequence, and res [0] .hash_seq is reference fingerprint Hash sequence.

For example, reference fingerprint Hash sequence X={ A, B, C, B, D, A, B }, any other candidate fingerprint Hash sequences Y= { B, D, C, A, B, A }.Such as the sequences such as { A, B } and { B, C, B, A }, it is both the subsequence and the subsequence of Y sequence of X sequence, It therefore, is the common subsequence of X and Y sequence.No longer completely enumerate the common subsequence of X and Y sequence one by one in the present embodiment. In the common subsequence of X and Y, sequence { B, C, B, A } includes 4 elements, therefore statistics is obtained the length is 4, is X and Y Longest common subsequence.

By taking other any candidate fingerprints as an example, after obtaining reference fingerprint and its longest common subsequence length, audio The registration of identification device calculating reference fingerprint and other candidate fingerprints.For example, following formula can be used to calculate:

Sim=nlcs/hash_seq_cnt × 100%；

Wherein, sim is the similarity of reference fingerprint and other candidate fingerprints, and nlcs is longest common subsequence length, Hash_seq_cnt is reference fingerprint Hash sequence length.In some embodiments, the code of the formula can refer to int sim= nlcs*1.0/hash_seq_cnt*100。

Speech recognizing device can calculate separately to obtain the registration of reference fingerprint and other each candidate fingerprints as a result,.

Then, speech recognizing device in other candidate fingerprints, can select the unisonance fingerprint of reference fingerprint.

For example, speech recognizing device can be by other maximum candidate fingerprints of registration numerical value, as the same of reference fingerprint Sound fingerprint；Alternatively, speech recognizing device by registration numerical value according to sequence from large to small, choose sequence in preceding default precedence Other candidate fingerprints, the unisonance fingerprint as reference fingerprint.

In some embodiments, step " according to the registration, the reference is selected in other described candidate fingerprints and is referred to The unisonance fingerprint of line " may include: in other described candidate fingerprints, filter out be greater than with the registration of the reference fingerprint or Unisonance fingerprint equal to the candidate fingerprint of preset threshold, as the reference fingerprint.

Wherein, preset threshold can be adjusted flexibly according to actual needs, such as 25%.

For speech recognizing device in other candidate fingerprints of candidate fingerprint collection, screening obtains the unisonance of reference fingerprint as a result, Fingerprint.

The present embodiment eliminates by the calculating of similarity and does unisonance audio indicia to the audio in audio repository as a result, Manpower and time cost, the case where also avoiding manual entry information not in time, and in audio storage, it no longer needs to do unisonance The artificial additional markers of audio or classification also eliminate the need for the risk of information mistakes and omissions record, reduce maintenance cost.Therefore, originally Embodiment improves the accuracy and efficiency of unisonance fingerprint and unisonance audio identification.

In some embodiments, if not finding the candidate for being greater than or equal to preset threshold with the registration of the reference fingerprint The corresponding audio of the reference fingerprint is then determined as the corresponding target audio of the audio to be identified by fingerprint.

Speech recognizing device determines that candidate fingerprint concentration does not have in the unisonance fingerprint that can not find reference fingerprint as a result, With other very approximate candidate fingerprints of reference fingerprint.Therefore, speech recognizing device is according to audio-frequency fingerprint each in fingerprint base and sound The mapping relations of frequency determine the corresponding audio of the reference fingerprint, and the audio are determined as the corresponding target sound of audio to be identified Frequently.

104, in the reference fingerprint and its corresponding audio of unisonance fingerprint, the corresponding mesh of the audio to be identified is selected Mark with phonetic symbols frequency.

After obtaining reference fingerprint and its unisonance fingerprint, speech recognizing device is according to audio-frequency fingerprint each in fingerprint base and audio Mapping relations, determine reference fingerprint and its corresponding audio of unisonance fingerprint.

Then, speech recognizing device selects target audio in reference fingerprint and its corresponding audio of unisonance fingerprint.Example Such as, speech recognizing device is by reference fingerprint and its corresponding audio of unisonance fingerprint, all as the corresponding target of audio to be identified Audio.In this way, avoiding the audio substantially identical with audio to be identified for causing leakage to be selected due to version problem, audio is improved The accuracy of fingerprint matching.

In some embodiments, reference fingerprint and its corresponding audio of unisonance fingerprint can also be carried out according to actual needs Screening, step " in the reference fingerprint and its corresponding audio of unisonance fingerprint, select the corresponding target of the audio to be identified Audio " may include: to obtain the reference fingerprint and its corresponding audio of unisonance fingerprint as unisonance audio, obtain unisonance audio Version information；According to the version information, the version priority of the unisonance audio is determined；By the unisonance of version highest priority Audio is as the corresponding target audio of the audio to be identified.

Wherein, version information includes the information such as source, singer, restocking and/or the issuing date of audio, can be audio certainly The presupposed information of band.Unisonance audio can be the different audios of version informations such as source difference and/or version.

For example, speech recognizing device sets the version priority that source is album according to the source-information in unisonance audio It is set to highest, source is that the version priority level initializing in radio station is minimum.Source is the unisonance of album by speech recognizing device as a result, Audio is determined as target audio.

For example, shelf life of the speech recognizing device according to unisonance audio, according to chronological order, most by shelf life Early version priority is set as highest, and the version priority of shelf life the latest is set as minimum.Speech recognizing device as a result, The earliest unisonance audio of shelf life is determined as target audio.

Target audio is most like with audio to be identified as a result, and version most accurate audio.

From the foregoing, it will be observed that the embodiment of the present invention can extract the audio-frequency fingerprint of audio to be identified as reference finger, institute is calculated State the similarity of reference finger Yu preset fingerprint library sound intermediate frequency fingerprint；According to the reference finger and fingerprint base sound intermediate frequency fingerprint Similarity filters out candidate fingerprint collection in the fingerprint base, and candidate fingerprint concentration includes referring to the approximate audio of reference finger Line；Then, reference fingerprint is selected in candidate fingerprint concentration, and obtains the unisonance fingerprint of the reference fingerprint；In the ginseng It examines in fingerprint and its corresponding audio of unisonance fingerprint, selects the corresponding target audio of the audio to be identified.The program exists as a result, Retrieve with after the approximate candidate fingerprint of reference finger, although candidate fingerprint be it is matched with reference finger, it may Because the version problem etc. of audio to be identified leads to the presence of uncertainty.Therefore, the program is further concentrated in candidate fingerprint and is selected Reference fingerprint out, and then unisonance fingerprint is selected in other candidate fingerprints of candidate fingerprint collection by the calculating of registration, it realizes Further screening to candidate fingerprint.The program, which is passed through, repeatedly screens obtained reference fingerprint and its unisonance fingerprint, includes It is closest with the reference finger of audio to be identified, and corresponding audio is identical or can be considered identical audio-frequency fingerprint.To refer to The target audio selected in fingerprint and its corresponding audio of unisonance fingerprint is the audio of optimal version, can be used as audio to be identified Real source or source, while having ensured the accuracy of target audio content and version, improved the whole effect of audio identification Rate and user experience.The program has refined audio identification granularity, has mentioned by being screened layer by layer to the audio-frequency fingerprint in fingerprint base The fining degree of audio identification is risen, so that retrieval obtains more accurate target audio.

Citing, is described in further detail by the method according to described in preceding embodiment below.

For example, referring to Fig. 2 a, in the present embodiment, will be specifically integrated in the speech recognizing device in server cluster into Row explanation.The server cluster includes feature extraction server, leaf server and root server.It may include one in the system Or more feature extraction servers, leaf server and root server.The present embodiment includes a feature extraction service with the system Device, more leaf servers and a root server are illustrated.

(1) client uploads audio to be identified.

User can by the audio of recording or the audio of local, by the audio identification software installed in client or Music software etc. is uploaded to feature extraction server.

(2) audio-frequency fingerprint is extracted.

Feature extraction server extracts the audio-frequency fingerprint of audio to be identified, as reference finger.Then, feature extraction service Reference finger is sent respectively to each leaf server by device, to carry out the matching of audio-frequency fingerprint.

(3) fingerprint matching.

Each leaf server extraction unit multi-voice frequency fingerprint from fingerprint base respectively, the matching of Lai Jinhang audio-frequency fingerprint.For example, Each leaf server can extract corresponding audio-frequency fingerprint from fingerprint base and be matched according to preset allocation rule, thus It realizes the shunting processing and parallel processing of mass data, improves audio identification speed.

With the illustration of any leaf server.

The leaf server calculates separately the similarity of each audio-frequency fingerprint in reference finger and fingerprint base.For example, leaf server The quantity for the identical cryptographic Hash that the reference finger and audio-frequency fingerprint each in fingerprint base are included can be counted respectively；By identical Kazakhstan The quantity of uncommon value, respectively corresponds the similarity as each audio-frequency fingerprint in reference finger and fingerprint base.

Then, which will be greater than the candidate fingerprint of default similarity threshold with the similarity numerical value of reference finger, It is determined as candidate fingerprint, and candidate fingerprint is sent to root server.

(4) unisonance identifies.

Each candidate fingerprint is configured to candidate and referred to by root server after the candidate fingerprint for obtaining the transmission of each page of server Line is concentrated, and then, selects reference fingerprint and its unisonance fingerprint in candidate fingerprint concentration.

For example, root server concentrates candidate fingerprint, with the maximum candidate fingerprint of reference finger similarity numerical value as ginseng Examine fingerprint.

Then, root server calculates reference fingerprint and candidate fingerprint concentrates the registration of other candidate fingerprints.As one kind Embodiment, the available reference fingerprint of root server and candidate fingerprint concentrate the public sub- sequence of the longest of other candidate fingerprints Column, count the length of the longest common subsequence；Then, it by the length of longest common subsequence, respectively corresponds as reference The registration of fingerprint and other candidate fingerprints.

Then, root server selects the same of the reference fingerprint in other described candidate fingerprints according to the registration Sound fingerprint.As an implementation, root server filters out the weight with the reference fingerprint in other described candidate fingerprints The right candidate fingerprint more than or equal to preset threshold, the unisonance fingerprint as the reference fingerprint.

Root server realizes the identification of unisonance fingerprint as a result,.

For example, idx is the similarity numerical ranks of candidate fingerprint and reference finger in Fig. 2 b, wherein idx numerical value is 0 Audio-frequency fingerprint and the similarity numerical value of reference finger are maximum；Id is the corresponding audio number of candidate fingerprint, should so as to basis Id finds its corresponding audio；Score is the similarity numerical value of candidate fingerprint and reference finger, and the numerical value the big, illustrates itself and base Quasi- fingerprint similarity is higher；Lcs is the longest common subsequence length namely similarity numerical value of candidate fingerprint and reference fingerprint.

By taking Fig. 2 b as an example, taking similarity threshold is 9, then it includes 35 candidates that the candidate fingerprint of root server configuration is concentrated altogether Fingerprint, that is, the similarity numerical value score of this 35 candidate fingerprints and reference finger is greater than 9.

Wherein, the audio-frequency fingerprint that idx is 0 and the similarity numerical value of reference finger are maximum, as reference fingerprint, because This, the lcs with itself is 100.Root server calculates separately out candidate fingerprint concentration, the candidate fingerprint and ginseng of idx0 to 34 The lcs length for examining fingerprint, as similarity.If preset threshold is 25, similarity numerical value is 25 or more by root server Candidate fingerprint is all used as the unisonance fingerprint of reference fingerprint.

(5) audio is screened.

After obtaining reference fingerprint and its unisonance fingerprint, root service is corresponding in the reference fingerprint and its unisonance fingerprint In audio, the corresponding target audio of the audio to be identified is selected.

For example, root server obtains the reference fingerprint and its corresponding audio of unisonance fingerprint is unisonance audio, obtain same The version information of sound audio；According to the version information, the version priority of the unisonance audio is determined；Most by version priority High unisonance audio is as the corresponding target audio of the audio to be identified.

It is defeated if root server determines that the corresponding audio of the candidate fingerprint of idx26 is target audio by taking above-mentioned Fig. 2 b as an example Its audio id out.

(6) result exports.

The target audio that screening obtains is returned to client by root server, for client terminal playing to user.

For example, client obtains the audio id of root server return in Fig. 2 c, it is corresponding that the number is retrieved from audio repository Target audio, and show user on recognition result display interface.Certainly, which can also be provided on display interface The title of frequency, singer for example so-and-so, information such as source such as album, and provide broadcast button are played for user.

From the foregoing, it will be observed that user can will need the audio identified to be uploaded to server cluster, server cluster passes through leaf service Device carries out parallel fingerprint matching, improves audio retrieval speed.Root server carries out into one the matching result of leaf server The screening of step improves to select content and audio to be identified is closest, and version and the most matched target audio of user demand Audio identification efficiency and user experience.

In order to better implement above method, the embodiment of the present invention can also provide a kind of speech recognizing device, the audio Identification device specifically can integrate in the network device, which can be the equipment such as terminal or server.

For example, as shown in figure 3, the speech recognizing device may include fingerprint unit 301, candidate unit 302, unisonance unit 303 and audio unit 304, as follows:

(1) fingerprint unit 301；

Fingerprint unit 301, the audio-frequency fingerprint for extracting audio to be identified calculate the reference finger as reference finger With the similarity of preset fingerprint library sound intermediate frequency fingerprint.

For example, fingerprint unit 301 obtains audio to be identified, the extraction of audio-frequency fingerprint is carried out, and by the sound of audio to be identified Frequency fingerprint is as reference finger, for inquiring and its closest or most like audio-frequency fingerprint.

In some embodiments, fingerprint unit 301 can receive audio identification request, obtain audio to be identified；To described Audio to be identified carries out audio-frequency fingerprint extraction, Hash sequence is obtained, using the Hash sequence as reference finger.

For example, the identification request of client input audio can be used in user, fingerprint unit 301 is receiving audio identification request Afterwards, notice client starts to carry out audio collection, thus record to the sound etc. in the humming sound or environment of user, Audio to be identified is obtained, which is that this audio identification requests corresponding audio to be identified.Certainly, user can also With what client was locally stored, or the audio downloaded from network is uploaded to sound fingerprint unit 301, as a result, fingerprint unit 301 obtain audio identification request and its corresponding audio to be identified.

Then, fingerprint unit 301 carries out audio-frequency fingerprint extraction to the audio signal of audio to be identified, obtains audio to be identified Audio-frequency fingerprint, which contains the audio feature information of audio to be identified.Wherein, to the audio-frequency fingerprint of audio signal Extraction, which can specifically include, carries out framing, adding window, FFT (Fast Fourier Transform, fast Fourier to audio signal Transformation) frequency-domain transform, extraction local peaking and conversion Hash sequence etc..

Specifically, fingerprint unit 301 is after obtaining audio to be identified, to the audio signal of audio to be identified carry out framing and Windowing process.Framing is that whole section audio signal is cut into multistage by preset rules, and each section is a frame, so that audio signal exists It is smoothly, so as to the Audio Signal Processing smoothing input signal for the later period on microcosmic.Then, fingerprint unit 301 uses pre- If windowed function respectively to every frame audio carry out adding window, preset windowed function can be Hamming window etc., thus after making framing Audio signal it is more coherent, show periodic function feature.

Then, fingerprint unit 301 carries out FFT frequency-domain transform to each frame audio signal, obtains the frequency comprising frequency domain information Spectrum.In turn, fingerprint unit 301 extracts the local peaking in frequency spectrum, and it is as to be identified to be converted into the Hash sequence Hash sequence The audio-frequency fingerprint of audio.It should be noted that may include multiple cryptographic Hash in the Hash sequence.

Fingerprint unit 301 comes calculating benchmark fingerprint and default finger using the audio-frequency fingerprint of audio to be identified as reference finger The similarity of line library sound intermediate frequency fingerprint realizes the retrieval or matching of audio-frequency fingerprint.

In some embodiments, the audio-frequency fingerprint in reference finger and fingerprint base is characterized using Hash sequence, fingerprint list Member 301 can be used for: count the identical cryptographic Hash that the reference finger is included with each audio-frequency fingerprint in preset fingerprint library respectively Quantity；According to the quantity of the identical cryptographic Hash, the phase of the reference finger with audio-frequency fingerprint each in fingerprint base is calculated separately Like degree.

By taking audio-frequency fingerprint any in fingerprint base as an example, fingerprint unit 301 by reference finger Hash sequence cryptographic Hash with Cryptographic Hash in the audio-frequency fingerprint Hash sequence is compared one by one, and counts the quantity quantity of identical cryptographic Hash, fingerprint unit 301 using the quantity of obtained identical cryptographic Hash as the similarity of reference finger and the audio-frequency fingerprint.Fingerprint unit 301 as a result, It calculates separately to obtain the similarity of each audio-frequency fingerprint in reference finger and fingerprint base.

(2) candidate unit 302；

Candidate unit 302, for the similarity according to the reference finger and fingerprint base sound intermediate frequency fingerprint, in the fingerprint Candidate fingerprint collection is filtered out in library.

For example, candidate unit 302 can be according to preset similarity threshold, by the similarity in fingerprint base with reference finger Numerical value be greater than the similarity threshold audio-frequency fingerprint screen, as with the matched candidate fingerprint of reference finger.

In turn, the candidate fingerprint that screening obtains is configured in identity set by candidate unit 302, obtains candidate fingerprint collection. Candidate fingerprint concentration includes the one or more and matched candidate fingerprint of reference finger as a result,.

(3) unisonance unit 303；

Unisonance unit 303 for selecting reference fingerprint in candidate fingerprint concentration, and obtains the same of the reference fingerprint Sound fingerprint.

Wherein, reference fingerprint is the candidate fingerprint most like with reference finger.For example, unisonance unit 303 can will be described Candidate fingerprint is concentrated, and the maximum candidate fingerprint of similarity numerical value with the reference finger is determined as reference fingerprint.

Then, unisonance unit 303 selects the unisonance fingerprint of the reference fingerprint.It should be noted that unisonance fingerprint can be managed Solution is that its corresponding audio audio corresponding with reference fingerprint is identical or can be considered identical.For example, in the song of music platform In library, there is number difference but be multiple audios of same song in fact, for example be the different editions of same song, it is different Singer turns over the different editions sung, or takes in the same song in different albums or radio station, will belong to multiple sounds of same first song Frequency is defined as unisonance audio, their audio-frequency fingerprint is unisonance fingerprint.

In some embodiments, unisonance unit 303 specifically can be used for: calculate the reference fingerprint and the candidate fingerprint Concentrate the registration of other candidate fingerprints；According to the registration, the reference fingerprint is selected in other described candidate fingerprints Unisonance fingerprint.

Wherein, the registration of reference fingerprint and other candidate fingerprints can pass through the side such as correlation, longest common subsequence Formula is calculated.Wherein, correlation can be the variance for calculating reference fingerprint and other candidate fingerprint Hash sequences, by variance yields Registration as reference fingerprint and other candidate fingerprints.Then, unisonance unit 303 by variance value meet preset requirement its His candidate fingerprint, the unisonance fingerprint as reference fingerprint.

It is illustrated, unisonance unit with longest common subsequence (LCS, Longest Common Subsequence) 303 can be used for: obtaining the reference fingerprint and candidate fingerprint concentrates the longest common subsequence of other candidate fingerprints, count institute State the length of longest common subsequence；According to the length of the longest common subsequence, the reference fingerprint and its is calculated The registration of his candidate fingerprint.

Nlcs=LCS (res [i] .hash_seq, res [0] .hash_seq)

By taking other any candidate fingerprints as an example, after obtaining reference fingerprint and its longest common subsequence length, unisonance The registration of unit 303 calculating reference fingerprint and other candidate fingerprints.For example, following formula can be used to calculate:

Sim=nlcs/hash_seq_cnt × 100%；

Unisonance unit 303 can calculate separately to obtain the registration of reference fingerprint and other each candidate fingerprints as a result,.

Then, unisonance unit 303 in other candidate fingerprints, can select the unisonance fingerprint of reference fingerprint.

For example, unisonance unit 303 can unisonance by other maximum candidate fingerprints of registration numerical value, as reference fingerprint Fingerprint；Alternatively, speech recognizing device by registration numerical value according to sequence from large to small, choose sequence preceding default precedence its His candidate fingerprint, the unisonance fingerprint as reference fingerprint.

In some embodiments, unisonance unit 303 can be used for: in other described candidate fingerprints, filter out with it is described The registration of reference fingerprint is greater than or equal to the candidate fingerprint of preset threshold, the unisonance fingerprint as the reference fingerprint.

In other candidate fingerprints of candidate fingerprint collection, the unisonance that screening obtains reference fingerprint refers to unisonance unit 303 as a result, Line.

Unisonance unit 303 eliminates by the calculating of similarity and does unisonance audio indicia to the audio in audio repository as a result, Manpower and time cost, the case where also avoiding manual entry information not in time, and in audio storage, no longer need to do same The artificial additional markers of sound audio or classification also eliminate the need for the risk of information mistakes and omissions record, reduce maintenance cost.Therefore, Embodiment improves the accuracy and efficiencies of unisonance fingerprint and unisonance audio identification.

In some embodiments, if not finding the candidate for being greater than or equal to preset threshold with the registration of the reference fingerprint Fingerprint, then the corresponding audio of the reference fingerprint is determined as the corresponding target audio of the audio to be identified by audio unit 304.

As a result, in the unisonance fingerprint that can not find reference fingerprint, unisonance unit 303 determine candidate fingerprint concentrate not with Other very approximate candidate fingerprints of reference fingerprint.Therefore, audio unit 304 is according to audio-frequency fingerprint each in fingerprint base and audio Mapping relations determine the corresponding audio of the reference fingerprint, and the audio are determined as the corresponding target audio of audio to be identified.

(4) audio unit 304；

Audio unit 304, for selecting described to be identified in the reference fingerprint and its corresponding audio of unisonance fingerprint The corresponding target audio of audio.

After obtaining reference fingerprint and its unisonance fingerprint, audio unit 304 is according to audio-frequency fingerprint each in fingerprint base and audio Mapping relations, determine reference fingerprint and its corresponding audio of unisonance fingerprint.

Then, audio unit 304 selects target audio in reference fingerprint and its corresponding audio of unisonance fingerprint.For example, Audio unit 304 is by reference fingerprint and its corresponding audio of unisonance fingerprint, all as the corresponding target audio of audio to be identified.

In some embodiments, reference fingerprint and its corresponding audio of unisonance fingerprint can also be carried out according to actual needs Screening, audio unit 304 specifically can be used for: obtaining the reference fingerprint and its corresponding audio of unisonance fingerprint is unisonance sound Frequently, the version information of unisonance audio is obtained；According to the version information, the version priority of the unisonance audio is determined；By version The unisonance audio of this highest priority is as the corresponding target audio of the audio to be identified.

For example, audio unit 304 is according to the source-information in unisonance audio, it is the version priority level initializing of album by source For highest, source is that the version priority level initializing in radio station is minimum.Source is the unisonance sound of album by audio unit 304 as a result, Frequency is determined as target audio.

For example, shelf life of the audio unit 304 according to unisonance audio, according to chronological order, most by shelf life Early version priority is set as highest, and the version priority of shelf life the latest is set as minimum.Audio unit 304 as a result, The earliest unisonance audio of shelf life is determined as target audio.

From the foregoing, it will be observed that fingerprint of embodiment of the present invention unit 301 can extract the audio-frequency fingerprint of audio to be identified as benchmark Fingerprint calculates the similarity of the reference finger Yu preset fingerprint library sound intermediate frequency fingerprint；Candidate unit 302 refers to according to the benchmark The similarity of line and fingerprint base sound intermediate frequency fingerprint, filters out candidate fingerprint collection in the fingerprint base, and candidate fingerprint concentration includes With the approximate audio-frequency fingerprint of reference finger；Then, unisonance unit 303 selects reference fingerprint in candidate fingerprint concentration, and obtains Take the unisonance fingerprint of the reference fingerprint；In the reference fingerprint and its corresponding audio of unisonance fingerprint, audio unit 304 is selected The corresponding target audio of the audio to be identified out.As a result, the program retrieve with after the approximate candidate fingerprint of reference finger, Although candidate fingerprint be it is matched with reference finger, it may cause to exist because of the version problem etc. of audio to be identified It is uncertain.Therefore, the program further selects reference fingerprint in candidate fingerprint concentration, and then is being waited by the calculating of registration It selects in other candidate fingerprints of fingerprint collection and selects unisonance fingerprint, realize the further screening to candidate fingerprint.The program is passed through Obtained reference fingerprint and its unisonance fingerprint are repeatedly screened, includes closest with the reference finger of audio to be identified, and is corresponding Audio is identical or can be considered identical audio-frequency fingerprint.To be selected in reference fingerprint and its corresponding audio of unisonance fingerprint Target audio is the audio of optimal version, can be used as the real source or source of audio to be identified, while having ensured target audio The accuracy of content and version improves the whole efficiency and user experience of audio identification.The program passes through in fingerprint base Audio-frequency fingerprint is screened layer by layer, has refined audio identification granularity, the fining degree of audio identification is improved, to retrieve To more accurate target audio.

The embodiment of the present invention also provides a kind of audio recognition devices, and as shown in fig. 4 a, it illustrates institutes of the embodiment of the present invention The structural schematic diagram for the audio recognition devices being related to, specifically:

The audio recognition devices may include one or processor 401, one or one of more than one processing core The components such as memory 402, power supply 403 and the input unit 404 of the above computer readable storage medium.Those skilled in the art can To understand, audio recognition devices structure shown in Fig. 4 a does not constitute the restriction to audio recognition devices, may include than figure Show more or fewer components, perhaps combines certain components or different component layouts.Wherein:

Processor 401 is the control centre of the audio recognition devices, is known using various interfaces and the entire audio of connection The various pieces of other equipment, by running or executing the software program and/or module that are stored in memory 402, and calling The data being stored in memory 402 execute the various functions and processing data of audio recognition devices, to set to audio identification It is standby to carry out integral monitoring.Optionally, processor 401 may include one or more processing cores；Preferably, processor 401 can collect At application processor and modem processor, wherein the main processing operation system of application processor, user interface and apply journey Sequence etc., modem processor mainly handle wireless communication.It is understood that above-mentioned modem processor can not also collect At into processor 401.

Memory 402 can be used for storing software program and module, and processor 401 is stored in memory 402 by operation Software program and module, thereby executing various function application and data processing.Memory 402 can mainly include storage journey Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function Such as audio identification function) etc.；Storage data area, which can be stored, uses created data etc. according to audio recognition devices.This Outside, memory 402 may include high-speed random access memory, can also include nonvolatile memory, for example, at least one Disk memory, flush memory device or other volatile solid-state parts.Correspondingly, memory 402 can also include storage Device controller, to provide access of the processor 401 to memory 402.

Audio recognition devices further include the power supply 403 powered to all parts, it is preferred that power supply 403 can pass through power supply Management system and processor 401 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system The functions such as reason.Power supply 403 can also include one or more direct current or AC power source, recharging system, power failure The random components such as detection circuit, power adapter or inverter, power supply status indicator.

The audio recognition devices may also include input unit 404, the input unit 404 can be used for receiving input number or Character information, and generate keyboard related with user setting and function control, mouse, operating stick, optics or trace ball Signal input.

In addition, audio recognition devices can also include audio collecting device 405 referring to Fig. 4 b, audio collecting device 405 is used In acquisition audio to be identified.For example, audio collecting device 405 can acquire audio to be identified by modes such as recording.

Although being not shown, audio recognition devices can also be including display unit etc., and details are not described herein.Specifically in this implementation In example, processor 401 in audio recognition devices can according to following instruction, by one or more application program into The corresponding executable file of journey is loaded into memory 402, and is run by processor 401 and be stored in answering in memory 402 With program, thus realize various functions, it is as follows:

The audio-frequency fingerprint of audio to be identified is extracted as reference finger, calculates the reference finger and preset fingerprint library middle pitch The similarity of frequency fingerprint；According to the similarity of the reference finger and fingerprint base sound intermediate frequency fingerprint, screened in the fingerprint base Candidate fingerprint collection out；Reference fingerprint is selected in candidate fingerprint concentration, and obtains the unisonance fingerprint of the reference fingerprint；Institute It states in reference fingerprint and its corresponding audio of unisonance fingerprint, selects the corresponding target audio of the audio to be identified.

Processor 401 can also run the application program being stored in memory 402, implement function such as:

It calculates the reference fingerprint and candidate fingerprint concentrates the registration of other candidate fingerprints；According to the registration, The unisonance fingerprint of the reference fingerprint is selected in other described candidate fingerprints.

It obtains the reference fingerprint and candidate fingerprint concentrates the longest common subsequence of other candidate fingerprints, statistics is described most The length of long common subsequence；According to the length of the longest common subsequence, the reference fingerprint and other times is calculated Select the registration of fingerprint.

The quantity for the identical cryptographic Hash that the reference finger and each audio-frequency fingerprint in preset fingerprint library are included is counted respectively； According to the quantity of the identical cryptographic Hash, the similarity of each audio-frequency fingerprint in the reference finger and fingerprint base is calculated separately.

It obtains the reference fingerprint and its corresponding audio of unisonance fingerprint is unisonance audio, obtain the version letter of unisonance audio Breath；According to the version information, the version priority of the unisonance audio is determined；The unisonance audio of version highest priority is made For the corresponding target audio of the audio to be identified.

The specific implementation of above each operation can be found in the embodiment of front, and details are not described herein.

It will appreciated by the skilled person that all or part of the steps in the various methods of above-described embodiment can be with It is completed by instructing, or relevant hardware is controlled by instruction to complete, which can store computer-readable deposits in one In storage media, and is loaded and executed by processor.

For this purpose, the embodiment of the present invention provides a kind of storage medium, wherein being stored with a plurality of instruction, which can be processed Device is loaded, to execute the step in any audio identification methods provided by the embodiment of the present invention.For example, the instruction can To execute following steps:

Following steps can also be performed in the instruction:

Wherein, which may include: read-only memory (ROM, Read Only Memory), random access memory Body (RAM, Random Access Memory), disk or CD etc..

By the instruction stored in the storage medium, any audio provided by the embodiment of the present invention can be executed and known Step in other method, it is thereby achieved that achieved by any audio identification methods provided by the embodiment of the present invention Beneficial effect is detailed in the embodiment of front, and details are not described herein.

A kind of audio identification methods, device, equipment and storage medium is provided for the embodiments of the invention above to carry out It is discussed in detail, used herein a specific example illustrates the principle and implementation of the invention, above embodiments Illustrate to be merely used to help understand method and its core concept of the invention；Meanwhile for those skilled in the art, according to this The thought of invention, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification is not answered It is interpreted as limitation of the present invention.

Claims

1. a kind of audio identification methods characterized by comprising

The audio-frequency fingerprint of audio to be identified is extracted as reference finger, the reference finger is calculated and refers to preset fingerprint library sound intermediate frequency The similarity of line；

According to the similarity of the reference finger and fingerprint base sound intermediate frequency fingerprint, candidate fingerprint is filtered out in the fingerprint base Collection；

In the reference fingerprint and its corresponding audio of unisonance fingerprint, the corresponding target audio of the audio to be identified is selected.

2. the method according to claim 1, wherein the unisonance fingerprint for obtaining the reference fingerprint, comprising:

3. according to the method described in claim 2, it is characterized in that, the calculating reference fingerprint and candidate fingerprint concentrate it The registration of his candidate fingerprint, comprising:

It obtains the reference fingerprint and candidate fingerprint concentrates the longest common subsequence of other candidate fingerprints, it is public to count the longest The length of subsequence altogether；

According to the length of the longest common subsequence, the registration of the reference fingerprint Yu other candidate fingerprints is calculated.

4. according to the method described in claim 2, referring in other described candidates it is characterized in that, described according to the registration The unisonance fingerprint of the reference fingerprint is selected in line, comprising:

In other described candidate fingerprints, the candidate for being greater than or equal to preset threshold with the registration of the reference fingerprint is filtered out Fingerprint, the unisonance fingerprint as the reference fingerprint.

5. according to the method described in claim 4, it is characterized in that, the method also includes:

If not finding the candidate fingerprint for being greater than or equal to preset threshold with the registration of the reference fingerprint, the reference is referred to The corresponding audio of line is determined as the corresponding target audio of the audio to be identified.

6. the method according to claim 1, wherein selecting reference fingerprint in candidate fingerprint concentration, comprising:

The candidate fingerprint is concentrated, the maximum candidate fingerprint of similarity numerical value with the reference finger, is determined as with reference to referring to Line.

7. the method according to claim 1, wherein described calculate the reference finger and preset fingerprint library middle pitch The similarity of frequency fingerprint, comprising:

According to the quantity of the identical cryptographic Hash, it is similar to audio-frequency fingerprint each in fingerprint base to calculate separately the reference finger Degree.

8. method according to claim 1-7, which is characterized in that described to refer in the reference fingerprint and its unisonance In the corresponding audio of line, the corresponding target audio of the audio to be identified is selected, comprising:

It obtains the reference fingerprint and its corresponding audio of unisonance fingerprint is unisonance audio, obtain the version information of unisonance audio；

9. a kind of speech recognizing device characterized by comprising

Fingerprint unit, the audio-frequency fingerprint for extracting audio to be identified calculate the reference finger and preset as reference finger The similarity of fingerprint base sound intermediate frequency fingerprint；

Candidate unit sieves in the fingerprint base for the similarity according to the reference finger and fingerprint base sound intermediate frequency fingerprint Select candidate fingerprint collection；

Unisonance unit for selecting reference fingerprint in candidate fingerprint concentration, and obtains the unisonance fingerprint of the reference fingerprint；

Audio unit, for selecting the audio pair to be identified in the reference fingerprint and its corresponding audio of unisonance fingerprint The target audio answered.

10. a kind of audio recognition devices, which is characterized in that the audio recognition devices include: memory, processor and are stored in On the memory, and the audio identification program that can be run on the processor, the audio identification program is by the processing The step of device realizes the method according to claim 1 when executing.

11. a kind of audio recognition devices, which is characterized in that the audio recognition devices further include audio collecting device, the sound Frequency acquisition device is for acquiring audio to be identified.

12. a kind of storage medium, which is characterized in that the storage medium is stored with a plurality of instruction, and described instruction is suitable for processor It is loaded, the step in 1 to 8 described in any item audio identification methods is required with perform claim.