CN107886941A

CN107886941A - A kind of audio mask method and device

Info

Publication number: CN107886941A
Application number: CN201610866294.5A
Authority: CN
Inventors: 高玉敏; 王志鹏
Original assignee: Yeelion Online Network Technology Beijing Co Ltd
Current assignee: Yeelion Online Network Technology Beijing Co Ltd
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2018-04-06

Abstract

The present invention relates to a kind of audio mask method and device, method to include：Audio file is cut into multiple audio fragments, and therefrom chooses at least one audio fragment；The acoustic feature and characteristics of image of at least one audio fragment are extracted, the acoustic feature includes tonality and/or frequency distribution feature；The audio frequency characteristics of identical audio fragment and characteristics of image at least one audio fragment are synthesized, obtain the characteristic vector of identical audio fragment at least one audio fragment；The characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability distribution, label of the label as audio file corresponding to maximum in select probability value, solve the problems, such as that labor intensive resource is big, time-consuming during audio mark and mark accuracy rate is low, realize the audio mark of high efficiency high-accuracy.

Description

A kind of audio mask method and device

Technical field

The present invention relates to audio analysis and process field, more particularly to a kind of audio mask method and device.

Background technology

Audio file is labeled, is audio analysis and the indispensable knowledge training source of process field.

In the prior art, when being labeled to audio, through frequently with two methods, artificial mark (i.e. expert marks) method with Based on statistics mark method, when being labeled using artificial mark method to audio, human cost input is big, time-consuming, Er Qiebiao The accuracy rate of note is not high, can only be to labeled part audio before in audio when carrying out audio mark using statistics It is labeled, it is impossible to not labeled part audio is labeled before in audio, cause the accuracy rate of mark relatively low.

The content of the invention

Present invention aim to address prior art sound intermediate frequency mark labor intensive resource is big, time-consuming and mark accuracy rate The problem of low.

In a first aspect, the invention provides a kind of audio mask method, comprise the following steps：Audio file is cut into more Individual audio fragment, and therefrom choose at least one audio fragment；Extract the acoustic feature and figure of at least one audio fragment As feature, the acoustic feature includes tonality and/or frequency distribution feature；By identical sound at least one audio fragment The acoustic feature and described image feature of frequency fragment are synthesized, and obtain identical sound at least one audio fragment The characteristic vector of frequency fragment；The characteristic vector of at least one audio fragment is trained and predicted, is obtained corresponding general Rate is distributed, label of the label as the audio file corresponding to the maximum in select probability value.

Preferably, it is described that audio file is cut into multiple audio fragments, and therefrom choose at least one audio fragment step Suddenly, including：Audio file is cut into by multiple audio fragments by ffmpeg instruments, and therefrom chooses the beginning part, pars intermedia Divide and the audio fragment of ending three.

Preferably, the acoustic feature step of extraction at least one audio fragment, including：Pass through marsyas works Tool extracts the acoustic feature of at least one audio fragment, and when the audio fragment is monophonic, extraction includes tonality spy 31 features including frequency distribution feature of seeking peace, and its mean square deviation is calculated, obtain 31 characteristic values；When the audio fragment is During two-channel, then the Characteristic Number extracted is 62, and calculates its mean square deviation, obtains 62 characteristic values；The audio of the monophonic Fragment synthesizes the vector of one 62 dimension, and the audio fragment of the two-channel synthesizes the vector of one 124 dimension；

The image characteristic step of extraction at least one audio fragment, including：Institute is extracted by marsyas instruments State the frequency spectrum profile of at least one audio fragment；The frequency spectrum profile is cut into 10 small figures of spectrum distribution；Calculate every 59 LBP vectors of the individual small figure of the spectrum distribution；By 59 LBP vectors synthesis one of the small figure of 10 spectrum distributions The vector of 590 dimensions.

Preferably, it is described by the acoustic feature of identical audio fragment at least one audio fragment and described Characteristics of image is synthesized, and obtains the characteristic vector step of identical audio fragment at least one audio fragment, including： When the audio fragment is monophonic, the vectors of the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up, obtain Obtain the characteristic vector of identical audio fragment at least one audio fragment；, will when the audio fragment is two-channel The vector of the dimension of vector synthesis 714 of 590 dimensions described in the vector sum of 124 dimension, obtains phase at least one audio fragment The characteristic vector of same audio fragment.

Preferably, the characteristic vector at least one audio fragment is trained and predicted, obtains corresponding Probability distribution, labelling step of the label as the audio file corresponding to the maximum in select probability value, including：Pass through The characteristic vector of at least one audio fragment and training set are trained and predicted by SVM；Obtain at least one sound Probability distribution of the frequency fragment on default label；By preset rules at least one audio fragment on default label Probability distribution is synthesized, label of the label as the audio file corresponding to the maximum in select probability value.

Second aspect, the invention provides a kind of audio annotation equipment, described device includes：Cutting module, extraction module, Synthesis module and training prediction module；Wherein, cutting module is used to audio file being cut into multiple audio fragments, and therefrom selects Take at least one audio fragment；Extraction module is used for the acoustic feature and characteristics of image for extracting at least one audio fragment, The acoustic feature includes tonality and/or frequency distribution feature；Synthesis module is used for phase at least one audio fragment The acoustic feature and described image feature of same audio fragment are synthesized, and obtain phase at least one audio fragment The characteristic vector of same audio fragment；Training prediction module is used to instruct the characteristic vector of at least one audio fragment Practice and prediction, obtain corresponding probability distribution, label is as the audio file corresponding to the maximum in select probability value Label.

Preferably, the cutting module is specifically used for audio file is cut into multiple audio pieces by ffmpeg instruments Section, and therefrom choose the beginning part, center section and the audio fragment of ending three.

Preferably, described device also includes：Computing module；

The acoustics that the extraction module is specifically used for extracting at least one audio fragment by marsyas instruments is special Sign, when the audio fragment is monophonic, 31 features including tonality feature and frequency distribution feature are extracted, and count Its mean square deviation is calculated, obtains 31 characteristic values；When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and is counted Its mean square deviation is calculated, obtains 62 characteristic values；The synthesis module is additionally operable to the audio fragment of the monophonic synthesizing one 62 The vector of dimension, the audio fragment of the two-channel is synthesized to the vector of one 124 dimension；The extraction module is additionally operable to pass through Marsyas instruments extract the frequency spectrum profile of at least one audio fragment；The cutting module is additionally operable to the frequency spectrum Distribution map is cut into 10 small figures of spectrum distribution；The computing module is used for calculate each small figure of the spectrum distribution 59 LBP vectors；The synthesis module is additionally operable to 59 LBP vectors, one 590 dimension of synthesis of the small figure of 10 spectrum distributions Vector.

Preferably, the synthesis module is specifically used for when the audio fragment is monophonic, the vector that described 62 are tieed up The vector that vector synthesis 652 with described 590 dimensions is tieed up, obtain the spy of identical audio fragment at least one audio fragment Sign vector；When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up Vector, obtain the characteristic vector of identical audio fragment at least one audio fragment；10 spectrum distributions is small The vector of 59 LBP vector one 590 dimension of synthesis of figure.

Preferably, it is described training prediction module be specifically used for by SVM by the feature of at least one audio fragment to Amount and training set are trained and predicted；Obtain probability distribution of at least one audio fragment on default label；It is described Synthesis module is additionally operable to close probability distribution of at least one audio fragment on default label by preset rules Into label of the label as the audio file corresponding to the maximum in select probability value.

The present invention solves the problems, such as that labor intensive resource is big, time-consuming during audio mark and mark accuracy rate is low, realizes The audio mark of high efficiency high-accuracy.

Brief description of the drawings

Fig. 1 is audio mask method schematic flow sheet provided in an embodiment of the present invention；

Fig. 2 is audio annotation equipment schematic diagram provided in an embodiment of the present invention；

Fig. 3 is another audio annotation equipment schematic diagram provided in an embodiment of the present invention.

Embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

For ease of the understanding to the embodiment of the present invention, it is further explained below in conjunction with accompanying drawing with specific embodiment Bright, embodiment does not form the restriction to the embodiment of the present invention.

Fig. 1 is audio mask method schematic flow sheet provided in an embodiment of the present invention.As shown in figure 1, methods described is held Row main body can be terminal, and terminal can be referred to as user equipment (User Equipment, UE), mobile station (Mobile Station, MS), mobile terminal (Mobile Terminal), computer, microcomputer etc..For example, terminal can be removable computer, Computer with mobile terminal, the present invention are not limited this.

S110, audio file is cut into multiple audio fragments, and therefrom chooses at least one audio fragment.

Audio file includes but is not limited to song, one section of recording, piano music.Herein, can be by ffmpeg instruments by sound Frequency file is cut into multiple audio fragments, and therefrom chooses the beginning part, center section and the audio fragment of ending three.

It is illustrated below by song of audio file.When audio file is the song of mp3 forms, the song is chosen The 30-40s of bent the beginning part is an audio fragment, and the 10s of center section is an audio fragment, the 30- of ending 40s is an audio fragment.

S120, extracts the acoustic feature and characteristics of image of at least one audio fragment, and the acoustic feature includes adjusting Property and/or frequency distribution feature.

Alternatively, the acoustic feature step of extraction at least one audio fragment, including：

The acoustic feature of at least one audio fragment is extracted by marsyas instruments, when the audio fragment is single During sound channel, 31 features including tonality feature and frequency distribution feature are extracted, and calculate its mean square deviation, obtain 31 spies Value indicative；When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and calculates its mean square deviation, obtains 62 spies Value indicative；

The audio fragment of the monophonic synthesizes the vector of one 62 dimension, and the audio fragment of the two-channel synthesizes one The vector of 124 dimensions.

Specifically, for the song of mp3 forms, when the song is single sound channel, the beginning part of the song is extracted Audio fragment and calculates its mean square deviation, obtains 31 in 31 including the tonality and/or frequency distribution feature of single sound channel Individual feature, obtains 62 characteristic values altogether, and the vector that 62 characteristic values synthesize 62 dimensions is special as the acoustics of the audio fragment Sign.Wherein, during extraction, sampled point window size can be 1024.During single sound channel, the acoustic feature of center section audio fragment Same as described above with the extracting method of the acoustic feature of ending audio fragment, here is omitted.

Because the song of mp3 forms is mostly two-channel, when the song is two-channel, the song is being extracted in a sound After the acoustic feature of the beginning part audio fragment in road, acoustics of the song in the beginning part audio fragment of another sound channel is extracted Feature, which amounts to, obtains 124 characteristic values, sound of the vector that 124 characteristic value synthesis 124 are tieed up as the beginning part audio fragment Learn feature.During two-channel, the extraction of the acoustic feature of center section audio fragment and the acoustic feature of ending audio fragment Method is same as described above, and here is omitted.

Alternatively, the image characteristic step of at least one audio fragment is extracted, including：Carried by marsyas instruments Take the frequency spectrum profile of at least one audio fragment；The frequency spectrum profile is cut into 10 small figures of spectrum distribution；Meter Calculate 59 LBP vectors of each small figure of spectrum distribution；By 59 LBP vectors synthesis one of the small figure of 10 spectrum distributions The vector of individual 590 dimension.

Specifically, the spectrum distribution of the audio fragment of the beginning part is extracted by the sound2png in marsyas instruments Figure, using linear frequency distribution method, is divided into 10 small figures of spectrum distribution, to each by the frequency spectrum profile of the audio fragment The small figure of spectrum distribution asks for LBP vectors, totally 59 dimension, 59 LBP vectors, one 590 dimension of synthesis of 10 small figures of spectrum distribution to Amount, the characteristics of image as the audio fragment.

The extraction of the extracting method of the characteristics of image of center section audio fragment, the characteristics of image of ending audio fragment Method is as the extracting method of the characteristics of image of the audio fragment of the beginning part, and here is omitted.

S130, the acoustic feature of identical audio fragment at least one audio fragment and described image is special Sign is synthesized, and obtains the characteristic vector of identical audio fragment at least one audio fragment.

Alternatively, by the acoustic feature and described image of identical audio fragment at least one audio fragment Feature is synthesized, and obtains the characteristic vector step of identical audio fragment at least one audio fragment, including：

When audio fragment is monophonic, the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up to Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment；

When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up Vector, obtain the characteristic vector of identical audio fragment at least one audio fragment.

Wherein, the acoustic feature of identical audio fragment and described image feature are synthesized, i.e. the beginning part sound The acoustic feature and characteristics of image of frequency fragment are synthesized, and the acoustic feature and characteristics of image of center section audio fragment are closed Into the acoustic feature and characteristics of image of ending audio fragment are synthesized.

S140, the characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability distribution, selected Select label of the label as the audio file corresponding to the maximum in probable value.

Alternatively, the characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability It is distributed, labelling step of the label as the audio file corresponding to the maximum in select probability value, including：By support to Amount machine (Support Vector Machine, SVM) carries out the characteristic vector of at least one audio fragment and training set Training and prediction；Obtain probability distribution of at least one audio fragment on default label；

Probability distribution of at least one audio fragment on default label is synthesized by preset rules, selected Label of the label as the audio file corresponding to maximum in probable value.

So that audio is mp3 form songs as an example, competing product data are analyzed, the style of song label of dried shrimp music includes altogether 23 major class labels, including：Bruce, rural area, allusion, electronics, folk rhyme, a Chinese musical telling, jazz, light music, new century, prevalence, section Play Bruce, thunder ghost, rock and roll, sing make people, children, distinct Chinese characteristics, the tinkle of bells, coloured book, world music, stage/screen/amusement, Experiment, Latin, metal, this 23 major class labels include 614 group labels, due to sing make people, children, distinct Chinese characteristics, the tinkle of bells, This 5 major class labels of coloured book are oriented to content, and the two major class labels of Latin, metal are easy to obscure, world music, dance Platform/screen/amusement, to test this 3 major class label styles indefinite, is all deleted.According to the characteristic of music in my cruel Qu Ku Content, add the tinkle of bells, audiobook, DJ, red song, children's song, Chinese opera this 5 major class labels.The default label of composition Including but not limited to：Bruce, rural area, allusion, electronics, folk rhyme, a Chinese musical telling, jazz, light music, new century, prevalence, rhythm cloth Shandong This, rock and roll, thunder ghost, the tinkle of bells, audiobook, DJ, red song, children's song, any combination of Chinese opera.

When choosing training set, exemplified by the number to preset label is 13, the song of dried shrimp music more than 300 ten thousand can be captured Major class label, all songs only with a major class label are first filtered out, about 1,000,000 is first altogether, and secondly, 13 major class labels are total to About 500 group labels, in order to ensure that the song for belonging to group label is evenly distributed in training set and reaches a certain amount, often More than 10 song is chosen under individual group label, the sample set finally formed is that each major class label includes 500 songs, altogether 6500 songs, this 6500 song composing training collection.

Using SVM, the characteristic vector for the song that will be predicted and the training set pre-established are trained, in advance Survey, obtain the probability distribution of the song.

Alternatively, probability distribution of at least one audio fragment on default label is closed by preset rules Into, labelling step of the label as the audio file corresponding to the maximum in select probability value, including：

When preset rules are Max rules, the maximum of at least one audio fragment probability on default label is calculated It is worth, label of the label as the audio file corresponding to the maximum in select probability value；

When preset rules are Min rules, the minimum of at least one audio fragment probability on default label is calculated It is worth, label of the label as the audio file corresponding to the maximum in select probability value；

When preset rules are Sum rules, the sum of at least one audio fragment probability on default label, choosing are calculated Select label of the label as the audio file corresponding to the maximum in probable value；

When preset rules are Product rules, at least one audio fragment probability on default label is calculated It is long-pending, label of the label as the audio file corresponding to the maximum in select probability value.

Using the number for presetting label as 13, exemplified by audio fragment number is 3, when preset rules are Max rules, 3 are calculated The maximum of individual audio fragment probability on 13 labels, label is as the audio corresponding to the maximum in select probability value The label of file；

When preset rules are Min rules, the minimum value of 3 audio fragments probability on 13 labels is calculated, selection is general Label of the label as the audio file corresponding to maximum in rate value；

When preset rules are Sum rules, the sum of 3 audio fragments probability on 13 labels, select probability value are calculated In maximum corresponding to label of the label as the audio file；

When preset rules are Product rules, the product of 3 audio fragments probability on 13 labels is calculated, selection is general Label of the label as the audio file corresponding to maximum in rate value.

In one embodiment, for example song has 3 audio fragments, respectively A, B and C, above-mentioned 3 audio fragment A, B, Probability distribution of the C on 4 labels is A:[0.8,0.1,0.05,0.05], B:[0.2,0.2,0.2,0.4], C:[0.7,0.1, 0.1,0.1], when preset rules are Max rules, take the maximum of A, B, C on each tab respectively, as a result for [0.8, 0.2,0.2,0.4], now, label corresponding to 0.8 is the label of the song；When preset rules are Min rules, take respectively A, the minimum value of B, C on each tab, it is as a result [0.8,0.2,0.2,0.4], now, label corresponding to 0.8 is the song Bent label；When preset rules are Sum rules, calculate A, B, C sum on each tab respectively, as a result for [1.7,0.4, 0.35,0.55], now, label corresponding to 1.7 is the label of the song；When preset rules are Product rules, respectively The product of A, B, C on each tab is calculated, is as a result [0.112,0.002,0.001,0.002], now, is marked corresponding to 0.112 Label are the label of the song.

Before reality labels to audio file, verified using ismir2004 data sets provided in an embodiment of the present invention The accuracy rate of audio mask method, it is found that preferable result tends to occur at and carried out using Min rules in the experiment of result integration, Therefore, in process of production, the result that selection Min rules obtain is as final result.

Using audio mask method provided in an embodiment of the present invention, audio file is cut into multiple audio fragments, and from It is middle to choose at least one audio fragment；Extract the acoustic feature and characteristics of image of at least one audio fragment, the acoustic feature Including tonality and/or frequency distribution feature；By the audio frequency characteristics and image of identical audio fragment at least one audio fragment Feature is synthesized, and obtains the characteristic vector of identical audio fragment at least one audio fragment；To at least one audio piece The characteristic vector of section is trained and predicted, obtains corresponding probability distribution, label corresponding to the maximum in select probability value As the label of audio file, solve the problems, such as that labor intensive resource is big, time-consuming during audio mark and mark accuracy rate is low, Realize the audio mark of high efficiency high-accuracy.

Fig. 2 is audio annotation equipment schematic diagram provided in an embodiment of the present invention.As shown in Fig. 2 in fig. 2, audio mark Device includes：Cutting module 210, extraction module 220, synthesis module 230 and training prediction module 240.

Wherein, cutting module 210 is used to audio file being cut into multiple audio fragments, and therefrom chooses at least one sound Frequency fragment.

Extraction module 220 is used for the acoustic feature and characteristics of image for extracting at least one audio fragment, the acoustics Feature includes tonality and/or frequency distribution feature；

Synthesis module 230 be used for the acoustic feature of identical audio fragment at least one audio fragment and Described image feature is synthesized, and obtains the characteristic vector of identical audio fragment at least one audio fragment；

The training prediction module 240 to the characteristic vector of at least one audio fragment for being trained and in advance Survey, obtain corresponding probability distribution, label of the label as the audio file corresponding to the maximum in select probability value.

Preferably, the cutting module 210 is specifically used for audio file is cut into multiple audios by ffmpeg instruments Fragment, and therefrom choose the beginning part, center section and the audio fragment of ending three.

Preferably, extraction module 220 is specifically used for the sound that at least one audio fragment is extracted by marsyas instruments Feature is learned, when the audio fragment is monophonic, extracts 31 features including tonality feature and frequency distribution feature, And its mean square deviation is calculated, obtain 31 characteristic values；When the audio fragment is two-channel, then the Characteristic Number extracted is 62, And its mean square deviation is calculated, obtain 62 characteristic values；

Synthesis module 230 is additionally operable to synthesize the audio fragment of the monophonic vector of one 62 dimension, by the alliteration The audio fragment in road synthesizes the vector of one 124 dimension；

Extraction module 220 is additionally operable to extract the frequency spectrum profile of at least one audio fragment by marsyas instruments；

Cutting module 210 is additionally operable to frequency spectrum profile being cut into 10 small figures of spectrum distribution.

A module, computing module 250 are added in Fig. 2 audio annotation equipment, as shown in figure 3, Fig. 3 is the present invention Another audio annotation equipment schematic diagram that embodiment provides, in figure 3, computing module 250 are used to calculate each frequency spectrum point 59 LBP vectors of the small figure of cloth.

Synthesis module 230 is additionally operable to the vector of 59 LBP vector one 590 dimension of synthesis of 10 small figures of spectrum distribution.

Preferably, synthesis module 230 is specifically used for, when audio fragment is monophonic, by 590 described in the vector sum of 62 dimensions The vector of the dimension of vector synthesis 652 of dimension, obtain the characteristic vector of identical audio fragment at least one audio fragment；Work as audio When fragment is two-channel, by the vector of the dimension of vector synthesis 714 of 590 dimensions described in the vector sum of 124 dimensions, at least one audio is obtained The characteristic vector of identical audio fragment in fragment；By 59 LBP vectors, one 590 dimension of synthesis of 10 small figures of spectrum distribution Vector.

Preferably, training prediction module 240 be specifically used for, by SVM by the feature of at least one audio fragment to Amount and training set are trained and predicted；

Obtain probability distribution of at least one audio fragment on default label；

Synthesis module 230 is additionally operable to by probability of the preset rules at least one audio fragment on default label Distribution is synthesized, label of the label as the audio file corresponding to the maximum in select probability value.

Alternatively, computing module 250 is additionally operable to：

When preset rules are Max rules, the maximum of at least one audio fragment probability on default label, choosing are calculated Select label of the label as the audio file corresponding to the maximum in probable value；

Using audio annotation equipment provided in an embodiment of the present invention, audio file is cut into multiple audio pieces by cutting module Section, and therefrom choose at least one audio fragment；Extraction module extracts acoustic feature and the image spy of at least one audio fragment Sign, acoustic feature include tonality and/or frequency distribution feature；Synthesis module is by identical audio piece at least one audio fragment The acoustic feature and described image feature of section are synthesized, and obtain identical audio fragment at least one audio fragment Characteristic vector；Training prediction module is trained and predicted to the characteristic vector of at least one audio fragment, obtains corresponding general Rate is distributed, label of the label as the audio file corresponding to the maximum in select probability value, when solving audio mark The problem of labor intensive resource is big, time-consuming and mark accuracy rate is low, realize the audio mark of high efficiency high-accuracy.

Those of ordinary skill in the art should further appreciate that, be described with reference to the embodiments described herein Each example module and algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clear Illustrate to Chu the interchangeability of hardware and software, generally describe the group of each example according to function in the above description Into and step.These functions hold track with hardware or software mode actually, depending on technical scheme application-specific and set Count constraints.Those of ordinary skill in the art can be described to be realized using distinct methods to each specific application Function, but this realization is it is not considered that exceed scope of the present application.

Track can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor Software module, or the two combination implemented.Software module can be placed in random access memory (RAM), internal memory, read-only storage Device (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology neck In any other form of storage medium well known in domain.

Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include Within protection scope of the present invention.

Claims

1. a kind of audio mask method, it is characterised in that comprise the following steps：

Audio file is cut into multiple audio fragments, and therefrom chooses at least one audio fragment；

The acoustic feature and characteristics of image of at least one audio fragment are extracted, the acoustic feature includes tonality and/or frequency Rate distribution characteristics；

The acoustic feature of identical audio fragment and described image feature at least one audio fragment are closed Into the characteristic vector of identical audio fragment in acquisition at least one audio fragment；

The characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability distribution, selection is general Label of the label as the audio file corresponding to maximum in rate value.

2. according to the method for claim 1, it is characterised in that it is described that audio file is cut into multiple audio fragments, and At least one audio fragment step is therefrom chosen, including：

Audio file is cut into by multiple audio fragments by ffmpeg instruments, and therefrom choose the beginning part, center section and The audio fragment of ending three.

3. according to the method for claim 1, it is characterised in that the acoustics of extraction at least one audio fragment is special Step is levied, including：

The acoustic feature of at least one audio fragment is extracted by marsyas instruments, when the audio fragment is monophonic When, 31 features including tonality feature and frequency distribution feature are extracted, and its mean square deviation is calculated, obtain 31 features Value；When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and calculates its mean square deviation, obtains 62 features Value；

The audio fragment of the monophonic synthesizes the vector of one 62 dimension, and the audio fragment of the two-channel synthesizes one 124 dimension Vector；

The image characteristic step of extraction at least one audio fragment, including：

The frequency spectrum profile of at least one audio fragment is extracted by marsyas instruments；

The frequency spectrum profile is cut into 10 small figures of spectrum distribution；

Calculate 59 LBP vectors of each small figure of spectrum distribution；

By the vector of 59 LBP vector one 590 dimension of synthesis of the small figure of 10 spectrum distributions.

4. according to the method described in claim any one of 1-3, it is characterised in that described by least one audio fragment The acoustic feature and described image feature of identical audio fragment are synthesized, and are obtained at least one audio fragment The characteristic vector step of identical audio fragment, including：

When the audio fragment is monophonic, the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up to Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment；

When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up to Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment.

5. according to the method for claim 1, it is characterised in that the characteristic vector at least one audio fragment It is trained and predicts, obtain corresponding probability distribution, label is as the audio corresponding to the maximum in select probability value The labelling step of file, including：

The characteristic vector of at least one audio fragment and training set are trained and predicted by SVM；

Probability distribution of at least one audio fragment on default label is synthesized by preset rules, select probability Label of the label as the audio file corresponding to maximum in value.

6. a kind of audio annotation equipment, it is characterised in that described device includes：Cutting module, extraction module, synthesis module and instruction Practice prediction module；

The cutting module, for audio file to be cut into multiple audio fragments, and therefrom choose at least one audio fragment；

The extraction module, for extracting the acoustic feature and characteristics of image of at least one audio fragment, the acoustics is special Sign includes tonality and/or frequency distribution feature；

The synthesis module, for by the acoustic feature of identical audio fragment and institute at least one audio fragment State characteristics of image to be synthesized, obtain the characteristic vector of identical audio fragment at least one audio fragment；

The training prediction module, for the characteristic vector of at least one audio fragment to be trained and predicted, obtain Corresponding probability distribution, label of the label as the audio file corresponding to the maximum in select probability value.

7. device according to claim 6, it is characterised in that the cutting module is specifically used for, and passes through ffmpeg instruments Audio file is cut into multiple audio fragments, and therefrom chooses the beginning part, center section and the audio piece of ending three Section.

8. device according to claim 6, it is characterised in that described device also includes：Computing module；

The extraction module is specifically used for, and the acoustic feature of at least one audio fragment is extracted by marsyas instruments, when When the audio fragment is monophonic, 31 features including tonality feature and frequency distribution feature are extracted, and calculate it Mean square deviation, obtain 31 characteristic values；When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and calculates it Mean square deviation, obtain 62 characteristic values；

The synthesis module is additionally operable to, and the audio fragment of the monophonic is synthesized to the vector of one 62 dimension, by the two-channel Audio fragment synthesize one 124 dimension vector；

The extraction module is additionally operable to, and the frequency spectrum profile of at least one audio fragment is extracted by marsyas instruments；

The cutting module is additionally operable to, and the frequency spectrum profile is cut into 10 small figures of spectrum distribution；

The computing module, for calculating 59 LBP vectors of each small figure of the spectrum distribution；

The synthesis module is additionally operable to, by 59 LBP of 10 spectrum distributions small figure vector, one 590 dimension of synthesis to Amount.

9. according to the device described in claim any one of 6-8, it is characterised in that the synthesis module is specifically used for, when described When audio fragment is monophonic, the vectors of the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up, described in acquisition The characteristic vector of identical audio fragment at least one audio fragment；

When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up to Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment；

10. device according to claim 6, it is characterised in that

The training prediction module is specifically used for, by SVM by the characteristic vector and training set of at least one audio fragment It is trained and predicts；

The synthesis module is additionally operable to, and passes through probability point of the preset rules at least one audio fragment on default label Cloth is synthesized, label of the label as the audio file corresponding to the maximum in select probability value.