CN107886941A - A kind of audio mask method and device - Google Patents
A kind of audio mask method and device Download PDFInfo
- Publication number
- CN107886941A CN107886941A CN201610866294.5A CN201610866294A CN107886941A CN 107886941 A CN107886941 A CN 107886941A CN 201610866294 A CN201610866294 A CN 201610866294A CN 107886941 A CN107886941 A CN 107886941A
- Authority
- CN
- China
- Prior art keywords
- audio
- audio fragment
- fragment
- vector
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/686—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
Abstract
The present invention relates to a kind of audio mask method and device, method to include:Audio file is cut into multiple audio fragments, and therefrom chooses at least one audio fragment;The acoustic feature and characteristics of image of at least one audio fragment are extracted, the acoustic feature includes tonality and/or frequency distribution feature;The audio frequency characteristics of identical audio fragment and characteristics of image at least one audio fragment are synthesized, obtain the characteristic vector of identical audio fragment at least one audio fragment;The characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability distribution, label of the label as audio file corresponding to maximum in select probability value, solve the problems, such as that labor intensive resource is big, time-consuming during audio mark and mark accuracy rate is low, realize the audio mark of high efficiency high-accuracy.
Description
Technical field
The present invention relates to audio analysis and process field, more particularly to a kind of audio mask method and device.
Background technology
Audio file is labeled, is audio analysis and the indispensable knowledge training source of process field.
In the prior art, when being labeled to audio, through frequently with two methods, artificial mark (i.e. expert marks) method with
Based on statistics mark method, when being labeled using artificial mark method to audio, human cost input is big, time-consuming, Er Qiebiao
The accuracy rate of note is not high, can only be to labeled part audio before in audio when carrying out audio mark using statistics
It is labeled, it is impossible to not labeled part audio is labeled before in audio, cause the accuracy rate of mark relatively low.
The content of the invention
Present invention aim to address prior art sound intermediate frequency mark labor intensive resource is big, time-consuming and mark accuracy rate
The problem of low.
In a first aspect, the invention provides a kind of audio mask method, comprise the following steps:Audio file is cut into more
Individual audio fragment, and therefrom choose at least one audio fragment;Extract the acoustic feature and figure of at least one audio fragment
As feature, the acoustic feature includes tonality and/or frequency distribution feature;By identical sound at least one audio fragment
The acoustic feature and described image feature of frequency fragment are synthesized, and obtain identical sound at least one audio fragment
The characteristic vector of frequency fragment;The characteristic vector of at least one audio fragment is trained and predicted, is obtained corresponding general
Rate is distributed, label of the label as the audio file corresponding to the maximum in select probability value.
Preferably, it is described that audio file is cut into multiple audio fragments, and therefrom choose at least one audio fragment step
Suddenly, including:Audio file is cut into by multiple audio fragments by ffmpeg instruments, and therefrom chooses the beginning part, pars intermedia
Divide and the audio fragment of ending three.
Preferably, the acoustic feature step of extraction at least one audio fragment, including:Pass through marsyas works
Tool extracts the acoustic feature of at least one audio fragment, and when the audio fragment is monophonic, extraction includes tonality spy
31 features including frequency distribution feature of seeking peace, and its mean square deviation is calculated, obtain 31 characteristic values;When the audio fragment is
During two-channel, then the Characteristic Number extracted is 62, and calculates its mean square deviation, obtains 62 characteristic values;The audio of the monophonic
Fragment synthesizes the vector of one 62 dimension, and the audio fragment of the two-channel synthesizes the vector of one 124 dimension;
The image characteristic step of extraction at least one audio fragment, including:Institute is extracted by marsyas instruments
State the frequency spectrum profile of at least one audio fragment;The frequency spectrum profile is cut into 10 small figures of spectrum distribution;Calculate every
59 LBP vectors of the individual small figure of the spectrum distribution;By 59 LBP vectors synthesis one of the small figure of 10 spectrum distributions
The vector of 590 dimensions.
Preferably, it is described by the acoustic feature of identical audio fragment at least one audio fragment and described
Characteristics of image is synthesized, and obtains the characteristic vector step of identical audio fragment at least one audio fragment, including:
When the audio fragment is monophonic, the vectors of the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up, obtain
Obtain the characteristic vector of identical audio fragment at least one audio fragment;, will when the audio fragment is two-channel
The vector of the dimension of vector synthesis 714 of 590 dimensions described in the vector sum of 124 dimension, obtains phase at least one audio fragment
The characteristic vector of same audio fragment.
Preferably, the characteristic vector at least one audio fragment is trained and predicted, obtains corresponding
Probability distribution, labelling step of the label as the audio file corresponding to the maximum in select probability value, including:Pass through
The characteristic vector of at least one audio fragment and training set are trained and predicted by SVM;Obtain at least one sound
Probability distribution of the frequency fragment on default label;By preset rules at least one audio fragment on default label
Probability distribution is synthesized, label of the label as the audio file corresponding to the maximum in select probability value.
Second aspect, the invention provides a kind of audio annotation equipment, described device includes:Cutting module, extraction module,
Synthesis module and training prediction module;Wherein, cutting module is used to audio file being cut into multiple audio fragments, and therefrom selects
Take at least one audio fragment;Extraction module is used for the acoustic feature and characteristics of image for extracting at least one audio fragment,
The acoustic feature includes tonality and/or frequency distribution feature;Synthesis module is used for phase at least one audio fragment
The acoustic feature and described image feature of same audio fragment are synthesized, and obtain phase at least one audio fragment
The characteristic vector of same audio fragment;Training prediction module is used to instruct the characteristic vector of at least one audio fragment
Practice and prediction, obtain corresponding probability distribution, label is as the audio file corresponding to the maximum in select probability value
Label.
Preferably, the cutting module is specifically used for audio file is cut into multiple audio pieces by ffmpeg instruments
Section, and therefrom choose the beginning part, center section and the audio fragment of ending three.
Preferably, described device also includes:Computing module;
The acoustics that the extraction module is specifically used for extracting at least one audio fragment by marsyas instruments is special
Sign, when the audio fragment is monophonic, 31 features including tonality feature and frequency distribution feature are extracted, and count
Its mean square deviation is calculated, obtains 31 characteristic values;When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and is counted
Its mean square deviation is calculated, obtains 62 characteristic values;The synthesis module is additionally operable to the audio fragment of the monophonic synthesizing one 62
The vector of dimension, the audio fragment of the two-channel is synthesized to the vector of one 124 dimension;The extraction module is additionally operable to pass through
Marsyas instruments extract the frequency spectrum profile of at least one audio fragment;The cutting module is additionally operable to the frequency spectrum
Distribution map is cut into 10 small figures of spectrum distribution;The computing module is used for calculate each small figure of the spectrum distribution 59
LBP vectors;The synthesis module is additionally operable to 59 LBP vectors, one 590 dimension of synthesis of the small figure of 10 spectrum distributions
Vector.
Preferably, the synthesis module is specifically used for when the audio fragment is monophonic, the vector that described 62 are tieed up
The vector that vector synthesis 652 with described 590 dimensions is tieed up, obtain the spy of identical audio fragment at least one audio fragment
Sign vector;When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up
Vector, obtain the characteristic vector of identical audio fragment at least one audio fragment;10 spectrum distributions is small
The vector of 59 LBP vector one 590 dimension of synthesis of figure.
Preferably, it is described training prediction module be specifically used for by SVM by the feature of at least one audio fragment to
Amount and training set are trained and predicted;Obtain probability distribution of at least one audio fragment on default label;It is described
Synthesis module is additionally operable to close probability distribution of at least one audio fragment on default label by preset rules
Into label of the label as the audio file corresponding to the maximum in select probability value.
The present invention solves the problems, such as that labor intensive resource is big, time-consuming during audio mark and mark accuracy rate is low, realizes
The audio mark of high efficiency high-accuracy.
Brief description of the drawings
Fig. 1 is audio mask method schematic flow sheet provided in an embodiment of the present invention;
Fig. 2 is audio annotation equipment schematic diagram provided in an embodiment of the present invention;
Fig. 3 is another audio annotation equipment schematic diagram provided in an embodiment of the present invention.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
Part of the embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art
The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.
For ease of the understanding to the embodiment of the present invention, it is further explained below in conjunction with accompanying drawing with specific embodiment
Bright, embodiment does not form the restriction to the embodiment of the present invention.
Fig. 1 is audio mask method schematic flow sheet provided in an embodiment of the present invention.As shown in figure 1, methods described is held
Row main body can be terminal, and terminal can be referred to as user equipment (User Equipment, UE), mobile station (Mobile
Station, MS), mobile terminal (Mobile Terminal), computer, microcomputer etc..For example, terminal can be removable computer,
Computer with mobile terminal, the present invention are not limited this.
S110, audio file is cut into multiple audio fragments, and therefrom chooses at least one audio fragment.
Audio file includes but is not limited to song, one section of recording, piano music.Herein, can be by ffmpeg instruments by sound
Frequency file is cut into multiple audio fragments, and therefrom chooses the beginning part, center section and the audio fragment of ending three.
It is illustrated below by song of audio file.When audio file is the song of mp3 forms, the song is chosen
The 30-40s of bent the beginning part is an audio fragment, and the 10s of center section is an audio fragment, the 30- of ending
40s is an audio fragment.
S120, extracts the acoustic feature and characteristics of image of at least one audio fragment, and the acoustic feature includes adjusting
Property and/or frequency distribution feature.
Alternatively, the acoustic feature step of extraction at least one audio fragment, including:
The acoustic feature of at least one audio fragment is extracted by marsyas instruments, when the audio fragment is single
During sound channel, 31 features including tonality feature and frequency distribution feature are extracted, and calculate its mean square deviation, obtain 31 spies
Value indicative;When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and calculates its mean square deviation, obtains 62 spies
Value indicative;
The audio fragment of the monophonic synthesizes the vector of one 62 dimension, and the audio fragment of the two-channel synthesizes one
The vector of 124 dimensions.
Specifically, for the song of mp3 forms, when the song is single sound channel, the beginning part of the song is extracted
Audio fragment and calculates its mean square deviation, obtains 31 in 31 including the tonality and/or frequency distribution feature of single sound channel
Individual feature, obtains 62 characteristic values altogether, and the vector that 62 characteristic values synthesize 62 dimensions is special as the acoustics of the audio fragment
Sign.Wherein, during extraction, sampled point window size can be 1024.During single sound channel, the acoustic feature of center section audio fragment
Same as described above with the extracting method of the acoustic feature of ending audio fragment, here is omitted.
Because the song of mp3 forms is mostly two-channel, when the song is two-channel, the song is being extracted in a sound
After the acoustic feature of the beginning part audio fragment in road, acoustics of the song in the beginning part audio fragment of another sound channel is extracted
Feature, which amounts to, obtains 124 characteristic values, sound of the vector that 124 characteristic value synthesis 124 are tieed up as the beginning part audio fragment
Learn feature.During two-channel, the extraction of the acoustic feature of center section audio fragment and the acoustic feature of ending audio fragment
Method is same as described above, and here is omitted.
Alternatively, the image characteristic step of at least one audio fragment is extracted, including:Carried by marsyas instruments
Take the frequency spectrum profile of at least one audio fragment;The frequency spectrum profile is cut into 10 small figures of spectrum distribution;Meter
Calculate 59 LBP vectors of each small figure of spectrum distribution;By 59 LBP vectors synthesis one of the small figure of 10 spectrum distributions
The vector of individual 590 dimension.
Specifically, the spectrum distribution of the audio fragment of the beginning part is extracted by the sound2png in marsyas instruments
Figure, using linear frequency distribution method, is divided into 10 small figures of spectrum distribution, to each by the frequency spectrum profile of the audio fragment
The small figure of spectrum distribution asks for LBP vectors, totally 59 dimension, 59 LBP vectors, one 590 dimension of synthesis of 10 small figures of spectrum distribution to
Amount, the characteristics of image as the audio fragment.
The extraction of the extracting method of the characteristics of image of center section audio fragment, the characteristics of image of ending audio fragment
Method is as the extracting method of the characteristics of image of the audio fragment of the beginning part, and here is omitted.
S130, the acoustic feature of identical audio fragment at least one audio fragment and described image is special
Sign is synthesized, and obtains the characteristic vector of identical audio fragment at least one audio fragment.
Alternatively, by the acoustic feature and described image of identical audio fragment at least one audio fragment
Feature is synthesized, and obtains the characteristic vector step of identical audio fragment at least one audio fragment, including:
When audio fragment is monophonic, the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up to
Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment;
When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up
Vector, obtain the characteristic vector of identical audio fragment at least one audio fragment.
Wherein, the acoustic feature of identical audio fragment and described image feature are synthesized, i.e. the beginning part sound
The acoustic feature and characteristics of image of frequency fragment are synthesized, and the acoustic feature and characteristics of image of center section audio fragment are closed
Into the acoustic feature and characteristics of image of ending audio fragment are synthesized.
S140, the characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability distribution, selected
Select label of the label as the audio file corresponding to the maximum in probable value.
Alternatively, the characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability
It is distributed, labelling step of the label as the audio file corresponding to the maximum in select probability value, including:By support to
Amount machine (Support Vector Machine, SVM) carries out the characteristic vector of at least one audio fragment and training set
Training and prediction;Obtain probability distribution of at least one audio fragment on default label;
Probability distribution of at least one audio fragment on default label is synthesized by preset rules, selected
Label of the label as the audio file corresponding to maximum in probable value.
So that audio is mp3 form songs as an example, competing product data are analyzed, the style of song label of dried shrimp music includes altogether
23 major class labels, including:Bruce, rural area, allusion, electronics, folk rhyme, a Chinese musical telling, jazz, light music, new century, prevalence, section
Play Bruce, thunder ghost, rock and roll, sing make people, children, distinct Chinese characteristics, the tinkle of bells, coloured book, world music, stage/screen/amusement,
Experiment, Latin, metal, this 23 major class labels include 614 group labels, due to sing make people, children, distinct Chinese characteristics, the tinkle of bells,
This 5 major class labels of coloured book are oriented to content, and the two major class labels of Latin, metal are easy to obscure, world music, dance
Platform/screen/amusement, to test this 3 major class label styles indefinite, is all deleted.According to the characteristic of music in my cruel Qu Ku
Content, add the tinkle of bells, audiobook, DJ, red song, children's song, Chinese opera this 5 major class labels.The default label of composition
Including but not limited to:Bruce, rural area, allusion, electronics, folk rhyme, a Chinese musical telling, jazz, light music, new century, prevalence, rhythm cloth Shandong
This, rock and roll, thunder ghost, the tinkle of bells, audiobook, DJ, red song, children's song, any combination of Chinese opera.
When choosing training set, exemplified by the number to preset label is 13, the song of dried shrimp music more than 300 ten thousand can be captured
Major class label, all songs only with a major class label are first filtered out, about 1,000,000 is first altogether, and secondly, 13 major class labels are total to
About 500 group labels, in order to ensure that the song for belonging to group label is evenly distributed in training set and reaches a certain amount, often
More than 10 song is chosen under individual group label, the sample set finally formed is that each major class label includes 500 songs, altogether
6500 songs, this 6500 song composing training collection.
Using SVM, the characteristic vector for the song that will be predicted and the training set pre-established are trained, in advance
Survey, obtain the probability distribution of the song.
Alternatively, probability distribution of at least one audio fragment on default label is closed by preset rules
Into, labelling step of the label as the audio file corresponding to the maximum in select probability value, including:
When preset rules are Max rules, the maximum of at least one audio fragment probability on default label is calculated
It is worth, label of the label as the audio file corresponding to the maximum in select probability value;
When preset rules are Min rules, the minimum of at least one audio fragment probability on default label is calculated
It is worth, label of the label as the audio file corresponding to the maximum in select probability value;
When preset rules are Sum rules, the sum of at least one audio fragment probability on default label, choosing are calculated
Select label of the label as the audio file corresponding to the maximum in probable value;
When preset rules are Product rules, at least one audio fragment probability on default label is calculated
It is long-pending, label of the label as the audio file corresponding to the maximum in select probability value.
Using the number for presetting label as 13, exemplified by audio fragment number is 3, when preset rules are Max rules, 3 are calculated
The maximum of individual audio fragment probability on 13 labels, label is as the audio corresponding to the maximum in select probability value
The label of file;
When preset rules are Min rules, the minimum value of 3 audio fragments probability on 13 labels is calculated, selection is general
Label of the label as the audio file corresponding to maximum in rate value;
When preset rules are Sum rules, the sum of 3 audio fragments probability on 13 labels, select probability value are calculated
In maximum corresponding to label of the label as the audio file;
When preset rules are Product rules, the product of 3 audio fragments probability on 13 labels is calculated, selection is general
Label of the label as the audio file corresponding to maximum in rate value.
In one embodiment, for example song has 3 audio fragments, respectively A, B and C, above-mentioned 3 audio fragment A, B,
Probability distribution of the C on 4 labels is A:[0.8,0.1,0.05,0.05], B:[0.2,0.2,0.2,0.4], C:[0.7,0.1,
0.1,0.1], when preset rules are Max rules, take the maximum of A, B, C on each tab respectively, as a result for [0.8,
0.2,0.2,0.4], now, label corresponding to 0.8 is the label of the song;When preset rules are Min rules, take respectively
A, the minimum value of B, C on each tab, it is as a result [0.8,0.2,0.2,0.4], now, label corresponding to 0.8 is the song
Bent label;When preset rules are Sum rules, calculate A, B, C sum on each tab respectively, as a result for [1.7,0.4,
0.35,0.55], now, label corresponding to 1.7 is the label of the song;When preset rules are Product rules, respectively
The product of A, B, C on each tab is calculated, is as a result [0.112,0.002,0.001,0.002], now, is marked corresponding to 0.112
Label are the label of the song.
Before reality labels to audio file, verified using ismir2004 data sets provided in an embodiment of the present invention
The accuracy rate of audio mask method, it is found that preferable result tends to occur at and carried out using Min rules in the experiment of result integration,
Therefore, in process of production, the result that selection Min rules obtain is as final result.
Using audio mask method provided in an embodiment of the present invention, audio file is cut into multiple audio fragments, and from
It is middle to choose at least one audio fragment;Extract the acoustic feature and characteristics of image of at least one audio fragment, the acoustic feature
Including tonality and/or frequency distribution feature;By the audio frequency characteristics and image of identical audio fragment at least one audio fragment
Feature is synthesized, and obtains the characteristic vector of identical audio fragment at least one audio fragment;To at least one audio piece
The characteristic vector of section is trained and predicted, obtains corresponding probability distribution, label corresponding to the maximum in select probability value
As the label of audio file, solve the problems, such as that labor intensive resource is big, time-consuming during audio mark and mark accuracy rate is low,
Realize the audio mark of high efficiency high-accuracy.
Fig. 2 is audio annotation equipment schematic diagram provided in an embodiment of the present invention.As shown in Fig. 2 in fig. 2, audio mark
Device includes:Cutting module 210, extraction module 220, synthesis module 230 and training prediction module 240.
Wherein, cutting module 210 is used to audio file being cut into multiple audio fragments, and therefrom chooses at least one sound
Frequency fragment.
Extraction module 220 is used for the acoustic feature and characteristics of image for extracting at least one audio fragment, the acoustics
Feature includes tonality and/or frequency distribution feature;
Synthesis module 230 be used for the acoustic feature of identical audio fragment at least one audio fragment and
Described image feature is synthesized, and obtains the characteristic vector of identical audio fragment at least one audio fragment;
The training prediction module 240 to the characteristic vector of at least one audio fragment for being trained and in advance
Survey, obtain corresponding probability distribution, label of the label as the audio file corresponding to the maximum in select probability value.
Preferably, the cutting module 210 is specifically used for audio file is cut into multiple audios by ffmpeg instruments
Fragment, and therefrom choose the beginning part, center section and the audio fragment of ending three.
Preferably, extraction module 220 is specifically used for the sound that at least one audio fragment is extracted by marsyas instruments
Feature is learned, when the audio fragment is monophonic, extracts 31 features including tonality feature and frequency distribution feature,
And its mean square deviation is calculated, obtain 31 characteristic values;When the audio fragment is two-channel, then the Characteristic Number extracted is 62,
And its mean square deviation is calculated, obtain 62 characteristic values;
Synthesis module 230 is additionally operable to synthesize the audio fragment of the monophonic vector of one 62 dimension, by the alliteration
The audio fragment in road synthesizes the vector of one 124 dimension;
Extraction module 220 is additionally operable to extract the frequency spectrum profile of at least one audio fragment by marsyas instruments;
Cutting module 210 is additionally operable to frequency spectrum profile being cut into 10 small figures of spectrum distribution.
A module, computing module 250 are added in Fig. 2 audio annotation equipment, as shown in figure 3, Fig. 3 is the present invention
Another audio annotation equipment schematic diagram that embodiment provides, in figure 3, computing module 250 are used to calculate each frequency spectrum point
59 LBP vectors of the small figure of cloth.
Synthesis module 230 is additionally operable to the vector of 59 LBP vector one 590 dimension of synthesis of 10 small figures of spectrum distribution.
Preferably, synthesis module 230 is specifically used for, when audio fragment is monophonic, by 590 described in the vector sum of 62 dimensions
The vector of the dimension of vector synthesis 652 of dimension, obtain the characteristic vector of identical audio fragment at least one audio fragment;Work as audio
When fragment is two-channel, by the vector of the dimension of vector synthesis 714 of 590 dimensions described in the vector sum of 124 dimensions, at least one audio is obtained
The characteristic vector of identical audio fragment in fragment;By 59 LBP vectors, one 590 dimension of synthesis of 10 small figures of spectrum distribution
Vector.
Preferably, training prediction module 240 be specifically used for, by SVM by the feature of at least one audio fragment to
Amount and training set are trained and predicted;
Obtain probability distribution of at least one audio fragment on default label;
Synthesis module 230 is additionally operable to by probability of the preset rules at least one audio fragment on default label
Distribution is synthesized, label of the label as the audio file corresponding to the maximum in select probability value.
Alternatively, computing module 250 is additionally operable to:
When preset rules are Max rules, the maximum of at least one audio fragment probability on default label, choosing are calculated
Select label of the label as the audio file corresponding to the maximum in probable value;
When preset rules are Min rules, the minimum of at least one audio fragment probability on default label is calculated
It is worth, label of the label as the audio file corresponding to the maximum in select probability value;
When preset rules are Sum rules, the sum of at least one audio fragment probability on default label, choosing are calculated
Select label of the label as the audio file corresponding to the maximum in probable value;
When preset rules are Product rules, at least one audio fragment probability on default label is calculated
It is long-pending, label of the label as the audio file corresponding to the maximum in select probability value.
Using audio annotation equipment provided in an embodiment of the present invention, audio file is cut into multiple audio pieces by cutting module
Section, and therefrom choose at least one audio fragment;Extraction module extracts acoustic feature and the image spy of at least one audio fragment
Sign, acoustic feature include tonality and/or frequency distribution feature;Synthesis module is by identical audio piece at least one audio fragment
The acoustic feature and described image feature of section are synthesized, and obtain identical audio fragment at least one audio fragment
Characteristic vector;Training prediction module is trained and predicted to the characteristic vector of at least one audio fragment, obtains corresponding general
Rate is distributed, label of the label as the audio file corresponding to the maximum in select probability value, when solving audio mark
The problem of labor intensive resource is big, time-consuming and mark accuracy rate is low, realize the audio mark of high efficiency high-accuracy.
Those of ordinary skill in the art should further appreciate that, be described with reference to the embodiments described herein
Each example module and algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clear
Illustrate to Chu the interchangeability of hardware and software, generally describe the group of each example according to function in the above description
Into and step.These functions hold track with hardware or software mode actually, depending on technical scheme application-specific and set
Count constraints.Those of ordinary skill in the art can be described to be realized using distinct methods to each specific application
Function, but this realization is it is not considered that exceed scope of the present application.
Track can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor
Software module, or the two combination implemented.Software module can be placed in random access memory (RAM), internal memory, read-only storage
Device (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology neck
In any other form of storage medium well known in domain.
Above-described embodiment, the purpose of the present invention, technical scheme and beneficial effect are carried out further
Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention
Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc., all should include
Within protection scope of the present invention.
Claims (10)
1. a kind of audio mask method, it is characterised in that comprise the following steps:
Audio file is cut into multiple audio fragments, and therefrom chooses at least one audio fragment;
The acoustic feature and characteristics of image of at least one audio fragment are extracted, the acoustic feature includes tonality and/or frequency
Rate distribution characteristics;
The acoustic feature of identical audio fragment and described image feature at least one audio fragment are closed
Into the characteristic vector of identical audio fragment in acquisition at least one audio fragment;
The characteristic vector of at least one audio fragment is trained and predicted, obtain corresponding probability distribution, selection is general
Label of the label as the audio file corresponding to maximum in rate value.
2. according to the method for claim 1, it is characterised in that it is described that audio file is cut into multiple audio fragments, and
At least one audio fragment step is therefrom chosen, including:
Audio file is cut into by multiple audio fragments by ffmpeg instruments, and therefrom choose the beginning part, center section and
The audio fragment of ending three.
3. according to the method for claim 1, it is characterised in that the acoustics of extraction at least one audio fragment is special
Step is levied, including:
The acoustic feature of at least one audio fragment is extracted by marsyas instruments, when the audio fragment is monophonic
When, 31 features including tonality feature and frequency distribution feature are extracted, and its mean square deviation is calculated, obtain 31 features
Value;When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and calculates its mean square deviation, obtains 62 features
Value;
The audio fragment of the monophonic synthesizes the vector of one 62 dimension, and the audio fragment of the two-channel synthesizes one 124 dimension
Vector;
The image characteristic step of extraction at least one audio fragment, including:
The frequency spectrum profile of at least one audio fragment is extracted by marsyas instruments;
The frequency spectrum profile is cut into 10 small figures of spectrum distribution;
Calculate 59 LBP vectors of each small figure of spectrum distribution;
By the vector of 59 LBP vector one 590 dimension of synthesis of the small figure of 10 spectrum distributions.
4. according to the method described in claim any one of 1-3, it is characterised in that described by least one audio fragment
The acoustic feature and described image feature of identical audio fragment are synthesized, and are obtained at least one audio fragment
The characteristic vector step of identical audio fragment, including:
When the audio fragment is monophonic, the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up to
Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment;
When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up to
Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment.
5. according to the method for claim 1, it is characterised in that the characteristic vector at least one audio fragment
It is trained and predicts, obtain corresponding probability distribution, label is as the audio corresponding to the maximum in select probability value
The labelling step of file, including:
The characteristic vector of at least one audio fragment and training set are trained and predicted by SVM;
Obtain probability distribution of at least one audio fragment on default label;
Probability distribution of at least one audio fragment on default label is synthesized by preset rules, select probability
Label of the label as the audio file corresponding to maximum in value.
6. a kind of audio annotation equipment, it is characterised in that described device includes:Cutting module, extraction module, synthesis module and instruction
Practice prediction module;
The cutting module, for audio file to be cut into multiple audio fragments, and therefrom choose at least one audio fragment;
The extraction module, for extracting the acoustic feature and characteristics of image of at least one audio fragment, the acoustics is special
Sign includes tonality and/or frequency distribution feature;
The synthesis module, for by the acoustic feature of identical audio fragment and institute at least one audio fragment
State characteristics of image to be synthesized, obtain the characteristic vector of identical audio fragment at least one audio fragment;
The training prediction module, for the characteristic vector of at least one audio fragment to be trained and predicted, obtain
Corresponding probability distribution, label of the label as the audio file corresponding to the maximum in select probability value.
7. device according to claim 6, it is characterised in that the cutting module is specifically used for, and passes through ffmpeg instruments
Audio file is cut into multiple audio fragments, and therefrom chooses the beginning part, center section and the audio piece of ending three
Section.
8. device according to claim 6, it is characterised in that described device also includes:Computing module;
The extraction module is specifically used for, and the acoustic feature of at least one audio fragment is extracted by marsyas instruments, when
When the audio fragment is monophonic, 31 features including tonality feature and frequency distribution feature are extracted, and calculate it
Mean square deviation, obtain 31 characteristic values;When the audio fragment is two-channel, then the Characteristic Number extracted is 62, and calculates it
Mean square deviation, obtain 62 characteristic values;
The synthesis module is additionally operable to, and the audio fragment of the monophonic is synthesized to the vector of one 62 dimension, by the two-channel
Audio fragment synthesize one 124 dimension vector;
The extraction module is additionally operable to, and the frequency spectrum profile of at least one audio fragment is extracted by marsyas instruments;
The cutting module is additionally operable to, and the frequency spectrum profile is cut into 10 small figures of spectrum distribution;
The computing module, for calculating 59 LBP vectors of each small figure of the spectrum distribution;
The synthesis module is additionally operable to, by 59 LBP of 10 spectrum distributions small figure vector, one 590 dimension of synthesis to
Amount.
9. according to the device described in claim any one of 6-8, it is characterised in that the synthesis module is specifically used for, when described
When audio fragment is monophonic, the vectors of the dimension of vector synthesis 652 of 590 dimensions described in vector sum that described 62 are tieed up, described in acquisition
The characteristic vector of identical audio fragment at least one audio fragment;
When the audio fragment is two-channel, the dimension of vector synthesis 714 of 590 dimensions described in vector sum that described 124 are tieed up to
Amount, obtain the characteristic vector of identical audio fragment at least one audio fragment;
By the vector of 59 LBP vector one 590 dimension of synthesis of the small figure of 10 spectrum distributions.
10. device according to claim 6, it is characterised in that
The training prediction module is specifically used for, by SVM by the characteristic vector and training set of at least one audio fragment
It is trained and predicts;
Obtain probability distribution of at least one audio fragment on default label;
The synthesis module is additionally operable to, and passes through probability point of the preset rules at least one audio fragment on default label
Cloth is synthesized, label of the label as the audio file corresponding to the maximum in select probability value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610866294.5A CN107886941A (en) | 2016-09-29 | 2016-09-29 | A kind of audio mask method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610866294.5A CN107886941A (en) | 2016-09-29 | 2016-09-29 | A kind of audio mask method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107886941A true CN107886941A (en) | 2018-04-06 |
Family
ID=61768845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610866294.5A Pending CN107886941A (en) | 2016-09-29 | 2016-09-29 | A kind of audio mask method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107886941A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065076A (en) * | 2018-09-05 | 2018-12-21 | 深圳追科技有限公司 | Setting method, device, equipment and the storage medium of audio tag |
CN112562647A (en) * | 2020-11-24 | 2021-03-26 | 中电海康集团有限公司 | Method and device for marking audio starting point |
CN115757784A (en) * | 2022-11-21 | 2023-03-07 | 中科世通亨奇(北京)科技有限公司 | Corpus labeling method and apparatus based on labeling model and label template screening |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398825A (en) * | 2007-09-29 | 2009-04-01 | 三星电子株式会社 | Rapid music assorting and searching method and device |
CN101599271A (en) * | 2009-07-07 | 2009-12-09 | 华中科技大学 | A kind of recognition methods of digital music emotion |
CN103186527A (en) * | 2011-12-27 | 2013-07-03 | 北京百度网讯科技有限公司 | System for building music classification model, system for recommending music and corresponding method |
CN104090876A (en) * | 2013-04-18 | 2014-10-08 | 腾讯科技(深圳)有限公司 | Classifying method and classifying device for audio files |
CN104123934A (en) * | 2014-07-23 | 2014-10-29 | 泰亿格电子(上海)有限公司 | Speech composition recognition method and system |
CN104735528A (en) * | 2015-03-02 | 2015-06-24 | 青岛海信电器股份有限公司 | Sound effect matching method and device |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
-
2016
- 2016-09-29 CN CN201610866294.5A patent/CN107886941A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101398825A (en) * | 2007-09-29 | 2009-04-01 | 三星电子株式会社 | Rapid music assorting and searching method and device |
CN101599271A (en) * | 2009-07-07 | 2009-12-09 | 华中科技大学 | A kind of recognition methods of digital music emotion |
CN103186527A (en) * | 2011-12-27 | 2013-07-03 | 北京百度网讯科技有限公司 | System for building music classification model, system for recommending music and corresponding method |
CN104090876A (en) * | 2013-04-18 | 2014-10-08 | 腾讯科技(深圳)有限公司 | Classifying method and classifying device for audio files |
CN104123934A (en) * | 2014-07-23 | 2014-10-29 | 泰亿格电子(上海)有限公司 | Speech composition recognition method and system |
CN104735528A (en) * | 2015-03-02 | 2015-06-24 | 青岛海信电器股份有限公司 | Sound effect matching method and device |
CN105872855A (en) * | 2016-05-26 | 2016-08-17 | 广州酷狗计算机科技有限公司 | Labeling method and device for video files |
Non-Patent Citations (2)
Title |
---|
常艺伟: "基于音乐库的用户音乐分类方法研究", 《中国优秀硕士学位论文库全文数据库 信息科技辑》 * |
田慧: "支持向量机集成及在音乐分类中的应用", 《中国优秀硕士学位论文库全文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065076A (en) * | 2018-09-05 | 2018-12-21 | 深圳追科技有限公司 | Setting method, device, equipment and the storage medium of audio tag |
CN112562647A (en) * | 2020-11-24 | 2021-03-26 | 中电海康集团有限公司 | Method and device for marking audio starting point |
CN112562647B (en) * | 2020-11-24 | 2022-09-06 | 中电海康集团有限公司 | Method and device for marking audio starting point |
CN115757784A (en) * | 2022-11-21 | 2023-03-07 | 中科世通亨奇(北京)科技有限公司 | Corpus labeling method and apparatus based on labeling model and label template screening |
CN115757784B (en) * | 2022-11-21 | 2023-07-07 | 中科世通亨奇(北京)科技有限公司 | Corpus labeling method and device based on labeling model and label template screening |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104395953A (en) | Evaluation of beats, chords and downbeats from a musical audio signal | |
Lokki et al. | Concert hall acoustics assessment with individually elicited attributes | |
CN103503060B (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
WO2018045988A1 (en) | Method and device for generating digital music score file of song, and storage medium | |
CN107608963A (en) | A kind of Chinese error correction based on mutual information, device, equipment and storage medium | |
Mion et al. | Score-independent audio features for description of music expression | |
CN103824565A (en) | Humming music reading method and system based on music note and duration modeling | |
CN105161116B (en) | The determination method and device of multimedia file climax segment | |
CN106653037B (en) | Audio data processing method and device | |
CN105096933B (en) | The generation method and device and phoneme synthesizing method and device of dictionary for word segmentation | |
CN104575519B (en) | The method, apparatus of feature extracting method, device and stress detection | |
CN105868372B (en) | Label distribution method and device | |
CN107886941A (en) | A kind of audio mask method and device | |
CN110377212B (en) | Method, apparatus, computer device and storage medium for triggering display through audio | |
CN108257614A (en) | The method and its system of audio data mark | |
CN104882146B (en) | The processing method and processing device of audio promotion message | |
CN109410972A (en) | Generate the method, apparatus and storage medium of sound effect parameters | |
CN106503181B (en) | Audio data processing method and device | |
CN106970950B (en) | Similar audio data searching method and device | |
CN102184733A (en) | Audio attention-based audio quality evaluation system and method | |
Chordia | Automatic raag classification of pitch-tracked performances using pitch-class and pitch-class dyad distributions | |
CN108074554A (en) | The evaluation method and system of a kind of piano performance | |
CN107133344B (en) | Data processing method and device | |
CN115579021A (en) | Vocal music performance scoring method and system based on neural network and audio-visual fusion | |
Schmitt et al. | Recognising guitar effects-which acoustic features really matter? |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180406 |