CN108877783A - The method and apparatus for determining the audio types of audio data - Google Patents
The method and apparatus for determining the audio types of audio data Download PDFInfo
- Publication number
- CN108877783A CN108877783A CN201810732941.2A CN201810732941A CN108877783A CN 108877783 A CN108877783 A CN 108877783A CN 201810732941 A CN201810732941 A CN 201810732941A CN 108877783 A CN108877783 A CN 108877783A
- Authority
- CN
- China
- Prior art keywords
- module
- rgcnn
- audio
- data
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012545 processing Methods 0.000 claims abstract description 32
- 239000000284 extract Substances 0.000 claims abstract description 17
- 238000005520 cutting process Methods 0.000 claims abstract description 12
- 230000004913 activation Effects 0.000 claims description 95
- 239000011159 matrix material Substances 0.000 claims description 93
- 238000012549 training Methods 0.000 claims description 82
- 230000006870 function Effects 0.000 claims description 22
- 238000000605 extraction Methods 0.000 claims description 21
- 238000013145 classification model Methods 0.000 claims description 14
- 238000003475 lamination Methods 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 238000001514 detection method Methods 0.000 abstract description 10
- 230000008569 process Effects 0.000 description 13
- 230000001755 vocal effect Effects 0.000 description 11
- 238000012360 testing method Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000009825 accumulation Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of method and apparatus of the audio types of determining audio data, belong to network technique field.The method includes:Down-sampled processing is carried out to the audio data of input;Cutting is down-sampled treated audio data, obtains multiple audio units;Extract the corresponding characteristic of each audio unit in the multiple audio unit;According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, obtains the eigenmatrix of the audio data of the input;Disaggregated model trained based on the eigenmatrix and in advance, determines the audio types of the audio data of the input.Using the present invention, can be improved detection audio whether be absolute music audio accuracy rate.
Description
Technical field
The present invention relates to network technique field, in particular to a kind of the method and dress of the audio types of determining audio data
It sets.
Background technique
As people's living standard increasingly improves, more and more people like listening to music, and loosen mood with this.It is general next
It says, the audio on music platform or music website can be divided into the vocal music audio for having voice and not the absolute music audio of voice.It is right
It is the more popular research topic of current audio detection field that vocal music audio and absolute music audio, which carry out classification,.
Currently, whether detection audio is that the method for absolute music audio is usually, the whole section audio judged will be needed to be divided into more
A segment determines in each segment whether include voice audio one by one, and then determines whether each segment is absolute music audio.
If all segments of whole section audio are absolute music audio, it is determined that the whole section audio is absolute music audio, if whole section of sound
It there are at least one segment is not absolute music audio in all segments of frequency, it is determined that the whole section audio is not absolute music audio.
In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems:
When whether including voice audio in determining individual chip, there may be error, in turn, determine all
Section will accumulate biggish error when whether being absolute music audio, in turn, cause to detect audio whether be absolute music audio standard
True rate reduces.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of audio types of determining audio data
Method and apparatus.The technical solution is as follows:
In a first aspect, a kind of method of the audio types of determining audio data is provided, the method includes:
Down-sampled processing is carried out to the audio data of input;
Cutting is down-sampled treated audio data, obtains multiple audio units;
Extract the corresponding characteristic of each audio unit in the multiple audio unit;
According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, is obtained described
The eigenmatrix of the audio data of input;
Disaggregated model trained based on the eigenmatrix and in advance, determines the audio class of the audio data of the input
Type.
Optionally, described to extract the corresponding characteristic of each audio unit in the multiple audio unit, including:
Extract the frequency data of each audio unit;
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound
The corresponding characteristic of frequency unit.
Optionally, the Feature Selection Model includes at least one empty thresholding residual error convolutional neural networks RGCNN module
With global pool module;
The frequency data by each audio unit input Feature Selection Model trained in advance respectively, obtain described every
The corresponding characteristic of a audio unit, including:
For each audio unit, based at least one described RGCNN module, to the frequency data of the audio unit into
Row processing, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio list
The corresponding characteristic of member.
Optionally, the Feature Selection Model includes N number of RGCNN module, each RGCNN in N number of RGCNN module
Module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, element product computing module and element adduction meter
Calculate module, wherein N is positive integer;
It is described the frequency data of the audio unit are handled based at least one described RGCNN module, it obtains
Between eigenmatrix global pool module described in the intermediate features Input matrix obtained into the corresponding feature of the audio unit
Data, including:
For the 1st RGCNN module, the frequency data are inputted into the volume in the 1st RGCNN module without activation primitive
Lamination, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and the frequency data are inputted the 1st RGCNN
With the convolutional layer of activation primitive in module, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st
The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain
The corresponding feature product matrix of the 1st RGCNN module;The frequency data are corresponding with the 1st RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module
Battle array;
It, will be i-th described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module for i-th of RGCNN module
Without the convolutional layer of activation primitive in RGCNN module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained;It will
Volume with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module
Lamination obtains the activation convolution eigenmatrix of i-th of RGCNN module, by the inactive volume of i-th of RGCNN module
Product eigenmatrix and activation convolution eigenmatrix input the element product computing module, obtain i-th of RGCNN module pair
The feature product matrix answered;By the corresponding intermediate features matrix of (i-1)-th RGCNN module and i-th of RGCNN module
Corresponding feature product matrix inputs the element and sums it up computing module, and it is corresponding intermediate special to obtain i-th of RGCNN module
Levy matrix;Wherein, i is greater than 1 and is not more than the arbitrary integer of N;
By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio list is obtained
The corresponding characteristic of member.
Optionally, in the same RGCNN module, without activation primitive convolutional layer the coefficient of expansion with activation primitive
Convolutional layer the coefficient of expansion it is identical;
The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the expansion system of the convolutional layer of (i-1)-th RGCNN module
Number.
Optionally, it is described extract the corresponding characteristic of each audio unit in the multiple audio unit before, also wrap
It includes:
Obtain multiple first training samples, wherein each first training sample includes the frequency data of sample audio unit
And audio types;
Based on the multiple first training sample and preset first training function, model is extracted to initial characteristics and is instructed
Practice, obtains the Feature Selection Model.
Optionally, it is described obtain the Feature Selection Model after, further include:
Obtain multiple second training samples, wherein each second training sample includes each sample sound in sample audio data
The audio types of the frequency data of frequency unit and the sample audio data;
Based on the frequency data in the Feature Selection Model and the multiple second training sample, it is special to obtain multiple samples
Levy data;
Based on the audio types and preset second training in the multiple sample characteristics data, multiple second training samples
Function is trained preliminary classification model, obtains the disaggregated model.
Second aspect, provides a kind of device of the audio types of determining audio data, and described device includes:
Processing module, for carrying out down-sampled processing to the audio data of input;
Cutting module obtains multiple audio units for down-sampled treated the audio data of cutting;
Extraction module, for extracting the corresponding characteristic of each audio unit in the multiple audio data;
Module is arranged, for the timing according to the corresponding audio unit of each characteristic, the characteristic is carried out
Arrangement, obtains the eigenmatrix of the audio data of the input;
Determining module determines the audio of the input for disaggregated model trained based on the eigenmatrix and in advance
The audio types of data.
Optionally, the extraction module, is used for:
Extract the frequency data of each audio unit;
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound
The corresponding characteristic of frequency unit.
Optionally, the Feature Selection Model includes at least one empty thresholding residual error convolutional neural networks RGCNN module
With global pool module;
The extraction module, is used for:
For each audio unit, based at least one described RGCNN module, to the frequency data of the audio unit into
Row processing, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio list
The corresponding characteristic of member.
Optionally, the Feature Selection Model includes N number of RGCNN module, each RGCNN in N number of RGCNN module
Module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, element product computing module and element adduction meter
Calculate module, wherein N is positive integer;
The extraction module, is used for:
For the 1st RGCNN module, the frequency data are inputted into the volume in the 1st RGCNN module without activation primitive
Lamination, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and the frequency data are inputted the 1st RGCNN
With the convolutional layer of activation primitive in module, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st
The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain
The corresponding feature product matrix of the 1st RGCNN module;The frequency data are corresponding with the 1st RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module
Battle array;
It, will be i-th described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module for i-th of RGCNN module
Without the convolutional layer of activation primitive in RGCNN module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained;It will
Volume with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module
Lamination obtains the activation convolution eigenmatrix of i-th of RGCNN module, by the inactive volume of i-th of RGCNN module
Product eigenmatrix and activation convolution eigenmatrix input the element product computing module, obtain i-th of RGCNN module pair
The feature product matrix answered;By the corresponding intermediate features matrix of (i-1)-th RGCNN module and i-th of RGCNN module
Corresponding feature product matrix inputs the element and sums it up computing module, and it is corresponding intermediate special to obtain i-th of RGCNN module
Levy matrix;Wherein, i is greater than 1 and is not more than the arbitrary integer of N;
By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio list is obtained
The corresponding characteristic of member.
Optionally, in the same RGCNN module, without activation primitive convolutional layer the coefficient of expansion with activation primitive
Convolutional layer the coefficient of expansion it is identical;
The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the expansion system of the convolutional layer of (i-1)-th RGCNN module
Number.
Optionally, described device further includes:
First obtains module, for extract the corresponding characteristic of each audio unit in the multiple audio unit it
Before, obtain multiple first training samples, wherein each first training sample includes the frequency data and audio of sample audio unit
Type;
First training module, for training function based on the multiple first training sample and preset first, to initial
Feature Selection Model is trained, and obtains the Feature Selection Model.
Optionally, described device further includes:
Second obtains module, after obtaining the Feature Selection Model, obtains multiple second training samples, wherein
Each second training sample includes the frequency data and the sample audio data of each sample audio unit in sample audio data
Audio types;
Third obtains module, for based on the frequency number in the Feature Selection Model and the multiple second training sample
According to obtaining multiple sample characteristics data;
Second training module, for based on the audio class in the multiple sample characteristics data, multiple second training samples
Type and preset second training function, are trained preliminary classification model, obtain the disaggregated model.
The third aspect provides a kind of server, and the server includes processor and memory, deposits in the memory
Contain at least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of journey
Sequence, the code set or instruction set loaded as the processor and executed with realize as described in above-mentioned first aspect really accordatura frequently
The method of the audio types of data.
Fourth aspect provides a kind of computer readable storage medium, at least one finger is stored in the storage medium
Enable, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or
Instruction set is loaded as the processor and is executed to realize the audio types of audio data really as described in above-mentioned first aspect
Method.
Technical solution bring beneficial effect provided in an embodiment of the present invention includes at least:
In the embodiment of the present invention, multiple characteristics based on target audio data integrally carry out target audio data
Classification, determines the corresponding audio types of target audio data with this, without classifying respectively to each audio unit, in this way may be used
To prevent error accumulation, it is thus possible to improve detection audio whether be absolute music audio accuracy rate.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is a kind of flow chart of the method for the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 2 is a kind of spectrum diagram of the method for the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 3 is a kind of model schematic of the method for the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 4 is a kind of model structure signal of the method for the audio types of determining audio data provided in an embodiment of the present invention
Figure;
Fig. 5 is a kind of schematic diagram of a scenario of the method for the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram of the device of the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of the device of the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 8 is a kind of structural schematic diagram of the device of the audio types of determining audio data provided in an embodiment of the present invention;
Fig. 9 is a kind of server architecture schematic diagram provided in an embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention
Formula is described in further detail.
The embodiment of the invention provides a kind of method of the audio types of determining audio data, this method can be by server
It realizes.
Server may include the components such as processor, memory.Processor can be CPU (Central Processing
Unit, central processing unit) etc., audio unit, each audio unit of extraction that can be used for extracting target audio data are corresponding
Characteristic, characteristic arrange the liquor-saturated eigenmatrix at target audio data, determines target audio data sound
The processing such as frequency type.Memory can be RAM (Random Access Memory, random access memory) that Flash (dodges
Deposit) etc., data needed for can be used for storing the data received, treatment process, the data generated in treatment process etc., such as mesh
Mark audio data, audio unit, characteristic, audio types of target audio data etc..Server can also include transceiver,
Image-detection component, audio output part and audio input means etc..Transceiver can be used for carrying out data biography with other equipment
It is defeated, it may include antenna, match circuit, modem etc..Image-detection component can be camera etc..Audio output part
It can be speaker, earphone etc..Audio input means can be microphone etc..
As shown in Figure 1, the process flow of this method may include following step:
In a step 101, down-sampled processing is carried out to the audio data of input.
In one possible embodiment, when user wants the audio data (can be described as target audio data) of detection input
Audio types, i.e. user are wanted to determine that target audio data belong to absolute music type and still fall within vocal music type, need first to this
Section target audio data perform some processing, such as down-sampled processing.Preferably, target audio data can be downsampled to
16000HZ, the benefit handled in this way have at 3 points:First is that the input data of available uniform data form, second is that can reduce
Input data amount, third is that can influence to avoid frequency spectrum height to target audio data.
In a step 102, down-sampled treated the audio data of cutting, obtains multiple audio units.
It, can be according to preset duration, by mesh after obtaining down-sampled treated audio data in one possible embodiment
Mark audio data is split, and target audio data is divided into multiple audio fragments, each audio fragment is an audio list
Member, preferred preset duration is 3s, if the duration of the last one audio unit can give up this audio unit less than 3s
It abandons.
In step 103, the corresponding characteristic of each audio unit in multiple audio units is extracted.
In one possible embodiment, after obtaining multiple audio units, user can carry out feature to each audio unit
Extraction process obtains the corresponding characteristic of each audio unit.
Optionally, features described above extraction process can be realized by Feature Selection Model, the processing of corresponding step 103
Process can be as follows:Extract the frequency data of each audio unit;The frequency data of each audio unit are inputted respectively in advance
Trained Feature Selection Model obtains the corresponding characteristic of each audio unit.
In one possible embodiment, mel-spectrogram (plum is extracted respectively at least one obtained audio unit
That sonograph), the corresponding Meier sonograph of an audio unit can be as shown in Figure 2, wherein the horizontal axis of Meier sonograph indicates
Time, the longitudinal axis indicate that frequency band number, frequency band number indicate that a frequency range, preferred frequency band number can be determined as 128.
According to the mel-spectrogram of each audio unit, the equal of each frequency band place frequency is calculated along time orientation
Value, obtains 128 mean frequency values;The variance that each frequency band place frequency is calculated along time orientation, obtains 128 frequency variances.
In order to make the data normalization of input, the mean value of each frequency band and the ratio of variance can be calculated, ratio is determined as each sound
The frequency data of frequency unit.
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, to each audio unit
Frequency data carry out feature extraction, obtain the corresponding characteristic of each audio unit.
Optionally, above-mentioned trained Feature Selection Model includes at least one empty thresholding residual error convolutional neural networks
RGCNN module and global pool module;In the Feature Selection Model that the frequency data input of each audio unit is trained in advance,
Specifically treatment process is:For each audio unit, it is based at least one RGCNN module, to the frequency data of audio unit
It is handled, obtains intermediate features matrix, by intermediate features Input matrix global pool module, obtain the corresponding spy of audio unit
Levy data.
In one possible embodiment, the structure of above-mentioned trained Feature Selection Model may include at least one cavity
Thresholding residual error convolutional neural networks RGCNN module and global pool module, when the frequency data input feature vector of an audio unit
It when extracting model, according to preset process flow, handles all RGCNN modules to the frequency data of input, obtains
Between eigenmatrix, the intermediate features Input matrix global pool module that then will be obtained, to intermediate eigenmatrix carry out global pool
Change processing, obtains the characteristic of the corresponding vector form of audio unit,
Optionally, at least one above-mentioned Feature Selection Model may include N number of RGCNN module, in N number of RGCNN module
Each RGCNN module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, element product computing module and member
Element adduction computing module, wherein activation primitive is sigmoid activation primitive, and N is positive integer.
For the 1st RGCNN module, frequency data are inputted into the convolution in the 1st RGCNN module without activation primitive
Layer, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and frequency data are inputted band in the 1st RGCNN module and are swashed
The convolutional layer of function living, obtains the activation convolution eigenmatrix of the 1st RGCNN module, by the inactive of the 1st RGCNN module
Convolution eigenmatrix and activation convolution eigenmatrix input element product computing module, obtain the 1st corresponding spy of RGCNN module
Levy product matrix;Frequency data feature product matrix input element corresponding with the 1st RGCNN module is summed it up into computing module,
Obtain the corresponding intermediate features matrix of the 1st RGCNN module;For i-th of RGCNN module, by (i-1)-th RGCNN module pair
Without the convolutional layer of activation primitive in i-th of RGCNN module of the intermediate features Input matrix answered, i-th of RGCNN module is obtained
Inactive convolution eigenmatrix;By band in (i-1)-th RGCNN module, i-th of RGCNN module of corresponding intermediate features Input matrix
The convolutional layer of activation primitive obtains the activation convolution eigenmatrix of i-th of RGCNN module, and i-th of the non-of RGCNN module is swashed
Convolution eigenmatrix living and activation convolution eigenmatrix input element product computing module, it is corresponding to obtain i-th of RGCNN module
Feature product matrix;The corresponding intermediate features matrix of (i-1)-th RGCNN module feature corresponding with i-th of RGCNN module is multiplied
Product matrix input element sums it up computing module, obtains the corresponding intermediate features matrix of i-th of RGCNN module;Wherein, i is greater than 1
And it is not more than the arbitrary integer of N;By the corresponding intermediate features Input matrix global pool module of n-th RGCNN module, sound is obtained
The corresponding characteristic of frequency unit.
In one possible embodiment, the value of above-mentioned N is preferably 4 to 6, i.e., the number of above-mentioned RGCNN module is preferably 4
To 6, it is 5 in the present embodiment with the number of RGCNN module and is illustrated.Each RGCNN module includes without activation
The convolutional layer of function, the convolutional layer with activation primitive, element product computing module and element sum it up computing module.
After the frequency data for obtaining audio unit through the above steps, as shown in figure 3, frequency data are inputted the 1st simultaneously
In a RGCNN module without the convolutional layer of activation primitive and the convolutional layer of function to be activated, according to the volume without activation primitive
Lamination obtains the inactive convolution eigenmatrix of the 1st RGCNN module, obtains the 1st according to the convolutional layer with activation primitive
The activation convolution eigenmatrix of RGCNN module, by obtained inactive convolution eigenmatrix and activation convolution eigenmatrix input
Element product computing module obtains the corresponding feature product matrix of the 1st RGCNN module, and the 1st RGCNN module is corresponding
Feature product matrix and frequency data input element sum it up computing module, obtain the corresponding intermediate features square of the 1st RGCNN module
Battle array.
By the corresponding intermediate features Input matrix of the obtain the 1st RGCNN module into the 2nd RGCNN module without
The convolutional layer of the convolutional layer of activation primitive and function to be activated obtains the 2nd RGCNN mould according to the convolutional layer without activation primitive
The inactive convolution eigenmatrix of block obtains the activation convolution feature of the 2nd RGCNN module according to the convolutional layer with activation primitive
Obtained inactive convolution eigenmatrix and activation convolution eigenmatrix input element product computing module are obtained the 2nd by matrix
The corresponding feature product matrix of a RGCNN module, by the corresponding intermediate features matrix of the 1st RGCNN module and the 2nd RGCNN
The corresponding feature product matrix input element of module sums it up computing module, obtains the corresponding intermediate features square of the 2nd RGCNN module
Battle array.
Then by the corresponding intermediate features Input matrix of the 2nd RGCNN module into the 3rd RGCNN module without swash
The convolutional layer of function living and the convolutional layer of function to be activated, are handled referring to above-mentioned processing step, obtain the 3rd RGCNN mould
Intermediate features matrix in block.Processing step and so on above is repeated, until obtaining the centre of the last one RGCNN module
Eigenmatrix.
By the corresponding intermediate features Input matrix global pool module of the last one RGCNN module, pass through global pool mould
The global poolization of block is handled, and is a numerical value by each row of data processing of intermediate features matrix, obtain audio unit it is corresponding to
The characteristic of amount form.
Optionally, in the same RGCNN module, without activation primitive convolutional layer the coefficient of expansion with activation primitive
Convolutional layer the coefficient of expansion it is identical;The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than (i-1)-th RGCNN module
The coefficient of expansion of convolutional layer.
Wherein, the coefficient of expansion of convolutional layer is used to indicate the range for extracting feature, and the coefficient of expansion is bigger, and the feature of extraction is got over
Globalization, the coefficient of expansion is smaller, and the feature of extraction is more specific.
In one possible embodiment, in above-mentioned N number of RGCNN module, in the same RGCNN module, without activation
The coefficient of expansion of the convolutional layer of function is identical as the coefficient of expansion of the convolutional layer with activation primitive.In any two RGCNN module
The coefficient of expansion of convolutional layer be different from, and the convolution according to the sequence for carrying out feature extraction processing, in all RGCNN modules
The coefficient of expansion of layer is incremental, for example, it is assumed that i-th of RGCNN module is in N number of RGCNN module in addition to the 1st RGCNN mould
Any one module except block, then the coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than (i-1)-th RGCNN module
The coefficient of expansion of convolutional layer.Preferably, RGCNN module can be set to 5, by the convolutional layer in this 5 RGCNN modules
The coefficient of expansion is arranged by the form of exponential increasing, as the coefficient of expansion of the convolutional layer in the 1st RGCNN module is set as the 2, the 2nd
The coefficient of expansion of convolutional layer in a RGCNN module is set as the coefficient of expansion setting of the convolutional layer in the 4, the 3rd RGCNN module
It is the expansion that the coefficient of expansion of the convolutional layer in the 8, the 4th RGCNN module is set as the convolutional layer in the 16, the 5th RGCNN module
Coefficient is set as 32.
Optionally, the training process that features described above extracts model can be as follows:Obtain multiple first training samples;Based on more
Initial characteristics are extracted model and are trained, obtain feature extraction mould by a first training sample and preset first training function
Type.
Wherein, each first training sample includes the frequency data and audio types of sample audio unit.
In one possible embodiment, before using Feature Selection Model, need first to instruct Feature Selection Model
Practice.Firstly, multiple training samples (i.e. the first training sample) that model is extracted for training characteristics are obtained, each first training sample
This includes the frequency data and audio types of sample audio unit.
The process for obtaining the frequency data of sample audio unit can be, first acquisition sample audio data, by sample sound
Sample audio data, then according to preset duration, is divided into sample audio unit, preferably according to 16000Hz is downsampled to by frequency
, preset duration can be 3s, i.e., sample audio data is divided into multiple sample audio units of duration 3s, and determine each
The audio types (i.e. sample audio type) of sample audio unit, that is, determine each sample audio unit be absolute music audio or
Vocal music audio.
Then, the frequency data of each sample audio unit are extracted, corresponding treatment process is referred to above-mentioned processing step
Suddenly, this will not be repeated here.
The frequency data input initial characteristics of sample audio unit are extracted into model, the spy of model is extracted by initial characteristics
Sign is extracted, and the corresponding sample characteristics data of each sample audio unit are obtained.Sample characteristics data are input to full link model,
As shown in figure 4, the full link model is used to determine its music type according to each sample characteristics data, wrapped in the full link model
Include a canonical module Dropout and two dense link block Dense.By full link model, sample audio unit is obtained
Corresponding testing audio type, the testing audio type are a probability numbers.
The error amount between testing audio type and sample audio type is calculated, it is pre- to determine whether obtained error amount is less than
If error amount threshold value, if the error amount being calculated is not less than preset error value threshold value, initial characteristics are determined according to error amount
The adjusted value of each coefficient in model is extracted, and each coefficient in model is extracted to initial characteristics and is adjusted.According to multiple samples
Audio unit obtains multiple testing audio types, obtains multiple errors according to multiple testing audio types and sample audio type
Value is extracted model according to each error initial characteristics and is trained, until the error amount that is calculated is less than preset error value threshold value,
Current Feature Selection Model is determined as trained Feature Selection Model, training process terminates.
Optionally, above-mentioned disaggregated model is also required to be trained, and corresponding treatment process can be as follows:Obtain multiple second
Training sample;Based on the frequency data in Feature Selection Model and multiple second training samples, multiple sample characteristics data are obtained;
Function is trained based on the audio types and preset second in multiple sample characteristics data, multiple second training samples, to initial
Disaggregated model is trained, and obtains disaggregated model.
Wherein, each second training sample includes the frequency data and sample of each sample audio unit in sample audio data
The audio types of audio data, sample audio unit are that sample audio data is split to obtain according to preset duration, each
The duration of sample audio data is set as 8 minutes, and the insufficient audio unit of duration is supplied with 0;Each sample audio unit when
Long is preferably 3s.
It, can be right after obtaining trained Feature Selection Model by above-mentioned processing step in one possible embodiment
Preliminary classification model is trained.Firstly, obtaining multiple training sample (the i.e. second training samples for training preliminary classification model
This), each second training sample includes the frequency data and sample audio data of each sample audio unit in sample audio data
Audio types.Then, the frequency data in each second training sample are inputted in trained Feature Selection Model, is obtained
The sample characteristics data of each second training sample.Obtained sample characteristics data are inputted in preliminary classification model, it is preferable that
It is influenced to cancel 0 bring added in sample audio data for supplying duration, it can be before preliminary classification model
Add a mask layer.By preliminary classification model to the classification processing of sample characteristics data, second training sample is obtained
Test music type.
The error amount between the sample audio type in testing audio type and the second training sample is calculated, what determination obtained
Whether error amount is less than preset error value threshold value, if the error amount being calculated is not less than preset error value threshold value, according to accidentally
Difference determines the adjusted value of each coefficient in preliminary classification model, and is adjusted to each coefficient in preliminary classification model.According to
Multiple sample audio units obtain multiple testing audio types, are obtained according to multiple testing audio types and sample audio type more
A error amount is trained according to each error preliminary classification model, until the error amount being calculated is less than preset error value threshold
Value, is determined as trained disaggregated model for current disaggregated model, training process terminates.
At step 104, according to the timing of the corresponding audio unit of each characteristic, characteristic is arranged,
The eigenmatrix of the audio data inputted.
In one possible embodiment, after the characteristic for obtaining each audio unit through the above steps, determine each
The time sequencing of audio unit, and according to the time sequencing of each audio unit, the characteristic of each audio unit is carried out
Arrangement, the eigenmatrix of the audio data inputted, as shown in Figure 5.
In step 105, the disaggregated model trained based on eigenmatrix and in advance determines the audio of the audio data of input
Type.
Wherein, audio types include absolute music type or vocal music type.Preferably, above-mentioned disaggregated model can be RNN mould
Type (Recurrent Neural Networks, Recognition with Recurrent Neural Network).
It, will after the eigenmatrix for the target audio data that through the above steps 104 are obtained in one possible embodiment
The eigenmatrix of target audio data is input in disaggregated model trained in advance, by the classification processing of disaggregated model, is determined
The audio types of target audio data.The audio types of audio data may include absolute music type and vocal music type, corresponding,
Trained disaggregated model can be divided into for determining audio data for the disaggregated model of the probability of absolute music type and for true
Audio data is the disaggregated model of the probability of vocal music type.
If being used to determine that audio data to be the probability of absolute music type for the eigenmatrix input of target audio data
Disaggregated model, then disaggregated model output is the probability that target audio data are absolute music type, in this case, when output
When probability is greater than the first predetermined probabilities threshold value, it can determine that target audio data are absolute music type, when the probability of output is little
When the first predetermined probabilities threshold value, it can determine that target audio data are vocal music type.
If being used to determine that audio data to be point of the probability of vocal music type for the eigenmatrix input of target audio data
Class model, then disaggregated model output is the probability that target audio data are vocal music type, in this case, when the probability of output
When greater than the second predetermined probabilities threshold value, it can determine that target audio data are vocal music type, when the probability of output is not more than second
When predetermined probabilities threshold value, it can determine that target audio data are absolute music type.
In the embodiment of the present invention, multiple characteristics based on target audio data integrally carry out target audio data
Classification, determines the corresponding audio types of target audio data with this, without classifying respectively to each audio unit, in this way may be used
To prevent error accumulation, it is thus possible to improve detection audio whether be absolute music audio accuracy rate.
Based on the same technical idea, the embodiment of the invention also provides a kind of dresses of the audio types of determining audio data
It sets, which can be the server in above-described embodiment, as shown in fig. 6, the device includes:Processing module 610, cutting module
620, extraction module 630 arranges module 640 and determining module 650.
Processing module 610 is configured as carrying out down-sampled processing to the audio data of input;
Cutting module 620 is configured as down-sampled treated the audio data of cutting, obtains multiple audio units;
Extraction module 630 is configured as extracting the corresponding characteristic of each audio unit in the multiple audio unit;
Module 640 is arranged, the timing according to the corresponding audio unit of each characteristic is configured as, to the characteristic
According to being arranged, the eigenmatrix of the audio data of the input is obtained;
Determining module 650 is configured as disaggregated model trained based on the eigenmatrix and in advance, determines the input
Audio data audio types.
Optionally, the extraction module 630, is configured as:
Extract the frequency data of each audio unit;
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound
The corresponding characteristic of frequency unit.
Optionally, the Feature Selection Model includes at least one empty thresholding residual error convolutional neural networks RGCNN module
With global pool module;
The extraction module 630, is configured as:
For each audio unit, based at least one described RGCNN module, to the frequency data of the audio unit into
Row processing, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio list
The corresponding characteristic of member.
Optionally, the Feature Selection Model includes N number of RGCNN module, each RGCNN in N number of RGCNN module
Module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, element product computing module and element adduction meter
Calculate module, wherein N is positive integer;
The extraction module 630, is configured as:
For the 1st RGCNN module, the frequency data are inputted into the volume in the 1st RGCNN module without activation primitive
Lamination, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and the frequency data are inputted the 1st RGCNN
With the convolutional layer of activation primitive in module, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st
The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain
The corresponding feature product matrix of the 1st RGCNN module;The frequency data are corresponding with the 1st RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module
Battle array;
It, will be i-th described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module for i-th of RGCNN module
Without the convolutional layer of activation primitive in RGCNN module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained;It will
Volume with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module
Lamination obtains the activation convolution eigenmatrix of i-th of RGCNN module, by the inactive volume of i-th of RGCNN module
Product eigenmatrix and activation convolution eigenmatrix input the element product computing module, obtain i-th of RGCNN module pair
The feature product matrix answered;By the corresponding intermediate features matrix of (i-1)-th RGCNN module and i-th of RGCNN module
Corresponding feature product matrix inputs the element and sums it up computing module, and it is corresponding intermediate special to obtain i-th of RGCNN module
Levy matrix;Wherein, i is greater than 1 and is not more than the arbitrary integer of N;
By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio list is obtained
The corresponding characteristic of member.
Optionally, in the same RGCNN module, without activation primitive convolutional layer the coefficient of expansion with activation primitive
Convolutional layer the coefficient of expansion it is identical;
The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the expansion system of the convolutional layer of (i-1)-th RGCNN module
Number.
Optionally, as shown in fig. 7, described device further includes:
First obtains module 710, and it is corresponding to be configured as extracting each audio unit in the multiple each audio unit
Before characteristic, multiple first training samples are obtained, wherein each first training sample includes the frequency of sample audio unit
Data and audio types;
First training module 720 is configured as based on the multiple first training sample and preset first training function,
Model is extracted to initial characteristics to be trained, and obtains the Feature Selection Model.
Optionally, as shown in figure 8, described device further includes:
Second obtains module 810, is configured as after obtaining the Feature Selection Model, obtains multiple second training samples
This, wherein each second training sample includes the frequency data and the sample of each sample audio unit in sample audio data
The audio types of audio data;
Third obtains module 820, is configured as based in the Feature Selection Model and the multiple second training sample
Frequency data, obtain multiple sample characteristics data;
Second training module 830 is configured as based in the multiple sample characteristics data, multiple second training samples
Audio types and preset second training function, are trained preliminary classification model, obtain the disaggregated model.
In the embodiment of the present invention, multiple characteristics based on target audio data integrally carry out target audio data
Classification, determines the corresponding audio types of target audio data with this, without classifying respectively to each audio unit, in this way may be used
To prevent error accumulation, it is thus possible to improve detection audio whether be absolute music audio accuracy rate.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
It should be noted that:The device of the audio types of determining audio data provided by the above embodiment is determining audio number
According to audio types when, only the example of the division of the above functional modules, in practical application, can according to need and
Above-mentioned function distribution is completed by different functional modules, i.e., the internal structure of equipment is divided into different functional modules, with
Complete all or part of function described above.In addition, the audio types of determining audio data provided by the above embodiment
The embodiment of the method for the audio types of device and determining audio data belongs to same design, and specific implementation process is detailed in method reality
Example is applied, which is not described herein again.
Fig. 9 is a kind of structural schematic diagram of computer equipment provided in an embodiment of the present invention, which can be because
Configuration or performance are different and generate bigger difference, may include one or more processors (central
Processing units, CPU) 901 and one or more memory 902, wherein it is stored in the memory 902
There is at least one instruction, at least one instruction is loaded by the processor 901 and executed to realize following determining audio numbers
According to audio types method and step:
Down-sampled processing is carried out to the audio data of input;
Cutting is down-sampled treated audio data, obtains multiple audio units;
Extract the corresponding characteristic of each audio unit in the multiple audio unit;
According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, is obtained described
The eigenmatrix of the audio data of input;
Disaggregated model trained based on the eigenmatrix and in advance, determines the audio class of the audio data of the input
Type.
Optionally, at least one instruction is loaded by the processor 901 and is executed to realize following methods step:
Extract the frequency data of each audio unit;
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound
The corresponding characteristic of frequency unit.
Optionally, at least one instruction is loaded by the processor 901 and is executed to realize following methods step:
For each audio unit, based at least one described RGCNN module, to the frequency data of the audio unit into
Row processing, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio list
The corresponding characteristic of member.
Optionally, at least one instruction is loaded by the processor 901 and is executed to realize following methods step:
For the 1st RGCNN module, the frequency data are inputted into the volume in the 1st RGCNN module without activation primitive
Lamination, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and the frequency data are inputted the 1st RGCNN
With the convolutional layer of activation primitive in module, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st
The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain
The corresponding feature product matrix of the 1st RGCNN module;The frequency data are corresponding with the 1st RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module
Battle array;
It, will be i-th described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module for i-th of RGCNN module
Without the convolutional layer of activation primitive in RGCNN module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained;It will
Volume with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module
Lamination obtains the activation convolution eigenmatrix of i-th of RGCNN module, by the inactive volume of i-th of RGCNN module
Product eigenmatrix and activation convolution eigenmatrix input the element product computing module, obtain i-th of RGCNN module pair
The feature product matrix answered;By the corresponding intermediate features matrix of (i-1)-th RGCNN module and i-th of RGCNN module
Corresponding feature product matrix inputs the element and sums it up computing module, and it is corresponding intermediate special to obtain i-th of RGCNN module
Levy matrix;Wherein, i is greater than 1 and is not more than the arbitrary integer of N;
By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio list is obtained
The corresponding characteristic of member.
Optionally, at least one instruction is loaded by the processor 901 and is executed to realize following methods step:
Obtain multiple first training samples, wherein each first training sample includes the frequency data of sample audio unit
And audio types;
Based on the multiple first training sample and preset first training function, model is extracted to initial characteristics and is instructed
Practice, obtains the Feature Selection Model.
Optionally, at least one instruction is loaded by the processor 901 and is executed to realize following methods step:
Obtain multiple second training samples, wherein each second training sample includes each sample sound in sample audio data
The audio types of the frequency data of frequency unit and the sample audio data;
Based on the frequency data in the Feature Selection Model and the multiple second training sample, it is special to obtain multiple samples
Levy data;
Based on the audio types and preset second training in the multiple sample characteristics data, multiple second training samples
Function is trained preliminary classification model, obtains the disaggregated model.
In the embodiment of the present invention, multiple characteristics based on target audio data integrally carry out target audio data
Classification, determines the corresponding audio types of target audio data with this, without classifying respectively to each audio unit, in this way may be used
To prevent error accumulation, it is thus possible to improve detection audio whether be absolute music audio accuracy rate.
In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, is stored at least in storage medium
One instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, code set or instruction set
It is loaded by processor and is executed to realize the identification maneuver class method for distinguishing in above-described embodiment.For example, described computer-readable
Storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (16)
1. a kind of method of the audio types of determining audio data, which is characterized in that the method includes:
Down-sampled processing is carried out to the audio data of input;
Cutting is down-sampled treated audio data, obtains multiple audio units;
Extract the corresponding characteristic of each audio unit in the multiple audio unit;
According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, obtains the input
Audio data eigenmatrix;
Disaggregated model trained based on the eigenmatrix and in advance, determines the audio types of the audio data of the input.
2. the method according to claim 1, wherein described extract each audio list in the multiple audio unit
The corresponding characteristic of member, including:
Extract the frequency data of each audio unit;
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound
The corresponding characteristic of frequency unit.
3. according to the method described in claim 2, it is characterized in that, the Feature Selection Model includes at least one empty thresholding
Residual error convolutional neural networks RGCNN module and global pool module;
The frequency data by each audio unit input Feature Selection Model trained in advance respectively, obtain described every
The corresponding characteristic of a audio unit, including:
For each audio unit, based at least one described RGCNN module, at the frequency data of the audio unit
Reason, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio unit pair
The characteristic answered.
4. according to the method described in claim 3, it is characterized in that, the Feature Selection Model includes N number of RGCNN module, institute
Stating each RGCNN module in N number of RGCNN module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, member
Plain product computing module and element sum it up computing module, wherein N is positive integer;
It is described that the frequency data of the audio unit are handled based at least one described RGCNN module, obtain intermediate spy
It levies matrix and global pool module described in the intermediate features Input matrix is obtained into the corresponding characteristic of the audio unit,
Including:
For the 1st RGCNN module, the frequency data are inputted into the convolution in the 1st RGCNN module without activation primitive
Layer, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, the frequency data is inputted the 1st RGCNN mould
With the convolutional layer of activation primitive in block, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st
The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain
The corresponding feature product matrix of the 1st RGCNN module;The frequency data are corresponding with the 1st RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module
Battle array;
For i-th of RGCNN module, by i-th of RGCNN described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module
Without the convolutional layer of activation primitive in module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained;It will be described
Convolutional layer with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module,
The activation convolution eigenmatrix of i-th of RGCNN module is obtained, by the inactive convolution feature of i-th of RGCNN module
Matrix and activation convolution eigenmatrix input the element product computing module, obtain the corresponding spy of i-th of RGCNN module
Levy product matrix;The corresponding intermediate features matrix of (i-1)-th RGCNN module is corresponding with i-th of RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of i-th of RGCNN module
Battle array;Wherein, i is greater than 1 and is not more than the arbitrary integer of N;
By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio unit pair is obtained
The characteristic answered.
5. according to the method described in claim 4, it is characterized in that, in the same RGCNN module, without the volume of activation primitive
The coefficient of expansion of lamination is identical as the coefficient of expansion of the convolutional layer with activation primitive;
The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the coefficient of expansion of the convolutional layer of (i-1)-th RGCNN module.
6. according to the method described in claim 2, it is characterized in that, described extract each audio list in the multiple audio unit
Before the corresponding characteristic of member, further include:
Obtain multiple first training samples, wherein each first training sample includes the frequency data and sound of sample audio unit
Frequency type;
Based on the multiple first training sample and preset first training function, model is extracted to initial characteristics and is trained,
Obtain the Feature Selection Model.
7. according to the method described in claim 6, it is characterized in that, it is described obtain the Feature Selection Model after, further include:
Obtain multiple second training samples, wherein each second training sample includes each sample audio list in sample audio data
The frequency data of member and the audio types of the sample audio data;
Based on the frequency data in the Feature Selection Model and the multiple second training sample, multiple sample characteristics numbers are obtained
According to;
Letter is trained based on the audio types and preset second in the multiple sample characteristics data, multiple second training samples
Number, is trained preliminary classification model, obtains the disaggregated model.
8. a kind of device of the audio types of determining audio data, which is characterized in that described device includes:
Processing module, for carrying out down-sampled processing to the audio data of input;
Cutting module obtains multiple audio units for down-sampled treated the audio data of cutting;
Extraction module, for extracting the corresponding characteristic of each audio unit in the multiple audio unit;
Arrangement module arranges the characteristic for the timing according to the corresponding audio unit of each characteristic,
Obtain the eigenmatrix of the audio data of the input;
Determining module determines the audio data of the input for disaggregated model trained based on the eigenmatrix and in advance
Audio types.
9. device according to claim 8, which is characterized in that the extraction module is used for:
Extract the frequency data of each audio unit;
The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound
The corresponding characteristic of frequency unit.
10. device according to claim 9, which is characterized in that the Feature Selection Model includes at least one empty door
Limit residual error convolutional neural networks RGCNN module and global pool module;
The extraction module, is used for:
For each audio unit, based at least one described RGCNN module, at the frequency data of the audio unit
Reason, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio unit pair
The characteristic answered.
11. device according to claim 10, which is characterized in that the Feature Selection Model includes N number of RGCNN module,
Each RGCNN module in N number of RGCNN module include the convolutional layer without activation primitive, the convolutional layer with activation primitive,
Element product computing module and element sum it up computing module, wherein N is positive integer;
The extraction module, is used for:
For the 1st RGCNN module, the frequency data are inputted into the convolution in the 1st RGCNN module without activation primitive
Layer, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, the frequency data is inputted the 1st RGCNN mould
With the convolutional layer of activation primitive in block, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st
The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain
The corresponding feature product matrix of the 1st RGCNN module;The frequency data are corresponding with the 1st RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module
Battle array;
For i-th of RGCNN module, by i-th of RGCNN described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module
Without the convolutional layer of activation primitive in module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained;It will be described
Convolutional layer with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module,
The activation convolution eigenmatrix of i-th of RGCNN module is obtained, by the inactive convolution feature of i-th of RGCNN module
Matrix and activation convolution eigenmatrix input the element product computing module, obtain the corresponding spy of i-th of RGCNN module
Levy product matrix;The corresponding intermediate features matrix of (i-1)-th RGCNN module is corresponding with i-th of RGCNN module
Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of i-th of RGCNN module
Battle array;Wherein, i is greater than 1 and is not more than the arbitrary integer of N;
By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio unit pair is obtained
The characteristic answered.
12. device according to claim 11, which is characterized in that in the same RGCNN module, without activation primitive
The coefficient of expansion of convolutional layer is identical as the coefficient of expansion of the convolutional layer with activation primitive;
The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the coefficient of expansion of the convolutional layer of (i-1)-th RGCNN module.
13. device according to claim 9, which is characterized in that described device further includes:
First obtains module, for obtaining before extracting the corresponding characteristic of each audio unit in the multiple audio unit
Take multiple first training samples, wherein each first training sample includes the frequency data and audio types of sample audio unit;
First training module, for training function based on the multiple first training sample and preset first, to initial characteristics
It extracts model to be trained, obtains the Feature Selection Model.
14. device according to claim 13, which is characterized in that described device further includes:
Second obtains module, after obtaining the Feature Selection Model, obtains multiple second training samples, wherein each
Second training sample includes the sound of the frequency data of each sample audio unit and the sample audio data in sample audio data
Frequency type;
Third obtains module, for based on the frequency data in the Feature Selection Model and the multiple second training sample,
Obtain multiple sample characteristics data;
Second training module, for based in the multiple sample characteristics data, multiple second training samples audio types and
Preset second training function, is trained preliminary classification model, obtains the disaggregated model.
15. a kind of server, which is characterized in that the server includes processor and memory, is stored in the memory
At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute
State code set or instruction set loaded by the processor and executed with realize as described in claim 1 to 7 is any really accordatura frequently
The method of the audio types of data.
16. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium
Few one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or the instruction
Collection is loaded by the processor and is executed to realize the audio types of audio data really as described in claim 1 to 7 is any
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810732941.2A CN108877783B (en) | 2018-07-05 | 2018-07-05 | Method and apparatus for determining audio type of audio data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810732941.2A CN108877783B (en) | 2018-07-05 | 2018-07-05 | Method and apparatus for determining audio type of audio data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108877783A true CN108877783A (en) | 2018-11-23 |
CN108877783B CN108877783B (en) | 2021-08-31 |
Family
ID=64299655
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810732941.2A Active CN108877783B (en) | 2018-07-05 | 2018-07-05 | Method and apparatus for determining audio type of audio data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108877783B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047514A (en) * | 2019-05-30 | 2019-07-23 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of accompaniment degree of purity appraisal procedure and relevant device |
CN110097895A (en) * | 2019-05-14 | 2019-08-06 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of absolute music detection method, device and storage medium |
CN110853457A (en) * | 2019-10-31 | 2020-02-28 | 中国科学院自动化研究所南京人工智能芯片创新研究院 | Interactive music teaching guidance method |
CN110955789A (en) * | 2019-12-31 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Multimedia data processing method and equipment |
CN111444382A (en) * | 2020-03-30 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Audio processing method and device, computer equipment and storage medium |
CN112989106A (en) * | 2021-05-18 | 2021-06-18 | 北京世纪好未来教育科技有限公司 | Audio classification method, electronic device and storage medium |
CN113053410A (en) * | 2021-02-26 | 2021-06-29 | 北京国双科技有限公司 | Voice recognition method, voice recognition device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH01186999A (en) * | 1988-01-20 | 1989-07-26 | Ricoh Co Ltd | Speaker collating method |
CN1662956A (en) * | 2002-06-19 | 2005-08-31 | 皇家飞利浦电子股份有限公司 | Mega speaker identification (ID) system and corresponding methods therefor |
US20050195720A1 (en) * | 2004-02-24 | 2005-09-08 | Shigetaka Nagatani | Data processing apparatus, data processing method, reproducing apparatus, and reproducing method |
CN102760444A (en) * | 2012-04-25 | 2012-10-31 | 清华大学 | Support vector machine based classification method of base-band time-domain voice-frequency signal |
CN105895110A (en) * | 2016-06-30 | 2016-08-24 | 北京奇艺世纪科技有限公司 | Method and device for classifying audio files |
CN106571150A (en) * | 2015-10-12 | 2017-04-19 | 阿里巴巴集团控股有限公司 | Method and system for positioning human acoustic zone of music |
CN107293290A (en) * | 2017-07-31 | 2017-10-24 | 郑州云海信息技术有限公司 | The method and apparatus for setting up Speech acoustics model |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
-
2018
- 2018-07-05 CN CN201810732941.2A patent/CN108877783B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH01186999A (en) * | 1988-01-20 | 1989-07-26 | Ricoh Co Ltd | Speaker collating method |
CN1662956A (en) * | 2002-06-19 | 2005-08-31 | 皇家飞利浦电子股份有限公司 | Mega speaker identification (ID) system and corresponding methods therefor |
US20050195720A1 (en) * | 2004-02-24 | 2005-09-08 | Shigetaka Nagatani | Data processing apparatus, data processing method, reproducing apparatus, and reproducing method |
CN102760444A (en) * | 2012-04-25 | 2012-10-31 | 清华大学 | Support vector machine based classification method of base-band time-domain voice-frequency signal |
CN106571150A (en) * | 2015-10-12 | 2017-04-19 | 阿里巴巴集团控股有限公司 | Method and system for positioning human acoustic zone of music |
CN105895110A (en) * | 2016-06-30 | 2016-08-24 | 北京奇艺世纪科技有限公司 | Method and device for classifying audio files |
CN107293290A (en) * | 2017-07-31 | 2017-10-24 | 郑州云海信息技术有限公司 | The method and apparatus for setting up Speech acoustics model |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
Non-Patent Citations (1)
Title |
---|
韩冰等: "一种基于选择性集成SVM的新闻音频自动分类方法", 《模式识别与人工智能》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097895A (en) * | 2019-05-14 | 2019-08-06 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of absolute music detection method, device and storage medium |
CN110097895B (en) * | 2019-05-14 | 2021-03-16 | 腾讯音乐娱乐科技(深圳)有限公司 | Pure music detection method, pure music detection device and storage medium |
CN110047514A (en) * | 2019-05-30 | 2019-07-23 | 腾讯音乐娱乐科技(深圳)有限公司 | A kind of accompaniment degree of purity appraisal procedure and relevant device |
CN110853457A (en) * | 2019-10-31 | 2020-02-28 | 中国科学院自动化研究所南京人工智能芯片创新研究院 | Interactive music teaching guidance method |
CN110955789A (en) * | 2019-12-31 | 2020-04-03 | 腾讯科技(深圳)有限公司 | Multimedia data processing method and equipment |
CN110955789B (en) * | 2019-12-31 | 2024-04-12 | 腾讯科技(深圳)有限公司 | Multimedia data processing method and equipment |
CN111444382A (en) * | 2020-03-30 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Audio processing method and device, computer equipment and storage medium |
CN111444382B (en) * | 2020-03-30 | 2021-08-17 | 腾讯科技(深圳)有限公司 | Audio processing method and device, computer equipment and storage medium |
CN113053410A (en) * | 2021-02-26 | 2021-06-29 | 北京国双科技有限公司 | Voice recognition method, voice recognition device, computer equipment and storage medium |
CN112989106A (en) * | 2021-05-18 | 2021-06-18 | 北京世纪好未来教育科技有限公司 | Audio classification method, electronic device and storage medium |
CN112989106B (en) * | 2021-05-18 | 2021-07-30 | 北京世纪好未来教育科技有限公司 | Audio classification method, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108877783B (en) | 2021-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108877783A (en) | The method and apparatus for determining the audio types of audio data | |
CN108305641B (en) | Method and device for determining emotion information | |
CN109346087B (en) | Noise robust speaker verification method and apparatus against bottleneck characteristics of a network | |
US11282514B2 (en) | Method and apparatus for recognizing voice | |
CN106469555B (en) | Voice recognition method and terminal | |
US11854536B2 (en) | Keyword spotting apparatus, method, and computer-readable recording medium thereof | |
TWI740315B (en) | Sound separation method, electronic and computer readable storage medium | |
CN113555007B (en) | Voice splicing point detection method and storage medium | |
CN116403250A (en) | Face recognition method and device with shielding | |
CN113763966B (en) | End-to-end text irrelevant voiceprint recognition method and system | |
CN109545226A (en) | A kind of audio recognition method, equipment and computer readable storage medium | |
CN110570871A (en) | TristouNet-based voiceprint recognition method, device and equipment | |
CN113793620A (en) | Voice noise reduction method, device and equipment based on scene classification and storage medium | |
CN106663421A (en) | Voice recognition system and voice recognition method | |
CN113593546B (en) | Terminal equipment awakening method and device, storage medium and electronic device | |
CN115201769A (en) | Radar signal pulse repetition interval generation method, device, equipment and medium | |
CN115221351A (en) | Audio matching method and device, electronic equipment and computer-readable storage medium | |
CN111582456B (en) | Method, apparatus, device and medium for generating network model information | |
CN111291186B (en) | Context mining method and device based on clustering algorithm and electronic equipment | |
CN114566160A (en) | Voice processing method and device, computer equipment and storage medium | |
CN114329042A (en) | Data processing method, device, equipment, storage medium and computer program product | |
CN112071331A (en) | Voice file repairing method and device, computer equipment and storage medium | |
CN111898529A (en) | Face detection method and device, electronic equipment and computer readable medium | |
CN111768764A (en) | Voice data processing method and device, electronic equipment and medium | |
CN114971643B (en) | Abnormal transaction identification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |