CN108877783A

CN108877783A - The method and apparatus for determining the audio types of audio data

Info

Publication number: CN108877783A
Application number: CN201810732941.2A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2018-07-05
Filing date: 2018-07-05
Publication date: 2018-11-23
Anticipated expiration: 2038-07-05
Also published as: CN108877783B

Abstract

The invention discloses a kind of method and apparatus of the audio types of determining audio data, belong to network technique field.The method includes：Down-sampled processing is carried out to the audio data of input；Cutting is down-sampled treated audio data, obtains multiple audio units；Extract the corresponding characteristic of each audio unit in the multiple audio unit；According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, obtains the eigenmatrix of the audio data of the input；Disaggregated model trained based on the eigenmatrix and in advance, determines the audio types of the audio data of the input.Using the present invention, can be improved detection audio whether be absolute music audio accuracy rate.

Description

The method and apparatus for determining the audio types of audio data

Technical field

The present invention relates to network technique field, in particular to a kind of the method and dress of the audio types of determining audio data It sets.

Background technique

As people's living standard increasingly improves, more and more people like listening to music, and loosen mood with this.It is general next It says, the audio on music platform or music website can be divided into the vocal music audio for having voice and not the absolute music audio of voice.It is right It is the more popular research topic of current audio detection field that vocal music audio and absolute music audio, which carry out classification,.

Currently, whether detection audio is that the method for absolute music audio is usually, the whole section audio judged will be needed to be divided into more A segment determines in each segment whether include voice audio one by one, and then determines whether each segment is absolute music audio. If all segments of whole section audio are absolute music audio, it is determined that the whole section audio is absolute music audio, if whole section of sound It there are at least one segment is not absolute music audio in all segments of frequency, it is determined that the whole section audio is not absolute music audio.

In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems：

When whether including voice audio in determining individual chip, there may be error, in turn, determine all Section will accumulate biggish error when whether being absolute music audio, in turn, cause to detect audio whether be absolute music audio standard True rate reduces.

Summary of the invention

In order to solve problems in the prior art, the embodiment of the invention provides a kind of audio types of determining audio data Method and apparatus.The technical solution is as follows：

In a first aspect, a kind of method of the audio types of determining audio data is provided, the method includes：

Down-sampled processing is carried out to the audio data of input；

Cutting is down-sampled treated audio data, obtains multiple audio units；

Extract the corresponding characteristic of each audio unit in the multiple audio unit；

According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, is obtained described The eigenmatrix of the audio data of input；

Disaggregated model trained based on the eigenmatrix and in advance, determines the audio class of the audio data of the input Type.

Optionally, described to extract the corresponding characteristic of each audio unit in the multiple audio unit, including：

Extract the frequency data of each audio unit；

The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, obtain each sound The corresponding characteristic of frequency unit.

Optionally, the Feature Selection Model includes at least one empty thresholding residual error convolutional neural networks RGCNN module With global pool module；

The frequency data by each audio unit input Feature Selection Model trained in advance respectively, obtain described every The corresponding characteristic of a audio unit, including：

For each audio unit, based at least one described RGCNN module, to the frequency data of the audio unit into Row processing, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio list The corresponding characteristic of member.

Optionally, the Feature Selection Model includes N number of RGCNN module, each RGCNN in N number of RGCNN module Module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, element product computing module and element adduction meter Calculate module, wherein N is positive integer；

It is described the frequency data of the audio unit are handled based at least one described RGCNN module, it obtains Between eigenmatrix global pool module described in the intermediate features Input matrix obtained into the corresponding feature of the audio unit Data, including：

For the 1st RGCNN module, the frequency data are inputted into the volume in the 1st RGCNN module without activation primitive Lamination, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and the frequency data are inputted the 1st RGCNN With the convolutional layer of activation primitive in module, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain The corresponding feature product matrix of the 1st RGCNN module；The frequency data are corresponding with the 1st RGCNN module Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module Battle array；

It, will be i-th described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module for i-th of RGCNN module Without the convolutional layer of activation primitive in RGCNN module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained；It will Volume with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module Lamination obtains the activation convolution eigenmatrix of i-th of RGCNN module, by the inactive volume of i-th of RGCNN module Product eigenmatrix and activation convolution eigenmatrix input the element product computing module, obtain i-th of RGCNN module pair The feature product matrix answered；By the corresponding intermediate features matrix of (i-1)-th RGCNN module and i-th of RGCNN module Corresponding feature product matrix inputs the element and sums it up computing module, and it is corresponding intermediate special to obtain i-th of RGCNN module Levy matrix；Wherein, i is greater than 1 and is not more than the arbitrary integer of N；

By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio list is obtained The corresponding characteristic of member.

Optionally, in the same RGCNN module, without activation primitive convolutional layer the coefficient of expansion with activation primitive Convolutional layer the coefficient of expansion it is identical；

The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the expansion system of the convolutional layer of (i-1)-th RGCNN module Number.

Optionally, it is described extract the corresponding characteristic of each audio unit in the multiple audio unit before, also wrap It includes：

Obtain multiple first training samples, wherein each first training sample includes the frequency data of sample audio unit And audio types；

Based on the multiple first training sample and preset first training function, model is extracted to initial characteristics and is instructed Practice, obtains the Feature Selection Model.

Optionally, it is described obtain the Feature Selection Model after, further include：

Obtain multiple second training samples, wherein each second training sample includes each sample sound in sample audio data The audio types of the frequency data of frequency unit and the sample audio data；

Based on the frequency data in the Feature Selection Model and the multiple second training sample, it is special to obtain multiple samples Levy data；

Based on the audio types and preset second training in the multiple sample characteristics data, multiple second training samples Function is trained preliminary classification model, obtains the disaggregated model.

Second aspect, provides a kind of device of the audio types of determining audio data, and described device includes：

Processing module, for carrying out down-sampled processing to the audio data of input；

Cutting module obtains multiple audio units for down-sampled treated the audio data of cutting；

Extraction module, for extracting the corresponding characteristic of each audio unit in the multiple audio data；

Module is arranged, for the timing according to the corresponding audio unit of each characteristic, the characteristic is carried out Arrangement, obtains the eigenmatrix of the audio data of the input；

Determining module determines the audio of the input for disaggregated model trained based on the eigenmatrix and in advance The audio types of data.

Optionally, the extraction module, is used for：

Extract the frequency data of each audio unit；

The extraction module, is used for：

Optionally, described device further includes：

First obtains module, for extract the corresponding characteristic of each audio unit in the multiple audio unit it Before, obtain multiple first training samples, wherein each first training sample includes the frequency data and audio of sample audio unit Type；

First training module, for training function based on the multiple first training sample and preset first, to initial Feature Selection Model is trained, and obtains the Feature Selection Model.

Optionally, described device further includes：

Second obtains module, after obtaining the Feature Selection Model, obtains multiple second training samples, wherein Each second training sample includes the frequency data and the sample audio data of each sample audio unit in sample audio data Audio types；

Third obtains module, for based on the frequency number in the Feature Selection Model and the multiple second training sample According to obtaining multiple sample characteristics data；

Second training module, for based on the audio class in the multiple sample characteristics data, multiple second training samples Type and preset second training function, are trained preliminary classification model, obtain the disaggregated model.

The third aspect provides a kind of server, and the server includes processor and memory, deposits in the memory Contain at least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of journey Sequence, the code set or instruction set loaded as the processor and executed with realize as described in above-mentioned first aspect really accordatura frequently The method of the audio types of data.

Fourth aspect provides a kind of computer readable storage medium, at least one finger is stored in the storage medium Enable, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or Instruction set is loaded as the processor and is executed to realize the audio types of audio data really as described in above-mentioned first aspect Method.

Technical solution bring beneficial effect provided in an embodiment of the present invention includes at least：

In the embodiment of the present invention, multiple characteristics based on target audio data integrally carry out target audio data Classification, determines the corresponding audio types of target audio data with this, without classifying respectively to each audio unit, in this way may be used To prevent error accumulation, it is thus possible to improve detection audio whether be absolute music audio accuracy rate.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of flow chart of the method for the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 2 is a kind of spectrum diagram of the method for the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 3 is a kind of model schematic of the method for the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 4 is a kind of model structure signal of the method for the audio types of determining audio data provided in an embodiment of the present invention Figure；

Fig. 5 is a kind of schematic diagram of a scenario of the method for the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 6 is a kind of structural schematic diagram of the device of the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of the device of the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of the device of the audio types of determining audio data provided in an embodiment of the present invention；

Fig. 9 is a kind of server architecture schematic diagram provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

The embodiment of the invention provides a kind of method of the audio types of determining audio data, this method can be by server It realizes.

Server may include the components such as processor, memory.Processor can be CPU (Central Processing Unit, central processing unit) etc., audio unit, each audio unit of extraction that can be used for extracting target audio data are corresponding Characteristic, characteristic arrange the liquor-saturated eigenmatrix at target audio data, determines target audio data sound The processing such as frequency type.Memory can be RAM (Random Access Memory, random access memory) that Flash (dodges Deposit) etc., data needed for can be used for storing the data received, treatment process, the data generated in treatment process etc., such as mesh Mark audio data, audio unit, characteristic, audio types of target audio data etc..Server can also include transceiver, Image-detection component, audio output part and audio input means etc..Transceiver can be used for carrying out data biography with other equipment It is defeated, it may include antenna, match circuit, modem etc..Image-detection component can be camera etc..Audio output part It can be speaker, earphone etc..Audio input means can be microphone etc..

As shown in Figure 1, the process flow of this method may include following step：

In a step 101, down-sampled processing is carried out to the audio data of input.

In one possible embodiment, when user wants the audio data (can be described as target audio data) of detection input Audio types, i.e. user are wanted to determine that target audio data belong to absolute music type and still fall within vocal music type, need first to this Section target audio data perform some processing, such as down-sampled processing.Preferably, target audio data can be downsampled to 16000HZ, the benefit handled in this way have at 3 points：First is that the input data of available uniform data form, second is that can reduce Input data amount, third is that can influence to avoid frequency spectrum height to target audio data.

In a step 102, down-sampled treated the audio data of cutting, obtains multiple audio units.

It, can be according to preset duration, by mesh after obtaining down-sampled treated audio data in one possible embodiment Mark audio data is split, and target audio data is divided into multiple audio fragments, each audio fragment is an audio list Member, preferred preset duration is 3s, if the duration of the last one audio unit can give up this audio unit less than 3s It abandons.

In step 103, the corresponding characteristic of each audio unit in multiple audio units is extracted.

In one possible embodiment, after obtaining multiple audio units, user can carry out feature to each audio unit Extraction process obtains the corresponding characteristic of each audio unit.

Optionally, features described above extraction process can be realized by Feature Selection Model, the processing of corresponding step 103 Process can be as follows：Extract the frequency data of each audio unit；The frequency data of each audio unit are inputted respectively in advance Trained Feature Selection Model obtains the corresponding characteristic of each audio unit.

In one possible embodiment, mel-spectrogram (plum is extracted respectively at least one obtained audio unit That sonograph), the corresponding Meier sonograph of an audio unit can be as shown in Figure 2, wherein the horizontal axis of Meier sonograph indicates Time, the longitudinal axis indicate that frequency band number, frequency band number indicate that a frequency range, preferred frequency band number can be determined as 128.

According to the mel-spectrogram of each audio unit, the equal of each frequency band place frequency is calculated along time orientation Value, obtains 128 mean frequency values；The variance that each frequency band place frequency is calculated along time orientation, obtains 128 frequency variances. In order to make the data normalization of input, the mean value of each frequency band and the ratio of variance can be calculated, ratio is determined as each sound The frequency data of frequency unit.

The frequency data of each audio unit are inputted to Feature Selection Model trained in advance respectively, to each audio unit Frequency data carry out feature extraction, obtain the corresponding characteristic of each audio unit.

Optionally, above-mentioned trained Feature Selection Model includes at least one empty thresholding residual error convolutional neural networks RGCNN module and global pool module；In the Feature Selection Model that the frequency data input of each audio unit is trained in advance, Specifically treatment process is：For each audio unit, it is based at least one RGCNN module, to the frequency data of audio unit It is handled, obtains intermediate features matrix, by intermediate features Input matrix global pool module, obtain the corresponding spy of audio unit Levy data.

In one possible embodiment, the structure of above-mentioned trained Feature Selection Model may include at least one cavity Thresholding residual error convolutional neural networks RGCNN module and global pool module, when the frequency data input feature vector of an audio unit It when extracting model, according to preset process flow, handles all RGCNN modules to the frequency data of input, obtains Between eigenmatrix, the intermediate features Input matrix global pool module that then will be obtained, to intermediate eigenmatrix carry out global pool Change processing, obtains the characteristic of the corresponding vector form of audio unit,

Optionally, at least one above-mentioned Feature Selection Model may include N number of RGCNN module, in N number of RGCNN module Each RGCNN module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, element product computing module and member Element adduction computing module, wherein activation primitive is sigmoid activation primitive, and N is positive integer.

For the 1st RGCNN module, frequency data are inputted into the convolution in the 1st RGCNN module without activation primitive Layer, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, and frequency data are inputted band in the 1st RGCNN module and are swashed The convolutional layer of function living, obtains the activation convolution eigenmatrix of the 1st RGCNN module, by the inactive of the 1st RGCNN module Convolution eigenmatrix and activation convolution eigenmatrix input element product computing module, obtain the 1st corresponding spy of RGCNN module Levy product matrix；Frequency data feature product matrix input element corresponding with the 1st RGCNN module is summed it up into computing module, Obtain the corresponding intermediate features matrix of the 1st RGCNN module；For i-th of RGCNN module, by (i-1)-th RGCNN module pair Without the convolutional layer of activation primitive in i-th of RGCNN module of the intermediate features Input matrix answered, i-th of RGCNN module is obtained Inactive convolution eigenmatrix；By band in (i-1)-th RGCNN module, i-th of RGCNN module of corresponding intermediate features Input matrix The convolutional layer of activation primitive obtains the activation convolution eigenmatrix of i-th of RGCNN module, and i-th of the non-of RGCNN module is swashed Convolution eigenmatrix living and activation convolution eigenmatrix input element product computing module, it is corresponding to obtain i-th of RGCNN module Feature product matrix；The corresponding intermediate features matrix of (i-1)-th RGCNN module feature corresponding with i-th of RGCNN module is multiplied Product matrix input element sums it up computing module, obtains the corresponding intermediate features matrix of i-th of RGCNN module；Wherein, i is greater than 1 And it is not more than the arbitrary integer of N；By the corresponding intermediate features Input matrix global pool module of n-th RGCNN module, sound is obtained The corresponding characteristic of frequency unit.

In one possible embodiment, the value of above-mentioned N is preferably 4 to 6, i.e., the number of above-mentioned RGCNN module is preferably 4 To 6, it is 5 in the present embodiment with the number of RGCNN module and is illustrated.Each RGCNN module includes without activation The convolutional layer of function, the convolutional layer with activation primitive, element product computing module and element sum it up computing module.

After the frequency data for obtaining audio unit through the above steps, as shown in figure 3, frequency data are inputted the 1st simultaneously In a RGCNN module without the convolutional layer of activation primitive and the convolutional layer of function to be activated, according to the volume without activation primitive Lamination obtains the inactive convolution eigenmatrix of the 1st RGCNN module, obtains the 1st according to the convolutional layer with activation primitive The activation convolution eigenmatrix of RGCNN module, by obtained inactive convolution eigenmatrix and activation convolution eigenmatrix input Element product computing module obtains the corresponding feature product matrix of the 1st RGCNN module, and the 1st RGCNN module is corresponding Feature product matrix and frequency data input element sum it up computing module, obtain the corresponding intermediate features square of the 1st RGCNN module Battle array.

By the corresponding intermediate features Input matrix of the obtain the 1st RGCNN module into the 2nd RGCNN module without The convolutional layer of the convolutional layer of activation primitive and function to be activated obtains the 2nd RGCNN mould according to the convolutional layer without activation primitive The inactive convolution eigenmatrix of block obtains the activation convolution feature of the 2nd RGCNN module according to the convolutional layer with activation primitive Obtained inactive convolution eigenmatrix and activation convolution eigenmatrix input element product computing module are obtained the 2nd by matrix The corresponding feature product matrix of a RGCNN module, by the corresponding intermediate features matrix of the 1st RGCNN module and the 2nd RGCNN The corresponding feature product matrix input element of module sums it up computing module, obtains the corresponding intermediate features square of the 2nd RGCNN module Battle array.

Then by the corresponding intermediate features Input matrix of the 2nd RGCNN module into the 3rd RGCNN module without swash The convolutional layer of function living and the convolutional layer of function to be activated, are handled referring to above-mentioned processing step, obtain the 3rd RGCNN mould Intermediate features matrix in block.Processing step and so on above is repeated, until obtaining the centre of the last one RGCNN module Eigenmatrix.

By the corresponding intermediate features Input matrix global pool module of the last one RGCNN module, pass through global pool mould The global poolization of block is handled, and is a numerical value by each row of data processing of intermediate features matrix, obtain audio unit it is corresponding to The characteristic of amount form.

Optionally, in the same RGCNN module, without activation primitive convolutional layer the coefficient of expansion with activation primitive Convolutional layer the coefficient of expansion it is identical；The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than (i-1)-th RGCNN module The coefficient of expansion of convolutional layer.

Wherein, the coefficient of expansion of convolutional layer is used to indicate the range for extracting feature, and the coefficient of expansion is bigger, and the feature of extraction is got over Globalization, the coefficient of expansion is smaller, and the feature of extraction is more specific.

In one possible embodiment, in above-mentioned N number of RGCNN module, in the same RGCNN module, without activation The coefficient of expansion of the convolutional layer of function is identical as the coefficient of expansion of the convolutional layer with activation primitive.In any two RGCNN module The coefficient of expansion of convolutional layer be different from, and the convolution according to the sequence for carrying out feature extraction processing, in all RGCNN modules The coefficient of expansion of layer is incremental, for example, it is assumed that i-th of RGCNN module is in N number of RGCNN module in addition to the 1st RGCNN mould Any one module except block, then the coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than (i-1)-th RGCNN module The coefficient of expansion of convolutional layer.Preferably, RGCNN module can be set to 5, by the convolutional layer in this 5 RGCNN modules The coefficient of expansion is arranged by the form of exponential increasing, as the coefficient of expansion of the convolutional layer in the 1st RGCNN module is set as the 2, the 2nd The coefficient of expansion of convolutional layer in a RGCNN module is set as the coefficient of expansion setting of the convolutional layer in the 4, the 3rd RGCNN module It is the expansion that the coefficient of expansion of the convolutional layer in the 8, the 4th RGCNN module is set as the convolutional layer in the 16, the 5th RGCNN module Coefficient is set as 32.

Optionally, the training process that features described above extracts model can be as follows：Obtain multiple first training samples；Based on more Initial characteristics are extracted model and are trained, obtain feature extraction mould by a first training sample and preset first training function Type.

Wherein, each first training sample includes the frequency data and audio types of sample audio unit.

In one possible embodiment, before using Feature Selection Model, need first to instruct Feature Selection Model Practice.Firstly, multiple training samples (i.e. the first training sample) that model is extracted for training characteristics are obtained, each first training sample This includes the frequency data and audio types of sample audio unit.

The process for obtaining the frequency data of sample audio unit can be, first acquisition sample audio data, by sample sound Sample audio data, then according to preset duration, is divided into sample audio unit, preferably according to 16000Hz is downsampled to by frequency , preset duration can be 3s, i.e., sample audio data is divided into multiple sample audio units of duration 3s, and determine each The audio types (i.e. sample audio type) of sample audio unit, that is, determine each sample audio unit be absolute music audio or Vocal music audio.

Then, the frequency data of each sample audio unit are extracted, corresponding treatment process is referred to above-mentioned processing step Suddenly, this will not be repeated here.

The frequency data input initial characteristics of sample audio unit are extracted into model, the spy of model is extracted by initial characteristics Sign is extracted, and the corresponding sample characteristics data of each sample audio unit are obtained.Sample characteristics data are input to full link model, As shown in figure 4, the full link model is used to determine its music type according to each sample characteristics data, wrapped in the full link model Include a canonical module Dropout and two dense link block Dense.By full link model, sample audio unit is obtained Corresponding testing audio type, the testing audio type are a probability numbers.

The error amount between testing audio type and sample audio type is calculated, it is pre- to determine whether obtained error amount is less than If error amount threshold value, if the error amount being calculated is not less than preset error value threshold value, initial characteristics are determined according to error amount The adjusted value of each coefficient in model is extracted, and each coefficient in model is extracted to initial characteristics and is adjusted.According to multiple samples Audio unit obtains multiple testing audio types, obtains multiple errors according to multiple testing audio types and sample audio type Value is extracted model according to each error initial characteristics and is trained, until the error amount that is calculated is less than preset error value threshold value, Current Feature Selection Model is determined as trained Feature Selection Model, training process terminates.

Optionally, above-mentioned disaggregated model is also required to be trained, and corresponding treatment process can be as follows：Obtain multiple second Training sample；Based on the frequency data in Feature Selection Model and multiple second training samples, multiple sample characteristics data are obtained； Function is trained based on the audio types and preset second in multiple sample characteristics data, multiple second training samples, to initial Disaggregated model is trained, and obtains disaggregated model.

Wherein, each second training sample includes the frequency data and sample of each sample audio unit in sample audio data The audio types of audio data, sample audio unit are that sample audio data is split to obtain according to preset duration, each The duration of sample audio data is set as 8 minutes, and the insufficient audio unit of duration is supplied with 0；Each sample audio unit when Long is preferably 3s.

It, can be right after obtaining trained Feature Selection Model by above-mentioned processing step in one possible embodiment Preliminary classification model is trained.Firstly, obtaining multiple training sample (the i.e. second training samples for training preliminary classification model This), each second training sample includes the frequency data and sample audio data of each sample audio unit in sample audio data Audio types.Then, the frequency data in each second training sample are inputted in trained Feature Selection Model, is obtained The sample characteristics data of each second training sample.Obtained sample characteristics data are inputted in preliminary classification model, it is preferable that It is influenced to cancel 0 bring added in sample audio data for supplying duration, it can be before preliminary classification model Add a mask layer.By preliminary classification model to the classification processing of sample characteristics data, second training sample is obtained Test music type.

The error amount between the sample audio type in testing audio type and the second training sample is calculated, what determination obtained Whether error amount is less than preset error value threshold value, if the error amount being calculated is not less than preset error value threshold value, according to accidentally Difference determines the adjusted value of each coefficient in preliminary classification model, and is adjusted to each coefficient in preliminary classification model.According to Multiple sample audio units obtain multiple testing audio types, are obtained according to multiple testing audio types and sample audio type more A error amount is trained according to each error preliminary classification model, until the error amount being calculated is less than preset error value threshold Value, is determined as trained disaggregated model for current disaggregated model, training process terminates.

At step 104, according to the timing of the corresponding audio unit of each characteristic, characteristic is arranged, The eigenmatrix of the audio data inputted.

In one possible embodiment, after the characteristic for obtaining each audio unit through the above steps, determine each The time sequencing of audio unit, and according to the time sequencing of each audio unit, the characteristic of each audio unit is carried out Arrangement, the eigenmatrix of the audio data inputted, as shown in Figure 5.

In step 105, the disaggregated model trained based on eigenmatrix and in advance determines the audio of the audio data of input Type.

Wherein, audio types include absolute music type or vocal music type.Preferably, above-mentioned disaggregated model can be RNN mould Type (Recurrent Neural Networks, Recognition with Recurrent Neural Network).

It, will after the eigenmatrix for the target audio data that through the above steps 104 are obtained in one possible embodiment The eigenmatrix of target audio data is input in disaggregated model trained in advance, by the classification processing of disaggregated model, is determined The audio types of target audio data.The audio types of audio data may include absolute music type and vocal music type, corresponding, Trained disaggregated model can be divided into for determining audio data for the disaggregated model of the probability of absolute music type and for true Audio data is the disaggregated model of the probability of vocal music type.

If being used to determine that audio data to be the probability of absolute music type for the eigenmatrix input of target audio data Disaggregated model, then disaggregated model output is the probability that target audio data are absolute music type, in this case, when output When probability is greater than the first predetermined probabilities threshold value, it can determine that target audio data are absolute music type, when the probability of output is little When the first predetermined probabilities threshold value, it can determine that target audio data are vocal music type.

If being used to determine that audio data to be point of the probability of vocal music type for the eigenmatrix input of target audio data Class model, then disaggregated model output is the probability that target audio data are vocal music type, in this case, when the probability of output When greater than the second predetermined probabilities threshold value, it can determine that target audio data are vocal music type, when the probability of output is not more than second When predetermined probabilities threshold value, it can determine that target audio data are absolute music type.

Based on the same technical idea, the embodiment of the invention also provides a kind of dresses of the audio types of determining audio data It sets, which can be the server in above-described embodiment, as shown in fig. 6, the device includes：Processing module 610, cutting module 620, extraction module 630 arranges module 640 and determining module 650.

Processing module 610 is configured as carrying out down-sampled processing to the audio data of input；

Cutting module 620 is configured as down-sampled treated the audio data of cutting, obtains multiple audio units；

Extraction module 630 is configured as extracting the corresponding characteristic of each audio unit in the multiple audio unit；

Module 640 is arranged, the timing according to the corresponding audio unit of each characteristic is configured as, to the characteristic According to being arranged, the eigenmatrix of the audio data of the input is obtained；

Determining module 650 is configured as disaggregated model trained based on the eigenmatrix and in advance, determines the input Audio data audio types.

Optionally, the extraction module 630, is configured as：

Extract the frequency data of each audio unit；

The extraction module 630, is configured as：

Optionally, as shown in fig. 7, described device further includes：

First obtains module 710, and it is corresponding to be configured as extracting each audio unit in the multiple each audio unit Before characteristic, multiple first training samples are obtained, wherein each first training sample includes the frequency of sample audio unit Data and audio types；

First training module 720 is configured as based on the multiple first training sample and preset first training function, Model is extracted to initial characteristics to be trained, and obtains the Feature Selection Model.

Optionally, as shown in figure 8, described device further includes：

Second obtains module 810, is configured as after obtaining the Feature Selection Model, obtains multiple second training samples This, wherein each second training sample includes the frequency data and the sample of each sample audio unit in sample audio data The audio types of audio data；

Third obtains module 820, is configured as based in the Feature Selection Model and the multiple second training sample Frequency data, obtain multiple sample characteristics data；

Second training module 830 is configured as based in the multiple sample characteristics data, multiple second training samples Audio types and preset second training function, are trained preliminary classification model, obtain the disaggregated model.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

It should be noted that：The device of the audio types of determining audio data provided by the above embodiment is determining audio number According to audio types when, only the example of the division of the above functional modules, in practical application, can according to need and Above-mentioned function distribution is completed by different functional modules, i.e., the internal structure of equipment is divided into different functional modules, with Complete all or part of function described above.In addition, the audio types of determining audio data provided by the above embodiment The embodiment of the method for the audio types of device and determining audio data belongs to same design, and specific implementation process is detailed in method reality Example is applied, which is not described herein again.

Fig. 9 is a kind of structural schematic diagram of computer equipment provided in an embodiment of the present invention, which can be because Configuration or performance are different and generate bigger difference, may include one or more processors (central Processing units, CPU) 901 and one or more memory 902, wherein it is stored in the memory 902 There is at least one instruction, at least one instruction is loaded by the processor 901 and executed to realize following determining audio numbers According to audio types method and step：

Down-sampled processing is carried out to the audio data of input；

Cutting is down-sampled treated audio data, obtains multiple audio units；

Optionally, at least one instruction is loaded by the processor 901 and is executed to realize following methods step：

Extract the frequency data of each audio unit；

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, is stored at least in storage medium One instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, code set or instruction set It is loaded by processor and is executed to realize the identification maneuver class method for distinguishing in above-described embodiment.For example, described computer-readable Storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of method of the audio types of determining audio data, which is characterized in that the method includes：

Down-sampled processing is carried out to the audio data of input；

Cutting is down-sampled treated audio data, obtains multiple audio units；

According to the timing of the corresponding audio unit of each characteristic, the characteristic is arranged, obtains the input Audio data eigenmatrix；

Disaggregated model trained based on the eigenmatrix and in advance, determines the audio types of the audio data of the input.

2. the method according to claim 1, wherein described extract each audio list in the multiple audio unit The corresponding characteristic of member, including：

Extract the frequency data of each audio unit；

3. according to the method described in claim 2, it is characterized in that, the Feature Selection Model includes at least one empty thresholding Residual error convolutional neural networks RGCNN module and global pool module；

For each audio unit, based at least one described RGCNN module, at the frequency data of the audio unit Reason, obtains intermediate features matrix, by global pool module described in the intermediate features Input matrix, obtains the audio unit pair The characteristic answered.

4. according to the method described in claim 3, it is characterized in that, the Feature Selection Model includes N number of RGCNN module, institute Stating each RGCNN module in N number of RGCNN module includes the convolutional layer without activation primitive, the convolutional layer with activation primitive, member Plain product computing module and element sum it up computing module, wherein N is positive integer；

It is described that the frequency data of the audio unit are handled based at least one described RGCNN module, obtain intermediate spy It levies matrix and global pool module described in the intermediate features Input matrix is obtained into the corresponding characteristic of the audio unit, Including：

For the 1st RGCNN module, the frequency data are inputted into the convolution in the 1st RGCNN module without activation primitive Layer, obtains the inactive convolution eigenmatrix of the 1st RGCNN module, the frequency data is inputted the 1st RGCNN mould With the convolutional layer of activation primitive in block, the activation convolution eigenmatrix of the 1st RGCNN module is obtained, by described 1st The inactive convolution eigenmatrix and activation convolution eigenmatrix of RGCNN module input the element product computing module, obtain The corresponding feature product matrix of the 1st RGCNN module；The frequency data are corresponding with the 1st RGCNN module Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of the 1st RGCNN module Battle array；

For i-th of RGCNN module, by i-th of RGCNN described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module Without the convolutional layer of activation primitive in module, the inactive convolution eigenmatrix of i-th of RGCNN module is obtained；It will be described Convolutional layer with activation primitive in i-th of RGCNN module described in the corresponding intermediate features Input matrix of (i-1)-th RGCNN module, The activation convolution eigenmatrix of i-th of RGCNN module is obtained, by the inactive convolution feature of i-th of RGCNN module Matrix and activation convolution eigenmatrix input the element product computing module, obtain the corresponding spy of i-th of RGCNN module Levy product matrix；The corresponding intermediate features matrix of (i-1)-th RGCNN module is corresponding with i-th of RGCNN module Feature product matrix inputs the element and sums it up computing module, obtains the corresponding intermediate features square of i-th of RGCNN module Battle array；Wherein, i is greater than 1 and is not more than the arbitrary integer of N；

By global pool module described in the corresponding intermediate features Input matrix of n-th RGCNN module, the audio unit pair is obtained The characteristic answered.

5. according to the method described in claim 4, it is characterized in that, in the same RGCNN module, without the volume of activation primitive The coefficient of expansion of lamination is identical as the coefficient of expansion of the convolutional layer with activation primitive；

The coefficient of expansion of the convolutional layer of i-th of RGCNN module is greater than the coefficient of expansion of the convolutional layer of (i-1)-th RGCNN module.

6. according to the method described in claim 2, it is characterized in that, described extract each audio list in the multiple audio unit Before the corresponding characteristic of member, further include：

Obtain multiple first training samples, wherein each first training sample includes the frequency data and sound of sample audio unit Frequency type；

Based on the multiple first training sample and preset first training function, model is extracted to initial characteristics and is trained, Obtain the Feature Selection Model.

7. according to the method described in claim 6, it is characterized in that, it is described obtain the Feature Selection Model after, further include：

Obtain multiple second training samples, wherein each second training sample includes each sample audio list in sample audio data The frequency data of member and the audio types of the sample audio data；

Based on the frequency data in the Feature Selection Model and the multiple second training sample, multiple sample characteristics numbers are obtained According to；

Letter is trained based on the audio types and preset second in the multiple sample characteristics data, multiple second training samples Number, is trained preliminary classification model, obtains the disaggregated model.

8. a kind of device of the audio types of determining audio data, which is characterized in that described device includes：

Extraction module, for extracting the corresponding characteristic of each audio unit in the multiple audio unit；

Arrangement module arranges the characteristic for the timing according to the corresponding audio unit of each characteristic, Obtain the eigenmatrix of the audio data of the input；

Determining module determines the audio data of the input for disaggregated model trained based on the eigenmatrix and in advance Audio types.

9. device according to claim 8, which is characterized in that the extraction module is used for：

Extract the frequency data of each audio unit；

10. device according to claim 9, which is characterized in that the Feature Selection Model includes at least one empty door Limit residual error convolutional neural networks RGCNN module and global pool module；

The extraction module, is used for：

11. device according to claim 10, which is characterized in that the Feature Selection Model includes N number of RGCNN module, Each RGCNN module in N number of RGCNN module include the convolutional layer without activation primitive, the convolutional layer with activation primitive, Element product computing module and element sum it up computing module, wherein N is positive integer；

The extraction module, is used for：

12. device according to claim 11, which is characterized in that in the same RGCNN module, without activation primitive The coefficient of expansion of convolutional layer is identical as the coefficient of expansion of the convolutional layer with activation primitive；

13. device according to claim 9, which is characterized in that described device further includes：

First obtains module, for obtaining before extracting the corresponding characteristic of each audio unit in the multiple audio unit Take multiple first training samples, wherein each first training sample includes the frequency data and audio types of sample audio unit；

First training module, for training function based on the multiple first training sample and preset first, to initial characteristics It extracts model to be trained, obtains the Feature Selection Model.

14. device according to claim 13, which is characterized in that described device further includes：

Second obtains module, after obtaining the Feature Selection Model, obtains multiple second training samples, wherein each Second training sample includes the sound of the frequency data of each sample audio unit and the sample audio data in sample audio data Frequency type；

Third obtains module, for based on the frequency data in the Feature Selection Model and the multiple second training sample, Obtain multiple sample characteristics data；

Second training module, for based in the multiple sample characteristics data, multiple second training samples audio types and Preset second training function, is trained preliminary classification model, obtains the disaggregated model.

15. a kind of server, which is characterized in that the server includes processor and memory, is stored in the memory At least one instruction, at least one section of program, code set or instruction set, at least one instruction, at least one section of program, institute State code set or instruction set loaded by the processor and executed with realize as described in claim 1 to 7 is any really accordatura frequently The method of the audio types of data.

16. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, extremely in the storage medium Few one section of program, code set or instruction set, at least one instruction, at least one section of program, the code set or the instruction Collection is loaded by the processor and is executed to realize the audio types of audio data really as described in claim 1 to 7 is any Method.