CN111460214B

CN111460214B - Classification model training method, audio classification method, device, medium and equipment

Info

Publication number: CN111460214B
Application number: CN202010255326.4A
Authority: CN
Inventors: 王康; 何怡; 许凌
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2024-04-19
Anticipated expiration: 2040-04-02
Also published as: CN111460214A

Abstract

The disclosure relates to a classification model training method, an audio classification method, an apparatus, a medium and a device. The method comprises the following steps: acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages; acquiring a plurality of second audios belonging to very common languages, and determining the language characteristics and the language of each second audio; setting all connection layers in the initial audio classification model according to the total number of languages to which the second audio belongs so as to obtain an intermediate audio classification model; and training the intermediate audio classification model by taking the language characteristic of the second audio as model input data and taking the language of the second audio as model output data so as to obtain a target audio classification model. Therefore, the accuracy of identifying and classifying the common languages can be improved, and the problems of poor model effect and low accuracy caused by few samples of the very common languages are solved.

Description

Classification model training method, audio classification method, device, medium and equipment

Technical Field

The disclosure relates to the field of computer technology, and in particular relates to a classification model training method, an audio classification device, a medium and equipment.

Background

In an audio processing scenario, there is sometimes a need to identify which language an audio content belongs to, i.e. for a piece of audio, identifying which language the speaking content within the audio belongs to, which can also be regarded as classifying the content of a piece of audio.

In the related art, model training is generally performed in advance for a target language to be identified, multiple model training modes can be used for performing model training, after the corresponding models are obtained through training, the identification effects of the multiple models obtained through training under the same identification scene are compared, the model with the best effect is selected as the model capable of being used for identifying the target language, and the selected model is used for completing identification when the fact that the speaking content in the audio belongs to the target language is needed to be identified later.

The above method is excellent when the training data amount of the target language is enough, for example, the target language is a common language such as Chinese, english, etc. If the training data of the target language is small, for example, the target language is a very-used language such as indian language or spanish language, and the model obtained by training has a disadvantage in accuracy due to insufficient training data, therefore, in the above manner, even if a model with the best effect in a plurality of models is selected, the recognition accuracy of the model cannot reach the standard, and the language to which the speaking content in the audio belongs cannot be accurately recognized.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides an audio classification model training method, the method comprising:

Acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages;

acquiring a plurality of second audios belonging to very common languages, and determining the language characteristics and the language of each second audio;

setting all connection layers in the initial audio classification model according to the total number of languages to which the second audio belongs so as to obtain an intermediate audio classification model;

And training the intermediate audio classification model by taking the language characteristic of the second audio as model input data and taking the language of the second audio as model output data so as to obtain a target audio classification model.

In a second aspect, the present disclosure provides an audio classification method, the method comprising:

splitting audio to be processed to obtain a plurality of audio fragments to be processed;

Respectively inputting each audio fragment to be processed into a target audio classification model to obtain an output result of the target audio classification model, wherein the target audio classification model is trained according to the audio classification model training method according to the first aspect of the disclosure, and the output result is used for indicating the probability that the audio fragment to be processed input into the target audio classification model corresponds to each language in the languages to which the second audio belongs;

And determining the language to which the audio to be processed belongs according to the probability that the audio to be processed corresponds to each language in the languages to which the second audio belongs aiming at each audio fragment to be processed.

In a third aspect, the present disclosure provides an audio classification model training apparatus, the apparatus comprising:

the first acquisition module is used for acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages;

The second acquisition module is used for acquiring a plurality of second audios belonging to very common languages and determining the language characteristics and the language of each second audio;

The setting module is used for setting the full connection layer in the initial audio classification model according to the total number of languages to which the second audio belongs so as to obtain an intermediate audio classification model;

And the model training module is used for taking the language characteristics of the second audio as model input data and the language to which the second audio belongs as model output data, and training the intermediate audio classification model to obtain a target audio classification model.

In a fourth aspect, the present disclosure provides an audio classification apparatus, the apparatus comprising:

the segmentation module is used for segmenting the audio to be processed to obtain a plurality of audio fragments to be processed;

The classification module is used for respectively inputting each audio fragment to be processed into a target audio classification model to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method in the first aspect of the disclosure, and the output result is used for indicating the probability that the audio fragment to be processed input into the target audio classification model corresponds to each language in the second audio belonging languages;

And the determining module is used for determining the language to which the audio to be processed belongs according to the probability that the audio to be processed corresponds to each language in the language to which the second audio belongs aiming at each audio fragment to be processed.

In a fifth aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the disclosure or which when executed by a processing device performs the steps of the method of the second aspect of the disclosure.

In a sixth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to implement the steps of the method according to the first aspect of the present disclosure or to implement the steps of the method according to the second aspect of the present disclosure.

According to the technical scheme, an initial audio classification model is obtained, a plurality of second audios belonging to very common languages are obtained, language features and the languages of each second audio are determined, a full connection layer in the initial audio classification model is set according to the total number of the languages of the second audios to obtain an intermediate audio classification model, then the language features of the second audios are used as model input data, the languages of the second audios are used as model output data, and the intermediate audio classification model is trained to obtain a target audio classification model. The initial audio classification model is obtained based on a plurality of first audio training belonging to common languages, so that the initial audio classification model has basic ability of language classification. Therefore, based on the initial audio classification model with good language classification capability, the method further carries out targeted training on the very-used languages, can improve the accuracy of identifying and classifying the very-used languages, and alleviates the problems of poor model effect and low accuracy caused by few samples of the very-used languages.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

In the drawings:

FIG. 1 is a flow chart of an audio classification model training method provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a flow chart of an audio classification method provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a block diagram of an audio classification model training apparatus provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a block diagram of an audio classification device provided in accordance with an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device provided in accordance with one embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

As described in the background art, in the prior art, for language recognition of audio speaking contents, model training is generally performed in advance for a target language to be recognized, multiple model training modes can be used for model training, after the corresponding models are obtained through training, recognition effects of the multiple models obtained through training under the same recognition scene are compared, a model with the best effect is selected as an audio classification model capable of being used for recognizing the target language, and the audio classification model is used for completing language recognition when the fact that the speaking contents in the audio belong to the target language is required to be recognized later. For example, if the target language is chinese, english or indian, the model training is directly performed based on the training data of chinese, english or indian in the above manner to obtain a plurality of models for identifying chinese, english or indian, and the model with the best identifying effect is selected as the audio classification model for identifying chinese, english or indian.

The above method is excellent when the training data amount of the target language is enough, for example, the target language is a common language such as Chinese, english, etc., and the common language has thousands of hours or more of audio as the training data. If the target language is a very common language such as indian, spanish, etc., the very common language has only audio of several hundred hours, several tens of hours, or even less as training data, for example, only audio of about 150 hours in indian may be used as training data. The training data of the unusual language is less, and the model obtained by training has a disadvantage in accuracy due to insufficient training data, so that the recognition accuracy of the model cannot reach the standard and the language to which the speaking content in the audio belongs can not be accurately recognized even if the model with the best effect in a plurality of models is selected as the audio classification model in the mode. Thus, in the above example, the resulting audio classification model for recognizing chinese, english, and hindi is not ideal in terms of language classification due to the lack of training data for the hindi itself.

In order to solve the above problems in the prior art, the present disclosure provides a classification model training method, an audio classification method, an apparatus, a medium, and a device.

Fig. 1 is a flow chart of an audio classification model training method provided in accordance with one embodiment of the present disclosure. As shown in fig. 1, the method may include the following steps.

In step 11, an initial audio classification model is obtained.

Wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages. The common language refers to a language with enough training data available, for example, chinese, english, etc.

Unless otherwise indicated, the language to which the present disclosure relates may be a language of a certain country (e.g., chinese, english, french, etc.) or a dialect of a certain region (e.g., sichuan, cantonese, etc.).

Prior to step 11, the method provided by the present disclosure may further include the steps of:

Acquiring a plurality of first audios belonging to common languages, and determining the language characteristics of each first audio and the language to which each first audio belongs;

the language features of the first audio are used as model input data, the language to which the first audio belongs is used as model output data, and the neural network model is trained to obtain an initial audio classification model.

The plurality of first audios belonging to the common languages may be acquired from data sets (including audios belonging to the common languages) corresponding to the plurality of (i.e., two or more) common languages, respectively, and the number of first audios corresponding to the respective common languages may be controlled to be equivalent in consideration of training effects of the model. For example, 4000 hours (hours) of the first audio of chinese and english is obtained, and 2000 hours of the first audio of chinese and 2000 hours of english are corresponded.

In one possible embodiment, the linguistic features of each first audio may be extracted by a pre-trained feature extraction model. For example, the feature extraction model may be trained based on AudioSet datasets. In particular, the feature extraction model may be obtained by using a pretrain model based on AudioSet dataset provided by google, the pretrain model is the audio classification model used in the prior art mentioned above, and the pretrain model is used to classify the audio input to the model to identify which language the audio input to the model belongs to. Thus, the last layer of the pretrain model (i.e., the last fully connected layer) is removed, and the remainder of the model has the ability to extract certain features of the audio that are useful in pretrain model for producing audio classification results, which can be seen to aid in classifying the language of the audio, which can be considered as the language features of the audio. Thus, the remaining part of the model can be used as a feature extraction model for extracting language features of the first audio. In one possible example, the pretrain model is a CNN (Convolutional Neural Networks, convolutional neural network) model, and the language feature of each first audio is a feature extracted by the convolutional neural network to help classify each first audio.

And inputting a first audio into the feature extraction model to obtain the language feature of the first audio output by the feature extraction model. Where the language feature of each first audio is a feature vector, such as an N-dimensional feature vector, for example, N may be 128.

As described above, in the data set of a certain language, a plurality of audio belonging to the language is stored, and therefore, the language to which each first audio belongs is known when the first audio is acquired.

After determining the language features and the language to which each first audio belongs, these data can be used as model training. The model training process comprises the following steps: the language characteristic of the first audio is used as input data of a model, the language of the first audio is used as real output of the model, and the neural network model is trained to obtain an initial audio classification model. In each training, the language characteristic of one first audio is used as model input data, and the language to which the first audio is input is used as real output. The initial audio classification model may be, for example, an LSTM (Long Short-Term Memory network) model.

The input of the initial audio classification model is the feature vector of the audio (i.e., the N-dimensional feature vector described above), and the output is the probability that the input audio corresponds to each of the languages to which the first audio belongs, where the output may be in the form of an M-dimensional vector, and M is the total number of the languages to which the first audio training the initial audio classification model belongs. Wherein the greater the probability value corresponding to a certain language, the more likely the audio belongs to that language. For example, if the initial audio classification model is obtained based on the first audio training corresponding to two common languages, i.e., chinese and english, the output result of the initial audio classification model is a 2-dimensional vector, and the probability that the input content input to the initial audio classification model belongs to chinese or english is represented respectively.

It should be noted that, the manner of training the neural network model belongs to the prior art, and is well known to those skilled in the art, and will not be described in detail herein.

The initial audio classification model is obtained based on a plurality of first audio training belonging to common languages, so that the classification effect of the initial audio classification model is excellent, and parameters in the model enable the initial audio classification model to have basic ability of language classification. Training of the initial audio classification model in the solution provided by the present disclosure, may be considered as a first stage training of the final desired model.

In step 12, a plurality of second audio frequencies belonging to very common languages are acquired, and language characteristics and the language of each second audio frequency are determined.

The term "very common language" is used herein to refer to a language with less training data available, such as India. Wherein, the plurality of second audios belonging to the very common languages are acquired, the plurality of second audios can be acquired from the data sets (including the audios belonging to the very common languages) respectively corresponding to the plurality of (i.e., two or more) common languages, and the number of second audios corresponding to the respective common languages can be controlled to be equivalent in consideration of the training effect of the model. For example, if the second audio belongs to the languages including the indian a and the indian B (two kinds of dialects of the indian respectively), the second audio of the indian a and the indian B may be obtained for 800h in total, and corresponds to the first audio 400h of the indian a and corresponds to the second audio 400h of the indian B. For the problem of insufficient data sets in unusual languages, the data can be expanded in a copying mode and used for model training, and in the data expansion process, the data can be processed to a certain extent, for example, if the Ind A only has 200h of audio, the 200h of audio can be copied and added with some noise to form new audio, and the original 200h of audio and the new 200h of audio are taken as 400h of audio used for model training.

Also, as described above, in the data set of a certain language, a plurality of audio belonging to the language is stored, and therefore, the language to which each second audio belongs is known when the second audio is acquired.

In one possible implementation, determining the language features of each second audio in step 12 may include the steps of:

And extracting the language characteristics of each second audio through a pre-trained characteristic extraction model.

Wherein the feature extraction model is trained based on AudioSet datasets and a description of the feature extraction model and language features has been given above in describing how to determine the language features of each first audio, and is not repeated here. The principle of determining the linguistic features of each second audio is the same as that of each first audio, and it is only necessary to change the processing object from the first audio to the second audio according to the description already given above.

In step 13, according to the total number of the languages to which the second audio belongs, the full connection layer in the initial audio classification model is set to obtain an intermediate audio classification model.

The initial audio classification model includes an input layer, a middle layer, and an output layer. The input layer is used for sending input data to the middle layer. The middle layer is used for carrying out intra-model operation on input data, the last layer of the middle layer is a full-connection layer, the output of the full-connection layer is M-dimensional data, M categories of corresponding languages are adopted, and M is the total number of languages to which the first audio training the initial audio classification model belongs. The full connection layer is connected with the output layer through an activation function and is used for obtaining M-dimensional vectors based on M-dimensional data output by the full connection layer, namely M probability values.

As described above, training of the initial audio classification model may be considered as a first stage training of the final desired model in the schemes provided in this disclosure, providing the model with good language classification capabilities. Then, the training of the second stage can be started, in the training of the second stage, the specific training is performed on the very common languages, and the corresponding categories in the full-connection layer need to be changed correspondingly to adapt to the current training, so that the full-connection layer in the initial audio classification model needs to be set according to the total number of the languages to which the second audio belongs to obtain the intermediate audio classification model.

In one possible embodiment, step 13 may comprise the steps of:

and setting the categories contained in the full-connection layer in the initial audio classification model, so that the number of the categories contained in the full-connection layer is the same as the total number of the languages to which the second audio belongs, and the categories contained in the full-connection layer are in one-to-one correspondence with the languages to which the second audio belongs.

For example, if the initial audio classification model is obtained based on training related data of chinese and english, the total number of categories of the full-connection layer of the initial audio classification model is 2, and the categories of the full-connection layer correspond to chinese and english, respectively, then if the training of the second stage is to perform targeted training on the indian a and the indian B, then the total number of categories of the full-connection layer needs to be set to 2, and the various categories correspond to the indian a and the indian B, respectively, and the intermediate audio classification model is obtained.

Thus, the full-connected layer of the intermediate audio classification model is fully matched to the second stage of training, and further training of the intermediate audio classification model may begin.

In step 14, the language feature of the second audio is used as model input data, the language to which the second audio belongs is used as model output data, and the intermediate audio classification model is trained to obtain the target audio classification model.

The training process of the intermediate audio classification model is consistent with the input data format of the initial audio classification model, and is similar to the training process of the intermediate audio classification model and the input data format of the initial audio classification model.

After determining the language features and the language to which each second audio belongs, these data can be used as model training. The model training process comprises the following steps: and training the intermediate audio classification model by taking the language characteristic of the second audio as model input data and taking the language of the second audio as the real output of the model so as to obtain the target audio classification model. In each training, the language characteristic of one second audio is used as model input data, and the language to which the second audio is input is used as real output. As noted above, the intermediate audio classification model may be an LSTM model.

The input of the target audio classification model is a feature vector (such as the N-dimensional feature vector described above) of the second audio, and the output is the probability that the input second audio corresponds to each of the languages to which the second audio belongs, and the output form may be a K-dimensional vector, where K is the total number of the languages to which the second audio training the target audio classification model belongs. Wherein the greater the probability value corresponding to a certain language, the more likely the audio belongs to that language. For example, if the target audio classification model is based on the first audio training corresponding to the two common languages of indian a and indian B, the output result of the target audio classification model is a 2-dimensional vector, and the probabilities that the input data input to the target audio classification model belongs to indian a or the probabilities that the input data belongs to indian B are respectively represented.

In addition, in training the intermediate audio classification model to obtain the target audio classification model, not only the second audio belonging to the common language may be trained in the training process of the second stage in the above manner, but also the intermediate audio classification model may be trained by combining the common language and the common language audio in the training process of the second stage in the above manner, in the same manner as that given above, that is, using the language features of the audio belonging to the common language (or the language features of the audio belonging to the non-common language) as the model input data and using the language to which the input audio belongs as the model output data, so as to obtain the target audio classification model. The balance and quantity of training data are the same as those given above, and the description thereof will not be repeated here. For example, in the second stage of model training, training may be performed in combination with chinese, english, and hindi, and the final target audio classification model is a model for classifying chinese, english, and hindi.

Fig. 2 is a flow chart of an audio classification method provided in accordance with an embodiment of the present disclosure. As shown in fig. 2, the method may include the following steps.

In step 21, the audio to be processed is sliced to obtain a plurality of audio pieces to be processed.

The longer the audio time length is, the higher the computing power required by the audio processing is, and the more problems are caused, so that the whole section of audio to be processed can be firstly segmented to obtain a plurality of audio fragments to be processed, and then the audio fragments to be processed are processed, thereby effectively reducing the computing pressure in the audio processing process and improving the audio processing efficiency and accuracy.

In a possible implementation manner, the audio to be processed can be equally split, so that the obtained multiple audio fragments to be processed are consistent in duration, consistent in data format and more efficient in subsequent processing.

In another possible implementation manner, the audio to be processed can be analyzed, and the audio to be processed is segmented by taking the part without human voice in the audio as a segmentation point, so that the obtained content of a plurality of audio segments to be processed is ensured to have higher relevance, and the subsequent language identification is facilitated.

In step 22, each audio segment to be processed is input to the target audio classification model to obtain an output result of the target audio classification model.

The target audio classification model is trained by the audio classification model training method provided by any embodiment of the disclosure. Accordingly, the output result is used for indicating the probability that the audio segment to be processed input to the target audio classification model corresponds to each of the languages to which the second audio belongs.

In step 23, for each audio segment to be processed, the language to which the audio segment to be processed belongs is determined according to the probability that the audio segment to be processed corresponds to each of the languages to which the second audio belongs.

According to the probability that the audio fragment to be processed corresponds to each language in the languages to which the second audio belongs, the language to which the audio fragment to be processed belongs can be determined, and furthermore, according to the language to which each audio fragment to be processed belongs, the language to which the audio to be processed belongs can be determined.

In one possible implementation manner, the language to which the audio segment to be processed belongs may be determined by the following manner:

if the probability that the audio fragment to be processed corresponds to each language in the languages to which the second audio belongs is larger than a preset probability threshold, determining the language corresponding to the probability larger than the preset probability threshold as the language to which the audio fragment to be processed belongs.

The preset probability threshold may be set according to an empirical value. The method for determining the language of the single audio piece to be processed is given here, and the method can be referred to for determining the language of each audio piece to be processed.

In another possible embodiment, step 23 may comprise the steps of:

aiming at each audio fragment to be processed, determining the language corresponding to the maximum probability as the language to which the audio fragment to be processed belongs according to the maximum probability corresponding to the audio fragment to be processed;

and determining the language of the audio to be processed according to the language of each audio fragment to be processed.

In this embodiment, the manner in which the language to which the audio piece to be processed belongs is determined is: and determining the language corresponding to the maximum probability as the language to which the audio fragment to be processed belongs according to the maximum probability corresponding to the audio fragment to be processed. As described above, the larger the probability value of the audio piece to be processed corresponding to a certain language, the more likely the audio piece to be processed belongs to the language, so that the language to which the audio piece to be processed belongs can be determined directly by the maximum probability value. The method for determining the language of the single audio piece to be processed is given here, and the method can be referred to for determining the language of each audio piece to be processed.

In one possible embodiment, determining the language to which the audio to be processed belongs according to the language to which each audio piece to be processed belongs may include the following steps:

Counting the languages to which each audio fragment to be processed belongs to determine the language with the largest number;

And determining the language with the largest number as the language to which the audio to be processed belongs.

After a section of audio to be processed is segmented, a plurality of audio segments to be processed can be obtained, and each audio segment to be processed corresponds to a language, so that the larger the proportion of the audio segments to be processed in a certain language in the section of audio to be processed, the more likely the audio to be processed belongs to the language. Therefore, statistics can be performed on the languages to which each audio piece to be processed belongs to determine the languages with the largest number, and the languages with the largest number are determined as the languages to which the audio piece to be processed belongs. For example, if the audio to be processed C is split into 10 audio segments to be processed, and, among the 10 audio segments to be processed, there are 8 languages to which the audio segment to be processed belongs are indian, and the remaining 2 languages to which the audio segment to be processed belongs are chinese, it may be determined that the language to which the audio C to be processed belongs is indian.

According to the scheme, the audio to be processed is segmented to obtain a plurality of audio segments to be processed, each audio segment to be processed is respectively input into the target audio classification model to obtain an output result of the target audio classification model, and the language to which the audio to be processed belongs is determined according to the probability that each audio segment to be processed corresponds to each language in the languages to which the second audio belongs with respect to each audio segment to be processed. The target classification model is obtained by training based on the audio classification model training method provided by any embodiment of the disclosure, has excellent recognition and classification effects, and can improve the accuracy of language determination of the audio to be processed.

Fig. 3 is a block diagram of an audio classification model training apparatus provided in accordance with an embodiment of the present disclosure. As shown in fig. 3, the apparatus 30 may include:

A first obtaining module 31, configured to obtain an initial audio classification model, where the initial audio classification model is obtained based on a plurality of first audio training belonging to a common language;

a second obtaining module 32, configured to obtain a plurality of second audio frequencies belonging to a very common language, and determine a language feature and a language to which each of the second audio frequencies belongs;

The setting module 33 is configured to set the full-connection layer in the initial audio classification model according to the total number of languages to which the second audio belongs, so as to obtain an intermediate audio classification model;

The model training module 34 is configured to train the intermediate audio classification model to obtain a target audio classification model by using the language feature of the second audio as model input data and using the language to which the second audio belongs as model output data.

Optionally, the setting module 33 is configured to set the categories included in the full-connection layer in the initial audio classification model, so that the number of the categories included in the full-connection layer is the same as the total number of the languages to which the second audio belongs, and the categories included in the full-connection layer are in one-to-one correspondence with the languages to which the second audio belongs.

Optionally, the second obtaining module 32 is configured to extract the language feature of each of the second audio through a pre-trained feature extraction model, where the feature extraction model is obtained through training based on AudioSet datasets.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 4 is a block diagram of an audio classification device provided in accordance with an embodiment of the present disclosure. As shown in fig. 4, the apparatus 40 may include:

the segmentation module 41 is configured to segment the audio to be processed to obtain a plurality of audio segments to be processed;

The classification module 42 is configured to input each of the audio segments to be processed into a target audio classification model to obtain an output result of the target audio classification model, where the target audio classification model is trained according to the audio classification model training method according to any embodiment of the disclosure, and the output result is used to indicate a probability that the audio segment to be processed input into the target audio classification model corresponds to each of the languages to which the second audio belongs;

the determining module 43 is configured to determine, for each of the audio segments to be processed, a language to which the audio segment to be processed belongs according to a probability that the audio segment to be processed corresponds to each of the languages to which the second audio belongs.

Optionally, the determining module 43 includes:

The first determining submodule is used for determining the language corresponding to the maximum probability as the language of each audio fragment to be processed according to the maximum probability corresponding to the audio fragment to be processed;

and the second determining submodule is used for determining the language to which the audio to be processed belongs according to the language to which each audio fragment to be processed belongs.

Optionally, the second determining sub-module includes:

the statistics sub-module is used for carrying out statistics on the languages to which each audio fragment to be processed belongs so as to determine the languages with the largest number;

And the third determining submodule is used for determining the language with the largest number as the language to which the audio to be processed belongs.

Referring now to fig. 5, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the electronic device may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an initial audio classification model, wherein the initial audio classification model is obtained based on a plurality of first audio training belonging to common languages; acquiring a plurality of second audios belonging to very common languages, and determining the language characteristics and the language of each second audio; setting all connection layers in the initial audio classification model according to the total number of languages to which the second audio belongs so as to obtain an intermediate audio classification model; and training the intermediate audio classification model by taking the language characteristic of the second audio as model input data and taking the language of the second audio as model output data so as to obtain a target audio classification model.

Or the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: splitting audio to be processed to obtain a plurality of audio fragments to be processed; respectively inputting each audio fragment to be processed into a target audio classification model to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method according to any embodiment of the disclosure, and the output result is used for indicating the probability that the audio fragment to be processed input into the target audio classification model corresponds to each language in the languages to which the second audio belongs; and determining the language to which the audio to be processed belongs according to the probability that the audio to be processed corresponds to each language in the languages to which the second audio belongs aiming at each audio fragment to be processed.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the first acquisition module may also be described as "a module that acquires an initial audio classification model".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided an audio classification model training method, the method comprising:

According to one or more embodiments of the present disclosure, there is provided an audio classification model training method, wherein the setting the full connection layer in the initial audio classification model according to the total number of languages to which the second audio belongs to obtain an intermediate audio classification model includes:

According to one or more embodiments of the present disclosure, there is provided an audio classification model training method, wherein the determining language features of each of the second audio includes:

extracting language features of each second audio through a pre-trained feature extraction model, wherein the feature extraction model is trained based on AudioSet data sets.

According to one or more embodiments of the present disclosure, there is provided an audio classification method, the method comprising:

Respectively inputting each audio fragment to be processed into a target audio classification model to obtain an output result of the target audio classification model, wherein the target audio classification model is obtained by training according to the audio classification model training method according to any embodiment of the disclosure, and the output result is used for indicating the probability that the audio fragment to be processed input into the target audio classification model corresponds to each language in the languages to which the second audio belongs;

According to one or more embodiments of the present disclosure, there is provided an audio classification method, wherein for each of the audio pieces to be processed, determining, according to a probability that the audio piece to be processed corresponds to each of the languages to which the second audio belongs, the language to which the audio to be processed belongs includes:

and determining the language to which the audio to be processed belongs according to the language to which each audio fragment to be processed belongs.

According to one or more embodiments of the present disclosure, there is provided an audio classification method, wherein the determining, according to a language to which each of the audio clips to be processed belongs, the language to which the audio clip to be processed belongs includes:

According to one or more embodiments of the present disclosure, there is provided an audio classification model training apparatus, the apparatus comprising:

According to one or more embodiments of the present disclosure, an audio classification model training device is provided, where the setting module is configured to set a class included in a full-connection layer in the initial audio classification model, so that the number of the classes included in the full-connection layer is the same as the total number of languages to which the second audio belongs, and the classes included in the full-connection layer are in one-to-one correspondence with the languages to which the second audio belongs.

According to one or more embodiments of the present disclosure, there is provided an audio classification model training apparatus, wherein the second obtaining module is configured to extract a language feature of each of the second audio by using a pre-trained feature extraction model, where the feature extraction model is obtained by training based on AudioSet datasets.

According to one or more embodiments of the present disclosure, there is provided an audio classification apparatus, the apparatus comprising:

The classification module is used for respectively inputting each audio fragment to be processed into a target audio classification model to obtain an output result of the target audio classification model, wherein the target audio classification model is trained according to the audio classification model training method according to any embodiment of the disclosure, and the output result is used for indicating the probability that the audio fragment to be processed input into the target audio classification model corresponds to each language in the second audio belonging language;

According to one or more embodiments of the present disclosure, there is provided an audio classification apparatus, wherein the determining module includes:

According to one or more embodiments of the present disclosure, there is provided an audio classification apparatus, wherein the second determination submodule includes:

According to one or more embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the audio classification model training method provided by any embodiment of the present disclosure, or which, when executed by a processing device, implements the steps of the audio classification method provided by any embodiment of the present disclosure.

According to one or more embodiments of the present disclosure, there is provided an electronic device including:

a storage device having a computer program stored thereon;

And the processing device is used for executing the computer program in the storage device to realize the steps of the audio classification model training method provided by any embodiment of the disclosure or the steps of the audio classification method provided by any embodiment of the disclosure.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of training an audio classification model, the method comprising:

setting the categories contained in a full-connection layer in the initial audio classification model, so that the number of the categories contained in the full-connection layer is the same as the sum of the number of the languages to which the first audio belongs and the number of the languages to which the second audio belongs, and the categories contained in the full-connection layer are in one-to-one correspondence with the languages to which the first audio and the second audio belong, so as to obtain an intermediate audio classification model;

And training the intermediate audio classification model by taking language features of the first audio and the second audio as model input data and taking languages of the first audio and the second audio as model output data so as to obtain a target audio classification model.

2. The method of claim 1, wherein said determining the linguistic characteristics of each of the second audio frequencies comprises:

3. A method of audio classification, the method comprising:

Respectively inputting each audio fragment to be processed into a target audio classification model to obtain an output result of the target audio classification model, wherein the target audio classification model is trained according to the audio classification model training method according to claim 1 or 2, and the output result is used for indicating the probability that the audio fragment to be processed input into the target audio classification model corresponds to each language in the languages to which the second audio belongs;

4. The method of claim 3, wherein for each of the audio segments to be processed, determining the language to which the audio segment to be processed belongs based on the probability that the audio segment to be processed corresponds to each of the languages to which the second audio belongs, comprises:

5. The method of claim 4, wherein the determining the language to which the audio to be processed belongs according to the language to which each of the audio pieces to be processed belongs comprises:

6. An audio classification model training apparatus, the apparatus comprising:

The setting module is used for setting the categories contained in the full-connection layer in the initial audio classification model, so that the number of the categories contained in the full-connection layer is the same as the sum of the number of the languages to which the first audio belongs and the number of the languages to which the second audio belongs, and the categories contained in the full-connection layer are in one-to-one correspondence with the languages to which the first audio and the second audio belong, so as to obtain an intermediate audio classification model;

The model training module is used for training the intermediate audio classification model by taking language features of the first audio and the second audio as model input data and taking languages of the first audio and the second audio as model output data so as to obtain a target audio classification model.

7. An audio classification device, the device comprising:

The classification module is configured to input each audio segment to be processed into a target audio classification model to obtain an output result of the target audio classification model, where the target audio classification model is trained according to the audio classification model training method according to claim 1 or 2, and the output result is used to indicate a probability that the audio segment to be processed input into the target audio classification model corresponds to each language in the second audio language;

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the program when executed by a processing device implements the steps of the method according to claim 1 or 2, or the program when executed by a processing device implements the steps of the method according to any one of claims 3-5.

9. An electronic device, comprising:

a storage device having a computer program stored thereon;

Processing means for executing said computer program in said storage means to carry out the steps of the method of claim 1 or 2 or to carry out the steps of the method of any one of claims 3-5.