CN112397075A

CN112397075A - Human voice audio recognition model training method, audio classification method and system

Info

Publication number: CN112397075A
Application number: CN202011436155.1A
Authority: CN
Inventors: 贾杨; 夏龙; 吴凡; 张金阳; 张兆元; 郭常圳
Original assignee: Beijing Ape Power Future Technology Co Ltd
Current assignee: Beijing Ape Power Future Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-02-23

Abstract

The application provides a human voice audio recognition model training method, an audio classification method and an audio classification system. The human voice audio recognition model training method comprises the following steps: obtaining a time-frequency two-dimensional graph of sub-audios in the training audios, and obtaining probability values of the sub-audios belonging to specific classifications as input of a neural network; and optimizing the parameters of the neural network by using the probability value of the sub-audio belonging to the specific classification and the classification of the preset sub-audio, so that the probability value of the sub-audio belonging to the specific classification obtained by the neural network converges to the preset sub-audio classification. The method provided by the invention can be used for identifying the classification of the human voice and the audio.

Description

Human voice audio recognition model training method, audio classification method and system

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a human voice audio recognition method, an audio classification method, and an audio classification system.

Background

In speech-related technical practices and product design, for example, in a speech interaction scenario, it is often necessary to acquire a user's speech input in order to perform a related service, for example, to determine whether the user's speech input is a correct answer or what instruction the user's speech input is.

In many business scenarios, the voice importer is required to be a user in a specific age interval. For example, on-line education requires children or students of corresponding age to perform voice responses or complete related interactive tasks by voice according to the requirements of on-line teaching. For example, a release task requires voice recording by users of the corresponding age group.

However, in actual cases, in the above-described scenario where there is a demand for the age of the user of the voice input, voice data of other ages may be included in the obtained human voice audio for various reasons. For example, in online education, adults help children to perform voice answers; or adults have performed audio recording tasks instead of children, etc.

In the prior art, the main means for solving the problems is realized based on user portraits. Whether the user currently performing the voice interaction is a child is judged, for example, through face recognition. The method generally requires that a plurality of pictures containing the complete personal portrait of the user are obtained, and the pictures are classified by using a trained deep learning model based on the images, so that the age groups of the user can be distinguished. However, in some application scenarios, there are difficulties with portrait capture by a user.

Disclosure of Invention

The application provides a human voice audio recognition model training method, an audio classification method and an audio classification system.

A human voice audio recognition model training method comprises the following steps: obtaining a time-frequency two-dimensional graph of sub-audios in the training audios, and obtaining probability values of the sub-audios belonging to specific classifications as input of a neural network; and optimizing the parameters of the neural network by using the probability value of the sub-audio belonging to the specific classification and the classification of the preset sub-audio, so that the probability value of the sub-audio belonging to the specific classification obtained by the neural network converges to the preset sub-audio classification.

The method further comprises the following steps: acquiring human voice audio in a plurality of preset classified original audio; segmenting human voice frequency to obtain sub-audio of corresponding preset classification; and splicing the sub-audios of different preset classifications to obtain a training audio.

In the above method, the segmenting the human voice audio to obtain sub-audio of corresponding preset classification includes: and segmenting the human voice audio by adopting a segmentation window with the duration of 500 milliseconds and the step length of 250 milliseconds to obtain sub audio with the duration of 500 milliseconds.

The obtaining of the time-frequency two-dimensional graph of the sub-audio in the training audio comprises: obtaining training audio, and segmenting the training audio according to a preset method to obtain sub-audio in the training audio; and calculating a Mel frequency cepstrum coefficient to obtain a time-frequency two-dimensional graph of the sub-audio.

The method further comprises the following steps: carrying out variable speed processing on the obtained human voice frequency; and segmenting the human voice audio subjected to the variable speed processing to obtain corresponding preset classified sub-audio.

An audio classification method for a model obtained by the above method, comprising: obtaining original audio of a user; segmenting the original audio to obtain sub-audio; obtaining a time-frequency two-dimensional graph of the sub-audio, and obtaining a probability value of the sub-audio belonging to a specific classification as the input of a neural network; and obtaining a weighted value of the original audio belonging to a specific classification according to the probability value.

In the above method, the obtaining the weight value that the original audio belongs to a specific category includes: calculating the probability average value of all sub-audios of the original audio belonging to a specific classification as the normalized probability of the original audio; searching the stored normalization probability of the original audio of the preset number before the current original audio time of the user; if the normalized probability is larger than the number of the judgment upper limits and reaches a first preset number, the weight value of the current original audio belonging to the specific classification is 1; if the normalized probability is smaller than the lower judgment limit and reaches a second preset number, the weight value of the current original audio belonging to the specific classification is 0; otherwise, obtaining the weight value of the current original audio belonging to the specific classification according to a preset algorithm.

In the above method, the obtaining the weight value that the original audio belongs to a specific category includes:

counting a first number of sub-audio values in the current original audio, wherein the probability value of the sub-audio values is greater than a threshold value of a specific classification probability; and obtaining the weight value of the current original audio, which belongs to the specific classification, by using the ratio of the first number to the number of all sub-audios in the current original audio.

The method further comprises the following steps: storing the original audio of the user and playing the original audio according to the instruction; and storing the weight value of the original audio belonging to the specific classification corresponding to the original audio.

The method further comprises the following steps: selecting original audio meeting specific requirements according to the weight value of the original audio belonging to a specific classification; or, calculating the user interest value according to the weight value of the original audio belonging to the specific classification.

A human voice audio classification system comprising:

the server receives the original audio, sends the original audio to the storage unit for storage, and stores the storage address of the original audio in the storage unit in the database;

searching the database, segmenting the original audio in the storage unit according to the storage address to obtain sub-audio, and sending the sub-audio to a neural network;

receiving a probability value of the sub-audio returned by the neural network belonging to the specific classification, and obtaining a weight value of the original audio belonging to the specific classification according to the probability value;

the database stores the storage address of the original audio in the storage unit and the weight value of the original audio belonging to the specific classification;

the storage unit is used for storing the original audio acquired by the server;

and the neural network is used for obtaining the probability value that the sub-audio belongs to the specific classification.

According to the training method for the human voice audio recognition model, the training audio is segmented to obtain a plurality of sub-audios, the probability value output by each sub-audio neural network is compared with the preset classification to optimize the neural network parameters, and the aim of continuously converging the output probability value of the neural network is achieved. A neural network obtained through sub-audio training obtains a better output effect;

further, the training audio is composed of the human voice parts in the original audio with different classifications, so that the training audio has less noise except the human voice or has less sub-audio of non-human voice, thereby reducing the interference to the neural network.

And after the original audios of different classifications are segmented, a plurality of sub-audios are respectively obtained and randomly spliced, and the obtained training audio trains the neural network to obtain a better effect. For example, the child audio frequency of the child and the child audio frequency of the adult are randomly spliced into the training audio frequency, so that the neural network has stronger capability of identifying the child audio frequency and the adult audio frequency.

And after the human voice audio in the original audio is obtained, firstly carrying out variable speed processing on the human voice audio, and then segmenting to obtain the sub-audio, so that the sub-audio used for training has multiple changes of the speed of sound, and the neural network obtained by training can adapt to the requirements of more sound qualities and scenes.

According to the audio classification method, the original audio is segmented, and then the category attribute of each sub-audio is identified, so that more accurate classification is obtained. The method can be suitable for the condition of a plurality of sound sources in the original audio, and the time length of a certain type of audio in the original audio can be counted because the identification can be carried out according to the classification of the sub audio. And judging whether the original audio meets the business requirements, for example, the proportion of the adult sound time length or the infant sound time length in the original audio is not lower than a certain requirement, or other similar parameters.

The application provides two methods for calculating the audio classification weight value, which can accurately calculate the classification weight value according to the audio condition meeting the classification requirement in the original audio, thereby being used for evaluating the quality of the original audio or evaluating the rights and interests of an original audio recorder.

The application also provides a human voice audio classification system. The method can be realized and has corresponding effect.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application, as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

FIG. 1 is a schematic diagram of an embodiment of applicants' audio classification system;

FIG. 2 is a schematic diagram of a neural network training process of the present application;

FIG. 3 is a schematic diagram of VAD module operation;

FIG. 4 is a schematic diagram of a data enhancement module data enhancement process flow;

fig. 5 is a schematic diagram of the concatenation generating training audio.

Detailed Description

Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

The following description will be given taking a product scenario of the voice correlation technique as an example. The implementation method is used for more clearly and intuitively explaining the implementation method of the invention, and the invention is not only suitable for application in the following scenes.

In speech products, recording tasks are often issued to obtain training data that is available, for example, when the system requires audio of a young child for training a neural network in order to obtain a neural network that is capable of recognizing the sound of the young child. The pronunciation of adults is different from that of children naturally, so that if a deep learning model obtained by training adult linguistic data is used, the deep learning model is poor in expression when audio recognition of children is carried out.

In order to obtain more voice audios of infants or people of a specific age, a recording task is solicited on the internet in an open mode, and people who participate in recording are recorded according to published recording requirements by default. However, such recorded audio is often contaminated with large amounts of "dirty data", i.e. audio that does not meet the recording requirements, e.g. recording tasks designed for young children include large amounts of adult recordings.

In order to solve the above problems, the present invention obtains sub-audio by segmenting the original audio; obtaining a time-frequency two-dimensional graph of the sound signal by calculating the MFCC; further, the audios are classified through a deep learning algorithm, for example, the age of a recording user is judged through the audios; based on the probability normalization mechanism provided by the invention, the classification probability value of the original audio is obtained according to the classification probability of the sub-audio obtained by the deep learning algorithm. Therefore, a reward payment mechanism is improved, the target user group is encouraged to participate in the recording, and the non-target group is resisted from participating in the recording.

The invention also combines Python, Java and Thrift frames, HTML5 and JavaScript to realize a voice and audio classification system based on deep learning, and realizes the identification of the age bracket of the recording user by the method. See the schematic system structure shown in fig. 1. Comprises the following steps: the system comprises a client, a server and a neural network.

And the client side registers the user in the client side module to record. And the user registers and acquires the exclusive id at the client module, selects different recording tasks after logging in, reads the recording requirements and then enters a recording interface.

And the server stores the recorded audio to the OSS and stores the corresponding OSS address and the user information in the MySQL database. Wherein the OSS is an Object Storage Service (Object Storage Service). MySQL is a relational database management System (RDBMS) that employs the Structured Query Language (SQL) for database management.

The server side inquires newly added audio in the MySQL database, for example, periodically inquires the MySQL database. Sending the unviewed audio to a neural network module for identification; after the neural network identification result is obtained, namely the probability value of the sub-audio returned by the neural network belonging to the specific classification, the server side performs probability normalization to obtain the weight value of the original audio belonging to the sub-audio belonging to the specific classification, judges the audio auditing result by combining the historical information and updates the record in MySQL. A consideration payment policy is then implemented and the recorded consideration is given to the user. The age classification of the present invention refers to classifying the recorded people into categories of children, adults, teenagers, etc. according to audio, and is described below with a classification task of children and adults.

the storage unit is used for storing the original audio acquired by the server; and reading the original audio to play according to the instruction of the client.

The training of the neural network is described in detail below.

Referring to fig. 2, the training audio in the training phase passes through a Voice Activity Detection (VAD) module, a data enhancement module, a sub-audio segmentation module, an extraction feature, a network forward operation, and a calculation of ios, and continuously optimizes parameters to make the model converge. On the other hand, when the trained neural network identifies the human audio, the original audio passes through a voice activation detection module, a sub-audio segmentation module, an extraction feature and network forward operation; compared with the testing process of the neural network, the method reduces the data enhancement and calculation loss function optimization parameter part.

The training process of the neural network is as follows.

Obtaining audio data for neural network training, comprising: speech recognition data sets from public Chinese and English adults and infants; or the audio recorded by the recording tasks of the independently issued infants and adults is manually checked and classified and marked as the voices of the infants or adults.

A Voice Activation Detection (VAD) module.

The original audio data contains components of non-human voice, whether recorded by adults or infants. For example, since young children are in the language learning stage, the recorded sounds contain considerable non-vocal components, which are mostly caused by the young children's speech pauses.

If the neural network directly uses the original audio frequencies to judge the audio age, the data distribution of the non-human voice part can directly influence the recognition result of the neural network. For example, the model trained in this way has a possibility of being determined as a baby for a non-human voice audio clip, so that the recognition performance of the neural network for the audio age is lost.

Referring to fig. 3, the present invention utilizes a Voice Activity Detection (VAD) module to identify the vocal part in the original audio a, which is labeled as vocal part in fig. 3, and cuts and splices the original audio according to the VAD result to derive the pure vocal audio in the original audio.

And processing the original audio used for neural network training by utilizing a voice activation detection module to correspondingly obtain pure human voice audio marked as infants and pure human voice audio marked as adults.

And a data enhancement module.

And the data enhancement module is adopted to increase the richness of data so as to improve the performance of the deep learning model. As shown with reference to fig. 4.

Firstly, pure human voice audio data obtained by a voice activation detection module is input. The module performs variable-speed and non-tone-changing enhancement on pure-person voice data, and ensures the tone of a speaker while changing the speaking speed of a voice recorder.

Secondly, because various noises exist in the real environment and the recorded audio is always in a quieter environment, in order to increase generalization of the model to the real audio, the data enhancement module can combine the autonomously recorded and open-source noise data set as a candidate to superimpose noise with random signal-to-noise ratio on the human voice data.

Finally, as shown in fig. 4, the scene is similar to the scene that adults and infants are reading together in real life. Respectively cutting and randomly splicing the infant audio and the adult audio processed by the two steps. The minimum granularity of the cutting is the window size of the sub-audio cutting module, and the obtained enhanced audio is used as training audio for neural network training.

Specifically, refer to fig. 5.

In order to deal with mixed audio scenes of adults and children, time nodes of different types of audio are estimated. And segmenting the audio subjected to variable speed processing and superimposed noise processing to obtain a plurality of sub-audio, wherein the sub-audio inherits the classification mark of the corresponding part of the original audio. As shown, a number of sub-audios classified as young children are obtained for the original audio classified as young children, and similarly, a number of sub-audios classified as adult are obtained for the original audio classified as adult. And further, pure human voice audio with different lengths is divided into sub-audio with fixed length, and random splicing is carried out. As shown in fig. 5, the sub-audio from the audio of the young child is spliced with the sub-audio from the audio of the adult to obtain a training audio.

One part of the audio clip of the training audio comes from infant audio, and the other part of the audio clip comes from adult audio. FIG. 5 is only one example, and the audio clip of the training audio does not necessarily have to be from the same baby audio, but may be a sub-audio obtained from multiple audio classified as babies; similarly, the adult audio segments in the training audio may be randomly selected from a plurality of audios labeled as adults respectively.

And, in the sub-audio splicing manner, as shown in fig. 5, a first baby audio segment is a sub-audio of a baby audio, and another audio segment in the training audio is composed of 2 or 3 sub-audios of baby audios, and an audio segment composed of a plurality of sub-audios of adult audios is spaced between the two baby audio segments. The invention does not limit the specific splicing mode of the training audio.

Obviously, the above is an implementation manner of the data enhancement module, and in other implementation manners, the data enhancement module may also directly perform random splicing of the human voice and audio without performing processing of speed change or/and without superimposing random noise. And the order of steps is not limiting of the invention.

And the sub-audio segmentation module is used for segmenting the training audio output by the data enhancement module, so that each sub-audio in the training audio is used for training the neural network. Therefore, the time point estimation of different types of audio in the original audio is realized by utilizing the neural network subsequently.

According to the embodiment of the invention, the training audio is segmented by adopting the segmentation window of 500ms and the segmentation step length of 250ms to obtain the sub-audio. However, the present invention does not limit the slicing window duration and the slicing step duration. And, in one aspect, the segmentation windows used by the sub-audio segmentation module for the training audio may be the same duration as the segmentation windows used for the infant audio and the adult audio for splicing to obtain the training audio as shown in fig. 5. On the other hand, the segmentation window adopted by the sub-audio segmentation module for the training audio may also be different from the segmentation window time length adopted by the infant audio and the adult audio shown in fig. 5, however, as a preferred implementation manner at this time, the segmentation window time length adopted by the infant audio and the adult audio shown in fig. 5 should be an integral multiple of the segmentation window time length adopted by the sub-audio segmentation module for the training audio, however, the present invention does not limit the implementation of other time length values.

The feature extraction module calculates the result of short-time Fourier transform of the sub-audio obtained by segmentation in the training audio. For example, for the sub-audio with each window of 500ms obtained above, the feature extraction module calculates the result of the short-time fourier transform by using the window length of 25ms and the step length of 10ms, and calculates the MFCC feature.

The mel-frequency cepstrum coefficient (MFCC) is a spectral coefficient obtained by linear transformation of a logarithmic energy spectrum based on a nonlinear mel scale (mel scale) of sound frequencies, conforms to the sensing characteristics of human ears on the sound frequencies, and can represent the time-frequency characteristics of sound.

And extracting the MFCC features according to the sub-audio to obtain a feature map which is regarded as a two-dimensional image, and further obtaining the probability that the sub-audio belongs to a certain classification. For example, in this embodiment, the probability value that the sub-audio belongs to the child classification is obtained by using the time-frequency characteristics of the MFCC and using the network architecture of the VGG.

Wherein the VGG is a convolutional network structure of a well-known image recognition field proposed by Visual Geometry Group of Oxford on ILSVRC 2014. The present invention may also be used with other network architectures to achieve the same goal.

And the entropy calculation module is used for calculating the cross entropy loss of the prediction probability of the sub audio belonging to a certain classification and the labeled classification mark of the sub audio, and returning the cross entropy loss to the network forward operation module to optimize the network parameters until convergence.

The network forward operation module optimizes network parameters by using the cross entropy and uses the optimized parameters to determine the probability that the sub-audio of the training audio belongs to a certain classification; and after the cross entropy loss of the entropy calculation module is obtained, optimizing the network parameters again. The above steps are repeated, so that the effect that the prediction of the classification probability of the sub-audio is continuously converged to the actual classification of the sub-audio by optimizing the network parameters is achieved.

The network forward operation module calculates the influence of the nodes of the input layer on the nodes of the hidden layer, namely, the influence of each node on the nodes of the next layer is calculated by walking the network forward once from the input layer to the hidden layer to the output layer. In particular implementations, a Convolutional Neural Network (CNN) may be employed: the hidden layer is mainly composed of a series of convolution layers, a pooling layer and a full-connection layer. The convolution layer projects the information in the receptive field to a certain element of the lower layer, so as to achieve the function of information enrichment. The pooling layer achieves the effect of down-sampling by adopting the nonlinear pooling functions of various different forms such as maximum pooling, mean pooling and the like. The full-connection layer fuses and intersects the high-level characteristic information abstracted by the convolution layer and the pooling layer, and finally realizes the classification effect.

The implementation mode of the human voice audio recognition model training method is illustrated by examples.

And obtaining an original audio through the trained neural network, and further calculating the probability value of the sub-audio belonging to the specific classification through segmenting the original audio. Specifically, the original audio is processed by a voice activation detection module, a sub-audio segmentation module, feature extraction and network forward operation.

Referring to fig. 1, probability values of sub-audios belonging to a specific class in original audio output by a neural network, specifically, probability values P, P ═ P of obtaining the class of the sub-audios¹，p²，...，pⁱ，...pⁿIn which p isⁱIndicating the probability that the ith sub-audio is of a particular classification, for exampleSuch as the probability that each sub-audio in an original audio belongs to a baby.

And the server obtains the weight value of the original audio belonging to the specific classification according to the probability value so as to be utilized in specific services. For example, in the scenario of the recording task issued by the network, it is assumed that the issuing task solicits the audio recorded by the young child, and rewards the user rights and interests for the audio recorded by the young child, such as a reward mode.

The embodiment of the invention provides two ways to obtain the weight value of the audio classification of the original audio and determine the reward based on the weight value.

Suppose the reward for a child recording a piece of audio is c.

In the first implementation, the history information of the user is referred to, and it is assumed that the same user id does not switch the recording participants in a short time.

1) A tolerance factor alpha is set.

2) The normalized probability of the current original audio is calculated in combination with the array P.

Wherein p isⁱRepresenting the probability that the ith sub-audio is classified as a baby.

3) The server side inquires the log record of the user id, inquires the previous alpha audio identification result, for example, the value of alpha is 10, namely, the identification result of 10 original audios before the current original audio is searched.

4) If the recognition result of the former alpha original audios has the probability that the probability of classifying the infant with a certain audio frequency is greater than the upper limit value Tr of the judgment threshold^upThen the original audio is classified as a baby, i.e. the weight value of the baby classification is 1. Accordingly, the product of the standard reward c and the weight value of 1 obtains the user right reward of the original audio as c.

If the recognition result of the alpha original audios does not have any one original audio classification infant probability larger than the upper limit value Tr of the judgment threshold^up. Then the normalization of the original audio is judgedProbability of change p^aWhether or not it is lower than the lower limit Tr of the decision threshold^down. If the current original audio is lower than the standard audio, the weight value of the current original audio classified as the infant is 0, and correspondingly, the product of the standard reward c and the weight value of 0 obtains that the user right reward of the original audio is 0; if the normalized probability p^aWhether it is not lower than the lower limit Tr of the decision threshold^downThen, the weight value of the current original audio classification for the infant is:

the user equity reward for the original audio is:

the above method assumes that the recording person will not switch in a short time, for example, when a child records in a short time, the child will not immediately change to adult recording. Therefore, referring to the original audio recorded by the user, if a certain index requirement is met, such as the upper limit value of the infant classification probability, the current original audio is also determined as the recorded audio of the infant.

Another implementation of the present invention to obtain the weight values is described below.

In a second implementation, the child data is used as much as possible by referring to the sub-audio recognition result array.

1) And setting a judgment threshold Tr of the audio frequency of the infant.

2) Carrying out discrimination and classification on the sub-audio identification array P in combination with Tr;

if p isⁱIf the sub audio frequency is greater than Tr, the ith sub audio frequency is considered as the audio frequency of the infant, and p is calculatedⁱThe value is assigned to be 1,

otherwise, the ith sub-audio is considered as non-infant audio, and p is addedⁱThe value is assigned to 0.

3) The weight values for classifying the original audio as a child are:

correspondingly, the user equity reward of the current original audio is:

by the method, the weight value of the original audio belonging to the specific classification is obtained, and the invention does not limit other methods for obtaining the weight value.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for training a human voice audio recognition model is characterized by comprising the following steps:

obtaining a time-frequency two-dimensional graph of sub-audios in the training audios, and obtaining probability values of the sub-audios belonging to specific classifications as input of a neural network;

and optimizing the parameters of the neural network by using the probability value of the sub-audio belonging to the specific classification and the classification of the preset sub-audio, so that the probability value of the sub-audio belonging to the specific classification obtained by the neural network converges to the preset sub-audio classification.

2. The method of claim 1, further comprising:

acquiring human voice audio in a plurality of preset classified original audio;

segmenting human voice frequency to obtain sub-audio of corresponding preset classification;

and splicing the sub-audios of different preset classifications to obtain a training audio.

3. The method of claim 2, wherein the segmenting the human voice audio into sub-audio of corresponding preset classifications comprises:

and segmenting the human voice audio by adopting a segmentation window with the duration of 500 milliseconds and the step length of 250 milliseconds to obtain sub audio with the duration of 500 milliseconds.

4. The method of claim 1, wherein obtaining the time-frequency two-dimensional graph of the sub-audio in the training audio comprises:

obtaining training audio, and segmenting the training audio according to a preset method to obtain sub-audio in the training audio;

and calculating a Mel frequency cepstrum coefficient to obtain a time-frequency two-dimensional graph of the sub-audio.

5. The method of claim 2, further comprising:

carrying out variable speed processing on the obtained human voice frequency;

and segmenting the human voice audio subjected to the variable speed processing to obtain corresponding preset classified sub-audio.

6. A method of audio classification using a model obtained by a method as claimed in any one of claims 1 to 5, comprising:

obtaining original audio of a user;

segmenting the original audio to obtain sub-audio;

obtaining a time-frequency two-dimensional graph of the sub-audio, and obtaining a probability value of the sub-audio belonging to a specific classification as the input of a neural network;

and obtaining a weighted value of the original audio belonging to a specific classification according to the probability value.

7. The method of claim 6, wherein the obtaining the weight value of the original audio belonging to a specific class comprises:

calculating the probability average value of all sub-audios of the original audio belonging to a specific classification as the normalized probability of the original audio;

searching the stored normalization probability of the original audio of the preset number before the current original audio time of the user; if the normalized probability is larger than the number of the judgment upper limits and reaches a first preset number, the weight value of the current original audio belonging to the specific classification is 1;

if the normalized probability is smaller than the lower judgment limit and reaches a second preset number, the weight value of the current original audio belonging to the specific classification is 0;

otherwise, obtaining the weight value of the current original audio belonging to the specific classification according to a preset algorithm.

8. The method of claim 6, wherein the obtaining the weight value of the original audio belonging to a specific class comprises:

counting a first number of sub-audio values in the current original audio, wherein the probability value of the sub-audio values is greater than a threshold value of a specific classification probability;

and obtaining the weight value of the current original audio, which belongs to the specific classification, by using the ratio of the first number to the number of all sub-audios in the current original audio.

9. The method according to one of claims 6 to 8, characterized in that the method further comprises:

storing the original audio of the user and playing the original audio according to the instruction;

and storing the weight value of the original audio belonging to the specific classification corresponding to the original audio.

10. The method of claim 6, further comprising:

selecting original audio meeting specific requirements according to the weight value of the original audio belonging to a specific classification;

or, calculating the user interest value according to the weight value of the original audio belonging to the specific classification.

11. A human voice audio classification system using the method of any one of claims 6 to 10, comprising:

the storage unit is used for storing the original audio acquired by the server;