CN113763928A

CN113763928A - Audio category prediction method and device, storage medium and electronic equipment

Info

Publication number: CN113763928A
Application number: CN202110578096.XA
Authority: CN
Inventors: 林炳怀; 王丽园
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-12-07

Abstract

The application discloses an audio category prediction method, an audio category prediction device, a storage medium and electronic equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring acoustic feature information and phoneme alignment information of at least one audio, and calibrating a corresponding audio category for each audio; performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio; carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; performing Gaussian process construction based on at least one compression feature set to obtain an audio class prediction function; and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model. The method and the device for predicting the audio category effectively reduce the calculation complexity in the audio category prediction work and improve the audio category prediction effect.

Description

Audio category prediction method and device, storage medium and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an audio category prediction method, an audio category prediction device, a storage medium and electronic equipment.

Background

The audio category prediction, i.e. the work of analyzing the category of the audio, such as the work of scoring the audio in a spoken language examination, has a high difficulty due to various factors, such as the pronunciation of different speaking areas, the quality of audio recording, the environment of audio recording, and the like.

At present, in the related art scheme, large-scale learning can be performed by collecting a large number of audio samples, so that the problem of high computational complexity exists, or a part of the large number of audio samples can be selected for learning, so that the problem of poor audio category prediction effect caused by unreliable extracted samples exists.

Therefore, at present, the problems of high calculation complexity of the audio category prediction work and poor audio category prediction effect exist.

Disclosure of Invention

The embodiment of the application provides an audio category prediction method and a related device, which can effectively reduce the calculation complexity in the audio category prediction work and improve the audio category prediction effect.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

according to an embodiment of the present application, an audio class prediction method includes: acquiring acoustic feature information and phoneme alignment information of at least one audio, and calibrating a corresponding audio category for each audio; performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio; carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; performing Gaussian process construction based on the at least one compression feature set to obtain an audio class prediction function; and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

According to an embodiment of the present application, an audio class prediction apparatus includes: the acquisition module is used for acquiring acoustic feature information and phoneme alignment information of at least one audio, and each audio marks a corresponding audio category; the input module is used for carrying out deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain the deep pronunciation feature of each audio; the compression module is used for carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; the Gaussian module is used for carrying out Gaussian process construction based on the at least one compression feature set to obtain an audio class prediction function; and the prediction module is used for adjusting parameters in the audio analysis model based on the audio category prediction function so as to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

In some embodiments of the present application, the acoustic feature information of the audio includes at least one frame of sub-acoustic features of the audio, and the phoneme alignment information of the audio includes a pronunciation start and stop time period of a word in the audio; the input module includes: the depth extraction unit is used for performing depth extraction processing on at least one frame of sub-acoustic features of each audio frequency to obtain at least one frame of depth features corresponding to each audio frequency; the first normalization unit is used for normalizing the depth features belonging to the same pronunciation starting and stopping time period in at least one frame of depth features corresponding to each audio to obtain the word features of words in each audio; and the second normalization unit is used for performing normalization processing on the word characteristics of the words in each audio to obtain the deep pronunciation characteristics of each audio.

In some embodiments of the present application, the depth extraction unit includes: the network input subunit is used for inputting at least one frame of sub-acoustic features of each audio frequency into a feature extraction neural network; and the network extraction subunit is used for performing depth extraction processing on at least one frame of input sub-acoustic features based on the feature extraction neural network to obtain at least one frame of depth features corresponding to each audio frequency.

In some embodiments of the present application, the compression module comprises: the clustering unit is used for clustering the deep pronunciation characteristics of each audio according to the audio category corresponding to each audio to obtain at least one characteristic cluster, each characteristic cluster comprises the deep pronunciation characteristics of at least one audio, and each characteristic cluster corresponds to one audio category; and the mapping unit is used for compressing and mapping each feature cluster into a corresponding compression feature set to obtain at least one compression feature set, wherein the feature dimension of each compression feature set is smaller than that of the corresponding feature cluster.

In some embodiments of the present application, the mapping unit includes: the mapping input subunit is used for respectively inputting each feature cluster into the compressed mapping neural network; and the network hidden projection unit is used for compressing and mapping the input feature clusters into corresponding compression feature sets based on the compression mapping neural network.

In some embodiments of the present application, the gaussian module comprises: the sample construction module is used for constructing a training feature set and a testing feature set based on the compressed features in the at least one compressed feature set; the matrix generation unit is used for carrying out covariance operation processing on the compressed features in the training feature set and the test feature set based on a covariance function so as to generate a target covariance matrix; the mean value generating unit is used for carrying out mean value operation processing on the compressed features in the training feature set and the test feature set based on a mean value function so as to generate a target mean value vector; a function construction unit, configured to construct the audio class prediction function based on the covariance matrix and the target mean vector.

In some embodiments of the present application, the matrix generating unit includes: the first covariance generation subunit is used for carrying out covariance operation processing on the compressed features in the training feature set based on the covariance function to obtain a first covariance matrix; the second covariance generation subunit is used for carrying out covariance operation processing on the compressed features in the test feature set based on the covariance function to obtain a second covariance matrix; the third generating subunit is configured to perform covariance operation processing on the compressed features in the training feature set and the test feature set based on the covariance function to obtain a third covariance matrix; a matrix determination subunit, configured to use the first covariance matrix, the second covariance matrix, and the third covariance matrix as the target covariance matrix.

In some embodiments of the present application, the mean generating unit includes: the first mean value generating subunit is configured to perform mean value operation processing on the compressed features in the training feature set based on the mean value function to obtain a first mean value vector; the second mean value generating subunit is configured to perform mean value operation processing on the compressed features in the training feature set based on the mean value function to obtain a second mean value vector; a mean determination subunit, configured to use the first mean vector and the second mean vector as the target mean vector.

In some embodiments of the present application, the function construction unit includes: a distribution function obtaining subunit, configured to obtain an audio class distribution function that satisfies gaussian distribution and is generated based on the mean function and the covariance function; a training class obtaining subunit, configured to obtain an audio class corresponding to the compression feature in the training feature set; and the prediction function construction subunit is used for constructing an audio class prediction function for predicting the audio class corresponding to the compression feature in the test feature set based on the audio class distribution function, the audio class corresponding to the compression feature in the training feature set, the target covariance matrix and the target mean vector.

In some embodiments of the present application, the prediction module comprises: a target posterior probability determining unit for determining a target posterior probability of the audio class predicted by the audio class prediction function; and the estimation unit is used for carrying out maximum likelihood estimation on the basis of the target posterior probability so as to adjust the parameters in the audio analysis model.

In some embodiments of the present application, the target posterior probability determining unit includes: a first probability obtaining subunit, configured to obtain a prior probability of an audio category predicted by the audio category prediction function; the second probability obtaining subunit is used for determining the posterior probability of the audio category predicted by the audio category prediction function; and the third probability obtaining subunit is configured to use a product of the prior probability and the posterior probability as the target posterior probability.

In some embodiments of the present application, the audio to be analyzed includes spoken audio that follows reading of the target text; the prediction module comprises: the spoken language information extraction unit is used for extracting acoustic feature information and phoneme alignment information corresponding to the spoken language audio based on the target text; and the spoken language classification unit is used for inputting the acoustic feature information and the phoneme alignment information corresponding to the spoken language audio into the trained audio analysis model so as to output the audio category of the spoken language audio.

According to another embodiment of the present application, a storage medium has stored thereon computer-readable instructions which, when executed by a processor of a computer, cause the computer to perform the method of the embodiments of the present application.

According to another embodiment of the present application, an electronic device includes: a memory storing computer readable instructions; and a processor for reading the computer readable instructions stored in the memory to perform the methods of the embodiments.

According to another embodiment of the present application, a computer program product or computer program comprises computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described in the embodiments of this application.

In the embodiment of the application, acoustic feature information and phoneme alignment information of at least one audio are obtained, and each audio is used for calibrating a corresponding audio type; performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio; carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; performing Gaussian process construction based on at least one compression feature set to obtain an audio class prediction function; and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

In this way, at least one compression feature set obtained based on the depth feature extraction processing and the compression mapping processing is used as the input of the Gaussian process to construct the Gaussian process, and the sparse Gaussian depth kernel learning process is established, so that the audio class prediction accuracy can be ensured while the calculation complexity can be effectively reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 shows a schematic diagram of a system to which embodiments of the present application may be applied.

FIG. 2 shows a schematic diagram of another system to which embodiments of the present application may be applied.

Fig. 3 shows a flow diagram of an audio class prediction method according to an embodiment of the present application.

Fig. 4 shows a flow diagram of a depth feature extraction method according to an embodiment of the present application.

FIG. 5 shows a schematic diagram of a deep feature extraction network according to one embodiment of the present application.

FIG. 6 shows a flow diagram of a compression mapping process according to one embodiment of the present application.

Fig. 7 shows a flow chart for likelihood estimation according to an embodiment of the application.

Fig. 8 shows a terminal interface diagram for performing an audio category prediction process in one scenario.

Fig. 9 shows another terminal interface diagram for performing the audio category prediction process in one scenario.

Fig. 10 shows another terminal interface diagram for performing the audio category prediction process in one scenario.

Fig. 11 shows a block diagram of an audio class prediction apparatus according to another embodiment of the present application.

FIG. 12 shows a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description that follows, specific embodiments of the present application will be described with reference to steps and symbols executed by one or more computers, unless otherwise indicated. Accordingly, these steps and operations will be referred to, several times, as being performed by a computer, the computer performing operations involving a processing unit of the computer in electronic signals representing data in a structured form. This operation transforms the data or maintains it at locations in the computer's memory system, which may be reconfigured or otherwise altered in a manner well known to those skilled in the art. The data maintains a data structure that is a physical location of the memory that has particular characteristics defined by the data format. However, while the principles of the application have been described in language specific to above, it is not intended to be limited to the specific form set forth herein, and it will be recognized by those of ordinary skill in the art that various of the steps and operations described below may be implemented in hardware.

FIG. 1 shows a schematic diagram of a system 100 to which embodiments of the present application may be applied. As shown in fig. 1, the system 100 may include a server 101 and a terminal 102. The server 101 and the terminal 102 may be directly or indirectly connected by wireless communication, and the application is not limited thereto.

Data can be transmitted between the server 101 and the terminal 102 through a target Protocol link, and the target Protocol link may include a transport layer Protocol-based link, such as a Transmission Control Protocol (TCP) link or a User Datagram Protocol (UDP) link Transmission, and other transport layer protocols.

The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

In one embodiment of the present example, the server 101 is a cloud server, and the server 101 may provide an artificial intelligence cloud service, such as an artificial intelligence cloud service that provides a Massively Multiplayer Online Role Playing Game (MMORPG). The so-called artificial intelligence cloud Service is also generally called AIaaS (AI as a Service, chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services. For example, the server 101 may provide an artificial intelligence based audio category prediction service.

The terminal 102 may be any device, and the terminal 102 includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, a VR/AR device, an intelligent watch, a computer, and the like.

In an embodiment of the present example, the server 101 may obtain acoustic feature information and phoneme alignment information of at least one audio, where each audio specifies a corresponding audio category; performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio; carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; performing Gaussian process construction based on at least one compression feature set to obtain an audio class prediction function; and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

In an example, referring to fig. 1, the server 101 may send the obtained relevant information (including the audio to be analyzed and the text corresponding to the audio to be analyzed) of the audio to be analyzed to the speech recognition module, obtain the acoustic feature information and the phoneme alignment information of the audio to be analyzed, then send the acoustic feature information and the phoneme alignment information of the audio to be analyzed to the audio category prediction module, and predict the audio category of the audio to be analyzed by using the trained audio analysis model therein.

Fig. 2 shows a schematic diagram of another system 200 to which embodiments of the present application may be applied. As shown in fig. 2, the system 200 may be a distributed system formed by a client 201, a plurality of nodes 202 connected by a network communication.

Taking a distributed system as an example of a blockchain system, referring To fig. 2, fig. 2 is an optional structural schematic diagram of the distributed system 200 applied To the blockchain system provided in the embodiment of the present application, and the system is formed by a plurality of nodes 202 and clients 201, a Peer-To-Peer (P2P) network is formed between the nodes, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). In a distributed system, any machine, such as a server, may join to become nodes 202 (each node 202 may be a server 101 as in fig. 1), and audio category prediction services may be provided in the nodes 202, including a hardware layer, a middle layer, an operating system layer, and an application layer.

Referring to the functions of each node in the blockchain system shown in fig. 2, the functions involved include:

1) routing, a basic function that a node has, is used to support communication between nodes.

Besides the routing function, the node may also have the following functions:

2) the application is used for being deployed in a block chain, realizing specific services according to actual service requirements, recording data related to the realization functions to form recording data, carrying a digital signature in the recording data to represent a source of task data, and sending the recording data to other nodes in the block chain system, so that the other nodes add the recording data to a temporary block when the source and integrity of the recording data are verified successfully.

For example, the services implemented by the application include:

2.1) wallet, for providing the function of transaction of electronic money, including initiating transaction (i.e. sending the transaction record of current transaction to other nodes in the blockchain system, after the other nodes are successfully verified, storing the record data of transaction in the temporary blocks of the blockchain as the response of confirming the transaction is valid; of course, the wallet also supports the querying of the remaining electronic money in the electronic money address;

and 2.2) sharing the account book, wherein the shared account book is used for providing functions of operations such as storage, query and modification of account data, record data of the operations on the account data are sent to other nodes in the block chain system, and after the other nodes verify the validity, the record data are stored in a temporary block as a response for acknowledging that the account data are valid, and confirmation can be sent to the node initiating the operations.

2.3) Intelligent contracts, computerized agreements, which can enforce the terms of a contract, implemented by codes deployed on a shared ledger for execution when certain conditions are met, for completing automated transactions according to actual business requirement codes, such as querying the logistics status of goods purchased by a buyer, transferring the buyer's electronic money to the merchant's address after the buyer signs for the goods; of course, smart contracts are not limited to executing contracts for trading, but may also execute contracts that process received information.

3) And the Block chain comprises a series of blocks (blocks) which are mutually connected according to the generated chronological order, new blocks cannot be removed once being added into the Block chain, and recorded data submitted by nodes in the Block chain system are recorded in the blocks.

In an embodiment of this example, the node 202 may obtain acoustic feature information and phoneme alignment information of at least one audio, where each audio specifies a corresponding audio category; performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio; carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; performing Gaussian process construction based on at least one compression feature set to obtain an audio class prediction function; and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

Fig. 3 schematically shows a flow chart of an audio class prediction method according to an embodiment of the present application. The execution subject of the audio category prediction method may be any terminal, such as the server 101 shown in fig. 1 or the node 202 shown in fig. 2.

As shown in fig. 3, the audio class prediction method may include steps S310 to S350.

Step S310, obtaining acoustic characteristic information and phoneme alignment information of at least one audio, and marking a corresponding audio category for each audio;

step S320, performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio;

step S330, carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set;

step S340, constructing a Gaussian process based on at least one compression feature set to obtain an audio class prediction function;

step S350, adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

The following describes a specific process of each step performed when audio category prediction is performed.

In step S310, acoustic feature information and phoneme alignment information of at least one audio are obtained, and each audio designates a corresponding audio category.

In the embodiment of the present example, the audio is voice, such as spoken language or singing voice, and the audio category is a label describing a category to which the audio belongs, it can be understood that the audio category may be calibrated according to requirements. The amount of the audio in the at least one audio can be selected according to requirements, the at least one audio can include at least one audio category, and each audio category can include a predetermined number of audios. For example, at least one audio includes 500, the audio categories may include five, and each audio category may include 100 audios.

In one example, the audio includes spoken audio (e.g., spoken english audio) that follows the target text, and the audio categories may include five: one star, two stars, three stars, four stars, and five stars, the higher the star level, the higher the score corresponding to the audio.

Acoustic feature information is a general term for acoustic representation of various elements of sound, that is, physical quantities representing acoustic characteristics of audio, and includes, for example, frequency cepstral coefficients (MFCCs), fundamental frequency features, formant features, and the like. The phoneme alignment information is pronunciation time alignment information of words in the audio, and the phoneme alignment information may include a pronunciation start and stop time period corresponding to each word in the audio (i.e., a time period from a pronunciation start time to a pronunciation stop time corresponding to each word), for example, a pronunciation start and stop time period of a certain word is 1.5s to 2 s.

When acquiring the acoustic feature information and the phoneme alignment information of each audio, acquiring the audio and a text corresponding to the audio; then, extracting acoustic features of the audio signal of the audio, for example, extracting frequency cepstrum coefficients (MFCCs) of the audio signal, where the extracted acoustic features include a sub-acoustic feature corresponding to each frame, and at the same time, recording a feature start-stop time period (a time period between a start time and an end time corresponding to the sub-acoustic feature) corresponding to the sub-acoustic feature of each frame in the audio signal; and performing phoneme alignment on the audio and the text corresponding to the audio through voice recognition to obtain a pronunciation starting and stopping time period of the word in the audio corresponding to the audio signal as phoneme alignment information. The audio and the text corresponding to the audio may be sent to a trained information extractor based on a Speech Recognition technology (ASR) for extraction, so as to obtain acoustic feature information and phoneme alignment information of the audio.

In step S320, an audio analysis model is used to perform deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio, so as to obtain a deep pronunciation feature of each audio.

In the embodiment of the present example, the audio analysis model, that is, the model to be trained for performing the audio category prediction, may include a deep extraction processing unit, a compression mapping processing unit, and a gaussian process building unit.

The deep extraction processing is a process of extracting deep pronunciation features representing richer audio information than the acoustic feature information, and the deep extraction processing unit can perform deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio to obtain the deep pronunciation features of each audio.

The deep pronunciation feature of each audio is obtained based on deep extraction processing of the acoustic feature information and the phoneme alignment information of each audio, so that the pseudo sample has excellent audio information representation capability while thinning the acoustic feature information when the pseudo sample (namely, a compression feature set) is constructed based on compression mapping processing in the subsequent steps.

In one embodiment, the acoustic feature information of the audio includes at least one frame of sub-acoustic features of the audio, and the phoneme alignment information of the audio includes a pronunciation start and stop time period of a word in the audio; in step S320, performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio to obtain a deep pronunciation feature of each audio, including:

performing depth extraction processing on at least one frame of sub-acoustic features of each audio to obtain at least one frame of depth features corresponding to each audio; normalizing the depth features belonging to the same pronunciation starting and stopping time period in at least one frame of depth features corresponding to each audio to obtain the word features of words in each audio; and carrying out normalization processing on the word characteristics of the words in each audio to obtain the deep pronunciation characteristics of each audio.

The acoustic feature information of the audio comprises at least one frame of sub-acoustic features of the audio, namely the acoustic feature information of each audio is composed of the sub-acoustic features of one frame, and the acoustic feature information is in a frame level. Each frame sub-acoustic feature may include a multivariate feature, for example, one sub-acoustic feature may be a [1, 0.2, 3] co-trigram feature, where 1 is a start pronunciation feature of the phoneme, 0.2 is a middle stable pronunciation feature of the phoneme, and 3 is a tail pronunciation feature of the phoneme.

The depth extraction processing is performed on at least one frame of sub-acoustic features of each audio, and for each audio, the depth extraction processing is performed on at least one frame of sub-acoustic features included in the acoustic feature information of the audio, so as to obtain a depth feature corresponding to each frame of sub-acoustic features, and further obtain at least one frame of depth features corresponding to the audio.

When performing depth extraction processing on at least one frame of sub-acoustic features of an audio, in one mode, depth extraction may be performed on at least one frame of sub-acoustic features of the audio through convolution processing to obtain at least one frame of sub-acoustic features of the audio, for example, the convolution processing mode first constructs a feature matrix of at least one frame of sub-acoustic features, each frame of sub-acoustic features corresponds to one position in the feature matrix, then convolution operation may be performed on the feature matrix based on a preset convolution matrix to obtain a convolution result matrix, and a feature value of the position corresponding to each frame of sub-acoustic features is determined from the convolution result matrix as depth features corresponding to each frame of sub-acoustic features. In another mode, depth extraction may be performed on at least one frame of sub-acoustic features of the audio through long-term and short-term memory fusion processing, so as to obtain at least one frame of sub-acoustic features of the audio, and the long-term and short-term memory fusion processing may aggregate, for example, partial features of sub-acoustic features of other frames for each frame of sub-acoustic features (for example, aggregate in a dot product manner, etc.), so as to obtain a depth feature corresponding to each frame of sub-acoustic features. It is to be understood that in other embodiments, at least one frame of sub-acoustic features of the audio may be subjected to a deep extraction process by an attention-based fusion method or the like.

For example, referring to fig. 4, the acoustic feature information of the audio "I like applet" includes six frames of sub-acoustic features [1, 0.2, 3], [1.2, 3, 0.5], [0.3, 2, 3], [0.2, 3, 0.4], [1, 1.2, 3], and [2, 3.5, 4], and six frames of sub-acoustic features [1, 2, 3], [2, 3, 4], [1, 1, 3], [1, 3, 4], [4, 2, 3], and [2, 3, 4] can be obtained by performing depth extraction processing on the six frames of sub-acoustic features in the acoustic feature information of the audio "I like applet".

When extracting the sub-acoustic features of each frame, a feature start-stop time period (a time period from a start time to an end time corresponding to the sub-acoustic features) corresponding to the sub-acoustic features in the audio signal can be recorded, so that the feature start-stop time period corresponding to the depth features corresponding to the sub-acoustic features of each frame is determined, and further, the depth features belonging to the same pronunciation start-stop time period can be determined according to the pronunciation start-stop time period of each word.

For example, referring to fig. 4, the pronunciation start/stop time period corresponding to I in the audio 'I like applet' may be 1 to 1.5 seconds, and the sub-acoustic features [1, 0.2, 3], [1.2, 3, 0.5] may be determined to belong to the same pronunciation start/stop time period corresponding to I according to the feature start/stop time period corresponding to the sub-acoustic features, and further, the depth features [1, 2, 3] corresponding to the sub-acoustic features [1, 0.2, 3] and the depth features [2, 3, 4] corresponding to the sub-acoustic features [1.2, 3, 0.5] are determined to belong to the same pronunciation start/stop time period corresponding to I.

The normalization process may be to average or sum the depth features belonging to the same pronunciation start-stop time period, and in this example, the normalization process selects averaging, for example, the meta features in the depth features [1, 2, 3] and [2, 3, 4] corresponding to I are respectively averaged to obtain the word features [1.5, 2.5, 3.5] corresponding to I.

Finally, normalization processing such as averaging or summing may be performed on the word features of the words in the audio, in this example, averaging is selected, for example, the word features of I [1.5, 2.5, 3.5], the word features of like [1, 2, 3.5], the word features of applet [3, 2.5, 3.5], and the respective meta features are averaged to obtain the deep pronunciation features of the audio "I like applet [1.53, 2.33, 3.5 ].

In one embodiment, performing depth extraction processing on at least one frame of sub-acoustic features of each audio to obtain at least one frame of depth features corresponding to each audio includes:

inputting at least one frame of sub-acoustic features of each audio frequency into a feature extraction neural network; and performing depth extraction processing on at least one frame of input sub-acoustic features based on the feature extraction neural network to obtain at least one frame of depth features corresponding to each audio frequency.

The feature extraction neural network may be a convolutional neural network (capable of performing convolutional processing), a long-short term memory network (capable of performing long-short term memory fusion processing), and the like, and the feature extraction neural network may be a superposition of multiple layers of neural networks, or a superposition of multiple layers of the same or different neural networks, for example, a superposition of 3 layers of convolutional networks is used as the feature extraction neural network. Deep extraction processing such as convolution processing or long-short term memory fusion processing is carried out on the basis of the feature extraction neural network, and the feature extraction neural network is used as a deep extraction processing unit in the audio analysis model, so that the stability of the audio analysis model for audio category prediction can be ensured.

Wherein, the feature extraction neural network may be a pre-trained feature extraction neural network, and the pre-training process of the feature extraction neural network may include: for the target feature extraction model shown in fig. 5, the target feature extraction model may include a feature extraction neural network and a fully connected network; an acoustic feature information set can be obtained, wherein each acoustic feature information in the acoustic feature information set comprises at least one frame of sub-acoustic features, and each sub-acoustic feature calibration feature tag (senone tag); then, each acoustic feature information in the acoustic feature information set is used as an input feature in a target feature extraction model, a depth feature is extracted based on a feature extraction neural network, the depth feature is subjected to full connection processing based on a full connection network, each sub-acoustic feature calibration feature label (senone label) in the acoustic feature information is output, iterative training is carried out on the target feature extraction model until the accuracy of the prediction feature label (senone label) of the target feature extraction model meets the prediction accuracy requirement, and then the feature extraction neural network in the trained target feature extraction model is set as the feature extraction neural network in the audio analysis model, so that the calculation complexity of the audio category prediction work can be further reduced.

In step S330, the deep pronunciation features of the audios of the same audio category are compressed and mapped to obtain at least one compressed feature set.

In this exemplary embodiment, the at least one audio may include at least one audio category of audio, each audio category may include a predetermined number of audios, and the audio included in each audio category is the same audio category of audio, for example, all the audios corresponding to the audio category of one star are the same audio category of audio.

The compression mapping processing is a process of compressing and mapping a deep pronunciation feature set (a set composed of deep pronunciation features of audios of the same audio category) into a compressed feature set of a lower feature dimension, and the deep pronunciation features of the audios of the same audio category can be subjected to compression mapping processing in a compression mapping processing unit in an audio analysis model to obtain at least one compressed feature set.

Referring to FIG. 6, belonging to the same audio class (label) l₁The deep pronunciation feature of the audio comprises n:

and

by matching the same audio class (lab)el)l₁The deep pronunciation characteristic of the audio is compressed and subjected to implicit reflection to obtain m (m)<<n) compression characteristics:

one audio class (label) l of composition₁And obtaining C compression feature sets in total according to the corresponding compression feature sets.

Therefore, each compression feature set corresponds to one audio category, each compression feature set comprises at least one compression feature, and the feature dimension (i.e. the number of the compression features) of each compression feature set is smaller than the depth pronunciation feature set (a set consisting of the depth pronunciation features of the audio of the same audio category) under the audio category corresponding to each compression feature set.

When the deep pronunciation features of the audios of the same audio category are subjected to compression mapping processing, in one embodiment, the deep pronunciation features of the audios of the same audio category may be subjected to compression mapping processing through convolution processing to obtain a compression feature set corresponding to the audio category, and the convolution processing mode is, for example, to first construct a feature matrix of the deep pronunciation features of the audios of the same audio category, and then, to perform convolution operation on the feature matrix based on a preset convolution matrix to obtain a convolution result matrix, where each row in the convolution result matrix is used as a compression feature to further obtain the compression feature set. In another mode, the deep pronunciation features of the audios of the same audio category may be compressed and mapped through long-short term memory extraction compression, for example, the long-short term memory extraction compression method aggregates partial feature information of other deep pronunciation features for each deep pronunciation feature (for example, aggregates through dot multiplication or the like), so as to obtain a compressed feature corresponding to each deep pronunciation feature, and then extracts partial compressed features from all the compressed features, so as to obtain a compressed feature set. It is understood that in other embodiments, the compression mapping process may be performed by a compression steganography.

By means of compression processing of the depth pronunciation feature set, a sparse pseudo sample (namely at least one compression feature set) can be explicitly constructed, so that the sparse Gaussian process in the subsequent steps can reliably reduce the calculation amount and guarantee the accuracy of audio class prediction.

In one embodiment, in step S330, the compressing and mapping the deep pronunciation features of the audios of the same audio category to obtain at least one compressed feature set, including:

clustering the deep pronunciation characteristics of each audio according to the audio category corresponding to each audio to obtain at least one characteristic cluster, wherein each characteristic cluster comprises the deep pronunciation characteristics of at least one audio and corresponds to one audio category; and compressing and mapping each feature cluster into a corresponding compression feature set to obtain at least one compression feature set, wherein the feature dimension of each compression feature set is smaller than that of the corresponding feature cluster.

The deep pronunciation characteristics of the audios of the same audio category can be aggregated together according to the audio category corresponding to each audio to obtain at least one characteristic cluster, for example, belonging to the same audio category (label) l₁N deep pronunciation features of the audio:

and

a cluster of features may be formed. By performing compressed steganographic processing on clusters of features, i.e. e.g. on

And

the formed characteristic cluster is subjected to compression and implicit radiation treatment to obtain m (m)<<n) compression characteristics:

a set of compression features is composed.

In one embodiment, compression mapping each feature cluster to a corresponding set of compressed features includes:

inputting each feature cluster into a compressed mapping neural network respectively; and processing the input feature cluster into a corresponding compression feature set based on compression mapping neural network.

The compressed mapping neural network can be a convolutional neural network (which can perform convolutional processing), a long-short term memory network (which can perform long-short term memory extraction compression processing), and other neural networks, and the feature extraction neural network can be a superposition of multiple layers of neural networks, or a superposition of multiple layers of the same or different neural networks, for example, a superposition of 3 layers of convolutional networks is used as the compressed mapping neural network. The neural network is compressed and mapped based on the compression mapping neural network, and the compression mapping neural network is used as a compression mapping neural network unit in the audio analysis model, so that the stability of the audio analysis model for audio category prediction can be further ensured.

In step S340, a gaussian process is constructed based on at least one set of compressed features to obtain an audio class prediction function.

In this exemplary embodiment, the gaussian process may be determined by a mean function m (x) and a covariance function (i.e., a gaussian kernel) K, where x is a compression characteristic, and the gaussian process is configured to perform the covariance function and the mean function based on at least one compression characteristic set, and obtain a process of obtaining an audio class prediction function based on the mean function and the covariance function. The audio class prediction function constructed and obtained based on the Gaussian process is an audio class prediction function of the parameters to be trained, and the audio class prediction function in the trained audio analysis model can be used for predicting the audio class of the audio to be classified.

Based on input features (i.e. acoustic features) X ═ { X₁，…，x_NThe gaussian process is constructed to obtain the output f ═ f (x)₁)，…，f(x_N) And f (X) obeys a gaussian distribution N as shown in formula (1), where m (X) is a mean of the gaussian distribution generated by operating the input features based on a mean function, and K (X, X) is a covariance matrix generated by operating the input features based on a covariance function (i.e., a gaussian kernel).

f～N(m(X),K(X,X)) (I)

For output Y, assume that there is a memory between output Y and output fIn Gaussian noise, the distribution of the final output y is shown in equation (2), where σ²I is a hyperparameter.

Y～N(m(X),K(X,X)+σ²I) (2)

Assuming that a joint probability distribution of an output Y (audio class) of training data X based on input features (i.e., acoustic features) and an output Y (audio class) of test data X based on the input features (i.e., acoustic features) conforms to a gaussian distribution, equation (3) is obtained, where upper case K is a covariance matrix corresponding to the test data, lower case K is a covariance matrix corresponding to the test data, and m (X) is set to 0, which may be set to a mean vector.

For each audio class y to be predicted, its a posteriori probability is shown in equation (4), and D may comprise the hyperparameter σ²I。

Finally, the audio category prediction function is obtained

As shown in equations (5) and (6),

in order to be a predicted audio class,

for the uncertainty score of the predicted audio class, a larger uncertainty score indicates a higher uncertainty of the predicted audio class.

The covariance function K can be selected according to the requirement, and is shown in formula (7) in one example.

Finally, the covariance function K (X, X) in the aforementioned gaussian process is performed as formula K (X) in the present application based on at least one set of compression features_i，x_j|θ)→k(g(x_i，w)，g(x_jW) | theta, w), obtaining at least one compression feature set based on depth feature extraction processing and compression mapping processing, taking the compression feature set as input of a Gaussian process to construct the Gaussian process, and constructing a Gaussian kernel k (x) based on acoustic features_i，x_j|θ)(x_iMay be the ith acoustic feature, x_jMay be the jth acoustic feature and θ may be a parameter in a covariance function) into a sparse gaussian depth kernel k (g (x) based on the compressed features_i，w)，g(x_jW) | θ, W) (g can represent depth feature extraction processing and compression mapping processing, and W can be a processing process parameter), so that the accuracy of audio category prediction can be ensured while the computational complexity is effectively reduced.

In one embodiment, step S340, performing a gaussian process construction based on at least one compressed feature set to obtain an audio class prediction function, includes:

constructing a training feature set and a testing feature set based on compressed features in at least one compressed feature set; performing covariance operation processing on the compressed features in the training feature set and the test feature set based on a covariance function to generate a target covariance matrix; performing mean value operation processing on the compressed features in the training feature set and the test feature set based on a mean value function to generate a target mean value vector; and constructing an audio class prediction function based on the covariance matrix and the target mean vector.

From at least oneAmong the compressed features in the compressed feature set, a part of the compressed features can be selected to construct a training feature set

A part of the compressed features are selected to construct a test feature set h (x), and the dimensions of the compressed features in the training feature set and the test feature set can be selected according to the agreed dimensions.

The type of the covariance function K can be selected according to requirements, the covariance matrix can calculate the covariance between every two compressed features to form a target covariance matrix, and each element in the target covariance matrix is the covariance between every two compressed features. The mean function m (x) may calculate the mean of the compressed features to obtain a target mean vector. Furthermore, based on the obtained covariance matrix and the target mean vector, an audio class prediction function of the parameter to be trained based on the compression feature set can be generated

In one embodiment, the performing covariance operation on the compressed features in the training feature set and the test feature set based on a covariance function to generate a target covariance matrix includes:

performing covariance operation processing on the compressed features in the training feature set based on a covariance function to obtain a first covariance matrix; performing covariance operation processing on the compression characteristics in the test characteristic set based on a covariance function to obtain a second covariance matrix; performing covariance operation processing on the compressed features in the training feature set and the test feature set based on a covariance function to obtain a third covariance matrix; and taking the first covariance matrix, the second covariance matrix and the third covariance matrix as target covariance matrices.

For compressed features in training feature set

The compressed features in the characteristic training feature set can be generated by carrying out covariance operation processing

Based on compression characteristics of covariance between two

Based on the compression characteristics

The first covariance matrix of (c) may convert K (X, X) in equation (3) to K (X

And performing covariance operation processing on the compressed features h (x) in the test feature set based on the covariance function, so as to generate a second covariance matrix which is used for representing the covariance between every two compressed features h (x) in the test feature set and is based on the compressed features h (x), and the second covariance matrix based on the compressed features h (x) can convert K (x, x) in the formula (3) into K (h (x), h (x)).

Covariance function based covariance operation processing is carried out on the training feature set and the compression features in the test feature set, and compression features h (x) in the characterization test feature set and the compression features in the training feature set can be generated

Based on the compression characteristics h (x) and

based on the compression characteristics h (x) and

the third covariance matrix of (c) may convert K (X, X) in equation (3) to K (h (X),

)。

in one embodiment, performing a mean operation on compressed features in a training feature set and a test feature set based on a mean function to generate a target mean vector includes:

performing mean value operation processing on the compression features in the training feature set based on a mean value function to obtain a first mean value vector; performing mean value operation processing on the compression features in the training feature set based on a mean value function to obtain a second mean value vector; and taking the first mean vector and the second mean vector as target mean vectors.

Taking the mean value of all the compressed features h (x) in the test feature set to obtain a first mean value vector mu 1, and taking the mean value of all the compressed features in the training feature set

Taking an average to obtain a second average vector μ 2, a mean vector matrix may be generated based on μ 1 and μ 2 to replace 0 in equation (3).

In one embodiment, constructing the audio class prediction function based on the covariance matrix and the target mean vector comprises:

acquiring an audio class distribution function which is generated based on a mean function and a covariance function and meets Gaussian distribution; acquiring an audio category corresponding to the compression feature in the training feature set; and constructing an audio class prediction function for predicting the audio class corresponding to the compression features in the test feature set based on the audio class distribution function, the audio class corresponding to the compression features in the training feature set, the target covariance matrix and the target mean vector.

The audio class distribution function may be a function satisfying a gaussian distribution N as shown in equation (1) generated from a pre-selected mean function and covariance function. The audio category corresponding to the compression feature set from which the compression features in the training feature set originate can be determined to obtain the audio category Y corresponding to the compression features in the training feature set.

Then, the audio class, the target covariance matrix and the target mean vector corresponding to the compression features in the training feature set are input into the audio class distribution function, and then the audio class prediction functions of the audio classes corresponding to the compression features in the prediction test feature set shown in formulas (5) and (6) can be obtained. Furthermore, parameters in the audio analysis model can be adjusted according to the prediction result of the audio category prediction function, so that the trained audio analysis model is obtained.

In step S350, parameters in the audio analysis model are adjusted based on the audio category prediction function, so as to obtain an audio category of the audio to be analyzed predicted by the trained audio analysis model.

In the embodiment of the present example, the audio class prediction function is an audio class prediction function with training parameters, the audio class prediction function is generated based on a compression feature set, and can predict an audio class of a target compression feature (for example, a compression feature in a test feature set) in a compression feature set to obtain a prediction result, and according to an error of the prediction result, parameters in the audio analysis model can be adjusted until the prediction accuracy of the audio class prediction function meets a target requirement, so as to obtain a trained audio analysis model. The audio category of the audio to be analyzed can be predicted based on the trained audio analysis model.

When parameters in the audio analysis model are adjusted, the adjusted parameters may include parameters in a deep extraction processing process in the audio analysis model (for example, parameters in a feature extraction neural network), parameters in a compression mapping process (for example, parameters in a compression mapping neural network), and parameters in a gaussian process (for example, parameters in an audio class prediction function, which may include hyper-parameters in a covariance function) to be jointly adjusted, so as to form an overall sparse gaussian deep kernel learning process, so that the computational complexity is effectively reduced, and the accuracy of the audio analysis model for audio class prediction can be ensured.

In one embodiment, in step S350, adjusting parameters in the audio analysis model based on the audio class prediction function includes:

determining a target posterior probability of the audio category predicted by the audio category prediction function; and carrying out maximum likelihood estimation based on the target posterior probability so as to adjust parameters in the audio analysis model.

The target posterior probability can be determined according to the audio category predicted by the audio category prediction function and according to the modes of Bayesian estimation and the like, and the determination formula of the target posterior probability is shown as the following formula:

wherein T is the transpose, Y is the audio class corresponding to the compression feature in the training feature set, Y is the audio class corresponding to the compression feature in the predicted test feature set,

is the target posterior probability.

Then, gradient descent is carried out through maximum likelihood estimation according to the target posterior probability, and parameters in the audio analysis model can be updated and adjusted.

In one embodiment, determining the target posterior probability for the audio class predicted by the audio class prediction function comprises:

acquiring the prior probability of the audio category predicted by the audio category prediction function; determining a posterior probability of the audio class predicted by the audio class prediction function; and taking the product of the prior probability and the posterior probability as the target posterior probability.

The prior probability of the audio class predicted by the audio class prediction function may specifically be the prior probability of the audio class corresponding to the compression feature in the test feature set, for example

Posterior probability of audio class e.g

The product of the prior probability and the posterior probability

By combining the prior probability and the posterior probability, and then carrying out maximum likelihood estimation according to parameters in the audio analysis model, the distribution of sparse pseudo samples (namely, compression feature sets) and real deep pronunciation features in the training process of the audio analysis model can be closer, and the prediction accuracy is further ensured.

In the method of obtaining the target posterior probability by combining the prior probability and the posterior probability, the maximum likelihood estimation can be performed according to the following likelihood estimation formula:

in the likelihood estimation formula, θ may be a parameter in the depth extraction process and a parameter in the compression mapping process, and σ may be_yAnd l may be a parameter in a gaussian process. The likelihood estimation formula can be decomposed into two loss functions loss1 and loss2, as shown in the following formula:

as shown in FIG. 7, the loss function loss1 may be based on the compression characteristics h (x) (including h (x)) in the test feature set₁) To h (x)_n) Estimate of the corresponding prediction error, the loss function loss2 may represent the compression features in the training-based feature set

(comprises

To

) An estimate of the corresponding prediction error.

In an application scenario, the audio in the foregoing embodiment is a spoken audio, and the audio to be analyzed includes a spoken audio of a target text to be read; obtaining an audio category of the audio to be analyzed predicted by the trained audio analysis model, wherein the audio category comprises:

extracting acoustic feature information and phoneme alignment information corresponding to the spoken language audio based on the target text; and inputting acoustic characteristic information and phoneme alignment information corresponding to the spoken language audio into the trained audio analysis model so as to output the audio category of the spoken language audio.

Following the spoken audio of the target text, for example, the spoken audio generated by the user following the english text in the english learning application, the acoustic feature information and the phoneme alignment information corresponding to the spoken audio may be extracted according to the embodiment corresponding to step S310 based on the spoken audio, that is, the target text. Inputting acoustic feature information and phoneme alignment information corresponding to the spoken language audio into the trained audio analysis model, extracting deep pronunciation features corresponding to the spoken language audio based on the trained deep extraction processing procedure in step S320, performing compression mapping processing based on the deep pronunciation features corresponding to the spoken language audio based on the trained compression mapping processing procedure in step S330 to obtain compression features corresponding to the spoken language audio, predicting the compression features based on the trained audio category prediction function in step S330 to obtain an audio category corresponding to the spoken language audio, and outputting uncertainty corresponding to the audio category.

Referring to the terminal interface change processes shown in fig. 8 to 10, in the terminal interface shown in fig. 8, a user may "start reading" and then follow up with the text "ilike applet" by triggering (e.g., clicking or long pressing, etc.), a terminal (e.g., the terminal 102 shown in fig. 1 or the terminal corresponding to the client 201 shown in fig. 2) may collect spoken audio of the user, and then, in the terminal interface shown in fig. 9, the user may "end reading" by triggering (e.g., clicking or releasing long pressing), so that the terminal finishes collecting spoken audio of the user, at this time, the terminal may send the spoken audio and text to a server (e.g., the server 101 shown in fig. 1 or the node 202 shown in fig. 2), the server may predict an audio category of the spoken audio and a score of the audio category based on the spoken audio and the text based on the trained audio analysis model, finally, through the terminal interface display as shown in fig. 10, where the audio category is four stars and the uncertainty score (i.e., score confidence) for four stars is 0.5.

In this way, based on steps S310 to S350, at least one compression feature set obtained based on the depth feature extraction processing and the compression mapping processing is used as an input of the gaussian process to construct the gaussian process, a relationship between acoustic feature information and a pseudo sample (i.e., the compression feature set) is explicitly modeled, a sparse gaussian depth kernel learning process is established, and the accuracy of audio class prediction can be ensured while the computational complexity is effectively reduced.

In order to better implement the audio class prediction method provided by the embodiments of the present application, an embodiment of the present application further provides an audio class prediction apparatus based on the audio class prediction method. Wherein the meaning of the noun is the same as that in the above audio category prediction method, and the specific implementation details can refer to the description in the method embodiment. Fig. 11 shows a block diagram of an audio class prediction apparatus according to an embodiment of the present application. Fig. 11 shows a block diagram of an audio class prediction apparatus according to another embodiment of the present application.

As shown in fig. 11, the audio type prediction apparatus 400 may include an obtaining module 410, an input module 420, a compression module 430, a gaussian module 440, and a prediction module 450, and the audio type prediction apparatus 400 may be applied to a terminal.

The obtaining module 410 may be configured to obtain acoustic feature information and phoneme alignment information of at least one audio, where each audio specifies a corresponding audio category; the input module 420 may be configured to perform deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by using an audio analysis model to obtain a deep pronunciation feature of each audio; the compression module 430 may be configured to perform compression mapping on the deep pronunciation features of the audios of the same audio category to obtain at least one compression feature set; the gaussian module 440 may be configured to perform a gaussian process construction based on the at least one set of compression features to obtain an audio class prediction function; the prediction module 450 may be configured to adjust parameters in the audio analysis model based on the audio category prediction function to obtain an audio category of the audio to be analyzed predicted by the trained audio analysis model.

In this way, based on the audio class prediction apparatus 400, at least one compression feature set obtained based on the depth feature extraction processing and the compression mapping processing can be used as the input of the gaussian process to construct the gaussian process, explicitly model the relationship between the acoustic feature information and the pseudo sample (i.e., the compression feature set), establish the sparse gaussian depth kernel learning process, and can effectively reduce the computational complexity and ensure the audio class prediction accuracy.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, an embodiment of the present application further provides an electronic device, where the electronic device may be a terminal or a server, as shown in fig. 12, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, and specifically:

the electronic device may include components such as a processor 501 of one or more processing cores, memory 502 of one or more computer-readable storage media, a power supply 503, and an input unit 504. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 12 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 501 is a control center of the electronic device, connects various parts of the entire computer device by using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the electronic device. Optionally, processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user pages, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.

The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.

The electronic device further comprises a power supply 503 for supplying power to each component, and preferably, the power supply 503 may be logically connected to the processor 501 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are realized through the power management system. The power supply 503 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may also include an input unit 504, where the input unit 504 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 501 in the electronic device loads an executable file corresponding to a process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application programs stored in the memory 502, so as to implement various functions, for example, the processor 501 may execute:

acquiring acoustic feature information and phoneme alignment information of at least one audio, and calibrating a corresponding audio category for each audio; performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio; carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set; performing Gaussian process construction based on the at least one compression feature set to obtain an audio class prediction function; and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

In one embodiment, the acoustic feature information of the audio includes at least one frame of sub-acoustic features of the audio, and the phoneme alignment information of the audio includes a pronunciation start and stop time period of a word in the audio; the deep extraction processing is performed on the acoustic feature information and the phoneme alignment information of each audio to obtain the deep pronunciation feature of each audio, and the deep pronunciation feature processing method includes: performing depth extraction processing on at least one frame of sub-acoustic features of each audio to obtain at least one frame of depth features corresponding to each audio; normalizing the depth features belonging to the same pronunciation starting and stopping time period in at least one frame of depth features corresponding to each audio to obtain the word features of words in each audio; and carrying out normalization processing on the word characteristics of the words in the audio to obtain the deep pronunciation characteristics of the audio.

In an embodiment, the performing depth extraction processing on at least one frame of sub-acoustic features of each of the audios to obtain at least one frame of depth features corresponding to each of the audios includes: inputting at least one frame of sub-acoustic features of each audio frequency into a feature extraction neural network; and performing depth extraction processing on at least one frame of input sub-acoustic features based on the feature extraction neural network to obtain at least one frame of depth features corresponding to each audio frequency.

In one embodiment, the compressing and mapping the deep pronunciation features of the audios of the same audio category to obtain at least one compression feature set includes: clustering the deep pronunciation characteristics of each audio according to the audio category corresponding to each audio to obtain at least one characteristic cluster, wherein each characteristic cluster comprises the deep pronunciation characteristics of at least one audio and corresponds to one audio category; and compressing and mapping each feature cluster into a corresponding compression feature set to obtain at least one compression feature set, wherein the feature dimension of each compression feature set is smaller than that of the corresponding feature cluster.

In one embodiment, the compression mapping of each feature cluster to a corresponding set of compression features includes: inputting each feature cluster into a compression mapping neural network respectively; and processing the input feature cluster into a corresponding compression feature set based on the compression mapping neural network.

In one embodiment, the constructing a gaussian process based on the at least one set of compressed features to obtain an audio class prediction function includes: constructing a training feature set and a testing feature set based on the compressed features in the at least one compressed feature set; performing covariance operation processing on the compressed features in the training feature set and the test feature set based on a covariance function to generate a target covariance matrix; performing mean operation processing on the compressed features in the training feature set and the test feature set based on a mean function to generate a target mean vector; and constructing the audio class prediction function based on the covariance matrix and the target mean vector.

In one embodiment, the performing covariance operation on the compressed features in the training feature set and the test feature set based on a covariance function to generate a target covariance matrix includes: performing covariance operation processing on the compressed features in the training feature set based on the covariance function to obtain a first covariance matrix; performing covariance operation processing on the compressed features in the test feature set based on the covariance function to obtain a second covariance matrix; performing covariance operation processing on the compressed features in the training feature set and the test feature set based on the covariance function to obtain a third covariance matrix; and taking the first covariance matrix, the second covariance matrix and the third covariance matrix as the target covariance matrix.

In one embodiment, the performing a mean operation on the compressed features in the training feature set and the test feature set based on a mean function to generate a target mean vector includes: performing mean operation processing on the compressed features in the training feature set based on the mean function to obtain a first mean vector; performing mean operation processing on the compressed features in the training feature set based on the mean function to obtain a second mean vector; taking the first mean vector and the second mean vector as the target mean vector.

In one embodiment, the constructing the audio class prediction function based on the covariance matrix and the target mean vector comprises: acquiring an audio class distribution function which is generated based on the mean function and the covariance function and meets Gaussian distribution; acquiring an audio category corresponding to the compression feature in the training feature set; and constructing an audio class prediction function for predicting the audio class corresponding to the compression feature in the test feature set based on the audio class distribution function, the audio class corresponding to the compression feature in the training feature set, the target covariance matrix and the target mean vector.

In one embodiment, the adjusting the parameters in the audio analysis model based on the audio class prediction function includes: determining a target posterior probability of the audio category predicted by the audio category prediction function; and carrying out maximum likelihood estimation based on the target posterior probability so as to adjust parameters in the audio analysis model.

In one embodiment, the determining the target posterior probability of the audio class predicted by the audio class prediction function comprises: acquiring the prior probability of the audio category predicted by the audio category prediction function; determining a posterior probability of an audio class predicted by the audio class prediction function; and taking the product of the prior probability and the posterior probability as the target posterior probability.

In one embodiment, the audio to be analyzed comprises spoken audio for reading with the target text; the obtaining of the audio category of the trained audio analysis model for predicting the audio to be analyzed includes: extracting acoustic feature information and phoneme alignment information corresponding to the spoken language audio based on the target text; and inputting the acoustic characteristic information and the phoneme alignment information corresponding to the spoken language audio into the trained audio analysis model so as to output the audio category of the spoken language audio.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.

To this end, the present application further provides a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the methods provided in the present application.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any method provided in the embodiments of the present application, the beneficial effects that can be achieved by the methods provided in the embodiments of the present application can be achieved, for details, see the foregoing embodiments, and are not described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method provided in the various alternative implementations of the above embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the embodiments that have been described above and shown in the drawings, but that various modifications and changes can be made without departing from the scope thereof.

Claims

1. An audio class prediction method, comprising:

acquiring acoustic feature information and phoneme alignment information of at least one audio, and calibrating a corresponding audio category for each audio;

performing deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain deep pronunciation features of each audio;

carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set;

performing Gaussian process construction based on the at least one compression feature set to obtain an audio class prediction function;

and adjusting parameters in the audio analysis model based on the audio category prediction function to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

2. The method according to claim 1, wherein the acoustic feature information of the audio includes at least one frame of sub-acoustic features of the audio, and the phoneme alignment information of the audio includes a pronunciation start and stop time period of a word in the audio;

the deep extraction processing is performed on the acoustic feature information and the phoneme alignment information of each audio to obtain the deep pronunciation feature of each audio, and the deep pronunciation feature processing method includes:

performing depth extraction processing on at least one frame of sub-acoustic features of each audio to obtain at least one frame of depth features corresponding to each audio;

normalizing the depth features belonging to the same pronunciation starting and stopping time period in at least one frame of depth features corresponding to each audio to obtain the word features of words in each audio;

and carrying out normalization processing on the word characteristics of the words in the audio to obtain the deep pronunciation characteristics of the audio.

3. The method according to claim 2, wherein the performing depth extraction processing on at least one frame of sub-acoustic features of each of the audios to obtain at least one frame of depth features corresponding to each of the audios comprises:

inputting at least one frame of sub-acoustic features of each audio frequency into a feature extraction neural network;

and performing depth extraction processing on at least one frame of input sub-acoustic features based on the feature extraction neural network to obtain at least one frame of depth features corresponding to each audio frequency.

4. The method according to claim 1, wherein the compressing and mapping the deep-sounding features of the audios of the same audio category to obtain at least one compressed feature set comprises:

clustering the deep pronunciation characteristics of each audio according to the audio category corresponding to each audio to obtain at least one characteristic cluster, wherein each characteristic cluster comprises the deep pronunciation characteristics of at least one audio and corresponds to one audio category;

and compressing and mapping each feature cluster into a corresponding compression feature set to obtain at least one compression feature set, wherein the feature dimension of each compression feature set is smaller than that of the corresponding feature cluster.

5. The method of claim 4, wherein the compression mapping each of the feature clusters to a corresponding set of compressed features comprises:

inputting each feature cluster into a compression mapping neural network respectively;

and processing the input feature cluster into a corresponding compression feature set based on the compression mapping neural network.

6. The method according to claim 1, wherein said performing a gaussian process construction based on said at least one set of compression features to obtain an audio class prediction function comprises:

constructing a training feature set and a testing feature set based on the compressed features in the at least one compressed feature set;

performing covariance operation processing on the compressed features in the training feature set and the test feature set based on a covariance function to generate a target covariance matrix;

performing mean operation processing on the compressed features in the training feature set and the test feature set based on a mean function to generate a target mean vector;

and constructing the audio class prediction function based on the covariance matrix and the target mean vector.

7. The method of claim 6, wherein the covariance based function processing the compressed features in the training feature set and the test feature set to generate a target covariance matrix comprises:

performing covariance operation processing on the compressed features in the training feature set based on the covariance function to obtain a first covariance matrix;

performing covariance operation processing on the compressed features in the test feature set based on the covariance function to obtain a second covariance matrix;

performing covariance operation processing on the compressed features in the training feature set and the test feature set based on the covariance function to obtain a third covariance matrix;

and taking the first covariance matrix, the second covariance matrix and the third covariance matrix as the target covariance matrix.

8. The method of claim 6, wherein the averaging the compressed features in the training feature set and the test feature set based on a mean function to generate a target mean vector comprises:

performing mean operation processing on the compressed features in the training feature set based on the mean function to obtain a first mean vector;

performing mean operation processing on the compressed features in the training feature set based on the mean function to obtain a second mean vector;

taking the first mean vector and the second mean vector as the target mean vector.

9. The method of claim 6, wherein the constructing the audio class prediction function based on the covariance matrix and the target mean vector comprises:

acquiring an audio class distribution function which is generated based on the mean function and the covariance function and meets Gaussian distribution;

acquiring an audio category corresponding to the compression feature in the training feature set;

and constructing an audio class prediction function for predicting the audio class corresponding to the compression feature in the test feature set based on the audio class distribution function, the audio class corresponding to the compression feature in the training feature set, the target covariance matrix and the target mean vector.

10. The method of claim 1, wherein the adjusting parameters in the audio analysis model based on the audio class prediction function comprises:

determining a target posterior probability of the audio category predicted by the audio category prediction function;

and carrying out maximum likelihood estimation based on the target posterior probability so as to adjust parameters in the audio analysis model.

11. The method according to claim 10, wherein said determining a target a posteriori probability for the audio class predicted by the audio class prediction function comprises:

acquiring the prior probability of the audio category predicted by the audio category prediction function;

determining a posterior probability of an audio class predicted by the audio class prediction function;

and taking the product of the prior probability and the posterior probability as the target posterior probability.

12. The method of claim 1, wherein the audio to be analyzed comprises spoken audio followed by target text;

the obtaining of the audio category of the trained audio analysis model for predicting the audio to be analyzed includes:

extracting acoustic feature information and phoneme alignment information corresponding to the spoken language audio based on the target text;

and inputting the acoustic characteristic information and the phoneme alignment information corresponding to the spoken language audio into the trained audio analysis model so as to output the audio category of the spoken language audio.

13. An audio class prediction apparatus, comprising:

the acquisition module is used for acquiring acoustic feature information and phoneme alignment information of at least one audio, and each audio marks a corresponding audio category;

the input module is used for carrying out deep extraction processing on the acoustic feature information and the phoneme alignment information of each audio by adopting an audio analysis model to obtain the deep pronunciation feature of each audio;

the compression module is used for carrying out compression mapping processing on the deep pronunciation characteristics of the audios of the same audio category to obtain at least one compression characteristic set;

the Gaussian module is used for carrying out Gaussian process construction based on the at least one compression feature set to obtain an audio class prediction function;

and the prediction module is used for adjusting parameters in the audio analysis model based on the audio category prediction function so as to obtain the audio category of the audio to be analyzed predicted by the trained audio analysis model.

14. A storage medium having stored thereon computer readable instructions which, when executed by a processor of a computer, cause the computer to perform the method of any one of claims 1 to 12.

15. An electronic device, comprising: a memory storing computer readable instructions; a processor reading computer readable instructions stored by the memory to perform the method of any of claims 1 to 12.