CN113555034A

CN113555034A - Compressed audio recognition method, device and storage medium

Info

Publication number: CN113555034A
Application number: CN202110887281.7A
Authority: CN
Inventors: 王润宇; 付立; 资礼波; 李萧萧
Original assignee: Jingdong Shuke Haiyi Information Technology Co Ltd
Current assignee: Jingdong Shuke Haiyi Information Technology Co Ltd
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2021-10-26
Anticipated expiration: 2041-08-03
Also published as: CN113555034B

Abstract

The present disclosure relates to a compressed audio recognition method, apparatus and storage medium, the method comprising: obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain a frequency spectrum characteristic corresponding to the compressed audio; coding the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio. By adopting the technical means, the problem that the recognition accuracy of the voice recognition model on the lossy compression audio is low in the prior art is solved.

Description

Compressed audio recognition method, device and storage medium

Technical Field

The present disclosure relates to the field of communications, and in particular, to a compressed audio recognition method, apparatus, and storage medium.

Background

With the advancement of technology, intelligent speech recognition is widely used in intelligent hardware, speech input, call center, and other scenarios. However, most voice recognition systems are deployed based on the cloud, and users need to upload audio collected by devices to the cloud system for processing. At present, the audio frequency uploaded by a user is mostly in a lossless compression audio file format, the lossless compression audio file has the characteristics of no signal distortion and high speech recognition accuracy, but the file is large, so that pressure can be caused on a network, and the network service collapse can be caused when the network load is large. Therefore, when uploading, it is better to use a lossy compression format for transmission, and this method is characterized by high compression ratio of the audio file, and the compressed file is 10% of the original file size at the minimum. However, the lossy compression can bring the problem of feature loss, and because the existing speech recognition model is usually trained on a lossless audio training set, the recognition accuracy of the speech recognition model on the lossy compressed audio is low.

In the course of implementing the disclosed concept, the inventors found that there are at least the following technical problems in the related art: the recognition accuracy of the speech recognition model to the lossy compressed audio is low, and the like.

Disclosure of Invention

In order to solve the above technical problem or at least partially solve the above technical problem, embodiments of the present disclosure provide a compressed audio recognition method, apparatus and storage medium, so as to at least solve the problems in the prior art that the recognition accuracy of a speech recognition model for a lossy compressed audio is low.

The purpose of the present disclosure is realized by the following technical scheme:

in a first aspect, an embodiment of the present disclosure provides a compressed audio identification method, including: obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain a frequency spectrum characteristic corresponding to the compressed audio; coding the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

In an exemplary embodiment, before the encoding operation of the spectral feature through the lossy acoustic model to obtain the probability matrix corresponding to the spectral feature, the method further includes: acquiring training lossless audio, and performing incremental learning training on an original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain a training lossy audio; and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

In one exemplary embodiment, before obtaining the training lossless audio, the method further comprises: acquiring life common text data and/or professional common text data; performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio; storing the training lossless audio in an audio data center.

In one exemplary embodiment, the incremental learning training includes: inputting training audio into an embedding layer and outputting a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio comprises: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model all comprising the embedding layer; carrying out statistical processing on the word vectors and/or the word vectors through a statistical language algorithm to obtain a feature vector list; training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

In one exemplary embodiment, the training of the acoustic model according to the list of feature vectors includes: counting categories and occurrence frequencies of the word vectors and/or the word vectors in the feature vector list; determining a Huffman tree according to the categories and the occurrence frequencies of the word vectors and/or the word vectors; and training the acoustic model according to the Huffman tree.

In one exemplary embodiment, the training of the acoustic model according to the huffman tree comprises: determining a vector neighbor probability according to the Huffman tree, wherein the vector neighbor probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; normalizing the vector adjacent probability through a normalization index function; and training the acoustic model according to the vector adjacent probability after the normalization processing.

In an exemplary embodiment, the decoding, by the lossy acoustic model and the decoder, the probability matrix to obtain the identified audio includes: performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, wherein the decoding operation comprises the first decoding operation; respectively calculating the decoding error probability of each decoding result in the plurality of decoding results according to the vector adjacent probability, wherein the vector adjacent probability comprises the following steps: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; and determining the decoding result corresponding to the decoding error probability with the largest value in the plurality of decoding error probabilities as the identification audio.

In an exemplary embodiment, the decoding, by the lossy acoustic model and the decoder, the probability matrix to obtain the identified audio includes: performing, by the decoder, a first decoding operation on a preset portion in the probability matrix to obtain a target decoding result, where the decoding operation includes the first decoding operation; performing, by the decoder, the first decoding operation on the remaining part of the probability matrix, except for a preset part, multiple times, and, when performing the first decoding operation on the remaining part multiple times, performing, according to a vector adjacent probability, a correction process on a decoding result of each first decoding operation to obtain the identification audio, where the vector adjacent probability includes: a probability of one word vector followed by another word vector, and a probability of one word vector followed by one word vector, the target decoding result being a decoding result of the first decoding operation.

In a second aspect, an embodiment of the present disclosure provides a compressed audio recognition apparatus, including: the acquisition module is used for acquiring compressed audio and performing discrete Fourier transform on the compressed audio to obtain spectral characteristics corresponding to the compressed audio; the coding module is used for coding the frequency spectrum characteristics through a lossy acoustic model so as to obtain a probability matrix corresponding to the frequency spectrum characteristics; and the decoding module is used for decoding the probability matrix through the lossy acoustic model and the decoder so as to obtain the identification audio.

In a third aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; a processor for implementing the compressed audio recognition method or the image processing method as described above when executing the program stored in the memory.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium. The above-mentioned computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the compressed audio recognition method or the method of image processing as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure at least has part or all of the following advantages: obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain a frequency spectrum characteristic corresponding to the compressed audio; coding the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio. By adopting the technical means, the problems that in the prior art, the recognition accuracy of the speech recognition model on the lossy compressed audio is low and the like are solved, so that the recognition accuracy of the lossy compressed audio is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 schematically illustrates a hardware block diagram of a computer terminal of a compressed audio recognition method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a compressed audio recognition method of an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a compressed audio recognition method according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow diagram of acoustic model training in accordance with an embodiment of the present disclosure;

fig. 5 is a block diagram schematically illustrating a structure of a compressed audio recognition apparatus according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided by the embodiments of the present disclosure may be executed in a computer terminal or a similar computing device. Taking an example of the method running on a computer terminal, fig. 1 schematically shows a hardware structure block diagram of a computer terminal of a compressed audio recognition method according to an embodiment of the present disclosure. As shown in fig. 1, a computer terminal may include one or more processors 102 (only one is shown in fig. 1), wherein the processors 102 may include but are not limited to a processing device such as a Microprocessor (MPU) or a Programmable Logic Device (PLD) and a memory 104 for storing data, and optionally, the computer terminal may further include a transmission device 106 for communication function and an input/output device 108, it is understood by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not a limitation to the structure of the computer terminal, for example, the computer terminal may further include more or less components than those shown in fig. 1, or have equivalent functions or different configurations than those shown in fig. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the compressed audio recognition method in the embodiments of the present disclosure, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned methods. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to a computer terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In an embodiment of the present disclosure, a compressed audio recognition method is provided, and fig. 2 schematically illustrates a flowchart of a compressed audio recognition method according to an embodiment of the present disclosure, where as shown in fig. 2, the flowchart includes the following steps:

step S202, acquiring compressed audio, and performing discrete Fourier transform on the compressed audio to obtain a frequency spectrum characteristic corresponding to the compressed audio;

step S204, coding operation is carried out on the frequency spectrum characteristics through a lossy acoustic model so as to obtain a probability matrix corresponding to the frequency spectrum characteristics;

and step S206, decoding the probability matrix through the lossy acoustic model and the decoder to obtain the identification audio.

According to the method and the device, the compressed audio is obtained, and discrete Fourier transform is carried out on the compressed audio to obtain the spectral characteristics corresponding to the compressed audio; coding the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio. By adopting the technical means, the problems that in the prior art, the recognition accuracy of the speech recognition model on the lossy compressed audio is low and the like are solved, so that the recognition accuracy of the lossy compressed audio is improved.

Before step S204 is executed, that is, before the spectral feature is encoded by the lossy acoustic model to obtain a probability matrix corresponding to the spectral feature, the method further includes: acquiring training lossless audio, and performing incremental learning training on an original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain a training lossy audio; and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

Incremental learning is one type of deep learning, and is very similar to the learning model of human beings. Because people learn and accept new things every day during the growth process. Learning is gradual and the acquired knowledge is not usually forgotten by humans. For the traditional deep learning technology, the problem of obtaining useful information from new data and new scenes is a difficult problem. With the increasing scale of data, the demand for time and space will increase rapidly, eventually resulting in a learning rate that is not as fast as the data update rate. In order to adapt to a new scene quickly, the traditional deep learning method needs to add new training data into the original training data and train again, and the process needs huge time cost. Also, the more data that is accumulated, the higher the time cost. Compared with conventional deep learning, incremental learning has significant advantages in two aspects: on one hand, as the historical training data is not required to be stored in the incremental learning, the occupation of the storage space is reduced; on the other hand, in the incremental training of the new scene, the historical information of the old scene is fully utilized, so that the time of subsequent training is greatly reduced.

The existing incremental learning method based on deep learning is mainly divided into three categories, which are mainly distinguished into the use mode of old scene data and the mode of model preserving old scene knowledge: incremental learning of old data is not used. The method freezes the last layer of the deep learning model and minimizes the variation of the shared parameters in the feature extraction layer. However, there may be conflicts in these parameters between the old and new scenarios; incremental learning using the synthetic data: the method uses a generative model to synthesize data from an old scene, but cannot effectively synthesize complex scene data; incremental learning from old data is used, the method selects a small number of samples from training data of old scenes, retains the classifiers for all incremental steps, and uses them as distillation. The incremental learning method used in the embodiments of the present disclosure is mainly based on incremental learning without using old data.

Optionally, the lossy acoustic model may be trained by: obtaining a training lossless audio, and performing lossy compression processing on the training lossless audio to obtain a training lossy audio; and performing incremental learning training on an original acoustic model through the training lossy audio to obtain the lossy acoustic model.

The lossy acoustic model is used to identify compressed audio, which is the object of identification, and which may be either lossy or lossless.

Optionally, before obtaining the training lossless audio, the method further comprises: acquiring life common text data and/or professional common text data; performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio; storing the training lossless audio in an audio data center.

The life common text data is text that is often used in the life of people, such as text of news simulcasts. The professional common text data is text of a technical field in which an acoustic model is commonly used, for example, the acoustic model is used for audio recognition in the computer field, and then the professional common text data is professional text in the computer field.

In an alternative embodiment, the incremental learning training comprises: inputting training audio into an embedding layer and outputting a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio comprises: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model all comprising the embedding layer; carrying out statistical processing on the word vectors and/or the word vectors through a statistical language algorithm to obtain a feature vector list; training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

The Embedding layer is an Embedding layer, the Embedding layer is a mode of converting discrete variables into continuous vectors, and a corresponding word vector matrix can be obtained by using the Embedding layer. The disclosed embodiments derive multiple word vectors and/or multiple word vectors by using an embedding layer. The incremental learning training can be used for training the original acoustic model to obtain the lossless acoustic model; and training the lossless acoustic model to obtain the lossy acoustic model. And carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list, namely extracting features according to the plurality of word vectors and/or the plurality of word vectors. The lossless acoustic model is used for feature extraction, so that the frequency spectrum of the lossless compressed audio is complete, and the frequency spectrum of the lossy compressed audio has a large number of feature loss. This is because the original acoustic model cannot efficiently encode the spectral features of lossy compressed audio since it is trained only on lossless compressed audio. Therefore, the lossy compressed audio is used for forming a training set, the lossless acoustic model is trained to obtain the lossy acoustic model, the coding capability of the lossy acoustic model on the lossy compressed audio features is enhanced, and meanwhile the coding capability of the lossy acoustic model on the lossless audio features is retained to the greatest extent.

And carrying out statistical processing on the word vectors and/or the word vectors through a statistical language algorithm to obtain a feature vector list. The statistical language algorithm may be N-Gram, which is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N. Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension. The algorithm is based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus. Binary Bi-grams and ternary Tri-grams are commonly used. The N-Gram can realize automatic conversion to Chinese characters by utilizing collocation information between adjacent words in the context,

in an alternative embodiment, training an acoustic model according to the list of feature vectors includes: counting categories and occurrence frequencies of the word vectors and/or the word vectors in the feature vector list; determining a Huffman tree according to the categories and the occurrence frequencies of the word vectors and/or the word vectors; and training the acoustic model according to the Huffman tree.

The method and the device construct a Huffman tree by utilizing the categories and the occurrence frequencies of the word vectors and/or the word vectors. In this way, the number of times of determination required for outputting different word vectors and/or word vectors (categories) is actually different. The more frequently occurring word vectors and/or word vectors, the closer to the root node, the fewer the number of judgments required. Thereby making the final overall judgment more efficient. The huffman tree is a special binary tree, which is a binary tree with the shortest weighted path length and is also called an optimal binary tree. And (3) given N weights as the weights of N leaf nodes of the binary tree, constructing a binary tree, and if the weighted path length of the binary tree reaches the minimum, calling the binary tree as a Huffman tree.

In an optional embodiment, training the acoustic model according to the huffman tree comprises: determining a vector neighbor probability according to the Huffman tree, wherein the vector neighbor probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; normalizing the vector adjacent probability through a normalization index function; and training the acoustic model according to the vector adjacent probability after the normalization processing.

Traversing all leaf nodes of the Huffman tree to determine the adjacent probability of the vectors, namely determining the probability that one word vector is followed by another word vector, the probability that one word vector is followed by another word vector and the probability that one word vector is followed by one word vector; the probability of what word or word is followed by a word or word can be judged to be the largest by vector adjacent probability. And normalizing the vector adjacent probability by a normalized exponential function, wherein the normalized exponential function is a softmax function.

In step S206, there are several optional embodiments, wherein one of the optional embodiments is that the lossy acoustic model and the decoder perform a decoding operation on the probability matrix to obtain the identification audio, and the method includes: performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, wherein the decoding operation comprises the first decoding operation; respectively calculating the decoding error probability of each decoding result in the plurality of decoding results according to the vector adjacent probability, wherein the vector adjacent probability comprises the following steps: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; and determining the decoding result corresponding to the decoding error probability with the largest value in the plurality of decoding error probabilities as the identification audio.

The decoding operation includes decoding the probability matrix by the lossy acoustic model and a decoder, decoding the probability matrix by a decoder, and decoding the probability matrix by the lossy acoustic model. The first decoding operation is decoding the probability matrix by the decoder, the decoding operation comprising the first decoding operation. A plurality of decoding results can be obtained by the decoder performing a first decoding operation on the probability matrix. And respectively calculating the decoding error probability of each decoding result in the plurality of decoding results according to the adjacent probability of the vector, and determining the decoding result corresponding to the decoding error probability with the largest value in the plurality of decoding error probabilities as the identification audio. Wherein the lossy acoustic model holds vector adjacency probabilities.

In step S206, there are several optional embodiments, wherein one of the optional embodiments is that the lossy acoustic model and the decoder perform a decoding operation on the probability matrix to obtain the identification audio, and the method includes: performing, by the decoder, a first decoding operation on a preset portion in the probability matrix to obtain a target decoding result, where the decoding operation includes the first decoding operation; performing, by the decoder, the first decoding operation on the remaining part of the probability matrix, except for a preset part, multiple times, and, when performing the first decoding operation on the remaining part multiple times, performing, according to a vector adjacent probability, a correction process on a decoding result of each first decoding operation to obtain the identification audio, where the vector adjacent probability includes: a probability of one word vector followed by another word vector, and a probability of one word vector followed by one word vector, the target decoding result being a decoding result of the first decoding operation.

And carrying out first decoding operation on a preset part in the probability matrix through the decoder to obtain a target decoding result, wherein the preset part is a part with a preset size of the head or the head of the probability matrix. And dividing the rest part according to a preset size, wherein the original sequence of the probability matrix is reserved by the division. And sequentially performing the first decoding operation on the multi-part data obtained by dividing the rest part except the preset part in the probability matrix by the decoder for multiple times, and correcting the decoding result of each first decoding operation according to the adjacent probability of the vector when performing the first decoding operation for multiple times to obtain the identification audio. For example, the target decoding result is "today", the decoding result of the second decoding operation is "how to day", but the probability of determining "how to day" followed by "how to weather" according to the target decoding result and the vector neighbor probability is the largest, so the decoding results of the first decoding operations of the first two times "how to day" are corrected to "how to weather today".

Alternatively, incremental learning refers to a learning system that can continuously learn new knowledge from new samples and can store most of the previously learned knowledge. The training set for incremental learning can be divided into T scenarios. Each scene T e {1, …, T }, input samples and their labels (x)_tT) data sets D belonging to the scene_t. The incremental training model f and the model parameters θ are trained once on a training set of one or more scenarios. When the model is used for reasoning, the output probability matrix is h^θ(x) The probability distribution representing each category is:

wherein f is^θ(x) And (5) reasoning results for the finally output model. The goal of incremental learning is to correctly predict the already trained scene t e {1, …, t at any training stage in the training process_CSample in (x)_t,y_t) The training goal can be expressed as:

the above-described training targets can be understood as constraint equations. c denotes the number of the training scenario (here from 1 to c), t_cRepresenting the c-th scenario, Lt represents the loss (loss) of the model in the t scenario, L represents the loss function of the model,

the loss representing Lt is a mathematical expectation calculated on the data set Dt, which can be understood as a mean or a sum. Argmin is a mathematical function, and softmax is a normalized exponential function, which is not described in detail.

In order to better understand the technical solutions, the embodiments of the present disclosure also provide an alternative embodiment for explaining the technical solutions.

Fig. 3 schematically shows a flowchart of a compressed audio recognition method according to an embodiment of the present disclosure, as shown in fig. 3:

feature extraction: performing discrete Fourier transform on the compressed audio to extract a spectral feature corresponding to the compressed audio;

acoustic model coding: coding the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics;

decoding by a decoder: and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

Fig. 4 schematically illustrates a flow chart of acoustic model training according to an embodiment of the present disclosure, as shown in fig. 4:

s402, acquiring life common text data and/or professional common text data;

s404, performing text-to-audio conversion operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio;

s406, performing incremental learning training on the original acoustic model through the training lossless audio to obtain a lossless acoustic model;

s408, carrying out lossy compression processing on the training lossless audio to obtain a training lossy audio;

and S410, performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present disclosure or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present disclosure.

In this embodiment, a compressed audio recognition apparatus is further provided, and the compressed audio recognition apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 5 is a block diagram schematically illustrating a compressed audio recognition apparatus according to an alternative embodiment of the present disclosure, and as shown in fig. 5, the apparatus includes:

an obtaining module 502, configured to obtain a compressed audio, and perform discrete fourier transform on the compressed audio to obtain a spectral feature corresponding to the compressed audio;

the encoding module 504 is configured to perform an encoding operation on the spectral feature through a lossy acoustic model to obtain a probability matrix corresponding to the spectral feature;

a decoding module 506, configured to perform a decoding operation on the probability matrix through the lossy acoustic model and the decoder to obtain the identification audio.

Optionally, the encoding module 504 is further configured to obtain training lossless audio, and perform incremental learning training on the original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain a training lossy audio; and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

Optionally, the encoding module 504 is further configured to obtain a training lossless audio, and perform lossy compression processing on the training lossless audio to obtain a training lossy audio; and performing incremental learning training on an original acoustic model through the training lossy audio to obtain the lossy acoustic model.

Optionally, the encoding module 504 is further configured to obtain life common text data and/or professional common text data; performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio; storing the training lossless audio in an audio data center.

Optionally, the encoding module 504 is further configured to input training audio into the embedding layer, and output a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio includes: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model all comprising the embedding layer; carrying out statistical processing on the word vectors and/or the word vectors through a statistical language algorithm to obtain a feature vector list; training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

optionally, the encoding module 504 is further configured to count categories and occurrence frequencies of the plurality of word vectors and/or the plurality of word vectors in the feature vector list; determining a Huffman tree according to the categories and the occurrence frequencies of the word vectors and/or the word vectors; and training the acoustic model according to the Huffman tree.

Optionally, the encoding module 504 is further configured to determine a vector neighbor probability according to the huffman tree, wherein the vector neighbor probability includes: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; normalizing the vector adjacent probability through a normalization index function; and training the acoustic model according to the vector adjacent probability after the normalization processing.

Optionally, the decoding module 506 is further configured to perform a first decoding operation on the probability matrix through the decoder to obtain a plurality of decoding results, where the decoding operation includes the first decoding operation; respectively calculating the decoding error probability of each decoding result in the plurality of decoding results according to the vector adjacent probability, wherein the vector adjacent probability comprises the following steps: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; and determining the decoding result corresponding to the decoding error probability with the largest value in the plurality of decoding error probabilities as the identification audio.

Optionally, the decoding module 506 is further configured to perform, by the decoder, a first decoding operation on a preset portion of the probability matrix to obtain a target decoding result, where the decoding operation includes the first decoding operation; performing, by the decoder, the first decoding operation on the remaining part of the probability matrix, except for a preset part, multiple times, and, when performing the first decoding operation on the remaining part multiple times, performing, according to a vector adjacent probability, a correction process on a decoding result of each first decoding operation to obtain the identification audio, where the vector adjacent probability includes: a probability of one word vector followed by another word vector, and a probability of one word vector followed by one word vector, the target decoding result being a decoding result of the first decoding operation.

Alternatively, incremental learning refers to a learning system that can continuously learn new knowledge from new samples and can store most of the previously learned knowledge. The training set for incremental learning can be divided into T scenarios. Each scene T e {1, …, T }, input samples and their labels (x)_t,y_t) Data set D belonging to the scene_t. The incremental training model f and the model parameters θ are trained once on a training set of one or more scenarios. When the model is used for reasoning, the output probability matrix is h^θ(x) The probability distribution representing each category is:

c denotes the number of the training scenario (here from 1 to c), t_cRepresenting the c-th scenario, Lt represents the loss (loss) of the model in the t scenario, L represents the loss function of the model,

the loss representing Lt is a mathematical expectation calculated on the data set Dt, which can be understood as a mean or sum. Argmin is a mathematical function, and softmax is a normalized exponential function, which is not described in detail.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present disclosure provide an electronic device.

Referring to fig. 6, an electronic device 600 provided in the embodiment of the present disclosure includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604; a memory 603 for storing a computer program; the processor 601 is configured to implement the steps in any of the above method embodiments when executing the program stored in the memory.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring compressed audio, and performing discrete Fourier transform on the compressed audio to obtain a spectral feature corresponding to the compressed audio;

s2, carrying out coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics;

and S3, decoding the probability matrix through the lossy acoustic model and the decoder to obtain the identification audio.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

The computer-readable storage medium may be contained in the apparatus/device described in the above embodiments; or may be present alone without being assembled into the device/apparatus. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present disclosure described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A compressed audio recognition method, comprising:

obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain a frequency spectrum characteristic corresponding to the compressed audio;

coding the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics;

and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

2. The method of claim 1, wherein before the operation of encoding the spectral feature by the lossy acoustic model to obtain the probability matrix corresponding to the spectral feature, the method further comprises:

acquiring training lossless audio, and performing incremental learning training on an original acoustic model through the training lossless audio to obtain a lossless acoustic model;

carrying out lossy compression processing on the training lossless audio to obtain a training lossy audio;

and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

3. The method of claim 2, wherein prior to obtaining training lossless audio, the method further comprises:

acquiring life common text data and/or professional common text data;

performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio;

storing the training lossless audio in an audio data center.

4. The method of claim 2, wherein the incremental learning training comprises:

inputting training audio into an embedding layer and outputting a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio comprises: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model all comprising the embedding layer;

carrying out statistical processing on the word vectors and/or the word vectors through a statistical language algorithm to obtain a feature vector list;

training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

5. The method of claim 4, wherein the training an acoustic model from the list of feature vectors comprises:

counting categories and occurrence frequencies of the word vectors and/or the word vectors in the feature vector list;

determining a Huffman tree according to the categories and the occurrence frequencies of the word vectors and/or the word vectors;

and training the acoustic model according to the Huffman tree.

6. The method of claim 5, wherein the training the acoustic model according to the Huffman tree comprises:

determining a vector neighbor probability according to the Huffman tree, wherein the vector neighbor probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector;

normalizing the vector adjacent probability through a normalization index function;

and training the acoustic model according to the vector adjacent probability after the normalization processing.

7. The method of claim 1, wherein the decoding, by the lossy acoustic model and decoder, the probability matrix to obtain the identified audio comprises:

performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, wherein the decoding operation comprises the first decoding operation;

respectively calculating the decoding error probability of each decoding result in the plurality of decoding results according to the vector adjacent probability, wherein the vector adjacent probability comprises the following steps: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector;

and determining the decoding result corresponding to the decoding error probability with the largest value in the plurality of decoding error probabilities as the identification audio.

8. The method of claim 1, wherein the decoding, by the lossy acoustic model and decoder, the probability matrix to obtain the identified audio comprises:

performing, by the decoder, a first decoding operation on a preset portion in the probability matrix to obtain a target decoding result, where the decoding operation includes the first decoding operation;

performing, by the decoder, the first decoding operation on the remaining part of the probability matrix, except for a preset part, multiple times, and, when performing the first decoding operation on the remaining part multiple times, performing, according to a vector adjacent probability, a correction process on a decoding result of each first decoding operation to obtain the identification audio, where the vector adjacent probability includes: a probability of one word vector followed by another word vector, and a probability of one word vector followed by one word vector, the target decoding result being a decoding result of a first decoding operation for the first time, the target decoding result being a decoding result of the first decoding operation for the first time.

9. A compressed audio recognition apparatus, comprising:

the acquisition module is used for acquiring compressed audio and performing discrete Fourier transform on the compressed audio to obtain spectral characteristics corresponding to the compressed audio;

the coding module is used for coding the frequency spectrum characteristics through a lossy acoustic model so as to obtain a probability matrix corresponding to the frequency spectrum characteristics;

and the decoding module is used for decoding the probability matrix through the lossy acoustic model and the decoder so as to obtain the identification audio.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.