CN113555034B

CN113555034B - Compressed audio identification method, device and storage medium

Info

Publication number: CN113555034B
Application number: CN202110887281.7A
Authority: CN
Inventors: 王润宇; 付立; 资礼波; 李萧萧
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2021-08-03
Filing date: 2021-08-03
Publication date: 2024-03-01
Anticipated expiration: 2041-08-03
Also published as: CN113555034A

Abstract

The present disclosure relates to a compressed audio recognition method, apparatus, and storage medium, the method comprising: obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio; performing coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio. By adopting the technical means, the problems that in the prior art, the recognition accuracy of a voice recognition model on lossy compressed audio is low and the like are solved.

Description

Compressed audio identification method, device and storage medium

Technical Field

The present disclosure relates to the field of communications, and in particular, to a method, an apparatus, and a storage medium for identifying compressed audio.

Background

With the progress of technology, intelligent speech recognition is widely used in the scenes of intelligent hardware, speech input, call centers and the like. However, most of the voice recognition systems are deployed based on the cloud, and users need to upload the audio collected by the device to the cloud system for processing. At present, most of audio uploaded by users is in a lossless compression audio file format, and the lossless compression audio file is characterized by undistorted signals, high speech recognition accuracy, large file, pressure on a network and network service breakdown when the network load is large. Therefore, during uploading, the method is preferably to use a lossy compression format for transmission, and the method is characterized in that the compression ratio of the audio file is high, and the minimum compressed file is 10% of the original file size. However, the problem of feature missing is caused by lossy compression, and the existing speech recognition model is usually trained on a lossless audio training set, so that the recognition accuracy of the speech recognition model on lossy compressed audio is low.

In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the related art: and the recognition accuracy of the voice recognition model on the lossy compressed audio is low.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, embodiments of the present disclosure provide a method, an apparatus, and a storage medium for identifying compressed audio, so as to at least solve the problems in the prior art that the accuracy of identifying lossy compressed audio by a speech recognition model is low.

The aim of the disclosure is achieved by the following technical scheme:

in a first aspect, embodiments of the present disclosure provide a compressed audio recognition method, including: obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio; performing coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

In an exemplary embodiment, before the encoding operation is performed on the spectral features through the lossy acoustic model to obtain the probability matrix corresponding to the spectral features, the method further includes: acquiring training lossless audio, and performing incremental learning training on an original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain training lossy audio; and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

In one exemplary embodiment, before acquiring the training lossless audio, the method further comprises: acquiring life common text data and/or professional common text data; performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio; the training lossless audio is stored in an audio data center.

In one exemplary embodiment, the incremental learning training includes: inputting training audio into an embedding layer and outputting a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio comprises: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model each comprising the embedded layer; carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list; training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

In an exemplary embodiment, the training the acoustic model according to the feature vector list includes: counting categories and occurrence frequencies of the plurality of word vectors and/or the plurality of word vectors in the feature vector list; determining a Huffman tree according to the plurality of word vectors and/or the categories and the occurrence frequencies of the plurality of word vectors; and training the acoustic model according to the Huffman tree.

In one exemplary embodiment, the training the acoustic model according to the huffman tree comprises: determining a vector adjacent probability according to the Huffman tree, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; normalizing the vector adjacent probability through a normalization exponential function; and training the acoustic model according to the vector adjacent probability after normalization processing.

In one exemplary embodiment, the decoding of the probability matrix by the lossy acoustic model and decoder to obtain the identified audio comprises: performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, wherein the decoding operation comprises the first decoding operation; calculating a decoding error probability of each decoding result of the plurality of decoding results according to a vector adjacent probability, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; and determining a decoding result corresponding to the decoding error probability with the largest numerical value in the decoding error probabilities as the identification audio.

In one exemplary embodiment, the decoding of the probability matrix by the lossy acoustic model and decoder to obtain the identified audio comprises: performing a first decoding operation on a preset part in the probability matrix through the decoder to obtain a target decoding result, wherein the decoding operation comprises the first decoding operation; performing, by the decoder, the first decoding operation for a plurality of times on a remaining portion except for a preset portion in the probability matrix, and performing, when the first decoding operation is performed for a plurality of times on the remaining portion, correction processing on a decoding result of each of the first decoding operations according to vector adjacent probabilities, to obtain the identified audio, where the vector adjacent probabilities include: the target decoding result is the decoding result of the first decoding operation.

In a second aspect, embodiments of the present disclosure provide a compressed audio recognition apparatus, comprising: the acquisition module is used for acquiring compressed audio and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio; the coding module is used for carrying out coding operation on the spectrum characteristics through the lossy acoustic model so as to obtain a probability matrix corresponding to the spectrum characteristics; and the decoding module is used for decoding the probability matrix through the lossy acoustic model and a decoder so as to obtain the identification audio.

In a third aspect, embodiments of the present disclosure provide an electronic device. The electronic equipment comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory for storing a computer program; and a processor for implementing the compressed audio recognition method or the image processing method as described above when executing the program stored on the memory.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium. The computer readable storage medium stores a computer program which, when executed by a processor, implements the compressed audio recognition method or the image processing method as described above.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has at least part or all of the following advantages: obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio; performing coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio. By adopting the technical means, the problems that in the prior art, the recognition accuracy of the voice recognition model on the lossy compressed audio is low and the like are solved, so that the recognition accuracy of the lossy compressed audio is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described below, and it will be apparent to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 schematically illustrates a hardware block diagram of a computer terminal of a compressed audio recognition method according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of compressed audio recognition according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a method of compressed audio recognition according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of acoustic model training in accordance with an embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a compressed audio recognition device according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

The present disclosure will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided by the embodiments of the present disclosure may be performed in a computer terminal or similar computing device. Taking a computer terminal as an example, fig. 1 schematically illustrates a hardware block diagram of a computer terminal of a compressed audio identification method according to an embodiment of the disclosure. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor (Microprocessor Unit, abbreviated MPU) or a programmable logic device (Programmable logic device, abbreviated PLD)) and the like and a memory 104 for storing data, and optionally, the above-mentioned computer terminal may further include a transmission device 106 for communication functions and an input-output device 108.

The memory 104 may be used to store computer programs, such as software programs and modules of application software, such as computer programs corresponding to the compressed audio recognition method in the embodiments of the present disclosure, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, i.e., implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In the embodiment of the present disclosure, a compressed audio identification method is provided, fig. 2 schematically shows a flowchart of a compressed audio identification method according to an embodiment of the present disclosure, as shown in fig. 2, where the flowchart includes the following steps:

step S202, obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio;

step S204, performing coding operation on the spectrum features through a lossy acoustic model to obtain probability matrixes corresponding to the spectrum features;

and step S206, decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

According to the method and the device, compressed audio is obtained, and discrete Fourier transform is carried out on the compressed audio so as to obtain frequency spectrum characteristics corresponding to the compressed audio; performing coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics; and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio. By adopting the technical means, the problems that in the prior art, the recognition accuracy of the voice recognition model on the lossy compressed audio is low and the like are solved, so that the recognition accuracy of the lossy compressed audio is improved.

Before performing step S204, that is, before performing an encoding operation on the spectral features by using the lossy acoustic model to obtain a probability matrix corresponding to the spectral features, the method further includes: acquiring training lossless audio, and performing incremental learning training on an original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain training lossy audio; and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

Incremental learning is one of deep learning, and is very similar to the learning model of human beings. Because people learn and accept new things every day during the growth process. Learning is gradual and the knowledge obtained is not usually forgotten by humans. For traditional deep learning techniques, it is a difficult problem to obtain useful information from new data and new scenes. With the increasing size of data, the demands for time and space will also increase rapidly, eventually leading to learning speeds that are not compatible with data update speeds. In order to quickly adapt to a new scene, the traditional deep learning method needs to add new training data into the original training data to perform training again, and the process needs huge time cost. Meanwhile, the more data is accumulated, the higher the time cost. Incremental learning has significant advantages over traditional deep learning in two ways: on one hand, the increment learning does not need to save history training data, so that the occupation of storage space is reduced; on the other hand, in the incremental training of the new scene, the history information of the old scene is fully utilized, so that the time of the subsequent training is greatly reduced.

The existing incremental learning method based on deep learning is mainly divided into three types, and the main differences are the use mode of old scene data and the mode of reserving old scene knowledge by a model: incremental learning of old data is not used. The method freezes the last layer of the deep learning model and minimizes the variation of the shared parameters in the feature extraction layer. However, there may be a conflict in these parameters between the old and new scenes; incremental learning using synthetic data: this method synthesizes data from old scenes using a generative model, but cannot synthesize complex scene data effectively; using incremental learning from old data, the method selects a small number of samples from training data of the old scene, retains the classifier for all incremental steps, and uses it as distillation. The incremental learning method used in the embodiments of the present disclosure is mainly based on incremental learning that does not use old data.

Alternatively, the training of the lossy acoustic model may also be performed by the following method: obtaining training lossless audio, and performing lossy compression processing on the training lossless audio to obtain training lossy audio; and performing incremental learning training on the original acoustic model through the training lossy audio to obtain the lossy acoustic model.

The lossy acoustic model is used to identify compressed audio, which may be either lossy or lossless, as the identified object.

Optionally, before acquiring the training lossless audio, the method further comprises: acquiring life common text data and/or professional common text data; performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio; the training lossless audio is stored in an audio data center.

The life general text data is text about texts frequently used in life of people, such as news simulcast text. The professional usual text data is text of a technical field commonly used for an acoustic model, such as an acoustic model used for audio recognition in a computer field, and then the professional usual text data is professional text in the computer field.

In an alternative embodiment, the incremental learning training includes: inputting training audio into an embedding layer and outputting a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio comprises: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model each comprising the embedded layer; carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list; training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

The embedded layer refers to an Embedding layer, wherein the Embedding layer is a mode of converting discrete variables into continuous vectors, and a corresponding word vector matrix can be obtained by using the Embedding layer. Embodiments of the present disclosure obtain multiple word vectors and/or multiple word vectors through the use of an embedding layer. The incremental learning training can be used for training the original acoustic model to obtain the lossless acoustic model; training the lossless acoustic model to obtain the lossy acoustic model. And carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list, namely extracting features according to the plurality of word vectors and/or the plurality of word vectors. Feature extraction is performed through the lossless acoustic model, the frequency spectrum of the lossless compression audio is complete, and a large number of feature deletions exist in the frequency spectrum of the lossy compression audio. This is because the original acoustic model cannot efficiently encode the spectral features of lossy compressed audio because it is trained on only lossless compressed audio. Therefore, the invention uses the lossy compressed audio to form a training set, trains the lossless acoustic model to obtain the lossy acoustic model, enhances the coding capability of the lossy acoustic model on the lossy compressed audio characteristics, and simultaneously furthest maintains the coding capability of the lossy acoustic model on the lossless audio characteristics.

And carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list. The statistical language algorithm may be N-Gram, which is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size N on the content in the text according to bytes, forming a sequence of byte fragments of length N. Each byte segment is called a gram, statistics is carried out on the occurrence frequency of all the grams, filtering is carried out according to a preset threshold value, a key gram list, namely a vector feature space of the text, is formed, and each gram in the list is a feature vector dimension. The algorithm is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the words. These probabilities can be obtained by directly counting the number of simultaneous occurrences of N words from the corpus. Binary Bi-Gram and ternary Tri-Gram are commonly used. The N-Gram can utilize collocation information between adjacent words in the context to realize automatic conversion to Chinese characters,

in an alternative embodiment, training the acoustic model according to the feature vector list includes: counting categories and occurrence frequencies of the plurality of word vectors and/or the plurality of word vectors in the feature vector list; determining a Huffman tree according to the plurality of word vectors and/or the categories and the occurrence frequencies of the plurality of word vectors; and training the acoustic model according to the Huffman tree.

Embodiments of the present disclosure construct a huffman tree using the plurality of word vectors and/or the categories and frequency of occurrence of the plurality of word vectors. Thus, when different word vectors and/or word vectors (classes) are output, the number of determinations required is actually different. The more frequently occurring word vectors and/or word vectors, the closer to the root node, and the fewer the number of determinations required. Thereby the final overall judging efficiency is higher. The huffman tree is a special binary tree, which is the binary tree with the shortest weighted path length, also called the optimal binary tree. And setting N weights as weights of N leaf nodes of the binary tree, constructing a binary tree, and if the weighted path length of the binary tree is minimum, calling the binary tree as a Huffman tree.

In an alternative embodiment, training the acoustic model according to the huffman tree comprises: determining a vector adjacent probability according to the Huffman tree, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; normalizing the vector adjacent probability through a normalization exponential function; and training the acoustic model according to the vector adjacent probability after normalization processing.

The adjacent probability of the vector can be determined by traversing all leaf nodes of the Huffman tree, namely, the probability that one word vector is followed by another word vector and the probability that one word vector is followed by one word vector are determined; the probability of what word or word follows a word or word can be determined to be the largest by the vector adjacent probability. And normalizing the vector adjacent probability by a normalized exponential function, wherein the normalized exponential function is a softmax function.

In step S206, there are a plurality of alternative embodiments, wherein one alternative embodiment is to perform a decoding operation on the probability matrix through the lossy acoustic model and a decoder to obtain the identified audio, including: performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, wherein the decoding operation comprises the first decoding operation; calculating a decoding error probability of each decoding result of the plurality of decoding results according to a vector adjacent probability, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; and determining a decoding result corresponding to the decoding error probability with the largest numerical value in the decoding error probabilities as the identification audio.

The decoding operation includes decoding the probability matrix by the lossy acoustic model and a decoder, decoding the probability matrix by a decoder, and decoding the probability matrix by the lossy acoustic model. The first decoding operation is decoding the probability matrix by the decoder, the decoding operation including the first decoding operation. And performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results. And respectively calculating the decoding error probability of each decoding result in the plurality of decoding results according to the vector adjacent probability, and determining the decoding result corresponding to the decoding error probability with the largest numerical value in the plurality of decoding error probabilities as the identification audio. Wherein the lossy acoustic model preserves vector neighbor probabilities.

In step S206, there are a plurality of alternative embodiments, wherein one alternative embodiment is to perform a decoding operation on the probability matrix through the lossy acoustic model and a decoder to obtain the identified audio, including: performing a first decoding operation on a preset part in the probability matrix through the decoder to obtain a target decoding result, wherein the decoding operation comprises the first decoding operation; performing, by the decoder, the first decoding operation for a plurality of times on a remaining portion except for a preset portion in the probability matrix, and performing, when the first decoding operation is performed for a plurality of times on the remaining portion, correction processing on a decoding result of each of the first decoding operations according to vector adjacent probabilities, to obtain the identified audio, where the vector adjacent probabilities include: the target decoding result is the decoding result of the first decoding operation.

And performing a first decoding operation on a preset part in the probability matrix through the decoder to obtain a target decoding result, wherein the preset part is a part with a preset size of the head or the head of the probability matrix. Dividing the rest part according to a preset size, wherein the original sequence of the probability matrix is reserved in the division. And sequentially carrying out multiple times of first decoding operation on the multi-part data of which the residual part except the preset part is divided by the preset size in the probability matrix through the decoder, and carrying out correction processing on the decoding result of each first decoding operation according to vector adjacent probability when carrying out the multiple times of first decoding operation so as to obtain the identification audio. For example, the target decoding result is "today", the decoding result of the second first decoding operation is "how day", but the probability that "today" is followed by "how weather" is determined to be the largest based on the target decoding result and the vector adjacent probability, so the decoding result of the first two first decoding operations "how day today" is corrected to "how weather today".

Alternatively, incremental learning refers to a learning system that can continually learn new knowledge from new samples and can store most of the previously learned knowledge. The training set for incremental learning may be divided into T scenarios. Each scene T e {1, …, T }, input sample and its label (x _t T) data set D belonging to the scene _t . The incremental training model f and model parameters θ are trained on a training set of one or more scenes at a time. During model reasoning, the output probability matrix is h ^θ (x) The probability distribution representing each category includes:

wherein f ^θ (x) And (5) reasoning the result for the finally output model. The goal of incremental learning is to correctly predict the already trained scene t e {1, …, t at any training stage in the training process _C Samples (x) _t ,y _t ) The training targets can be expressed as:

above-mentionedThe goal of training can be understood as a constraint equation. c represents the number of the training scene (here from 1 to c), t _c Representing the c-th scene, lt represents the model's loss (loss) at t scene, L represents the model's loss function,the loss representing Lt is a mathematical expectation calculated over this dataset of Dt and can be understood as a mean or sum. Argmin is a mathematical function, softmax is a normalized exponential function, and no further description is given.

In order to better understand the above technical solution, the embodiments of the present disclosure further provide an alternative embodiment, which is used to explain the above technical solution.

Fig. 3 schematically illustrates a flowchart of a compressed audio recognition method according to an embodiment of the disclosure, as shown in fig. 3:

Feature extraction: performing discrete Fourier transform on the compressed audio to extract spectral features corresponding to the compressed audio;

acoustic model coding: performing coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics;

the decoder decodes: and decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

Fig. 4 schematically illustrates a flow diagram of acoustic model training according to an embodiment of the present disclosure, as shown in fig. 4:

s402, acquiring life common text data and/or professional common text data;

s404, performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio;

s406, performing incremental learning training on the original acoustic model through the training lossless audio to obtain a lossless acoustic model;

s408, carrying out lossy compression processing on the training lossless audio to obtain training lossy audio;

and S410, performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present disclosure may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as Read-Only Memory (ROM), random access Memory (RandomAccess Memory, RAM), magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present disclosure.

The embodiment also provides a compressed audio recognition device, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 5 schematically illustrates a block diagram of a compressed audio recognition device according to an alternative embodiment of the present disclosure, as shown in fig. 5, the device includes:

the obtaining module 502 is configured to obtain compressed audio, and perform discrete fourier transform on the compressed audio to obtain a spectrum feature corresponding to the compressed audio;

the encoding module 504 is configured to perform an encoding operation on the spectral feature through a lossy acoustic model, so as to obtain a probability matrix corresponding to the spectral feature;

a decoding module 506, configured to perform a decoding operation on the probability matrix through the lossy acoustic model and a decoder to obtain the identified audio.

Optionally, the encoding module 504 is further configured to obtain training lossless audio, and perform incremental learning training on the original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain training lossy audio; and performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model.

Optionally, the encoding module 504 is further configured to obtain training lossless audio, and perform lossy compression processing on the training lossless audio to obtain training lossy audio; and performing incremental learning training on the original acoustic model through the training lossy audio to obtain the lossy acoustic model.

Optionally, the encoding module 504 is further configured to obtain life usual text data and/or professional usual text data; performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio; the training lossless audio is stored in an audio data center.

Optionally, the encoding module 504 is further configured to input training audio into an embedding layer, and output a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio includes: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model each comprising the embedded layer; carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list; training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

optionally, the encoding module 504 is further configured to count categories and occurrence frequencies of the plurality of word vectors and/or the plurality of word vectors in the feature vector list; determining a Huffman tree according to the plurality of word vectors and/or the categories and the occurrence frequencies of the plurality of word vectors; and training the acoustic model according to the Huffman tree.

Optionally, the encoding module 504 is further configured to determine a vector neighboring probability according to the huffman tree, where the vector neighboring probability includes: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; normalizing the vector adjacent probability through a normalization exponential function; and training the acoustic model according to the vector adjacent probability after normalization processing.

Optionally, the decoding module 506 is further configured to perform a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, where the decoding operation includes the first decoding operation; calculating a decoding error probability of each decoding result of the plurality of decoding results according to a vector adjacent probability, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector; and determining a decoding result corresponding to the decoding error probability with the largest numerical value in the decoding error probabilities as the identification audio.

Optionally, the decoding module 506 is further configured to perform a first decoding operation on a preset portion in the probability matrix by using the decoder to obtain a target decoding result, where the decoding operation includes the first decoding operation; performing, by the decoder, the first decoding operation for a plurality of times on a remaining portion except for a preset portion in the probability matrix, and performing, when the first decoding operation is performed for a plurality of times on the remaining portion, correction processing on a decoding result of each of the first decoding operations according to vector adjacent probabilities, to obtain the identified audio, where the vector adjacent probabilities include: the target decoding result is the decoding result of the first decoding operation.

Alternatively, incremental learning refers to a learning system that can continually learn new knowledge from new samples and can store most of the previously learned knowledge. The training set for incremental learning may be divided into T scenarios. Each scene T e {1, …, T }, input sample and its label (x _t ,y _t ) Data set D belonging to the scene _t . The incremental training model f and model parameters θ are trained on a training set of one or more scenes at a time. During model reasoning, the output probability matrix is h ^θ (x) The probability distribution representing each category includes:

c represents the number of the training scene (here from 1 to c), t _c Representing the c-th scene, lt represents the loss (loss) of the model under the t scene, and L represents the loss function of the modelThe number of the product is the number,the loss representing Lt is a mathematical expectation calculated over this dataset of Dt and can be understood as a mean or sum. Argmin is a mathematical function, softmax is a normalized exponential function, and no further description is given.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.

Embodiments of the present disclosure provide an electronic device.

Referring to fig. 6, an electronic device 600 provided by an embodiment of the present disclosure includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604; a memory 603 for storing a computer program; the processor 601 is configured to implement the steps of any of the method embodiments described above when executing a program stored on a memory.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio;

s2, carrying out coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics;

S3, decoding the probability matrix through the lossy acoustic model and a decoder to obtain the identification audio.

Embodiments of the present disclosure also provide a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

The computer-readable storage medium may be embodied in the apparatus/means described in the above embodiments; or may exist alone without being assembled into the apparatus/device. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the disclosure described above may be implemented in a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices and, in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be implemented as individual integrated circuit modules, or as individual integrated circuit modules. As such, the present disclosure is not limited to any specific combination of hardware and software.

The foregoing description of the preferred embodiments of the present disclosure is merely illustrative of and not intended to be limiting of the present disclosure, as various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principles of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method of compressed audio recognition, comprising:

obtaining compressed audio, and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio;

performing coding operation on the spectrum characteristics through a lossy acoustic model to obtain a probability matrix corresponding to the spectrum characteristics;

decoding the probability matrix through the lossy acoustic model and a decoder to obtain identification audio;

the lossy acoustic model is obtained by training lossy audio to perform incremental learning training on a lossless acoustic model, and the lossless acoustic model is obtained by training lossless audio to perform incremental learning training on an original acoustic model;

or the lossy acoustic model is obtained by training lossy audio to perform incremental learning training on the original acoustic model.

2. The method of claim 1, wherein before the encoding the spectral features by the lossy acoustic model to obtain the probability matrix corresponding to the spectral features, the method further comprises:

acquiring training lossless audio, and performing incremental learning training on an original acoustic model through the training lossless audio to obtain a lossless acoustic model; carrying out lossy compression processing on the training lossless audio to obtain training lossy audio; performing incremental learning training on the lossless acoustic model through the training lossy audio to obtain the lossy acoustic model;

or performing incremental learning training on the original acoustic model through the training lossy audio to obtain the lossy acoustic model.

3. The method of claim 2, wherein prior to obtaining training lossless audio, the method further comprises:

acquiring life common text data and/or professional common text data;

performing text-to-audio operation on the acquired life common text data and/or the professional common text data to obtain training lossless audio;

the training lossless audio is stored in an audio data center.

4. The method of claim 1, wherein the incremental learning training comprises:

inputting training audio into an embedding layer and outputting a plurality of word vectors and/or a plurality of word vectors through the embedding layer, wherein the training audio comprises: the training lossless audio and the training lossy audio, the original acoustic model, the lossless acoustic model, and the lossy acoustic model each comprising the embedded layer;

carrying out statistical processing on the plurality of word vectors and/or the plurality of word vectors through a statistical language algorithm to obtain a feature vector list;

training an acoustic model according to the feature vector list, wherein the acoustic model comprises: the original acoustic model, the lossless acoustic model, and the lossy acoustic model.

5. The method of claim 4, wherein training the acoustic model from the list of feature vectors comprises:

counting categories and occurrence frequencies of the plurality of word vectors and/or the plurality of word vectors in the feature vector list;

determining a Huffman tree according to the plurality of word vectors and/or the categories and the occurrence frequencies of the plurality of word vectors;

And training the acoustic model according to the Huffman tree.

6. The method of claim 5, wherein the training the acoustic model according to the huffman tree comprises:

determining a vector adjacent probability according to the Huffman tree, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector;

normalizing the vector adjacent probability through a normalization exponential function;

and training the acoustic model according to the vector adjacent probability after normalization processing.

7. The method of claim 1, wherein decoding the probability matrix by the lossy acoustic model and decoder to obtain the identified audio comprises:

performing a first decoding operation on the probability matrix by the decoder to obtain a plurality of decoding results, wherein the decoding operation comprises the first decoding operation;

calculating a decoding error probability of each decoding result of the plurality of decoding results according to a vector adjacent probability, wherein the vector adjacent probability comprises: the probability of one word vector followed by another word vector, and the probability of one word vector followed by one word vector;

And determining a decoding result corresponding to the decoding error probability with the largest numerical value in the decoding error probabilities as the identification audio.

8. The method of claim 1, wherein decoding the probability matrix by the lossy acoustic model and decoder to obtain the identified audio comprises:

performing a first decoding operation on a preset part in the probability matrix through the decoder to obtain a target decoding result, wherein the decoding operation comprises the first decoding operation;

performing, by the decoder, the first decoding operation for a plurality of times on a remaining portion except for a preset portion in the probability matrix, and performing, when the first decoding operation is performed for a plurality of times on the remaining portion, correction processing on a decoding result of each of the first decoding operations according to vector adjacent probabilities, to obtain the identified audio, where the vector adjacent probabilities include: the target decoding result is the decoding result of the first decoding operation, and the target decoding result is the decoding result of the first decoding operation.

9. A compressed audio recognition apparatus, comprising:

the acquisition module is used for acquiring compressed audio and performing discrete Fourier transform on the compressed audio to obtain frequency spectrum characteristics corresponding to the compressed audio;

the coding module is used for carrying out coding operation on the spectrum characteristics through the lossy acoustic model so as to obtain a probability matrix corresponding to the spectrum characteristics;

the decoding module is used for decoding the probability matrix through the lossy acoustic model and a decoder so as to obtain the identification audio;

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any one of claims 1 to 8.