CN110942782A

CN110942782A - Voice compression method, voice decompression method, voice compression device, voice decompression device and electronic equipment

Info

Publication number: CN110942782A
Application number: CN201911260327.1A
Authority: CN
Inventors: 文仕学
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-03-31

Abstract

The embodiment of the invention provides a voice compression method, a voice decompression device and electronic equipment, wherein the voice compression method comprises the following steps: acquiring original voice data; performing frequency domain compression and/or time domain compression on the original voice data according to a coding neural network to obtain compressed voice data; because the training data is adopted, the neural network can be trained to learn which frequency components in the discarded voice data, and knowledge in the acoustic field is not required to be applied, so that the difficulty of designing the encoder for voice data compression is lower in the embodiment of the invention, and the encoder with low design difficulty can be used for compressing the voice data.

Description

Voice compression method, voice decompression method, voice compression device, voice decompression device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for compressing and decompressing voice, and an electronic device.

Background

Along with the continuous development of scientific technology, the sampling capability of the voice acquisition equipment is improved, so that the space occupied by the acquired voice data is larger and larger. To facilitate storage and transmission of the voice data, the voice data may be compressed.

Among them, lossy compression is one of the commonly used compression methods. Lossy compression is achieved by discarding portions of the original speech data (e.g., discarding components corresponding to bands/frequencies that are not sensitive to the human ear) using a lossy compression encoder (e.g., mp3(mpeg audio Layer 3, mpeg audio Layer) encoder). And human ears are not sensitive to which frequency bands or frequencies, and the determination needs to be carried out by applying knowledge in the acoustic field, so that the design difficulty of a lossy compression encoder is high.

Disclosure of Invention

The embodiment of the invention provides a voice compression method, which is used for compressing voice data by adopting an encoder with small design difficulty.

The embodiment of the invention also provides a voice decompression method for decompressing the voice data compressed by the voice compression method.

Correspondingly, the embodiment of the invention also provides a voice compression and decompression device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses a voice compression method, which specifically includes: acquiring original voice data; and carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

Optionally, the frequency domain compression comprises: carrying out frequency domain transformation on the original voice data to obtain a spectrum matrix corresponding to the original voice data; and inputting the speech spectrum matrix corresponding to the original speech data into the coding neural network to obtain frequency domain compressed speech data output by the coding neural network.

Optionally, the time-domain compression comprises: and inputting the original voice data into the coding neural network to obtain time domain compressed voice data output by the coding neural network.

Optionally, the method further comprises the step of training the encoding neural network: acquiring training voice data; performing frequency domain compression and/or time domain compression on the training voice data according to the coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to a decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the coding neural network.

The embodiment of the invention also discloses a voice compression device, which specifically comprises: the first acquisition module is used for acquiring original voice data; and the compression module is used for carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

Optionally, the compression module comprises: the frequency domain compression submodule is used for carrying out frequency domain transformation on the original voice data to obtain a speech spectrum matrix corresponding to the original voice data; and inputting the speech spectrum matrix corresponding to the original speech data into the coding neural network to obtain frequency domain compressed speech data output by the coding neural network.

Optionally, the compression module comprises: and the time domain compression submodule is used for inputting the original voice data into the coding neural network to obtain time domain compressed voice data output by the coding neural network.

Optionally, the apparatus further comprises: the first training module is used for acquiring training voice data; performing frequency domain compression and/or time domain compression on the training voice data according to the coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to a decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the coding neural network.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the voice compression method according to any embodiment of the invention.

An embodiment of the present invention also discloses an electronic device for speech compression, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: acquiring original voice data; and carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

Optionally, further comprising instructions for training the encoded neural network to: acquiring training voice data; performing frequency domain compression and/or time domain compression on the training voice data according to the coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to a decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the coding neural network.

The embodiment of the invention also discloses a voice decompression method, which specifically comprises the following steps: acquiring compressed voice data; and carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

Optionally, the compressed speech data comprises frequency domain compressed speech data, and the frequency domain decompression comprises: and inputting the frequency domain compressed voice data into the decoding neural network to obtain the frequency domain decompressed voice data output by the decoding neural network.

Optionally, after obtaining the frequency domain decompressed speech data output by the decoding neural network, the method further includes: and carrying out time domain transformation on the frequency domain decompressed voice data to obtain corresponding time domain decompressed voice data.

Optionally, the compressed voice data includes time-domain compressed voice data, and the time-domain decompression includes: and inputting the time domain compressed voice data into the decoding neural network to obtain the time domain decompressed voice data output by the decoding neural network.

Optionally, the method further comprises: and playing the time domain decompressed voice data.

Optionally, the method further comprises the step of training the decoding neural network: acquiring training voice data; performing frequency domain compression and/or time domain compression on the training voice data according to a coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the decoding neural network.

The embodiment of the invention also discloses a voice decompression device, which specifically comprises: the second acquisition module is used for acquiring compressed voice data; and the decompression module is used for carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

Optionally, the compressed voice data includes frequency domain compressed voice data, and the decompression module includes: and the frequency domain decompression submodule is used for inputting the frequency domain compressed voice data into the decoding neural network to obtain the frequency domain decompressed voice data output by the decoding neural network.

Optionally, the apparatus further comprises: and the time domain transformation module is used for performing time domain transformation on the frequency domain decompressed speech data after the frequency domain decompressed speech data output by the decoding neural network are obtained, so as to obtain corresponding time domain decompressed speech data.

Optionally, the compressed voice data includes time-domain compressed voice data, and the decompression module includes: and the time domain decompression submodule is used for inputting the time domain compressed voice data into the decoding neural network to obtain the time domain decompressed voice data output by the decoding neural network.

Optionally, the apparatus further comprises: and the playing module is used for playing the time domain decompressed voice data.

Optionally, the apparatus further comprises: the second training module is used for acquiring training voice data; performing frequency domain compression and/or time domain compression on the training voice data according to a coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the decoding neural network.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the voice decompression method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device for speech decompression, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: acquiring compressed voice data; and carrying out frequency domain decompression and time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

Optionally, after obtaining the frequency domain decompressed speech data output by the decoding neural network, the electronic device further includes: and carrying out time domain transformation on the frequency domain decompressed voice data to obtain corresponding time domain decompressed voice data.

Optionally, further comprising instructions for training the decoding neural network to: acquiring training voice data; performing frequency domain compression and/or time domain compression on the training voice data according to a coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the decoding neural network.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, original voice data can be obtained, and then frequency domain compression and/or time domain compression are carried out on the original voice data according to a coding neural network to obtain compressed voice data; because the training data is adopted, the neural network can be trained to learn which frequency components in the discarded voice data, and knowledge in the acoustic field is not required to be applied, so that the difficulty of designing the encoder for voice data compression is lower in the embodiment of the invention, and the encoder with low design difficulty can be used for compressing the voice data.

Drawings

FIG. 1 is a flow chart of the steps of one embodiment of the speech compression method of the present invention;

FIG. 2 is a flow chart of the steps of an embodiment of a speech decompression method of the present invention;

FIG. 3 is a flow chart of the steps of a neural network training method embodiment of the present invention;

FIG. 4 is a flow chart of the steps of an embodiment of a method of speech compression and decompression of the present invention;

FIG. 5 is a flow chart of the steps of an alternative embodiment of a method of speech compression and decompression of the present invention;

FIG. 6 is a block diagram of a voice compression apparatus according to an embodiment of the present invention;

FIG. 7 is a block diagram of an alternative embodiment of a speech compression apparatus of the present invention;

FIG. 8 is a block diagram of an embodiment of a speech decompression apparatus according to the present invention;

FIG. 9 is a block diagram of an alternative embodiment of a speech decompression apparatus of the present invention;

FIG. 10 is a block diagram illustrating an electronic device for speech compression and decompression in accordance with an exemplary embodiment;

fig. 11 is a schematic structural diagram of an electronic device for speech compression and decompression according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

One of the core ideas of the embodiment of the invention is that the neural network is adopted to compress the voice data, and the training data is adopted to train the neural network to learn which components in the discarded voice data without applying knowledge in the acoustic field, so that the difficulty of designing the encoder for compressing the voice data is lower.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a speech compression method according to the present invention is shown, which may specifically include the following steps:

step 102, obtaining original voice data.

In the embodiment of the invention, when a certain section of voice data needs to be compressed, the section of voice data can be obtained, and the section of voice data needing to be compressed is called as original voice data; step 104 may then be performed to compress the raw speech data.

And step 104, performing frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

In the embodiment of the invention, the coding neural network can be trained in advance, so that the trained coding neural network can realize data coding (also called data compression); the training process of the encoding neural network is explained later. Further, after the raw speech data is obtained, the trained encoding neural network may be used to compress the raw speech data to generate a compact representation (which may be referred to as compressed speech data in the following). Compared with the original voice data, the compressed voice data is more compact, and the corresponding occupied space is smaller; thereby reducing the occupied storage space of the original data and facilitating the transmission of the original voice data.

The compression manner of the original speech data by the encoding neural network may include multiple manners, such as frequency domain compression and/or time domain compression, which is not limited in this embodiment of the present invention.

In summary, in the embodiments of the present invention, original voice data may be obtained, and then frequency domain compression and/or time domain compression is performed on the original voice data according to a coding neural network, so as to obtain compressed voice data; because the training data is adopted, the neural network can be trained to learn which frequency components in the discarded voice data, and knowledge in the acoustic field is not required to be applied, so that the difficulty of designing the encoder for voice data compression is lower in the embodiment of the invention, and the encoder with low design difficulty can be used for compressing the voice data.

Correspondingly, the embodiment of the invention also provides a language decompression method, so as to decompress the voice data compressed by the voice compression method.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a speech decompression method according to the present invention is shown, which may specifically include the following steps:

step 202, obtaining compressed voice data.

In the embodiment of the invention, when the voice data compressed by the coding neural network needs to be decompressed, the voice data compressed by the coding neural network can be obtained, and the voice data compressed by the coding neural network is called as compressed voice data; then, step 204 is executed to decompress the compressed voice data.

And 204, performing frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

In the embodiment of the present invention, the neural network may also be decoded in advance for training, so that the trained decoding neural network can implement data decoding (also referred to as data decompression); the training process of decoding the neural network is explained later. And then after obtaining the compressed voice data, decompressing the compressed voice data by adopting the trained decoding neural network to obtain decompressed voice data. The decompressed speech data may then be played, etc. In this embodiment of the present invention, the decompression method for decompressing the neural network may also include multiple types, such as frequency domain decompression and/or time domain decompression, which is not limited in this embodiment of the present invention. The decompression mode of the decompression neural network corresponds to the compression mode of the coding neural network, for example, the compression mode of the coding neural network is frequency domain compression, and correspondingly, the decompression mode of the decoding neural network is frequency domain decompression; the compression mode of the coding neural network is time domain compression, and correspondingly, the decompression mode of the decoding neural network is time domain decompression.

In summary, in the embodiments of the present invention, compressed voice data may be obtained, and then frequency domain decompression and/or time domain decompression are performed on the compressed voice data according to a decoding neural network, so as to obtain decompressed voice data; and then realizing the decompression of the voice data compressed by the coding neural network.

In this embodiment of the present invention, the encoding Neural network and the decoding Neural network may be any type of Neural network, such as CNN (Convolutional Neural Networks), LSTM (long-short-term-memory network), DNN (Deep Neural Networks), and the like, which is not limited in this embodiment of the present invention.

The encoding neural network and the decoding neural network are a pair of matched neural networks, and the decoding neural network and the encoding neural network can be connected and then trained; i.e. the output of the encoding neural network is connected to the input of the decoding neural network. The training process of the decoding neural network and the encoding neural network is explained below.

Referring to FIG. 3, a flowchart illustrating the steps of one embodiment of a neural network training method of the present invention is shown. The method comprises the following steps:

step 302, training voice data is obtained.

In the embodiment of the invention, the voice data can be collected in various modes, and then the collected multiple sections of voice data are adopted to form training voice data; training the encoding neural network and the decoding neural network by using training voice data.

And 304, performing frequency domain compression and/or time domain compression on the training voice data according to the coding neural network to obtain compressed voice data.

And step 306, performing frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

In the embodiment of the present invention, the coding neural network may be first adopted to compress the training speech data, and compressed speech data may be output. Then inputting the compressed voice data output by the coding neural network into the decoding neural network, decompressing the compressed voice data by the decoding neural network, and outputting the decompressed voice data. And comparing the decompressed speech data with the corresponding training speech data, and adjusting the weight of the coding neural network and the weight of the decoding neural network.

And 308, comparing the decompressed voice data with the training voice data, and adjusting the weight of the coding neural network.

And 310, comparing the decompressed voice data with the training voice data, and adjusting the weight of the decoding neural network.

In one example of the present invention, the weights of the encoding neural network and the weights of the decoding neural network may be updated simultaneously, i.e., step 308 and step 310 are performed simultaneously. Specifically, the decompressed speech data and the training speech data may be compared, and then the weight of the encoding neural network and the weight of the decoding neural network may be updated simultaneously according to the comparison result.

In one example of the present invention, the weights of the encoding neural network and the weights of the decoding neural network may be updated alternately, i.e., the step 308 and the step 310 are performed alternately. And when the decompressed speech data and the training speech data are compared and the weight of the coding neural network is adjusted, keeping the weight of the decoding neural network unchanged. And when the compressed voice data is compared with the training voice data and the weight of the decoding neural network is adjusted, keeping the weight of the coding neural network unchanged.

Wherein, no matter the weights of the encoding neural network and the decoding neural network are updated simultaneously or alternately, one way to adjust the weights of the neural network may be to calculate a loss function using the decompressed speech data and the training speech data, and then update the weights of the neural network with the goal of minimizing the loss function.

In an optional embodiment of the present invention, in the step 304, performing frequency domain compression on the training speech data according to the coding neural network to obtain compressed speech data may be that performing frequency domain transformation on the training speech data to obtain a spectrum matrix corresponding to the training speech data; and inputting the speech spectrum matrix corresponding to the training speech data into the coding neural network to obtain frequency domain compressed speech data output by the coding neural network. Correspondingly, in the step 306, the frequency domain decompression is performed on the compressed voice data according to the decoding neural network, and one implementation manner of obtaining the decompressed voice data may be that the frequency domain compressed voice data is input to the decoding neural network, and the frequency domain decompressed data output by the decoding neural network is obtained. Further, before performing the steps 308 and 310, the time domain transform may be performed on the frequency domain decompressed speech data to obtain corresponding time domain decompressed speech data. Then, in the step 308, the decompressed speech data and the training speech data are compared, and one implementation manner of adjusting the weight of the coding neural network may be to compare the time domain decompressed speech data and the training speech data, and adjust the weight of the coding neural network. And the step 310 of comparing the decompressed speech data with the training speech data, and one implementation manner of adjusting the weight of the decoding neural network may be to compare the time-domain decompressed speech data with the training speech data, and adjust the weight of the decoding neural network.

In an optional embodiment of the present invention, in the step 304, the training speech data is time-domain compressed according to the coding neural network, and one implementation manner of obtaining the compressed speech data may be that the training speech data is directly input into the coding neural network, so as to obtain the time-domain compressed speech data output by the coding neural network. Correspondingly, the step 306, performing time-domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data may be implemented in a manner that the time-domain compressed voice data is input to the decoding neural network to obtain time-domain decompressed data output by the decoding neural network. And then, the time domain decompressed speech data output by the decoding neural network can be directly compared with the training speech data subsequently, and the weights of the coding neural network and the decoding neural network are adjusted.

Furthermore, the coding neural network and the decoding neural network are trained by adopting training voice data, so that the coding neural network learns how to compress the voice data, and the decoding neural network learns how to decompress the voice data compressed by the coding neural network; compared with the prior art that the encoder is designed by applying the acoustic field knowledge, the difficulty is lower.

The speech compression method and the speech decompression method will be described in combination below.

Referring to fig. 4, a flowchart illustrating steps of an embodiment of a speech compression and decompression method according to the present invention is shown, which may specifically include the following steps:

step 402, obtaining original voice data.

In this embodiment of the present invention, the original speech data may include multiple types, such as a recording, music, and the like, which is not limited in this embodiment of the present invention.

In the training process, the coding neural network performs time domain compression, and the decoding neural network performs time domain decompression; then, when the coding neural network is used for compression, time domain compression may be performed on the original speech data, and refer to step 404; and when decompression is performed using the decoding neural network, time domain decompression may be performed on the compressed speech data, as may be referred to in step 408.

Step 404, inputting the original voice data into the coding neural network to obtain time domain compressed voice data output by the coding neural network.

And step 406, acquiring the time domain compressed voice data.

Step 408, inputting the time domain compressed voice data into the decoding neural network to obtain time domain decompressed voice data.

In the embodiment of the present invention, the original speech data may be directly input into the trained coding neural network, and the trained coding neural network directly compresses the original speech data to generate a compact representation in a time domain (which may be referred to as time domain compressed speech data in the following). Further, time domain compressed voice data can be obtained; the time-domain compressed speech data may then be input into the decoding neural network, and the decoding neural network decompresses the time-domain compressed speech data to restore decompressed speech data. The decompressed speech data output by the decoding neural network is data in a time domain, and may be referred to as time domain decompressed data subsequently.

And step 410, playing the time domain decompressed voice data.

When the decompressed voice data needs to be played, the decompressed voice data can be directly played.

In summary, in the embodiment of the present invention, original voice data may be obtained, and then the original voice data is input into the coding neural network, so as to obtain time domain compressed voice data output by the coding neural network; inputting the time domain compressed voice data into the decoding neural network to obtain time domain decompressed voice data; furthermore, the original voice data is directly compressed, so that the compression efficiency is improved; and the decoding neural network can directly play the decompressed voice data output by the decoding neural network, so that the decoding efficiency is improved.

Secondly, the encoding neural network and the decoding neural network can perform parallel computation, and the efficiency of further compression and decompression can be improved. Compared with the codec in the prior art, the encoding neural network can more accurately determine which frequency components in the original voice data are discarded through learning, the compression rate is higher, and the restoration degree of the compressed voice data is higher.

Referring to fig. 5, a flowchart illustrating steps of an alternative embodiment of a speech compression and decompression method according to the present invention is shown, which may specifically include the following steps:

step 502, obtaining original voice data.

In the training process, the encoding neural network performs frequency domain compression, and the decoding neural network performs frequency domain decompression; then, when the encoding neural network is used for compression, the original speech data may be compressed in the frequency domain, which may refer to step 504 and 506; and when decompression is performed using the decoding neural network, frequency domain decompression may be performed on the compressed speech data, as may be referred to in step 510.

Step 504, performing frequency domain transformation on the original voice data to obtain a spectrum matrix corresponding to the original voice data.

Step 506, inputting the speech spectrum matrix into the encoding neural network to obtain frequency domain compressed speech data output by the encoding neural network.

In the embodiment of the present invention, frequency domain transformation, such as Fast Fourier Transform (FFT), may be performed on the original voice data to obtain a spectrum matrix corresponding to the original voice data. And then inputting the speech spectrum matrix into the trained coding neural network, and compressing the speech spectrum matrix by the trained coding neural network to generate compact representation on a frequency domain (which can be subsequently used as frequency domain compressed speech data). For example, the original voice data is N frames, and each frame of original voice data is subjected to frequency domain transformation to obtain a 257-dimensional speech spectrum matrix corresponding to each frame of original voice data; and the corresponding speech spectrum matrix corresponding to the original speech data is 257 dimensions x N. Then 257 dimensional N data are input into the trained coding neural network, and 128 dimensions are extracted from each frame of speech spectrum matrix by the trained coding neural network to obtain 128 dimensional N frequency domain compressed speech data; thereby generating a compact representation in the frequency domain. Wherein N is a positive integer.

And step 508, acquiring frequency domain compressed voice data.

Step 510, inputting the frequency domain compressed voice data into the decoding neural network to obtain frequency domain decompressed voice data.

And then acquiring frequency domain compressed voice data, inputting the frequency domain compressed voice data to the trained decoding neural network, decompressing the frequency domain compressed voice data by the trained decoding neural network, and outputting corresponding decompressed voice data. The decompressed speech data output by the decoding neural network is data in a frequency domain, and may be referred to as frequency domain decompressed speech data subsequently.

And step 512, performing time domain transformation on the frequency domain decompressed speech data to obtain corresponding time domain decompressed speech data.

And 514, playing the time domain decompressed voice data.

Since the voice data that can be directly played is data in the time domain, and the data obtained by decompressing the frequency domain compressed data in step 510 is data in the frequency domain, when the frequency domain decompressed data is needed to be played, the frequency domain decompressed voice data can be converted into data in the time domain, that is, data in the time domain decompressed. The time domain conversion can be carried out on the frequency domain decompressed speech data to obtain corresponding time domain decompressed speech data; the time domain decompressed speech data may then be played.

In summary, in the embodiment of the present invention, original voice data may be obtained, then frequency domain transformation is performed on the original voice data to obtain a spectrum matrix corresponding to the original voice data, and the spectrum matrix corresponding to the original voice data is input into the encoding neural network to obtain frequency domain compressed voice data output by the encoding neural network; inputting the frequency domain compressed voice data into the decoding neural network to obtain frequency domain decompressed voice data; further, it is possible to realize compression of voice data by an encoder with a small design difficulty and decompression of compressed voice data by a decoder. Compared with the method that the original voice data are directly input into the encoding neural network for compression, the method that the spectrum matrix of the original voice data are input into the encoding neural network for compression can more accurately determine which frequency components in the original voice data are discarded, the compression rate is higher, and the reduction degree of the compressed voice data for reduction is higher.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a voice compression apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

a first obtaining module 602, configured to obtain original voice data;

the compressing module 604 is configured to perform frequency domain compression and/or time domain compression on the original voice data according to the coding neural network, so as to obtain compressed voice data.

Referring to fig. 7, a block diagram of an alternative embodiment of a speech compression apparatus of the present invention is shown.

In an alternative embodiment of the present invention, the compressing module 604 includes:

a frequency domain compression sub-module 6042, configured to perform frequency domain transformation on the original voice data to obtain a spectrum matrix corresponding to the original voice data; and inputting the speech spectrum matrix corresponding to the original speech data into the coding neural network to obtain frequency domain compressed speech data output by the coding neural network.

In an alternative embodiment of the present invention, the compression module includes:

and a time domain compression sub-module 6044, configured to input the original voice data into the coding neural network, so as to obtain time domain compressed voice data output by the coding neural network.

In an optional embodiment of the present invention, the apparatus further comprises:

a first training module 606 for obtaining training speech data; performing frequency domain compression and/or time domain compression on the training voice data according to the coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to a decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the coding neural network.

Referring to fig. 8, a block diagram of a voice decompression apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

a second obtaining module 802, configured to obtain compressed voice data;

the decompression module 804 is configured to perform frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network, so as to obtain decompressed voice data.

Referring to fig. 9, a block diagram of an alternative embodiment of a speech decompression apparatus of the present invention is shown.

In an optional embodiment of the present invention, the compressed voice data includes frequency domain compressed voice data, and the decompression module 804 includes:

the frequency domain decompression submodule 8042 is configured to input the frequency domain compressed voice data into the decoding neural network, so as to obtain frequency domain decompressed voice data output by the decoding neural network.

a time domain transforming module 806, configured to perform time domain transformation on the frequency domain decompressed speech data after obtaining the frequency domain decompressed speech data output by the decoding neural network, so as to obtain corresponding time domain decompressed speech data.

In an optional embodiment of the present invention, the compressed voice data includes time-domain compressed voice data, and the decompression module 804 includes:

the time domain decompression submodule 8044 is configured to input the time domain compressed voice data into the decoding neural network, so as to obtain the time domain decompressed voice data output by the decoding neural network.

a playing module 808, configured to play the time-domain decompressed speech data.

a second training module 810 for obtaining training speech data; performing frequency domain compression and/or time domain compression on the training voice data according to a coding neural network to obtain compressed voice data; carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data; and comparing the decompressed voice data with the training voice data, and adjusting the weight of the decoding neural network.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 10 is a block diagram illustrating an electronic device 1000 for speech compression and decompression according to an exemplary embodiment. For example, the electronic device 1000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 10, electronic device 1000 may include one or more of the following components: processing component 1002, memory 1004, power component 1006, multimedia component 1008, audio component 1010, input/output (I/O) interface 1012, sensor component 1014, and communications component 1016.

The processing component 1002 generally controls overall operation of the electronic device 1000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 1002 may include one or more processors 1020 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 1002 may include one or more modules that facilitate interaction between processing component 1002 and other components. For example, the processing component 1002 can include a multimedia module to facilitate interaction between the multimedia component 1008 and the processing component 1002.

The memory 1004 is configured to store various types of data to support operation at the device 1000. Examples of such data include instructions for any application or method operating on the electronic device 1000, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1004 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 1006 provide power to the various components of electronic device 1000. Power components 1006 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 1000.

The multimedia component 1008 includes a screen that provides an output interface between the electronic device 1000 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1008 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 1000 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1010 is configured to output and/or input audio signals. For example, the audio component 1010 may include a Microphone (MIC) configured to receive external audio signals when the electronic device 1000 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 1004 or transmitted via the communication component 1016. In some embodiments, audio component 1010 also includes a speaker for outputting audio signals.

I/O interface 1012 provides an interface between processing component 1002 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1014 includes one or more sensors for providing various aspects of status assessment for the electronic device 1000. For example, sensor assembly 1014 may detect the open/closed status of device 1000, the relative positioning of components, such as a display and keypad of electronic device 1000, the change in position of electronic device 1000 or a component of electronic device 1000, the presence or absence of user contact with electronic device 1000, the orientation or acceleration/deceleration of electronic device 1000, and the change in temperature of electronic device 1000. The sensor assembly 1014 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1014 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1016 is configured to facilitate wired or wireless communication between the electronic device 1000 and other devices. The electronic device 1000 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1014 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1014 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 1004 comprising instructions, executable by the processor 1020 of the electronic device 1000 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of speech compression, the method comprising: acquiring original voice data; and carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of an electronic device, enable the electronic device to perform a method of speech decompression, the method comprising: acquiring compressed voice data; and carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

Fig. 11 is a schematic structural diagram of an electronic device 1100 for speech compression and decompression according to another exemplary embodiment of the present invention. The electronic device 1100 may be a server, which may vary widely due to configuration or performance, and may include one or more Central Processing Units (CPUs) 1122 (e.g., one or more processors) and memory 1132, one or more storage media 1130 (e.g., one or more mass storage devices) storing applications 1142 or data 1144. Memory 1132 and storage media 1130 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1122 may be provided in communication with the storage medium 1130 to execute a series of instruction operations in the storage medium 1130 on the server.

The server may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1158, one or more keyboards 1156, and/or one or more operating systems 1141, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device for speech compression comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by one or more processors the one or more programs including instructions for: acquiring original voice data; and carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

An electronic device for speech decompression comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by one or more processors the one or more programs including instructions for: acquiring compressed voice data; and carrying out frequency domain decompression and time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes in detail a voice compressing and decompressing method, a voice compressing and decompressing apparatus, and an electronic device provided by the present invention, and a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the foregoing embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech compression, comprising:

acquiring original voice data;

and carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

2. The method of claim 1, wherein the frequency domain compression comprises:

carrying out frequency domain transformation on the original voice data to obtain a spectrum matrix corresponding to the original voice data;

and inputting the speech spectrum matrix corresponding to the original speech data into the coding neural network to obtain frequency domain compressed speech data output by the coding neural network.

3. A method of speech decompression, comprising:

acquiring compressed voice data;

and carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

4. The method of claim 3, wherein the compressed speech data comprises frequency domain compressed speech data, and wherein the frequency domain decompression comprises:

and inputting the frequency domain compressed voice data into the decoding neural network to obtain the frequency domain decompressed voice data output by the decoding neural network.

5. A speech compression apparatus, comprising:

the first acquisition module is used for acquiring original voice data;

and the compression module is used for carrying out frequency domain compression and/or time domain compression on the original voice data according to the coding neural network to obtain compressed voice data.

6. A speech decompression apparatus, comprising:

the second acquisition module is used for acquiring compressed voice data;

and the decompression module is used for carrying out frequency domain decompression and/or time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.

7. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of speech compression of any of method claims 1-2.

8. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech decompression method of any of method claims 3-4.

9. An electronic device for speech compression comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by one or more processors, the one or more programs including instructions for:

acquiring original voice data;

10. An electronic device for speech decompression, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:

acquiring compressed voice data;

and carrying out frequency domain decompression and time domain decompression on the compressed voice data according to the decoding neural network to obtain decompressed voice data.