CN111292725B

CN111292725B - Voice decoding method and device

Info

Publication number: CN111292725B
Application number: CN202010128594.XA
Authority: CN
Inventors: 王磊; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2022-11-25
Anticipated expiration: 2040-02-28
Also published as: CN111292725A

Abstract

The application provides a voice decoding method and a device, wherein the method comprises the following steps: acquiring a voice to be decoded, and acquiring a plurality of frame combinations from a plurality of audio frames according to a preset frame shift parameter and a preset frame skipping parameter; wherein the frame combination comprises at least one audio frame; extracting the characteristics of each frame combination to obtain the audio characteristics of each frame combination; and respectively inputting the audio features of each frame combination into the trained chain model to obtain the voice decoding result of each frame combination, and determining the voice decoding result of the voice to be decoded according to the voice decoding result of each frame combination. According to the method and the device, the plurality of frame combinations are obtained from the plurality of audio frames, so that more audio frames in the voice to be decoded participate in decoding, and the obtained voice decoding results are increased.

Description

Voice decoding method and device

Technical Field

The present application relates to the field of speech recognition, and in particular, to a speech decoding method and apparatus.

Background

Speech decoding refers to a process of recognizing speech as a Chinese character, and at present, a Chain model (Chain model) which completes training has a speech decoding function by training the Chain model. Specifically, for the speech to be decoded, the audio feature of a frame combination is extracted from the speech to be decoded, and the extracted audio feature is input into the trained chain model to obtain the speech decoding result.

However, the decoding accuracy of speech decoding is low, i.e., the accuracy of the obtained decoding result is low.

Disclosure of Invention

The application provides a voice decoding method and a voice decoding device, and aims to solve the problem of low voice decoding precision.

In order to achieve the above object, the present application provides the following technical solutions:

the application provides a voice decoding method, which comprises the following steps:

acquiring a voice to be decoded; wherein the speech to be decoded comprises a plurality of audio frames;

obtaining a plurality of frame combinations from the plurality of audio frames according to preset frame shift parameters and frame skip parameters; wherein the frame combination comprises at least one of the audio frames;

extracting the characteristics of each frame combination to obtain the audio characteristics of each frame combination;

respectively inputting the audio features of each frame combination into the trained chain model to obtain the voice decoding result of each frame combination;

and determining the voice decoding result of the voice to be decoded according to the voice decoding result of each frame combination.

Optionally, the determining, according to the speech decoding result of each frame combination, the speech decoding result of the speech to be decoded includes:

determining the probability of the voice decoding result of each frame combination;

and taking the speech decoding result with the highest probability as the speech decoding result of the speech to be decoded.

Optionally, the method further includes:

the chain model is pre-trained.

Optionally, obtaining a plurality of frame combinations from the plurality of audio frames according to a preset frame shift parameter and a preset frame skip parameter, including:

calculating the frame sequence parameter of each audio frame in the frame combination according to the preset frame shift parameter and the frame skipping parameter;

obtaining a plurality of frame combinations from the plurality of audio frames according to the frame order parameter.

The present application also provides a speech decoding apparatus, including:

the first acquisition module is used for acquiring the voice to be decoded; wherein the speech to be decoded comprises a plurality of audio frames;

the second acquisition module is used for acquiring a plurality of frame combinations from the plurality of audio frames according to preset frame shift parameters and frame skip parameters; wherein the frame combination comprises at least one of the audio frames;

the feature extraction module is used for extracting features of the frame combinations to obtain audio features of the frame combinations;

the input module is used for respectively inputting the audio features of each frame combination into the trained chain model to obtain the voice decoding result of each frame combination;

and the determining module is used for determining the voice decoding result of the voice to be decoded according to the voice decoding result of each frame combination.

Optionally, the determining module is configured to determine a speech decoding result of the speech to be decoded according to the speech decoding result of each frame combination, and includes:

the determining module is specifically configured to determine a probability of a speech decoding result of each frame combination; and taking the speech decoding result with the highest probability as the speech decoding result of the speech to be decoded.

Optionally, the apparatus further comprises: a training module for pre-training the chain model.

Optionally, the second obtaining module is configured to obtain a plurality of frame combinations from the plurality of audio frames according to a preset frame shift parameter and a preset frame skip parameter, and includes:

the second obtaining module is specifically configured to calculate a frame sequence parameter of each audio frame in the frame combination according to the preset frame shift parameter and the preset frame skip parameter; obtaining a plurality of frame combinations from the plurality of audio frames according to the frame order parameter.

The present application also provides a storage medium including a stored program, wherein the program executes any of the above-described speech decoding methods.

The application also provides a device, which comprises at least one processor, at least one memory connected with the processor, and a bus; the processor and the memory complete mutual communication through the bus; the processor is used for calling the program instructions in the memory to execute any one of the voice decoding methods.

In the speech decoding method and the speech decoding device, speech to be decoded is obtained, wherein the speech to be decoded comprises a plurality of audio frames; obtaining a plurality of frame combinations from a plurality of audio frames according to preset frame shift parameters and frame skipping parameters; wherein the frame combination comprises at least one audio frame; extracting the characteristics of each frame combination to obtain the audio characteristics of each frame combination; and respectively inputting the audio features of each frame combination into the trained chain model to obtain the voice decoding result of each frame combination, and determining the voice decoding result of the voice to be decoded according to the voice decoding result of each frame combination.

Because a plurality of frame combinations are obtained from a plurality of audio frames, more audio frames in the voice to be decoded participate in decoding, and the obtained voice decoding results are increased, the voice decoding result of the voice to be decoded is determined according to the voice decoding result of each frame combination, and the decoding precision of the obtained voice decoding result is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 (a) is a schematic diagram of audio frame distribution in speech according to an embodiment of the present application;

fig. 1 (b), fig. 1 (c), and fig. 1 (d) are schematic diagrams of audio frame distributions in a training set disclosed in an embodiment of the present application, respectively;

FIG. 2 is a flowchart of a speech decoding method disclosed in the embodiments of the present application;

fig. 3 is a schematic structural diagram of a speech decoding apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The speech decoding process of the speech to be decoded in the present embodiment will be described with the speech shown in fig. 1 (a) as the speech to be decoded. Fig. 2 is a speech decoding method according to an embodiment of the present application, including the following steps:

s201, obtaining the voice to be decoded.

In the present embodiment, the speech to be decoded includes a plurality of audio frames. As an example, the speech to be decoded may comprise 9 audio frames; in this embodiment, a segment of a speech after being segmented is called a frame, and the speech has 9 audio frames. As shown in fig. 1 (a), 9 audio frames may be represented as [1,2,3,4,5,6,7,8,9]; wherein, fig. 1 (a) includes 9 lattices, wherein one lattice represents one audio frame. The frame length of a specific one of the audio frames is 10ms.

S202, obtaining a plurality of frame combinations from a plurality of audio frames according to preset frame shift parameters and frame skipping parameters.

In this step, any one frame combination includes at least one audio frame in the speech to be decoded.

In this step, the preset frame shift parameters may include: 0,1 and 2. Optionally, the value of the frame skipping parameter may be 3, and certainly, in practice, the values of the frame shifting parameter and the frame skipping parameter may be determined according to an actual situation, and the specific values of the frame shifting parameter and the frame skipping parameter are not limited in this embodiment.

In this embodiment, in order to obtain a better speech decoding result, the frame shift parameter used in this step is the same as the frame shift parameter used to obtain the training set for training the preset chain model, that is, the frame shift parameter preset in this step is the same as the frame shift parameter used in the training process.

Wherein, the frame length of any audio frame represents the time length of the audio frame; if the value of the frame skipping parameter is 3, it means that a frame is taken from the audio every 30 ms.

As an example, the values of the frame shift parameter include 0,1 and 2, and when the frame shift parameter is 0, it indicates that one frame is taken by skipping 3 frames from the first audio frame of speech, and the obtained training set is represented as [1,4,7], as shown in fig. 1 (b). When the frame shift parameter is 1, it represents that each audio frame in the training set when the frame shift parameter is 0 is shifted to the right by 1 bit, and the resulting training set is represented as [2,5,8], as shown in fig. 1 (c). When the frame shift parameter is 2, it represents that each audio frame in the training set when the frame shift parameter is 0 is shifted to the right by 2 bits, and the resulting training set is represented as [3,6,9], as shown in fig. 1 (d).

Optionally, in this step, a manner of obtaining a plurality of frame combinations from the speech to be decoded according to the frame shift parameter and the frame skip parameter may include steps A1 to A2:

a1, calculating frame sequence parameters of each audio frame in the frame combination according to preset frame shift parameters and frame skipping parameters.

In this step, the frame sequence parameter of each of the plurality of audio frames represents the sequence of the chain model in which each of the plurality of audio frames completes training.

Taking the values of the frame shift parameters including 0,1 and 2, and the value of the frame skip parameter being 3 as an example. For the speech shown in fig. 1 (a) as the speech to be decoded, as an example, first, when the value of the frame shift parameter is 0, starting from the 1 st audio frame in the 9 audio frames, determining that the value of the frame sequence parameter of the 1 st audio frame is 1; skipping 3 frames from the 1 st audio frame to obtain a 4 th audio frame, and determining the value of a frame sequence parameter of the 4 th audio frame to be 2; and skipping 3 frames from the 4 th audio frame to obtain a 7 th audio frame, and determining that the value of the frame sequence parameter of the 7 th audio frame is 3.

Then, under the condition that the frame shift parameter is 1, starting from the 2 nd audio frame in the 9 audio frames, determining the value of the frame sequence parameter of the 2 nd audio frame to be 4; skipping 3 frames from the 2 nd audio frame to obtain a 5 th audio frame, and determining the value of a frame sequence parameter of the 5 th audio frame to be 5; and skipping 3 frames from the 5 th audio frame to obtain an 8 th audio frame, and determining that the value of the frame sequence parameter of the 8 th audio frame is 6.

Finally, under the condition that the frame shift parameter is 2, starting from the 3 rd audio frame in the 9 audio frames, determining the value of the frame sequence parameter of the 3 rd audio frame to be 7; skipping 3 frames from the 3 rd audio frame to obtain a 6 th audio frame, and determining that the value of a frame sequence parameter of the 6 th audio frame is 8; and skipping 3 frames from the 6 th audio frame to obtain a 9 th audio frame, and determining that the value of the frame sequence parameter of the 9 th audio frame is 9.

And A2, obtaining a plurality of frame combinations from a plurality of audio frames according to the frame sequence parameters.

In this step, a plurality of frame combinations are obtained from the plurality of audio frames according to the sequence indicated by the frame sequence parameter of each audio frame in the plurality of audio frames.

Taking the example in A1 as an example, according to the sequence indicated by the value of the frame sequence parameter, the sequence of the audio frames in the speech to be decoded according to the sequence indicated by the frame sequence parameter is as follows: 1,4,7,2,5,8,3,6,9.

In this step, 1,4,7 is combined as one frame, 2,5,8 is combined as one frame, and 3,6,9 is combined as one frame, so that three frame combinations are obtained.

S203, extracting the characteristics of each frame combination to obtain the audio characteristics of each frame combination.

In this step, feature extraction is performed on each frame combination to obtain audio features of each frame combination, where for any frame combination, a specific implementation manner of performing feature extraction on the frame combination is the prior art, and is not described here again.

Taking 1,4,7 as a frame combination, 2,5,8 as a frame combination, 3,6,9 as a frame combination as an example, in this step, the feature of 1,4,7 in the frame combination is extracted, the feature of 2,5,8 in the frame combination is extracted, and the feature of 3,6,9 in the frame combination is extracted.

And S204, respectively inputting the audio features of each frame combination into the trained chain model to obtain the voice decoding result of each frame combination.

Taking the audio feature of 1,4,7,2,5,8,3,6,9 as an example, in this step, the audio feature of 1,4,7 is input into a chain model that is trained, so as to obtain a result of 1,4,7; inputting the audio features of 2,5,8 frame combination into a chain model completing training to obtain a voice decoding result of 2,5,8 frame combination; the 3,6,9 audio features of the frame combination are input into a trained chain model, and a 3,6,9 speech decoding result of the frame combination is obtained.

S205, determining the voice decoding result of the voice to be decoded according to the voice decoding result of each frame combination.

Since the speech decoding result of each frame combination is obtained, in this step, the speech decoding result of the speech to be decoded can be determined according to the speech decoding result of each frame combination.

Optionally, an optimal speech decoding result may be selected from the speech decoding results of each frame combination as the speech decoding result of the speech to be decoded. The step of selecting the optimal speech decoding result may include steps B1 to B2:

and B1, determining the probability of the voice decoding result of each frame combination.

In this step, for the speech feature of any frame combination of the chain model with training completed, the chain model with training completed obtains the speech decoding result of the frame combination and the probability of the speech decoding result through forward calculation, wherein the probability represents the accuracy.

In this step, the probability of the speech decoding result for each frame combination is determined.

And B2, taking the voice decoding result with the highest probability as the voice decoding result of the voice to be decoded.

In this step, the speech decoding result with the highest probability is used as the speech decoding result of the speech to be decoded.

In this step, the speech decoding result with the highest probability is used as the speech decoding result of the speech to be decoded, that is, the speech decoding result of the speech to be decoded is the optimal value in the speech decoding results of each frame combination, so that the accuracy of the speech decoding result of the speech to be decoded can be ensured to be higher, and the decoding accuracy is further improved.

In one possible implementation, the method further includes pre-training the chain model. Wherein, the training process of the chain model comprises the following steps: and training the preset chain model by adopting a training set to obtain a trained chain model, wherein the trained chain model has a voice decoding function. Specifically, the process of training the preset chain model can be realized by a Kaldi open source tool. When Kaldi trains the chain model, the modeling unit in Kaldi uses a state for each triphone combination (triphone), and the frame rate is a low frame rate training mode of 30 ms. Experiments prove that the method is superior to a ctc (Connectionist Temporal classification) model in performance and has higher decoding speed.

In the embodiment of the application, in the training process of the chain model, in order to enable all audio frames of the preset voice to participate in training, a training set is constructed according to the frame skipping parameter and the frame shifting parameter.

Fig. 3 is a speech decoding apparatus according to an embodiment of the present application, including: a first acquisition module 301, a second acquisition module 302, a feature extraction module 303, an input module 304, and a determination module 305. Wherein the content of the first and second substances,

a first obtaining module 301, configured to obtain a speech to be decoded; wherein the speech to be decoded comprises a plurality of audio frames;

a second obtaining module 302, configured to obtain a plurality of frame combinations from the plurality of audio frames according to a preset frame shift parameter and a preset frame skip parameter; wherein the frame combination comprises at least one audio frame;

a feature extraction module 303, configured to perform feature extraction on each frame combination to obtain an audio feature of each frame combination;

an input module 304, configured to input the audio features of each frame combination into the trained chain model, respectively, to obtain a speech decoding result of each frame combination;

a determining module 305, configured to determine a speech decoding result of the speech to be decoded according to the speech decoding result of each frame combination.

Optionally, the determining module 305 is configured to determine a speech decoding result of the speech to be decoded according to the speech decoding result of each frame combination, and includes:

a determining module 305, specifically configured to determine a probability of a speech decoding result of each frame combination; and taking the speech decoding result with the highest probability as the speech decoding result of the speech to be decoded.

Optionally, the apparatus may further include: a training module for pre-training the chain model.

Optionally, the second obtaining module 302 is configured to obtain a plurality of frame combinations from the plurality of audio frames according to a preset frame shift parameter and a preset frame skip parameter, where the obtaining includes:

a second obtaining module 302, configured to specifically calculate a frame sequence parameter of each audio frame in the frame combination according to a preset frame shift parameter and a preset frame skip parameter; a plurality of frame combinations are obtained from the plurality of audio frames according to the frame order parameter.

The voice decoding device comprises a processor and a memory, wherein the first acquiring module 301, the second acquiring module 302, the feature extracting module 303, the input module 304, the determining module 305 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the precision of decoding the voice to be decoded is improved by adjusting the kernel parameters.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the speech decoding method when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the voice decoding method is executed when the program runs.

An embodiment of the present invention provides an apparatus, as shown in fig. 4, the apparatus includes at least one processor, and at least one memory and a bus connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory to execute the voice decoding method. The device herein may be a server, a PC, a PAD, a mobile phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

obtaining a plurality of frame combinations from a plurality of audio frames according to preset frame shift parameters and frame skipping parameters; wherein the frame combination comprises at least one audio frame;

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

The functions described in the method of the embodiment of the present application, if implemented in the form of software functional units and sold or used as independent products, may be stored in a storage medium readable by a computing device. Based on such understanding, part of the contribution to the prior art of the embodiments of the present application or part of the technical solution may be embodied in the form of a software product stored in a storage medium and including several instructions for causing a computing device (which may be a personal computer, a server, a mobile computing device or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for speech decoding, comprising:

calculating the frame sequence parameter of each audio frame in the frame combination according to the frame skipping parameter and a plurality of preset frame shifting parameters;

obtaining a plurality of frame combinations from the plurality of audio frames according to the frame sequence parameter; the frame combination comprises at least one audio frame, and the preset frame shift parameter is the same as the frame shift parameter adopted by the training chain model;

2. The method according to claim 1, wherein the determining the speech decoding result of the speech to be decoded according to the speech decoding result of each frame combination comprises:

3. The method of claim 1, further comprising:

the chain model is pre-trained.

4. A speech decoding apparatus, comprising:

the second acquisition module is used for acquiring a plurality of frame combinations from the plurality of audio frames according to preset frame shift parameters and frame skip parameters; the frame combination comprises at least one audio frame, and the preset frame shift parameter is the same as the frame shift parameter adopted by the training chain model;

the determining module is used for determining the voice decoding result of the voice to be decoded according to the voice decoding result of each frame combination;

the second obtaining module is configured to obtain a plurality of frame combinations from the plurality of audio frames according to a preset frame shift parameter and a preset frame skip parameter, and includes:

the second obtaining module is specifically configured to calculate a frame sequence parameter of each audio frame in the frame combination according to the frame skipping parameter and the plurality of preset frame shifting parameters; obtaining a plurality of frame combinations from the plurality of audio frames according to the frame sequence parameter.

5. The apparatus of claim 4, wherein the determining module is configured to determine the speech decoding result of the speech to be decoded according to the speech decoding result of each frame combination, and includes:

6. The apparatus of claim 4, further comprising: a training module for pre-training the chain model.

7. A storage medium comprising a stored program, wherein the program executes the speech decoding method of any of claims 1~3.

8. A speech decoding device, characterized in that it comprises at least one processor, and at least one memory connected to said processor, a bus; the processor and the memory are communicated with each other through the bus; the processor is configured to invoke program instructions in the memory to perform the speech decoding method of any of claims 1-3.