CN111402906A

CN111402906A - Speech decoding method, apparatus, engine and storage medium

Info

Publication number: CN111402906A
Application number: CN202010155132.7A
Authority: CN
Inventors: 赵伟伟
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2020-07-10

Abstract

The invention discloses a voice decoding method, a device, an engine and a storage medium, wherein the method is applied to a voice decoding engine, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one by one; and calling a general model by using the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share one common model, so that the parallel processing of the thread levels of the voice decoding is realized, the hardware cost is reduced, and the concurrency capability and the decoding efficiency of the voice decoding are improved.

Description

Speech decoding method, apparatus, engine and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech decoding method, apparatus, engine, and storage medium.

Background

With the development of computer technology, more and more technologies (big data, distributed, Blockchain, artificial intelligence, etc.) are applied to the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of security and real-time performance of the financial industry.

Speech decoding is an important component of speech recognition. Currently, generally, a general model is used to decode voice stream data to obtain a text corresponding to the voice stream data. If parallel processing is required to improve the speech decoding efficiency, the parallel processing can be realized only by deploying more general models at a process level, but the general models are large in size, so that the hardware cost is greatly improved.

Disclosure of Invention

The invention provides a voice decoding method, a voice decoding device, an engine and a storage medium, and aims to create a common model shared by a plurality of decoding channels, realize parallel processing at a thread level, reduce hardware cost and improve the concurrency capability and decoding efficiency of voice decoding.

To achieve the above object, the present invention provides a speech decoding method, including:

when receiving a plurality of voice decoding requests, applying for a plurality of thread-level decoding channels, wherein the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one;

and calling a general model by using the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

Preferably, the thread-level decoding channel comprises a channel decoding unit, a data cache region and a callback interface unit;

the step of respectively calling the general models by using the thread-level decoding channels, performing parallel decoding processing on the voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results comprises:

respectively caching voice stream data in the voice decoding requests by utilizing data caching areas of the thread-level decoding channels;

respectively calling a universal model by utilizing the channel decoding units of the thread-level decoding channels, and performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results;

and respectively responding to the plurality of voice decoding requests by utilizing the callback interface units of the plurality of thread-level decoding channels based on the decoding results.

Preferably, the buffering the voice stream data in the plurality of voice decoding requests respectively by using the data buffering areas of the plurality of thread-level decoding channels includes:

for any particular thread-level decoding channel of the plurality of thread-level decoding channels, checking a data state of a data buffer of the particular thread-level decoding channel;

if the data state of the data cache region of the specific thread-level decoding channel is waiting data, the voice stream data corresponding to the specific thread-level decoding channel is directly and temporarily stored in the data cache region of the specific thread-level decoding channel;

if the data state of the data buffer area of the specific thread-level decoding channel is data, temporarily storing the voice stream data corresponding to the specific thread-level decoding channel at the end of the data buffer area of the specific thread-level decoding channel.

Preferably, the step of using the channel decoding units of the multiple thread-level decoding channels to respectively call a general model to perform parallel decoding processing on the voice stream data in the multiple voice decoding requests, and obtaining the decoding result includes:

respectively calling a universal model by utilizing the channel decoding units of the thread-level decoding channels;

and on the basis of the general model, converting the voice stream data into a characteristic vector set in parallel in each channel decoding unit, and converting the characteristic vector set into a decoding result.

Preferably, the thread-level decode channel further comprises a state control unit,

the method further comprises the following steps:

updating the running state of the thread-level decoding channel, the data state of the data cache region and the registration state of the callback interface unit in real time through a state control unit of the thread-level decoding channel so as to execute corresponding steps according to the running state, the data state and the registration state; and/or

And receiving an external control signal through a state control unit of the thread-level decoding channel, and adjusting the running state of the thread-level decoding channel according to the external control signal.

Preferably, the thread-level decode channel includes a reclaim unit;

after the step of calling the generic models respectively by using the decoding channels at the thread levels, performing parallel decoding processing on the voice stream data in the voice decoding requests, and responding to the voice decoding requests based on the decoding results, the method further includes:

clearing voice stream data in a data buffer area of the thread-level decoding channel through a recovery unit of the thread-level decoding channel so as to store the voice stream data again by utilizing the data buffer area; and/or

And clearing the state information recorded by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so that the state control unit can record the state of the thread-level decoding channel again.

Preferably, the registration state of the callback interface unit is registered or unregistered;

before the step of using the callback interface unit of the plurality of thread-level decoding channels to respectively respond to the plurality of speech decoding requests based on the decoding results, the method further comprises:

checking the registration state of the callback interface of the thread-level decoding channel recorded by the state control unit of each thread-level decoding channel;

if the registration state of the callback interface unit of the thread-level decoding channel is registered, executing the following steps: respectively responding to the plurality of voice decoding requests by utilizing callback interface units of the plurality of thread-level decoding channels based on the decoding results;

and if the registration state of the callback interface unit of the thread-level decoding channel is unregistered, clearing the voice stream data of the data cache region through the recovery unit of the thread-level decoding channel, and clearing the state information updated by the state control unit.

In addition, to achieve the above object, the present invention also provides a speech decoding apparatus comprising:

the device comprises an application module, a processing module and a processing module, wherein the application module is used for applying for a plurality of thread-level decoding channels when a plurality of voice decoding requests are received, and the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one;

and the decoding module is used for calling a universal model by utilizing the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

In addition, to achieve the above object, the present invention further provides a speech decoding engine, which includes a processor, a memory and a speech decoding program stored in the memory, wherein when the speech decoding program is executed by the processor, the steps of the speech decoding method are implemented.

In addition, to achieve the above object, the present invention also provides a computer storage medium having a speech decoding program stored thereon, the speech decoding program implementing the steps of the speech decoding method as described above when executed by a processor.

Compared with the prior art, the invention provides a voice decoding method, a device, an engine and a storage medium, wherein the method is applied to a voice decoding engine, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one; and calling a general model by using the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share one common model, so that the parallel processing of the thread levels of the voice decoding is realized, the hardware cost is reduced, and the concurrency capability and the decoding efficiency of the voice decoding are improved.

Drawings

FIG. 1 is a diagram illustrating a hardware architecture of a speech decoding engine according to various embodiments of the present invention;

FIG. 2 is a schematic diagram of the speech decoding engine of the present invention;

FIG. 3 is a schematic diagram of the components of the speech decoding channel of the present invention

FIG. 4 is a flowchart illustrating a first embodiment of a speech decoding method according to the present invention;

FIG. 5 is a schematic view of a voice stream data processing flow of an embodiment of the voice decoding method of the present invention;

FIG. 6 is a functional block diagram of a speech decoding apparatus according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The voice decoding engine mainly related to the embodiment of the invention is a network connection engine capable of realizing network connection, and the voice decoding engine can be a server, a cloud platform and the like.

Referring to fig. 1, fig. 1 is a schematic diagram of a hardware structure of a speech decoding engine according to embodiments of the present invention. In this embodiment of the present invention, the speech decoding engine may include a processor 1001 (e.g., a Central processing unit, CPU), a communication bus 1002, an input port 1003, an output port 1004, and a memory 1005. The communication bus 1002 is used for realizing connection communication among the components; the input port 1003 is used for data input; the output port 1004 is used for data output, the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory, and the memory 1005 may optionally be a storage device independent of the processor 1001. Those skilled in the art will appreciate that the hardware configuration depicted in FIG. 1 is not intended to be limiting of the present invention, and may include more or less components than those shown, or some components in combination, or a different arrangement of components.

With continued reference to fig. 1, the memory 1005 of fig. 1, which is one type of readable storage medium, may include an operating system, a network communication module, an application program module, and a voice decoding program. In fig. 1, the network communication module is mainly used for connecting to a server and performing data communication with the server; and the processor 1001 may call the speech decoding program stored in the memory 1005 and execute the speech decoding method provided by the embodiment of the present invention.

Further, referring to fig. 2, fig. 2 is a schematic diagram illustrating the components of the speech decoding engine according to the present invention. The decoding engine is an important component of the voice recognition technology, and the decoding engine is connected with a network communication protocol, controls voice stream data and a decoding algorithm, returns a decoding result and responds to various control signals. From the aspect of application devices, the application devices can be divided into a terminal decoding engine (e.g., a handheld device, an embedded device), a server decoding engine (e.g., a cloud service), and the like. According to different scenes, a decoder is usually designed in a targeted manner, for example, a terminal decoding engine only needs one-way decoding and reduces power consumption, and a cloud service needs as many as possible of multiple ways and parallel and reduces deployment cost.

The general model is a speech decoding model trained from a huge amount of data, and the general model can be used as a model training basis for vertical domain recognition and is subjected to migration learning training in combination with domain data to obtain a high-precision general model, it is understood that the general model comprises a decoding algorithm, which can be a cluster search algorithm, a viterbi algorithm, etc. the general model is obtained by a mel-frequency cepstrum coefficient (MFCC), a linear prediction cepstrum coefficient (L initial predictive cepstrum coefficient, ttt transform, L "&g L &l &ltttpcc/t &ttt), a linear prediction analysis (L initial predictive coefficients, ttc/t transform," &ttt & ] the final speech decoding probability is provided to the speech decoding model, and the final speech decoding process for converting the speech decoding result into a speech feature set, and providing a final speech decoding result of the speech decoding process.

The generic model provides support for the multiple decoding channels to decode a voice stream. With continued reference to fig. 2, the decode engine includes a plurality of thread-level decode channels: thread-level decode lane 1, thread-level decode lane 2 … … thread-level decode lane n. The respective thread-level decode channels may process the voice streams in parallel. Each thread-level decoding channel corresponds to one voice decoding request, so that a timely and quick response can be provided for each voice decoding request. And moreover, a plurality of thread-level decoding channels are arranged on the basis of the general model, so that the volume of the general model is not increased, and compared with parallel processing at a process level, the hardware cost can be greatly reduced. One path uses one thread-level decoding channel in parallel, the thread-level decoding channel and the thread-level decoding channel are isolated from each other, but the same general model is used, so that one process can comprise n thread-level decoding channels, namely n paths of parallel decoding capability. Because the thread-level decoding channel runs at the thread level, one process can comprise any multi-path parallel decoding channel, and compared with a one-path decoding channel method of one process, the method not only can improve the processing efficiency by times, but also can greatly improve the use efficiency of the memory and the video memory.

Further, referring to fig. 3, fig. 3 is a schematic diagram of the voice decoding channel according to the present invention. The thread-level decoding channel comprises a data cache region, a channel decoding unit, a callback interface unit and a state control unit.

The data buffer area is used for storing voice stream data for the channel decoding unit to read; the buffer capacity of the data buffer area can be specifically set according to actual needs.

And the channel decoding unit extracts and reads voice stream data from the data buffer area, and calls the general model to decode the voice stream data after acquiring the voice stream data to obtain a decoding result. Generally, the decoding result is an optimal text corresponding to the voice stream data;

the callback interface unit is used for returning the decoding result to the client;

the state control unit is used for updating the states of the data cache region, the channel decoding unit and the callback interface unit in real time; the state control unit is also used for influencing an external control signal and adjusting the running state of the channel according to the external control signal.

Further, the thread-level decoding channel further includes a recovery unit, where the recovery unit is configured to clear the data buffer, clear the state information updated by the state control unit, and clear the data format information of the voice stream data recorded by the state control unit. Thus, the thread-level decode channel may be reused after a task is completed.

The embodiment of the invention provides a voice decoding method.

Referring to fig. 4, fig. 4 is a flowchart illustrating a speech decoding method according to a first embodiment of the present invention.

Step S101: when receiving a plurality of voice decoding requests, applying for a plurality of thread-level decoding channels, wherein the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one;

after the network connection between the speech decoding engine and the client is established, the speech decoding engine can receive the speech decoding request sent by the client. The speech decoding engine comprises a plurality of thread-level decoding channels for decoding, so that network connection can be established with a plurality of clients simultaneously, and a plurality of corresponding speech decoding requests can be received.

And the voice decoding engine establishes network connection with a plurality of clients, and applies for a corresponding thread-level decoding channel for each client after the connection is successful. The plurality of speech decoding requests correspond one-to-one to the plurality of thread-level decoding channels.

Respectively calling a universal model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results

Step S102: and calling a general model by using the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

The thread level decoding channel comprises a channel decoding unit, a data cache region and a callback interface unit, wherein the channel decoding unit, the data cache region and the callback interface unit run cooperatively, can store voice stream data, decode the voice stream data to obtain a decoding result, and respond the decoding result to a voice decoding request.

Specifically, the step S102: the steps of respectively calling a general model by using the thread-level decoding channels, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results comprise:

step S102 a: respectively caching voice stream data in the voice decoding requests by utilizing data caching areas of the thread-level decoding channels;

the data buffer area is used for buffering voice stream data received based on the voice decoding request, and the multiple data buffer areas of the multiple thread-level decoding channels respectively store the voice stream data in different voice decoding requests. The voice stream data corresponding to the voice decoding requests are respectively and temporarily stored in different data cache regions, and the different voice stream data are isolated from each other, so that the integrity of the voice stream data is ensured, and the voice stream data is stored orderly and is not easy to lose.

Specifically, the step S102 a: the step of buffering the voice stream data in the plurality of voice decoding requests respectively by using the data buffer areas of the plurality of thread-level decoding channels comprises:

step S102a 1: for any particular thread-level decoding channel of the plurality of thread-level decoding channels, checking a data state of a data buffer of the particular thread-level decoding channel;

each voice decoding request corresponds to a particular thread-level decoding channel, and the particular thread-level decoding channel corresponding to the voice decoding request is marked as a particular thread-level decoding channel.

When the voice stream data is stored in the data buffer area of the specific thread-level decoding channel, the data state of the data buffer area of the specific thread-level decoding channel needs to be determined first, and a corresponding voice stream data storage mode is selected according to the data state.

Step S102a 2: if the data state of the data cache region of the specific thread-level decoding channel is waiting data, the voice stream data corresponding to the specific thread-level decoding channel is directly and temporarily stored in the data cache region of the specific thread-level decoding channel;

in this embodiment, if the data state of the data buffer of the specific thread-level decoding channel is waiting data, the voice stream data corresponding to the specific thread-level decoding channel is directly and temporarily stored in the data buffer of the specific thread-level decoding channel.

Step S102a 3: if the data state of the data buffer area of the specific thread-level decoding channel is data, temporarily storing the voice stream data corresponding to the specific thread-level decoding channel at the end of the data buffer area of the specific thread-level decoding channel.

Therefore, the voice stream data in the data buffer area can be sequentially extracted by the channel decoding unit, the loss of the voice stream data is avoided, and the undecoded voice stream data is not missed in the decoding process.

Step S102 b: respectively calling a universal model by utilizing the channel decoding units of the thread-level decoding channels, and performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results;

in this embodiment, the general model provides support for the multiple thread-level decoding channels, and the multiple thread-level decoding channels decode respective voice stream data by using the general model to obtain a decoding result.

Specifically, the step S102b includes:

step S102b 1: respectively calling a universal model by utilizing the channel decoding units of the thread-level decoding channels;

and after the channel decoding unit extracts the voice stream data from the data cache region, calling a general model, and decoding the voice stream data by the general model.

Step S102b 2: and on the basis of the general model, converting the voice stream data into a characteristic vector set in parallel in each channel decoding unit, and converting the characteristic vector set into a decoding result.

The generic model is a speech decoding model trained from huge amount of data, which can be used as a model training basis for vertical domain recognition and is also transfer learning trained in combination with domain data to obtain a high accuracy generic model, which includes a decoding algorithm, which can be a bundle search algorithm, a viterbi algorithm, etc. in this embodiment, the generic model is obtained by using Mel Frequency Cepstrum Coefficient (MFCC), linear prediction cepstrum coefficient (L initial predictive cepstrum coefficient, ttt transform = L "&l/t g PCC), linear prediction analysis (L initial audio sequences, ttt transform &" &ttransform & "&/t PCC), and converting the result of ttt probability translation to a certain speech vector set, which is obtained by converting the final speech decoding result into a speech decoding result using the acoustic model, and providing a result of the final speech decoding result to the speech decoding model using the acoustic model.

In this embodiment, decoding may be performed based on the viterbi algorithm. The viterbi algorithm is a general decoding algorithm, and is a method for finding a shortest path of a sequence based on dynamic programming. The viterbi algorithm is a dynamic programming algorithm used to find the sequence of-viterbi paths-hidden states that are most likely to produce a sequence of observed events, particularly in the context of markov information sources and hidden markov models. In this embodiment, based on the voice stream data, a most likely implicit state sequence is obtained by using a viterbi algorithm, and the most likely implicit state sequence is marked as a decoding result. Generally, the decoding result is a text corresponding to the voice stream data. Furthermore, the voice stream data may also be decoded based on a beam search algorithm. The bundle searching is a heuristic graph searching algorithm, which is generally used under the condition that the solution space of a graph is relatively large, in order to reduce the space and time occupied by searching, some nodes with relatively poor quality are cut off when the depth of each step is expanded, and some nodes with relatively high quality are reserved. This reduces space consumption and improves time efficiency.

Step S102c: and respectively responding to the plurality of voice decoding requests by utilizing the callback interface units of the plurality of thread-level decoding channels based on the decoding results.

And after the connection with the client corresponding to the voice decoding request is successfully established and the corresponding decoding channel is obtained, registering a callback interface unit for the callback interface unit to return the decoding result to the corresponding client.

After the corresponding channel decoding unit and the callback interface unit are obtained, initializing the corresponding data cache region, the channel decoding unit and the callback interface unit for receiving the voice stream corresponding to the voice decoding request, decoding the voice stream by the channel decoding unit, and responding the decoding result to the plurality of voice decoding requests by the callback interface unit.

In this embodiment, network connection may be established with multiple clients simultaneously or sequentially. After network connection is established, voice stream data uploaded by the client side is received in real time through a communication protocol determined by the client side; the communication Protocol may be TCP (Transmission Control Protocol), HTTP (hypertext transfer Protocol), WebSocket (duplex communication Protocol based on TCP), MRCP (media resource Control Protocol), or the like. The client comprises a webpage, a microphone, a mobile terminal and the like.

Each client corresponds to one decoding channel, the decoding channels are isolated from each other and respectively perform voice decoding processes, and the decoding processes among the decoding channels are not interfered with each other. For example, three speech decoding requests come from three clients: the client A, the client B and the client C correspondingly apply for three decoding channels: decoding channel a, decoding channel B, decoding channel C. It can be understood that, if the number of the voice decoding requests, that is, the number of the corresponding clients, exceeds the maximum value of the preset number of decoding channels, the exceeded voice decoding requests are added to the queuing sequence according to the queuing rule, the voice decoding requests in the queuing sequence are marked as queued voice decoding requests, and after one or more decoding channels complete the current voice decoding task, one or more of the queued voice decoding requests of the corresponding number are accessed. And after the decoding channel is determined, initializing a channel decoding unit in the decoding channel, and registering a callback interface unit for returning a corresponding decoding result through the callback interface unit.

Furthermore, the thread-level decoding channel further comprises a state control unit, and the state control unit is used for updating the states of the data cache region, the channel decoding unit and the callback interface unit in real time; the state control unit is also used for influencing an external control signal and adjusting the running state of the channel according to the external control signal.

The method further comprises the following steps:

step S200: recording the running state of the thread-level decoding channel, the data state of the data cache region and the registration state of the callback interface unit in real time through a state control unit of the thread-level decoding channel so as to execute corresponding steps according to the running state, the data state and the registration state;

it is understood that, as the processing of the voice stream data by the voice decoding engine progresses, the running state of the decoding channel in the voice decoding engine, the data state of the data cache, and the registration state of the callback interface unit may change. In order to better monitor the voice decoding process, the present embodiment updates the operating status, the data status, and the registration status in real time through the status control unit, so as to execute corresponding steps according to the operating status, the data status, and the registration status. Wherein the operation state comprises operation in and operation out; the data state comprises time, waiting data and receiving transmission ending data; the registration status includes registered and unregistered.

Step S300: and receiving an external control signal through a state control unit of the thread-level decoding channel, and adjusting the running state of the thread-level decoding channel according to the external control signal.

Further, the state control unit of the thread-level decoding channel may further receive an external control signal, where the external control signal is sent by the client sending the voice decoding request, and the external control signal includes a normal data stream signal and a network connection error signal, where the normal data stream signal includes start, pause, end, and the like. And if the external control signal is started, updating the running state of the thread-level decoding channel to be running, and if the external control signal is started, updating the running state of the thread-level decoding channel to be stopped running.

Further, the thread-level decode channel includes a reclaim unit; the recovery unit is configured to empty the voice stream data in the data buffer of the thread-level decoding channel, so as to store the voice stream data again by using the data buffer. The recovery unit is further configured to empty the state information recorded by the state control unit of the thread-level decoding channel, so that the state control unit updates the state of the thread-level decoding channel again.

Further, after the step of calling the generic model by using the decoding channels at the thread levels, performing parallel decoding processing on the voice stream data in the voice decoding requests, and responding to the voice decoding requests based on the decoding result, the method further includes:

step S2011a, emptying the voice stream data in the data buffer area of the thread-level decoding channel through the recovery unit of the thread-level decoding channel, so as to store the voice stream data again by utilizing the data buffer area;

in the process of executing the decoding process on the voice stream data, if a network connection is interrupted, a stopped external control instruction is received, or the like, some or all of the voice stream data stored in the data buffer may not be extracted yet, and thus, the voice stream data is still stored in the data buffer. At this time, the voice stream data in the data buffer of the thread-level decoding channel needs to be emptied by the recovery unit, so that when the thread-level decoding channel needs to perform the voice decoding process again, the voice stream data can be stored again by using the data buffer.

Step S2011b, the state control unit clears the state information recorded by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel, so that the state control unit records the state of the thread-level decoding channel again.

In the process of performing voice decoding on the thread-level decoding channel, a recovery unit of the thread-level decoding channel needs to empty a state control unit of the thread-level decoding channel to record state information of the thread-level decoding channel, a data cache region and a callback interface unit. It can be understood that, when a speech decoding process is finished, the state control unit of the thread-level decoding channel needs to be cleared of the recorded state information, so that the state control unit can record the state of the thread-level decoding channel again.

Further, when the thread-level decoding channel receives voice stream data, data format information for updating the voice stream data is recorded by the state control unit, wherein the data format information comprises a data stream format, a data stream sampling rate, a data stream coding and the like. And after the corresponding voice decoding process is finished, clearing the data format information of the voice stream data of the state control unit through the recovery unit. Thus, disk space of the speech decoding engine can be saved.

Further, the registration state of the callback interface unit is registered or unregistered; if the client corresponding to the voice decoding request registers the callback interface, the state control unit updates the state of the corresponding callback interface to be registered; otherwise, the registration state is unregistered.

Before the step of responding to the speech decoding requests respectively based on the decoding results by using the callback interface units of the thread-level decoding channels, the step S102c further includes:

step S102c 0: checking the registration state of the callback interface of the thread-level decoding channel recorded by the state control unit of each thread-level decoding channel;

It is to be understood that, if the registration status is registered, the decoding result may be directly responded to the plurality of clients corresponding to the plurality of speech decoding requests through the plurality of registered callback interface units. If the registration state is unregistered, it indicates that there is no legal callback interface unit, and it is necessary to clear the voice stream data in the data cache region and clear the state information updated by the state control unit through the recovery unit of the thread-level decoding channel.

Referring to fig. 5, fig. 5 is a schematic view of a voice stream data processing flow according to an embodiment of the present invention, which illustrates an overall flow of the voice decoding method according to the present invention, taking a specific thread-level decoding channel as an example.

As shown in fig. 5, first, a voice decoding request is received, a thread-level decoding channel is applied, a decoding unit is initialized, and a callback interface unit is registered; and updating the running state of the thread-level decoding channel to be running, updating the state of the callback interface unit to be registered and updating the state of the data cache area to be data waiting through a state control unit. Furthermore, data format information of the voice stream data, including a data stream format, a data stream sampling rate, a data stream encoding, and the like, may also be updated by the state control unit.

Then, the voice stream data is sent to a thread level decoding channel through a communication protocol, and the thread level decoding channel temporarily stores the voice stream data in a data cache region; if the data buffer area has un-decoded data, the data buffer area is added at the end of the buffer area, and the state of the data buffer area is updated to be 'data-present' through a state control unit.

Further, voice stream data stored in the data buffer area is extracted through the channel decoding unit, and the data buffer area is emptied; updating the data state to 'wait for data' by the state control unit; and decoding the voice stream data to generate a decoding result.

After the decoding result is obtained, checking the registration state of the callback interface unit: and if the registration state is unregistered, clearing state information recorded by a channel state control unit of the thread-level decoding channel through a recovery unit.

Otherwise, if the registration state is registered, returning a decoding result through a callback interface;

further, it is determined whether there is an end data stream signal, where the end data stream transmission signal may be an external control signal, and if there is no end data stream signal, the steps are performed: sending the voice stream data to a thread level decoding channel through a communication protocol, wherein the thread level decoding channel temporarily stores the voice stream data in a data cache region; if the data buffer area has un-decoded data, the data buffer area is added at the end of the buffer area, and the state of the data buffer area is updated to be 'data-present' through a state control unit.

If the data stream ending signal exists, the data state is updated to be 'data transmission ending' through the control unit, and the decoding unit reads all data in the data cache region to obtain a decoding result;

then checking the registration state of the callback interface, if the registration state is registered, returning a decoding result through the callback interface unit, and updating the running state of the thread-level decoding channel to be 'stopped running' through the state control unit architecture; and if the registration state is unregistered, clearing state information recorded by a channel state control unit of the thread-level decoding channel through a recovery unit.

According to the scheme, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one; and calling a general model by using the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share one common model, so that the parallel processing of the thread levels of the voice decoding is realized, the hardware cost is reduced, and the concurrency capability and the decoding efficiency of the voice decoding are improved.

In addition, the embodiment also provides a voice decoding device. Referring to fig. 6, fig. 6 is a functional block diagram of a speech decoding apparatus according to a first embodiment of the present invention.

In this embodiment, the speech decoding apparatus is a virtual apparatus, and is stored in the memory 1005 of the speech decoding engine shown in fig. 1, so as to implement all functions of the speech decoding program: the device comprises a processing unit, a processing unit and a processing unit, wherein the processing unit is used for applying for a plurality of thread-level decoding channels when receiving a plurality of voice decoding requests, and the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one; the device is used for calling a general model by utilizing the thread-level decoding channels respectively, carrying out parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results.

Specifically, the speech decoding apparatus includes:

an application module 10, configured to apply for multiple thread-level decoding channels when receiving multiple voice decoding requests, where the multiple voice decoding requests correspond to the multiple thread-level decoding channels one to one;

the decoding module 20 is configured to respectively invoke a generic model by using the multiple thread-level decoding channels, perform parallel decoding processing on the voice stream data in the multiple voice decoding requests, obtain a decoding result, and respond to the multiple voice decoding requests based on the decoding result.

Further, the decoding module includes:

the buffer unit is used for respectively buffering voice stream data in the voice decoding requests by utilizing the data buffer areas of the thread-level decoding channels;

the calling unit is used for calling the general models respectively by utilizing the channel decoding units of the thread-level decoding channels, and performing parallel decoding processing on the voice stream data in the voice decoding requests to obtain decoding results;

a response unit, configured to respectively respond to the multiple speech decoding requests based on the decoding results by using the callback interface units of the multiple thread-level decoding channels.

Further, the response unit includes:

a check subunit, configured to check, for any particular thread-level decoding channel of the multiple thread-level decoding channels, a data state of a data buffer of the particular thread-level decoding channel;

a first temporary storage subunit, configured to, if the data state of the data buffer of the specific thread-level decoding channel is waiting data, directly temporarily store the voice stream data corresponding to the specific thread-level decoding channel in the data buffer of the specific thread-level decoding channel;

and a second temporary storage subunit, configured to temporarily store, if the data state of the data buffer of the specific thread-level decoding channel is data, the voice stream data corresponding to the specific thread-level decoding channel at the end of the data buffer of the specific thread-level decoding channel.

Further, the calling unit includes:

the calling subunit is used for respectively calling the universal model by utilizing the channel decoding units of the thread-level decoding channels;

and the decoding subunit is configured to, based on the general model, convert the voice stream data into a feature vector set in parallel in each of the channel decoding units, and convert the feature vector set into a decoding result.

Further, the speech decoding apparatus further includes:

the updating module is used for updating the running state of the thread-level decoding channel, the data state of the data cache region and the registration state of the callback interface unit in real time through a state control unit of the thread-level decoding channel so as to execute corresponding steps according to the running state, the data state and the registration state; and/or

And the control module is used for receiving an external control signal through a state control unit of the thread-level decoding channel and adjusting the running state of the thread-level decoding channel according to the external control signal.

Further, the decoding module further comprises:

the first emptying unit is used for emptying the voice stream data in the data buffer area of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so as to store the voice stream data again by utilizing the data buffer area; and/or

And the second emptying unit is used for emptying the state information recorded by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so that the state control unit can record the state of the thread-level decoding channel again.

Further, the response unit further includes:

the checking subunit is configured to check the registration state of the callback interface of the thread-level decoding channel, which is recorded by the state control unit of each thread-level decoding channel;

an execution subunit, configured to, if the registration state of the callback interface unit of the thread-level decoding channel is registered, execute the following steps: respectively responding to the plurality of voice decoding requests by utilizing callback interface units of the plurality of thread-level decoding channels based on the decoding results;

and the emptying subunit is configured to empty, if the registration state of the callback interface unit of the thread-level decoding channel is unregistered, the voice stream data in the data cache region through the recovery unit of the thread-level decoding channel, and empty the state information updated by the state control unit.

In addition, an embodiment of the present invention further provides a computer storage medium, where a speech decoding program is stored on the computer storage medium, and when the speech decoding program is executed by a processor, the steps of the speech decoding method are implemented, which are not described herein again.

Compared with the prior art, the method, the device, the engine and the storage medium for voice decoding provided by the invention are applied to the voice decoding engine, when a plurality of voice decoding requests are received, a plurality of thread-level decoding channels are applied, and the plurality of voice decoding requests correspond to the plurality of thread-level decoding channels one to one; and calling a general model by using the thread-level decoding channels respectively, performing parallel decoding processing on voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results. Therefore, a plurality of voice decoding requests are processed in parallel through a plurality of thread-level decoding channels, and the plurality of thread-level decoding channels are requested to share one common model, so that the parallel processing of the thread levels of the voice decoding is realized, the hardware cost is reduced, and the concurrency capability and the decoding efficiency of the voice decoding are improved.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a terminal device to execute the method according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention and is not intended to limit the scope of the present invention, and all equivalent structures or flow transformations made by the present specification and drawings, or applied directly or indirectly to other related arts, are included in the scope of the present invention.

Claims

1. A speech decoding method applied to a speech decoding engine, comprising:

2. The method of claim 1, wherein the thread-level decode channel comprises a channel decode unit, a data cache, and a callback interface unit; the step of respectively calling the general models by using the thread-level decoding channels, performing parallel decoding processing on the voice stream data in the voice decoding requests to obtain decoding results, and responding to the voice decoding requests based on the decoding results comprises:

3. The method according to claim 2, wherein the buffering voice stream data in the plurality of voice decoding requests with the data buffers of the plurality of thread-level decoding channels respectively comprises:

4. The method according to claim 2, wherein the step of using the channel decoding units of the thread-level decoding channels to respectively call a general model to perform parallel decoding processing on the voice stream data in the voice decoding requests to obtain the decoding result comprises:

5. The method of claim 2, wherein the thread-level decode channel further comprises a state control unit,

the method further comprises the following steps:

6. The method of claim 1, wherein the thread-level decode channel comprises a reclaim unit;

And clearing the state information updated by the state control unit of the thread-level decoding channel through the recovery unit of the thread-level decoding channel so that the state control unit can update the state of the thread-level decoding channel again.

7. The method of claim 2, wherein the registration status of the callback interface unit is registered or unregistered;

8. A speech decoding apparatus, characterized in that the speech decoding apparatus comprises:

9. A speech decoding engine comprising a processor, a memory and a speech decoding program stored in the memory, the speech decoding program, when executed by the processor, implementing the steps of the speech decoding method according to any one of claims 1 to 7.

10. A computer storage medium, having a speech decoding program stored thereon, which when executed by a processor implements the steps of the speech decoding method according to any one of claims 1-7.