WO2022042159A1

WO2022042159A1 - Delay control method and apparatus

Info

Publication number: WO2022042159A1
Application number: PCT/CN2021/108217
Authority: WO
Inventors: 陈江; 胡正伦
Original assignee: 百果园技术(新加坡)有限公司; 陈江
Priority date: 2020-08-31
Filing date: 2021-07-23
Publication date: 2022-03-03
Also published as: CN112017666A

Abstract

A delay control method and an apparatus. The delay control method is applied in a speech recognition system, the speech recognition system comprises a delay control parameter, and the method comprises: determining a delay level for a speech signal to be recognized (110); determining a target delay estimated time for the speech signal to be recognized according to the delay level (120); determining a time-varying delay time and a non-time-varying delay time for the speech signal to be recognized (130); determining whether it is necessary to adjust a speech recognition delay time of the speech recognition system by combining the time-varying delay time, the non-time-varying delay time, and the target delay estimated time (140); and in response to determining that it is necessary to adjust the speech recognition delay time of the speech recognition system, adjusting the value of the delay control parameter (150).

Description

Delay control method and device

This application claims the priority of the Chinese patent application with application number 202010901269.2 filed with the China Patent Office on August 31, 2020, the entire contents of which are incorporated herein by reference.

technical field

The present application relates to the technical field of speech recognition, for example, to a delay control method and device.

Background technique

Automatic Speech Recognition (ASR) takes speech as the research object, and enables machines to automatically recognize and understand human spoken language through speech signal processing and pattern recognition. Speech recognition technology is a technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. With the development of information technology, speech recognition technology is gradually becoming a key technology in computer information processing technology, and the application scenarios of speech recognition technology are becoming more and more extensive. Sensitive content, human-computer interaction and other scenarios.

In the process of using speech recognition technology, there are inevitably delays such as network delay and speech decoding delay, which make it difficult for the real-time conversion of speech to text to meet business requirements. In order to improve the real-time performance of speech recognition, in the related art, the speech decoder can be pruned or compressed in the training phase of the ASR model, but this will lead to the loss of the recognition rate of speech recognition.

SUMMARY OF THE INVENTION

The present application provides a delay control method and device to solve the problem that the recognition rate is impaired in order to improve the real-time performance of speech recognition.

The present application provides a delay control method. The method is applied to a speech recognition system. The speech recognition system includes delay control parameters, and the method includes:

Determine the delay level of the speech signal to be recognized;

Determine the target delay estimation time of the speech signal to be recognized according to the delay level;

determining the time-varying delay time and the time-invariant delay time of the to-be-recognized speech signal;

Combining the time-varying delay time, the non-time-varying delay time and the target delay estimation time, it is judged whether it is necessary to adjust the speech recognition delay time of the speech recognition system;

If it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted, the value of the delay control parameter is adjusted.

The present application also provides a delay control device, the device is applied in a speech recognition system, the speech recognition system includes delay control parameters, and the device includes:

a delay level determining unit, configured to determine the delay level of the speech signal to be recognized;

a target delay determination unit, configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level;

a time-varying and time-invariant delay determining unit, configured to determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized;

A delay adjustment judging unit, configured to combine the time-varying delay time, the non-time-varying delay time and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system;

The delay control parameter adjustment unit is configured to adjust the value of the delay control parameter when it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted.

The present application also provides a server, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the above-mentioned delay control method when executing the program.

The present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the above-mentioned delay control method is implemented.

Description of drawings

1 is a flowchart of an embodiment of a delay control method provided in Embodiment 1 of the present application;

2 is a schematic diagram of a server framework provided by Embodiment 1 of the present application;

3 is a flowchart of another embodiment of a delay control method provided in Embodiment 2 of the present application;

4 is a schematic diagram of another server framework provided by Embodiment 2 of the present application;

5 is a structural block diagram of a delay control device provided in Embodiment 3 of the present application;

FIG. 6 is a schematic structural diagram of a server according to Embodiment 4 of the present application.

detailed description

The present application will be described below with reference to the accompanying drawings and embodiments.

Example 1

FIG. 1 is a flowchart of an embodiment of a delay control method provided in Embodiment 1 of the present application. The delay control method of the embodiment of the present application may be applied to an ASR system.

The ASR system can be located in the server. As shown in Figure 2, the server can include not only the ASR system, but also a voice decoder. The voice data packets incoming from the client are decoded into Pulse Code Modulation (Pulse Code Modulation, Pulse Code Modulation) through the voice decoder. PCM) data, send it to the ASR system, and the ASR system performs speech recognition and decoding on the PCM data, and outputs the speech recognition result, and the speech recognition result can be text.

This embodiment may include the following steps:

Step 110: Determine the delay level of the speech signal to be recognized.

In one example, the voice signal to be recognized may be PCM data obtained after the data packet input to the server is decoded by the voice decoder. After the ASR system obtains the to-be-recognized speech signal, the delay level of the to-be-recognized speech signal may be determined first.

In this embodiment, the delay level can be used to represent the delay level that is suitable for the current context and generated when the ASR system performs speech recognition on the speech signal to be recognized. As an example, the delay level may include, but is not limited to, a high delay level, a medium delay level, or a low delay level.

In one implementation, the delay level of the speech signal to be recognized may be determined in a rule-based manner.

Step 120: Determine a target delay estimation time of the speech signal to be recognized according to the delay level.

In this step, the target delay estimation time may be the text generation delay of the speech signal to be recognized currently estimated according to the previous speech recognition result. For different speech scenarios, the target delay estimation time is different, so for different delay levels, the target delay estimation time is different, and the higher the delay level, the larger the target delay estimation time.

For example, for high delay level, high recognition rate and high word accuracy are the first priority, and the target delay estimation time is relatively large. For example, the target delay estimation time can be set to a preset maximum value; for medium delay level, The recognition rate and decoding speed can be taken into account, and the setting of the target delay estimation time is aimed at not affecting the user's experience and a higher recognition rate. For example, the target delay estimation time can be set to 150ms, and when the delay exceeds 150ms, the user's experience be affected. For a low delay level, user experience is the first priority, and the target delay estimation time is set to improve user experience and an appropriate recognition rate. For example, the target delay estimation time can be set to 100ms.

Step 130: Determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized.

As can be seen from Figure 2, the processing of speech data includes stages such as network transmission, speech decoding, and speech recognition, and each stage may bring about time-varying delay or time-invariant delay. Then, in step 130, the time-varying delay time and the time-invariant delay time of the speech signal to be recognized can be obtained.

As an example, the time-invariant delay time may include, but is not limited to, network transmission delay time; the time-variant delay time may include, but is not limited to: jitter buffer delay time, speech decoding delay time, and speech recognition delay estimation time.

The above-mentioned network transmission delay time, jitter buffer delay time, speech decoding delay time, and speech recognition delay estimation time are only exemplary descriptions of the time-varying delay time and the time-invariant delay time in this embodiment. Other delay times in the speech decoding process may also be acquired, and this embodiment does not limit the acquired other delay times in the speech decoding process.

Step 140: Determine whether the speech recognition delay time of the speech recognition system needs to be adjusted in combination with the time-varying delay time, the non-time-varying delay time, and the target delay estimation time.

In this step, the time-varying delay time and the time-invariant delay time may be compared with the target delay estimation time, and it is determined whether the speech recognition delay time of the speech recognition system needs to be adjusted according to the comparison result.

In one embodiment, step 140 may include the following steps:

Determine whether the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is less than or equal to the target delay estimation time; if the network transmission delay time, If the sum of the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is less than or equal to the target delay estimation time, it is determined that the speech recognition delay time of the speech recognition system does not need to be adjusted; if If the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is greater than the target delay estimation time, it is determined that the speech recognition of the speech recognition system needs to be adjusted delay.

Step 150, if it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted, adjust the value of the delay control parameter.

In this embodiment, the ASR system has an adjustable variable, that is, the delay control parameter. When it is determined that the speech recognition delay time of the ASR system needs to be adjusted, the value of the delay control parameter can be adjusted to adjust the decoding speed, so as to adjust the speech recognition speed. Identify the purpose of the delay.

In this embodiment, an adaptively adjusted delay control parameter is configured in the ASR system, by determining the delay level, time-varying delay time and non-time-varying delay time of the speech signal to be recognized, and determining the delay level of the speech signal to be recognized according to the delay level The target delay estimation time, and then combined with the time-varying delay time, the time-invariant delay time and the target delay estimation time, when it is determined that the speech recognition delay time of the speech recognition ASR system needs to be adjusted, the value of the delay control parameter in the ASR system can be adjusted, In order to achieve the purpose of dynamically adjusting the speech recognition delay according to the delay level of the current context, and improve the ability of the ASR system to quickly adapt to the changing delay environment, the ASR system can use high-precision speech recognition and decoding in a low-latency environment; In the environment, low-precision, low-latency speech recognition decoding is used to meet the target delay.

Embodiment 2

FIG. 3 is a flowchart of another embodiment of a delay control method provided in Embodiment 2 of the present application. This embodiment of the present application is described on the basis of Embodiment 1. As shown in FIG. In this embodiment, the time-varying delay time and the non-time-varying delay time are exemplified. As shown in Figure 4, the network delay t1 can be obtained when receiving the incoming data packet from the client, and when the voice decoder is used When performing speech decoding, the speech decoding delay time t2 can be obtained, and the speech recognition delay estimated time t3' is estimated when the speech recognition decoder is used in the ASR system for speech recognition. Combined with the estimated target delay estimation time p, the adaptive ASR The delay control module determines whether delay adjustment is required, and if delay adjustment is required, the value of the delay control parameter in the ASR system is adjusted to achieve the purpose of dynamically adjusting the speech recognition delay.

As shown in Figure 3, this embodiment includes the following steps:

Step 310: Determine the delay level of the speech signal to be recognized, where the delay level includes a high delay level, a medium delay level or a low delay level.

In one implementation, the delay level of the speech signal to be recognized may be determined in a rule-based manner. Then, in one embodiment, step 310 may include the following steps:

Step 310-1, if the previous speech recognition result contains a preset sensitive word, determine that the delay level of the speech signal to be recognized is a high delay level.

In one example, the previous speech recognition result may be the speech recognition result of a period of time before the current speech signal to be recognized, wherein the length of the previous period of time may be determined according to actual requirements. The length is not limited.

In one implementation, a rule base of sensitive words can be established in advance. If the speech recognition results of a period of time have hit sensitive words in the rule base of sensitive words, for example, there are sensitive words such as violence, terrorism, politics, pornography, etc. in the content of the conversation, because In this context, it is necessary to ensure high word accuracy and high recognition rate to identify potential sensitive words, that is, the high recognition rate is the first priority, and the corresponding delay is relatively high, so it can be determined that the speech signal to be recognized is The delay class is the high delay class.

Step 310-2, if the previous speech recognition result contains a preset rare word and the number of the rare word is greater than or equal to the preset number, determine that the delay level of the speech signal to be recognized is a medium delay level.

In one implementation, a rare word rule base may be pre-established, if the speech recognition results of a period of time before have hit rare words in the rare word rule base, and the number of hit rare words is greater than or equal to a preset number, for example, in news There are more than 5 uncommon words in the broadcast. In this context, the recognition difficulty of the ASR system is relatively high, the decoding speed is relatively low, and the delay is relatively high, but it does not need to be as high-precision as the recognition of sensitive words. The delay level of the speech signal to be recognized is a medium delay level.

Step 310-3, if the previous speech recognition result does not contain the preset sensitive words or uncommon words, or the number of uncommon words contained is less than the preset number, then determine that the delay level of the speech signal to be recognized is low delay grade.

In this step, if there are no sensitive words or uncommon words in the speech recognition results of a period of time before, or although there are uncommon words, the number of uncommon words is less than the preset number, such as speech recognition of entertainment content, in this context, use If the user has a timely text experience as the highest priority, it can be determined that the delay level of the speech signal to be recognized is a low delay level. Of course, an appropriate word accuracy rate needs to be ensured at a low delay level.

Step 320: Determine a target delay estimation time of the speech signal to be recognized according to the delay level.

In one embodiment, the above-mentioned high delay level, medium delay level or low delay level may be searched in a preset delay level data table to obtain the corresponding target delay estimation time.

In this embodiment, in order to improve the acquisition efficiency of the target delay estimation time, a delay level data table may be created in advance, and the delay level data table records the association relationship between a plurality of delay levels and the corresponding target delay estimation time. After the current delay level of the speech signal to be recognized is determined, the corresponding target delay estimation time can be obtained by querying the delay level data table. For example, when it is determined that the current delay level of the speech signal to be recognized is a low delay level, the estimated target delay time obtained by looking up the table may be 100ms; The obtained target delay estimation time may be 150ms; if the delay level is a high delay level, the target delay estimation time is a preset maximum value.

Step 330: Determine a time-varying delay time and a time-invariant delay time of the speech signal to be recognized, where the time-invariant delay time includes a network transmission delay time, and the time-varying delay time includes a jitter buffer delay time, a speech decoding delay time Time and Speech Recognition Latency Estimated time.

In this embodiment, the network transmission delay time refers to the transmission delay from the client to the server, and the delay is time-invariant. In one implementation, the time stamp of the server receiving the data packet and the time stamp of the client sending the data packet can be obtained, the difference between the two can be calculated, and the network transmission delay time can be obtained, that is, the network transmission delay time = the server receiving the data packet timestamp - The timestamp when the client sent the packet.

The jitter buffer delay time is the time-varying delay generated by Jitter Buffer Management (JBM). The jitter buffer management is to store the received voice data in a buffer first, and select the voice data in the buffer according to the delay time of the current network and the time obtained by the upper layer of the current network, so as to eliminate the problem of unstable network transmission quality. come jitter. Using jitter buffer management ensures smooth speech at the expense of latency.

The determination of the jitter buffer delay time depends on the algorithm design. In one embodiment, the jitter buffer delay time may be determined by the following manner: acquiring multiple network transmission delay times of the speech signal within a preset time period before the speech signal to be recognized, and calculating adjacent network transmission delays jitter between times; calculate the standard deviation of the jitter, and adjust the length of the jitter buffer according to the standard deviation as the jitter buffer delay time.

In this embodiment, the length of the jitter buffer is taken as the jitter buffer delay time. For example, assuming that the network transmission delay times of 4 data packets in the previous period are 100ms, 100ms, 30ms, 60ms, and 40ms, the corresponding jitters are 0ms, 70ms, 30ms, and 20ms, respectively. Then calculate the four jitter criteria. The difference sigma=25.4ms, which means that the jitter is relatively large and the network quality is unstable. You can adjust the length of the jitter buffer according to the standard deviation. For example, adjust the length of the jitter buffer to 2*sigma=50.8ms, that is, the jitter buffer delay time =50.8ms.

For another example, if the network quality is good and there is no jitter, such as the standard deviation sigma=0ms, the length of the jitter buffer can be adjusted to 0ms, that is, the jitter buffer delay time=0ms.

Since both the network transmission delay time and the jitter buffer delay time are delays generated in the network transmission stage, the sum of the two can be called network delay, that is, network delay = network transmission delay time + jitter buffer delay time.

The voice decoding delay time is the delay generated by using the voice decoder to decode the received voice data packets into PCM data. Different voice codec algorithms will produce different voice decoding delay times.

In an embodiment, the speech decoding delay time may be determined in the following manner: determining the target codec algorithm and code rate corresponding to the speech signal to be recognized, and obtaining the target codec algorithm and the speech corresponding to the code rate Decoding delay time.

According to different applications and different network quality, different codec algorithms and different bit rates can be selected. For example, if the current network jitter is relatively large, the user uses Advanced Audio Coding (AAC) or Moving Picture Experts Group Audio Layer 3 (Moving Picture Experts Group Audio Layer 3) in the background of music in live or singing mode. , MP3) encoding will have better sound quality; if the user is in call mode and the network is relatively good, using Adaptive Multi-rate WideBand (AMR-WB) can meet user needs.

In one implementation, a data table for recording the speech decoding delay times corresponding to different codec algorithms and different code rates can be created in advance. Table to obtain the corresponding speech decoding delay time. For example, the voice decoding delay time of AMR-WB is 0.9375ms; the voice decoding delay time of MP3 is 140ms; the voice decoding delay time of AAC is 210ms; High Efficiency-Advanced Audio Coding (HE-AAC) The speech decoding delay time is 360ms.

The speech recognition delay estimation time is the estimated decoding delay of the ASR system for the recognized speech signal. Since the decoding speed of the ASR system is different under different speech or noise conditions, the estimated speech recognition delay estimation time is also dynamically changed.

In one embodiment, the estimated speech recognition delay time may be determined in the following manner: acquiring the speech recognition delay time of the speech signal of the last unit time; The speech recognition delay time is input to the trained delay prediction model, and the speech recognition delay estimate time output by the delay prediction model is obtained.

In this embodiment, a delay prediction model may be pre-trained, and then the delay time of the current speech signal is estimated by using the speech recognition delay time of the speech signal of a unit time as a reference. Input the speech recognition delay time of the speech signal of the last unit time and the current speech signal to be recognized into the delay prediction model, which is processed by the delay prediction model to output the speech recognition delay estimation time of the current speech signal to be recognized.

Exemplarily, the delay prediction model may be a neural network model, and a general neural network model training method may be used to train the delay prediction model, and this embodiment does not limit the training method of the delay prediction model.

Step 340, if it is determined that the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is greater than the target delay estimation time, then it is determined that the speech needs to be adjusted. The speech recognition delay time of the recognition system.

In this embodiment, the goal of delay control is to make p>=d, where d=t1+t2+t3′, t1 is the network delay, that is, t1=network transmission delay time+jitter buffer delay time; t2 is speech decoding Delay time; t3' is the speech recognition delay estimation time, p is the target delay estimation time.

After obtaining t1, t2, t3', if the sum d of the three is less than or equal to p, it means that the decoding delay obtained according to the current speech recognition decoding meets the requirements, and there is no need to adjust the speech recognition delay time of the speech recognition system. If d exceeds p, it means that the decoding delay obtained according to the current speech recognition decoding does not meet the requirements, and the speech recognition delay time of the speech recognition system needs to be adjusted.

Step 350: Calculate the difference between the estimated target delay time and the sum of the network transmission delay time, the jitter buffer delay time, and the speech decoding delay time, and use the difference as the available speech recognition delay time.

In this step, under the control objective of p>=d, the available speech recognition delay time of the ASR system can be calculated. During implementation, the difference between the target delay estimation time and the sum of the network transmission delay time, the jitter buffer delay time and the speech decoding delay time can be calculated, and the difference can be used as the available speech recognition delay time.

For example, when the network condition is good, and the user is in voice call mode, the following delays are obtained: target delay estimate time = 500ms, t1 = network transmission delay time + jitter buffer delay time = 100ms + 0ms = 100ms, t2 = AMR-WB decoding Delay=1ms, the available speech recognition delay time t3=500ms-100ms-1ms=399ms, that is, in the case of a good network, the speech recognition delay for the ASR system to recognize the speech signal can reach 400ms, and the decoding speed does not need to be very fast at this time. , the recognition accuracy is higher.

For another example, when the network condition is poor (jitter 30ms), and the user is in live broadcast or singing mode, using AAC/MP3 encoding with music will have better sound quality, so the delay is as follows: target delay estimated time = 500ms, t1 = network Transmission delay time + jitter buffer delay time = 100ms + 60ms = 160ms, t2 = AAC decoding delay = 210ms, then the available speech recognition delay time t3 = 500ms-160ms-210ms = 130ms, that is, in the presence of network jitter, the ASR system The speech recognition delay for the recognized speech signal is only allowed to be 130ms, and the decoding speed needs to be relatively fast at this time, and the recognition accuracy is not so high.

Step 360: Adjust the value of the delay control parameter according to the available speech recognition delay time.

In one embodiment, the delay control parameter may be a parameter related to a beam search (Beam search) algorithm. As an example, the delay control parameter may include beam search width.

When the ASR system performs speech recognition, the cluster search algorithm is used to find the best path formed by the vocabulary, and the search space formed by the vocabulary is all possible character tokens. Starting from the root node <sos>, select the k nodes with the largest expansion probability among all the possible expansion nodes as the next layer of nodes, and then select the k cumulative multiplications with the largest expansion probability among the possible expansion nodes of these k nodes. Expand the nodes, and so on, to form a tree structure with k nodes in each layer, and finally select the one with the highest path score to backtrack to obtain the character sequence with the highest probability. Beam search is a search strategy with limited range, so the value of beam search width k needs to be large enough to get close to the optimal solution, but at the cost of a large amount of computation and a long delay.

Usually, the value of the beam search width k is a fixed value determined in advance according to prior knowledge. Although increasing the beam width can improve the recognition rate, the cost is to reduce the decoding speed. This method of fixing the search range is inefficient because its search range and confidence are far lower than the current candidates with the best confidence. In this embodiment, the beam search width is modified to an adjustable variable, and when the speech recognition delay time of the speech recognition system needs to be adjusted, the decoding speed can be adjusted by adjusting the beam search width.

In an embodiment, the above step 360 may include the following steps:

When the available speech recognition delay time is within a preset high delay range, the beam search width is expanded; when the available speech recognition delay time is within a preset low delay range, the beam search width is reduced width.

In this embodiment, when the available speech recognition delay time is within a preset low delay range, that is, for a low delay scenario, the search range can be narrowed by narrowing the beam search width, thereby increasing the decoding speed. When the available speech recognition delay time is within the preset high delay range, that is, for high delay scenarios, the search range can be expanded by increasing the beam search width, the decoding speed can be reduced, and the decoding accuracy can be improved.

In other embodiments, the delay control parameter may further include a trim factor, and the above-mentioned adjustment of the value of the delay control parameter further includes:

In the process of finding the best path formed by the vocabulary using the beam search algorithm, pruning conditions are determined based on the pruning factors, and candidate characters whose confidence levels meet the pruning conditions are retained, and the confidence levels that do not meet the pruning conditions are discarded. candidate characters.

In an example, the above-mentioned pruning condition determined based on the pruning factor may be: after determining the candidate character with the best confidence, calculating the product of the pruning factor and the best confidence as the pruning condition. Then, the candidate characters whose relative confidence is greater than the product of the pruning factor and the optimal confidence are retained, and the candidate characters whose relative confidence is less than or equal to the product of the pruning factor and the optimal confidence are discarded. That is, the formula for the trim condition is as follows:

Among them, α is the pruning factor, and the larger it is set, the stricter the selection of candidate characters.

In another example, the above-mentioned pruning condition determined based on the pruning factor may be: if the difference between the confidence of a candidate character and the optimal confidence exceeds the pruning factor, discard the candidate character, if a candidate character The difference between the confidence of and the best confidence does not exceed the trim factor, then the candidate character is retained. That is, the formula for the trim condition is as follows:

Among them, η is the trimming factor, and the smaller it is set, the stricter the selection of candidate characters.

By modifying the beam search width, a larger width can be set. Combined with the screening of the modification factor, the actual search width will be smaller than the beam search width and the delay will be reduced.

In other embodiments, the delay control parameter may further include a path weight, and the above-mentioned adjustment of the value of the delay control parameter further includes:

If the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of the descendant character is decreased.

In this embodiment, it is observed that candidates with high scores often come from the same historical path through the above strategy of beam search width and modification factor. If a candidate character token is too strong, the candidate tokens of the next layer may all be It is only derived from the strong token, so it is necessary to reserve alternative positions for other tokens with lower scores to avoid the search path concentrating on the descendants of a single node. When implemented, the maximum number of candidates from the same node can be limited. If the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of descendant characters can be reduced, so that other nodes with lower scores have higher weights Based on the path weight of the above descendant characters, it is easier to be selected, and the goal of increasing branch diversity is achieved without increasing the size of the cluster.

In this embodiment, adaptively adjusted delay control parameters are configured in the ASR system, such as adjustable beam search width, pruning factor and path weight, etc. After determining the delay level of the speech signal to be recognized, the network transmission delay time , jitter buffer delay time, speech decoding delay time, speech recognition delay estimation time and target delay estimation time to judge whether the delay of speech recognition needs to be dynamically adjusted, and when the delay of speech recognition needs to be adjusted, adjust the value of the delay control parameter to adjust Speech recognition delay, when the network delay is high, in order to maintain real-time speech recognition, the speech decoding time is simplified, the recognition rate is poor, but at least basic fluency is maintained; when the network delay is normal, high-quality speech recognition decoding is used to improve the recognition rate. .

Embodiment 3

5 is a structural block diagram of a delay control apparatus provided in Embodiment 3 of the present application. The apparatus may be applied to a speech recognition system, where the speech recognition system includes delay control parameters, and the apparatus may include:

The delay level determination unit 510 is configured to determine the delay level of the speech signal to be recognized; the target delay determination unit 520 is configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level; time-varying and time-invariant delays The determining unit 530 is configured to determine the time-varying delay time and the non-time-varying delay time of the speech signal to be recognized; the delay adjustment judging unit 540 is configured to combine the time-varying delay time, the non-time-varying delay time and all The target delay estimation time is used to judge whether it is necessary to adjust the speech recognition delay time of the speech recognition system; the delay control parameter adjustment unit 550 is set to adjust the speech recognition delay time of the speech recognition system when it is determined that it is necessary to adjust the speech recognition delay time of the speech recognition system. The value of the delay control parameter.

The delay control device provided in the embodiment of the present application can execute the delay control method provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.

Embodiment 4

FIG. 6 is a schematic structural diagram of a server according to Embodiment 4 of the present application. As shown in FIG. 6 , the server includes a processor 610, a memory 620, an input device 630, and an output device 640; the number of processors 610 in the server may be One or more, one processor 610 is taken as an example in FIG. 6; the processor 610, the memory 620, the input device 630 and the output device 640 in the server can be connected through a bus or other means, and the connection through a bus is taken as an example in FIG. 6 .

As a computer-readable storage medium, the memory 620 may be configured to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the delay control method in the embodiments of the present application. The processor 610 executes various functional applications and data processing of the server by running the software programs, instructions and modules stored in the memory 620, that is, to implement the above-mentioned delay control method.

Embodiment 5

Embodiment 5 of the present application further provides a storage medium including computer-executable instructions, where the computer-executable instructions are used to execute the delay control method in the foregoing embodiments when executed by a processor of a server.

For the embodiments of the apparatus, electronic equipment, and storage medium, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts.

Claims

A delay control method, applied in a speech recognition system, wherein the speech recognition system includes delay control parameters, and the method includes:

Determine the delay level of the speech signal to be recognized;

Determine the target delay estimation time of the speech signal to be recognized according to the delay level;

determining the time-varying delay time and the time-invariant delay time of the to-be-recognized speech signal;

Combining the time-varying delay time, the non-time-varying delay time and the target delay estimation time, it is judged whether it is necessary to adjust the speech recognition delay time of the speech recognition system;

In response to determining that the speech recognition delay time of the speech recognition system needs to be adjusted, the value of the delay control parameter is adjusted.
The delay control method according to claim 1, wherein the time-invariant delay time includes a network transmission delay time, and the time-variant delay time includes a jitter buffer delay time, a speech decoding delay time, and a speech recognition delay estimation time;

The combination of the time-varying delay time, the non-time-varying delay time, and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system includes:

Determine whether the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is less than or equal to the target delay estimation time;

In response to the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time being less than or equal to the target delay estimation time, determining that the speech recognition need not be adjusted The speech recognition delay time of the system;

In response to the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time being greater than the target delay estimation time, it is determined that the speech of the speech recognition system needs to be adjusted Identify the delay time.
The delay control method according to claim 2, wherein the determining the time-varying delay time of the speech signal to be recognized comprises:

Acquiring multiple network transmission delay times of the speech signal within a preset time period before the to-be-recognized speech signal, and calculating the jitter between adjacent network transmission delay times;

Calculate the standard deviation of the jitter, adjust the length of the jitter buffer according to the standard deviation, and use the length of the jitter buffer as the jitter buffer delay time.
The delay control method according to claim 2, wherein the determining the time-varying delay time of the speech signal to be recognized comprises:

Determine the target codec algorithm and code rate corresponding to the voice signal to be recognized, and obtain the target codec algorithm and the corresponding voice decoding delay time of the code rate.
The delay control method according to claim 2, wherein the determining the time-varying delay time of the speech signal to be recognized comprises:

Obtain the speech recognition delay time of the speech signal of the last unit time;

The speech recognition delay time of the speech signal to be recognized and the speech signal of the last unit time is input into the trained delay prediction model, and the estimated speech recognition delay time output by the delay prediction model is obtained.
The delay control method according to any one of claims 2-5, wherein the adjusting the value of the delay control parameter comprises:

calculating the difference between the estimated target delay time and the sum of the network transmission delay time, the jitter buffer delay time and the speech decoding delay time, and using the difference as the available speech recognition delay time;

The value of the delay control parameter is adjusted according to the available speech recognition delay time.
The delay control method of claim 6, wherein the delay control parameter includes a beam search width;

The adjusting the value of the delay control parameter according to the available speech recognition delay time includes:

When the available speech recognition delay time is within a preset high delay range, increasing the beam search width;

When the available speech recognition delay time is within a preset low delay range, the beam search width is reduced.
The delay control method according to claim 7, wherein the delay control parameter further comprises a trim factor, and the adjusting the value of the delay control parameter further comprises:

In the process of using the beam search algorithm to find the path formed by the vocabulary, the pruning condition is determined based on the pruning factor, and the candidate characters whose confidence levels meet the pruning conditions are retained, and the candidates whose confidence levels do not meet the pruning conditions are discarded. character.
The delay control method according to claim 8, wherein the delay control parameter further comprises a path weight, and the adjusting the value of the delay control parameter further comprises:

In the case that the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of descendant characters of the candidate character is reduced.
The delay control method according to any one of claims 1-5, wherein the determining the target delay estimation time of the speech signal to be recognized according to the delay level comprises:

The delay level is searched in a preset delay level data table to obtain a target delay estimation time corresponding to the delay level, wherein the delay level includes a high delay level, a medium delay level or a low delay level.
The delay control method according to claim 10, wherein the determining the delay level of the speech signal to be recognized comprises:

In the case that a preset sensitive word is included in the previous speech recognition result, determine that the delay level of the to-be-recognized speech signal is the high delay level;

In the case where the previous speech recognition result contains a preset rare word and the number of the rare word is greater than or equal to the preset number, determine that the delay level of the speech signal to be recognized is the medium delay level;

In the case that the previous speech recognition result does not contain a preset sensitive word or rare word, or the number of rare words contained is less than the preset number, it is determined that the delay level of the speech signal to be recognized is the low delay level .
A delay control device is applied in a speech recognition system, the speech recognition system includes delay control parameters, and the device includes:

A delay level determination unit, configured to determine the delay level of the speech signal to be recognized;

a target delay determination unit, configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level;

a time-varying and time-invariant delay determining unit, configured to determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized;

A delay adjustment and judgment unit, configured to combine the time-varying delay time, the non-time-varying delay time and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system;

The delay control parameter adjustment unit is configured to adjust the value of the delay control parameter in response to determining that the speech recognition delay time of the speech recognition system needs to be adjusted.
A server, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein, when the processor executes the program, any one of claims 1-11 is implemented The delay control method described in item.
A computer-readable storage medium storing a computer program, wherein when the program is executed by a processor, the delay control method according to any one of claims 1-11 is implemented.