WO2022042159A1 - Delay control method and apparatus - Google Patents

Delay control method and apparatus Download PDF

Info

Publication number
WO2022042159A1
WO2022042159A1 PCT/CN2021/108217 CN2021108217W WO2022042159A1 WO 2022042159 A1 WO2022042159 A1 WO 2022042159A1 CN 2021108217 W CN2021108217 W CN 2021108217W WO 2022042159 A1 WO2022042159 A1 WO 2022042159A1
Authority
WO
WIPO (PCT)
Prior art keywords
delay
time
speech recognition
speech
delay time
Prior art date
Application number
PCT/CN2021/108217
Other languages
French (fr)
Chinese (zh)
Inventor
陈江
胡正伦
Original Assignee
百果园技术(新加坡)有限公司
陈江
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 陈江 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2022042159A1 publication Critical patent/WO2022042159A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/10Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters the excitation function being a multipulse excitation

Definitions

  • the present application relates to the technical field of speech recognition, for example, to a delay control method and device.
  • the speech decoder can be pruned or compressed in the training phase of the ASR model, but this will lead to the loss of the recognition rate of speech recognition.
  • the present application provides a delay control method and device to solve the problem that the recognition rate is impaired in order to improve the real-time performance of speech recognition.
  • the present application provides a delay control method.
  • the method is applied to a speech recognition system.
  • the speech recognition system includes delay control parameters, and the method includes:
  • the value of the delay control parameter is adjusted.
  • the present application also provides a delay control device, the device is applied in a speech recognition system, the speech recognition system includes delay control parameters, and the device includes:
  • a delay level determining unit configured to determine the delay level of the speech signal to be recognized
  • a target delay determination unit configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level
  • a time-varying and time-invariant delay determining unit configured to determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized
  • a delay adjustment judging unit configured to combine the time-varying delay time, the non-time-varying delay time and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system;
  • the delay control parameter adjustment unit is configured to adjust the value of the delay control parameter when it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted.
  • the present application also provides a server, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the above-mentioned delay control method when executing the program.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the above-mentioned delay control method is implemented.
  • Embodiment 1 is a flowchart of an embodiment of a delay control method provided in Embodiment 1 of the present application;
  • FIG. 2 is a schematic diagram of a server framework provided by Embodiment 1 of the present application.
  • Embodiment 3 is a flowchart of another embodiment of a delay control method provided in Embodiment 2 of the present application.
  • FIG. 4 is a schematic diagram of another server framework provided by Embodiment 2 of the present application.
  • FIG. 5 is a structural block diagram of a delay control device provided in Embodiment 3 of the present application.
  • FIG. 6 is a schematic structural diagram of a server according to Embodiment 4 of the present application.
  • FIG. 1 is a flowchart of an embodiment of a delay control method provided in Embodiment 1 of the present application.
  • the delay control method of the embodiment of the present application may be applied to an ASR system.
  • the ASR system can be located in the server.
  • the server can include not only the ASR system, but also a voice decoder.
  • the voice data packets incoming from the client are decoded into Pulse Code Modulation (Pulse Code Modulation, Pulse Code Modulation) through the voice decoder.
  • PCM Pulse Code Modulation
  • Step 110 Determine the delay level of the speech signal to be recognized.
  • the voice signal to be recognized may be PCM data obtained after the data packet input to the server is decoded by the voice decoder. After the ASR system obtains the to-be-recognized speech signal, the delay level of the to-be-recognized speech signal may be determined first.
  • the delay level can be used to represent the delay level that is suitable for the current context and generated when the ASR system performs speech recognition on the speech signal to be recognized.
  • the delay level may include, but is not limited to, a high delay level, a medium delay level, or a low delay level.
  • the delay level of the speech signal to be recognized may be determined in a rule-based manner.
  • Step 120 Determine a target delay estimation time of the speech signal to be recognized according to the delay level.
  • the target delay estimation time may be the text generation delay of the speech signal to be recognized currently estimated according to the previous speech recognition result.
  • the target delay estimation time is different, so for different delay levels, the target delay estimation time is different, and the higher the delay level, the larger the target delay estimation time.
  • the target delay estimation time can be set to a preset maximum value; for medium delay level, The recognition rate and decoding speed can be taken into account, and the setting of the target delay estimation time is aimed at not affecting the user's experience and a higher recognition rate.
  • the target delay estimation time can be set to 150ms, and when the delay exceeds 150ms, the user's experience be affected.
  • the target delay estimation time can be set to 100ms.
  • Step 130 Determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized.
  • the processing of speech data includes stages such as network transmission, speech decoding, and speech recognition, and each stage may bring about time-varying delay or time-invariant delay. Then, in step 130, the time-varying delay time and the time-invariant delay time of the speech signal to be recognized can be obtained.
  • the time-invariant delay time may include, but is not limited to, network transmission delay time; the time-variant delay time may include, but is not limited to: jitter buffer delay time, speech decoding delay time, and speech recognition delay estimation time.
  • Step 140 Determine whether the speech recognition delay time of the speech recognition system needs to be adjusted in combination with the time-varying delay time, the non-time-varying delay time, and the target delay estimation time.
  • the time-varying delay time and the time-invariant delay time may be compared with the target delay estimation time, and it is determined whether the speech recognition delay time of the speech recognition system needs to be adjusted according to the comparison result.
  • step 140 may include the following steps:
  • Step 150 if it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted, adjust the value of the delay control parameter.
  • the ASR system has an adjustable variable, that is, the delay control parameter.
  • the delay control parameter When it is determined that the speech recognition delay time of the ASR system needs to be adjusted, the value of the delay control parameter can be adjusted to adjust the decoding speed, so as to adjust the speech recognition speed. Identify the purpose of the delay.
  • an adaptively adjusted delay control parameter is configured in the ASR system, by determining the delay level, time-varying delay time and non-time-varying delay time of the speech signal to be recognized, and determining the delay level of the speech signal to be recognized according to the delay level
  • the target delay estimation time and then combined with the time-varying delay time, the time-invariant delay time and the target delay estimation time, when it is determined that the speech recognition delay time of the speech recognition ASR system needs to be adjusted, the value of the delay control parameter in the ASR system can be adjusted,
  • the ASR system can use high-precision speech recognition and decoding in a low-latency environment; In the environment, low-precision, low-latency speech recognition decoding is used to meet the target delay.
  • FIG. 3 is a flowchart of another embodiment of a delay control method provided in Embodiment 2 of the present application. This embodiment of the present application is described on the basis of Embodiment 1.
  • the time-varying delay time and the non-time-varying delay time are exemplified.
  • the network delay t1 can be obtained when receiving the incoming data packet from the client, and when the voice decoder is used
  • the speech decoding delay time t2 can be obtained, and the speech recognition delay estimated time t3' is estimated when the speech recognition decoder is used in the ASR system for speech recognition.
  • the adaptive ASR determines whether delay adjustment is required, and if delay adjustment is required, the value of the delay control parameter in the ASR system is adjusted to achieve the purpose of dynamically adjusting the speech recognition delay.
  • this embodiment includes the following steps:
  • Step 310 Determine the delay level of the speech signal to be recognized, where the delay level includes a high delay level, a medium delay level or a low delay level.
  • the delay level can be used to represent the delay level that is suitable for the current context and generated when the ASR system performs speech recognition on the speech signal to be recognized.
  • the delay level may include, but is not limited to, a high delay level, a medium delay level, or a low delay level.
  • step 310 may include the following steps:
  • the previous speech recognition result may be the speech recognition result of a period of time before the current speech signal to be recognized, wherein the length of the previous period of time may be determined according to actual requirements.
  • the length is not limited.
  • a rule base of sensitive words can be established in advance. If the speech recognition results of a period of time have hit sensitive words in the rule base of sensitive words, for example, there are sensitive words such as violence, terrorism, politics, pornography, etc. in the content of the conversation, because In this context, it is necessary to ensure high word accuracy and high recognition rate to identify potential sensitive words, that is, the high recognition rate is the first priority, and the corresponding delay is relatively high, so it can be determined that the speech signal to be recognized is The delay class is the high delay class.
  • Step 310-2 if the previous speech recognition result contains a preset rare word and the number of the rare word is greater than or equal to the preset number, determine that the delay level of the speech signal to be recognized is a medium delay level.
  • a rare word rule base may be pre-established, if the speech recognition results of a period of time before have hit rare words in the rare word rule base, and the number of hit rare words is greater than or equal to a preset number, for example, in news There are more than 5 uncommon words in the broadcast.
  • the recognition difficulty of the ASR system is relatively high, the decoding speed is relatively low, and the delay is relatively high, but it does not need to be as high-precision as the recognition of sensitive words.
  • the delay level of the speech signal to be recognized is a medium delay level.
  • Step 310-3 if the previous speech recognition result does not contain the preset sensitive words or uncommon words, or the number of uncommon words contained is less than the preset number, then determine that the delay level of the speech signal to be recognized is low delay grade.
  • the number of uncommon words is less than the preset number, such as speech recognition of entertainment content, in this context, use If the user has a timely text experience as the highest priority, it can be determined that the delay level of the speech signal to be recognized is a low delay level. Of course, an appropriate word accuracy rate needs to be ensured at a low delay level.
  • Step 320 Determine a target delay estimation time of the speech signal to be recognized according to the delay level.
  • the target delay estimation time may be the text generation delay of the speech signal to be recognized currently estimated according to the previous speech recognition result.
  • the target delay estimation time is different, so for different delay levels, the target delay estimation time is different, and the higher the delay level, the larger the target delay estimation time.
  • the above-mentioned high delay level, medium delay level or low delay level may be searched in a preset delay level data table to obtain the corresponding target delay estimation time.
  • a delay level data table may be created in advance, and the delay level data table records the association relationship between a plurality of delay levels and the corresponding target delay estimation time.
  • the corresponding target delay estimation time can be obtained by querying the delay level data table. For example, when it is determined that the current delay level of the speech signal to be recognized is a low delay level, the estimated target delay time obtained by looking up the table may be 100ms; The obtained target delay estimation time may be 150ms; if the delay level is a high delay level, the target delay estimation time is a preset maximum value.
  • Step 330 Determine a time-varying delay time and a time-invariant delay time of the speech signal to be recognized, where the time-invariant delay time includes a network transmission delay time, and the time-varying delay time includes a jitter buffer delay time, a speech decoding delay time Time and Speech Recognition Latency Estimated time.
  • the network transmission delay time refers to the transmission delay from the client to the server, and the delay is time-invariant.
  • the jitter buffer delay time is the time-varying delay generated by Jitter Buffer Management (JBM).
  • JBM Jitter Buffer Management
  • the jitter buffer management is to store the received voice data in a buffer first, and select the voice data in the buffer according to the delay time of the current network and the time obtained by the upper layer of the current network, so as to eliminate the problem of unstable network transmission quality. come jitter. Using jitter buffer management ensures smooth speech at the expense of latency.
  • the jitter buffer delay time depends on the algorithm design.
  • the jitter buffer delay time may be determined by the following manner: acquiring multiple network transmission delay times of the speech signal within a preset time period before the speech signal to be recognized, and calculating adjacent network transmission delays jitter between times; calculate the standard deviation of the jitter, and adjust the length of the jitter buffer according to the standard deviation as the jitter buffer delay time.
  • network delay network transmission delay time + jitter buffer delay time.
  • the voice decoding delay time is the delay generated by using the voice decoder to decode the received voice data packets into PCM data. Different voice codec algorithms will produce different voice decoding delay times.
  • the speech decoding delay time may be determined in the following manner: determining the target codec algorithm and code rate corresponding to the speech signal to be recognized, and obtaining the target codec algorithm and the speech corresponding to the code rate Decoding delay time.
  • AAC Advanced Audio Coding
  • Moving Picture Experts Group Audio Layer 3 Moving Picture Experts Group Audio Layer 3
  • AMR-WB Adaptive Multi-rate WideBand
  • a data table for recording the speech decoding delay times corresponding to different codec algorithms and different code rates can be created in advance. Table to obtain the corresponding speech decoding delay time.
  • the voice decoding delay time of AMR-WB is 0.9375ms
  • the voice decoding delay time of MP3 is 140ms
  • the voice decoding delay time of AAC is 210ms
  • High Efficiency-Advanced Audio Coding (HE-AAC) The speech decoding delay time is 360ms.
  • the speech recognition delay estimation time is the estimated decoding delay of the ASR system for the recognized speech signal. Since the decoding speed of the ASR system is different under different speech or noise conditions, the estimated speech recognition delay estimation time is also dynamically changed.
  • the estimated speech recognition delay time may be determined in the following manner: acquiring the speech recognition delay time of the speech signal of the last unit time; The speech recognition delay time is input to the trained delay prediction model, and the speech recognition delay estimate time output by the delay prediction model is obtained.
  • a delay prediction model may be pre-trained, and then the delay time of the current speech signal is estimated by using the speech recognition delay time of the speech signal of a unit time as a reference. Input the speech recognition delay time of the speech signal of the last unit time and the current speech signal to be recognized into the delay prediction model, which is processed by the delay prediction model to output the speech recognition delay estimation time of the current speech signal to be recognized.
  • the delay prediction model may be a neural network model, and a general neural network model training method may be used to train the delay prediction model, and this embodiment does not limit the training method of the delay prediction model.
  • Step 340 if it is determined that the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is greater than the target delay estimation time, then it is determined that the speech needs to be adjusted.
  • the speech recognition delay time of the recognition system if it is determined that the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is greater than the target delay estimation time, then it is determined that the speech needs to be adjusted.
  • Step 350 Calculate the difference between the estimated target delay time and the sum of the network transmission delay time, the jitter buffer delay time, and the speech decoding delay time, and use the difference as the available speech recognition delay time.
  • the available speech recognition delay time of the ASR system can be calculated.
  • the difference between the target delay estimation time and the sum of the network transmission delay time, the jitter buffer delay time and the speech decoding delay time can be calculated, and the difference can be used as the available speech recognition delay time.
  • target delay estimate time 500ms
  • target delay estimated time 500ms
  • the speech recognition delay for the recognized speech signal is only allowed to be 130ms, and the decoding speed needs to be relatively fast at this time, and the recognition accuracy is not so high.
  • Step 360 Adjust the value of the delay control parameter according to the available speech recognition delay time.
  • the delay control parameter may be a parameter related to a beam search (Beam search) algorithm.
  • the delay control parameter may include beam search width.
  • the cluster search algorithm is used to find the best path formed by the vocabulary, and the search space formed by the vocabulary is all possible character tokens.
  • the root node ⁇ sos> select the k nodes with the largest expansion probability among all the possible expansion nodes as the next layer of nodes, and then select the k cumulative multiplications with the largest expansion probability among the possible expansion nodes of these k nodes. Expand the nodes, and so on, to form a tree structure with k nodes in each layer, and finally select the one with the highest path score to backtrack to obtain the character sequence with the highest probability.
  • Beam search is a search strategy with limited range, so the value of beam search width k needs to be large enough to get close to the optimal solution, but at the cost of a large amount of computation and a long delay.
  • the value of the beam search width k is a fixed value determined in advance according to prior knowledge. Although increasing the beam width can improve the recognition rate, the cost is to reduce the decoding speed. This method of fixing the search range is inefficient because its search range and confidence are far lower than the current candidates with the best confidence.
  • the beam search width is modified to an adjustable variable, and when the speech recognition delay time of the speech recognition system needs to be adjusted, the decoding speed can be adjusted by adjusting the beam search width.
  • step 360 may include the following steps:
  • the beam search width is expanded; when the available speech recognition delay time is within a preset low delay range, the beam search width is reduced width.
  • the search range when the available speech recognition delay time is within a preset low delay range, that is, for a low delay scenario, the search range can be narrowed by narrowing the beam search width, thereby increasing the decoding speed.
  • the search range when the available speech recognition delay time is within the preset high delay range, that is, for high delay scenarios, the search range can be expanded by increasing the beam search width, the decoding speed can be reduced, and the decoding accuracy can be improved.
  • the delay control parameter may further include a trim factor, and the above-mentioned adjustment of the value of the delay control parameter further includes:
  • pruning conditions are determined based on the pruning factors, and candidate characters whose confidence levels meet the pruning conditions are retained, and the confidence levels that do not meet the pruning conditions are discarded. candidate characters.
  • the above-mentioned pruning condition determined based on the pruning factor may be: after determining the candidate character with the best confidence, calculating the product of the pruning factor and the best confidence as the pruning condition. Then, the candidate characters whose relative confidence is greater than the product of the pruning factor and the optimal confidence are retained, and the candidate characters whose relative confidence is less than or equal to the product of the pruning factor and the optimal confidence are discarded. That is, the formula for the trim condition is as follows:
  • is the pruning factor, and the larger it is set, the stricter the selection of candidate characters.
  • the above-mentioned pruning condition determined based on the pruning factor may be: if the difference between the confidence of a candidate character and the optimal confidence exceeds the pruning factor, discard the candidate character, if a candidate character The difference between the confidence of and the best confidence does not exceed the trim factor, then the candidate character is retained. That is, the formula for the trim condition is as follows:
  • is the trimming factor, and the smaller it is set, the stricter the selection of candidate characters.
  • the actual search width will be smaller than the beam search width and the delay will be reduced.
  • the delay control parameter may further include a path weight, and the above-mentioned adjustment of the value of the delay control parameter further includes:
  • the path weight of the descendant character is decreased.
  • candidates with high scores often come from the same historical path through the above strategy of beam search width and modification factor. If a candidate character token is too strong, the candidate tokens of the next layer may all be It is only derived from the strong token, so it is necessary to reserve alternative positions for other tokens with lower scores to avoid the search path concentrating on the descendants of a single node.
  • the maximum number of candidates from the same node can be limited. If the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of descendant characters can be reduced, so that other nodes with lower scores have higher weights Based on the path weight of the above descendant characters, it is easier to be selected, and the goal of increasing branch diversity is achieved without increasing the size of the cluster.
  • adaptively adjusted delay control parameters are configured in the ASR system, such as adjustable beam search width, pruning factor and path weight, etc.
  • FIG. 5 is a structural block diagram of a delay control apparatus provided in Embodiment 3 of the present application.
  • the apparatus may be applied to a speech recognition system, where the speech recognition system includes delay control parameters, and the apparatus may include:
  • the delay level determination unit 510 is configured to determine the delay level of the speech signal to be recognized; the target delay determination unit 520 is configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level; time-varying and time-invariant delays
  • the determining unit 530 is configured to determine the time-varying delay time and the non-time-varying delay time of the speech signal to be recognized;
  • the delay adjustment judging unit 540 is configured to combine the time-varying delay time, the non-time-varying delay time and all
  • the target delay estimation time is used to judge whether it is necessary to adjust the speech recognition delay time of the speech recognition system;
  • the delay control parameter adjustment unit 550 is set to adjust the speech recognition delay time of the speech recognition system when it is determined that it is necessary to adjust the speech recognition delay time of the speech recognition system. The value of the delay control parameter.
  • the delay control device provided in the embodiment of the present application can execute the delay control method provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • FIG. 6 is a schematic structural diagram of a server according to Embodiment 4 of the present application.
  • the server includes a processor 610, a memory 620, an input device 630, and an output device 640; the number of processors 610 in the server may be One or more, one processor 610 is taken as an example in FIG. 6; the processor 610, the memory 620, the input device 630 and the output device 640 in the server can be connected through a bus or other means, and the connection through a bus is taken as an example in FIG. 6 .
  • the memory 620 may be configured to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the delay control method in the embodiments of the present application.
  • the processor 610 executes various functional applications and data processing of the server by running the software programs, instructions and modules stored in the memory 620, that is, to implement the above-mentioned delay control method.
  • Embodiment 5 of the present application further provides a storage medium including computer-executable instructions, where the computer-executable instructions are used to execute the delay control method in the foregoing embodiments when executed by a processor of a server.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A delay control method and an apparatus. The delay control method is applied in a speech recognition system, the speech recognition system comprises a delay control parameter, and the method comprises: determining a delay level for a speech signal to be recognized (110); determining a target delay estimated time for the speech signal to be recognized according to the delay level (120); determining a time-varying delay time and a non-time-varying delay time for the speech signal to be recognized (130); determining whether it is necessary to adjust a speech recognition delay time of the speech recognition system by combining the time-varying delay time, the non-time-varying delay time, and the target delay estimated time (140); and in response to determining that it is necessary to adjust the speech recognition delay time of the speech recognition system, adjusting the value of the delay control parameter (150).

Description

延迟控制方法和装置Delay control method and device
本申请要求在2020年08月31日提交中国专利局、申请号为202010901269.2的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202010901269.2 filed with the China Patent Office on August 31, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及语音识别技术领域,例如涉及一种延迟控制方法和装置。The present application relates to the technical field of speech recognition, for example, to a delay control method and device.
背景技术Background technique
自动语音识别(Automatic Speech Recognition,ASR)是以语音为研究对象,通过语音信号处理和模式识别让机器自动识别和理解人类口述的语言。语音识别技术就是让机器通过识别和理解过程把语音信号转换为相应的文本或命令的技术。随着信息技术的发展,语音识别技术正逐步成为计算机信息处理技术中的关键技术,语音识别技术的应用场景也变得越来越广泛,例如语音识别技术可以应用在字幕添加、识别谈话中的敏感内容、人机交互等场景。Automatic Speech Recognition (ASR) takes speech as the research object, and enables machines to automatically recognize and understand human spoken language through speech signal processing and pattern recognition. Speech recognition technology is a technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding. With the development of information technology, speech recognition technology is gradually becoming a key technology in computer information processing technology, and the application scenarios of speech recognition technology are becoming more and more extensive. Sensitive content, human-computer interaction and other scenarios.
在使用语音识别技术的过程中,不可避免地存在网络延迟、语音解码延迟等延迟,导致将语音转换成文字的实时性难以达到业务要求。为了提高语音识别的实时性,在相关技术中,可以在ASR模型的训练阶段对语音解码器剪枝或压缩,但这样会导致语音识别的识别率受损。In the process of using speech recognition technology, there are inevitably delays such as network delay and speech decoding delay, which make it difficult for the real-time conversion of speech to text to meet business requirements. In order to improve the real-time performance of speech recognition, in the related art, the speech decoder can be pruned or compressed in the training phase of the ASR model, but this will lead to the loss of the recognition rate of speech recognition.
发明内容SUMMARY OF THE INVENTION
本申请提供一种延迟控制方法和装置,以解决为了提高语音识别的实时性而导致识别率受损的问题。The present application provides a delay control method and device to solve the problem that the recognition rate is impaired in order to improve the real-time performance of speech recognition.
本申请提供了一种延迟控制方法,所述方法应用于语音识别系统中,所述语音识别系统中包括延迟控制参数,所述方法包括:The present application provides a delay control method. The method is applied to a speech recognition system. The speech recognition system includes delay control parameters, and the method includes:
确定待识别语音信号的延迟等级;Determine the delay level of the speech signal to be recognized;
根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间;Determine the target delay estimation time of the speech signal to be recognized according to the delay level;
确定所述待识别语音信号的时变延迟时间和非时变延迟时间;determining the time-varying delay time and the time-invariant delay time of the to-be-recognized speech signal;
结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间;Combining the time-varying delay time, the non-time-varying delay time and the target delay estimation time, it is judged whether it is necessary to adjust the speech recognition delay time of the speech recognition system;
若判定需要调整所述语音识别系统的语音识别延迟时间,则调整所述延迟控制参数的值。If it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted, the value of the delay control parameter is adjusted.
本申请还提供了一种延迟控制装置,所述装置应用于语音识别系统中,所述语音识别系统中包括延迟控制参数,所述装置包括:The present application also provides a delay control device, the device is applied in a speech recognition system, the speech recognition system includes delay control parameters, and the device includes:
延迟等级确定单元,设置为确定待识别语音信号的延迟等级;a delay level determining unit, configured to determine the delay level of the speech signal to be recognized;
目标延迟确定单元,设置为根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间;a target delay determination unit, configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level;
时变及非时变延迟确定单元,设置为确定所述待识别语音信号的时变延迟时间和非时变延迟时间;a time-varying and time-invariant delay determining unit, configured to determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized;
延迟调整判断单元,设置为结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间;A delay adjustment judging unit, configured to combine the time-varying delay time, the non-time-varying delay time and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system;
延迟控制参数调整单元,设置为在判定需要调整所述语音识别系统的语音识别延迟时间时,则调整所述延迟控制参数的值。The delay control parameter adjustment unit is configured to adjust the value of the delay control parameter when it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted.
本申请还提供了一种服务器,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述的延迟控制方法。The present application also provides a server, including a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor implements the above-mentioned delay control method when executing the program.
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的延迟控制方法。The present application also provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the above-mentioned delay control method is implemented.
附图说明Description of drawings
图1是本申请实施例一提供的一种延迟控制方法实施例的流程图;1 is a flowchart of an embodiment of a delay control method provided in Embodiment 1 of the present application;
图2是本申请实施例一提供的服务器框架示意图;2 is a schematic diagram of a server framework provided by Embodiment 1 of the present application;
图3是本申请实施例二提供的另一种延迟控制方法实施例的流程图;3 is a flowchart of another embodiment of a delay control method provided in Embodiment 2 of the present application;
图4是本申请实施例二提供的另一种服务器框架示意图;4 is a schematic diagram of another server framework provided by Embodiment 2 of the present application;
图5是本申请实施例三提供的一种延迟控制装置的结构框图;5 is a structural block diagram of a delay control device provided in Embodiment 3 of the present application;
图6是本申请实施例四提供的一种服务器的结构示意图。FIG. 6 is a schematic structural diagram of a server according to Embodiment 4 of the present application.
具体实施方式detailed description
下面结合附图和实施例对本申请进行说明。The present application will be described below with reference to the accompanying drawings and embodiments.
实施例一Example 1
图1是本申请实施例一提供的一种延迟控制方法实施例的流程图,本申请实施例的延迟控制方法可以应用于ASR系统中。FIG. 1 is a flowchart of an embodiment of a delay control method provided in Embodiment 1 of the present application. The delay control method of the embodiment of the present application may be applied to an ASR system.
该ASR系统可以位于服务器中,如图2所示,服务器中除了包括ASR系统,还可以包括语音解码器,客户端传入的语音数据包经由语音解码器解码成脉冲编码调制(Pulse Code Modulation,PCM)数据,传给ASR系统,由ASR系统对该PCM数据进行语音识别解码,输出语音识别结果,语音识别结果可以是文字文本。The ASR system can be located in the server. As shown in Figure 2, the server can include not only the ASR system, but also a voice decoder. The voice data packets incoming from the client are decoded into Pulse Code Modulation (Pulse Code Modulation, Pulse Code Modulation) through the voice decoder. PCM) data, send it to the ASR system, and the ASR system performs speech recognition and decoding on the PCM data, and outputs the speech recognition result, and the speech recognition result can be text.
本实施例可以包括如下步骤:This embodiment may include the following steps:
步骤110,确定待识别语音信号的延迟等级。Step 110: Determine the delay level of the speech signal to be recognized.
在一种例子中,待识别语音信号可以为传入至服务器的数据包经由语音解码器解码后得到的PCM数据。当ASR系统获得待识别语音信号以后,可以首先确定待识别语音信号的延迟等级。In one example, the voice signal to be recognized may be PCM data obtained after the data packet input to the server is decoded by the voice decoder. After the ASR system obtains the to-be-recognized speech signal, the delay level of the to-be-recognized speech signal may be determined first.
在该实施例中,延迟等级可以用于表示ASR系统对该待识别语音信号进行语音识别时产生的、适于当前语境的延迟程度。作为一种示例,延迟等级可以包括但不限于:高延迟等级、中延迟等级或低延迟等级。In this embodiment, the delay level can be used to represent the delay level that is suitable for the current context and generated when the ASR system performs speech recognition on the speech signal to be recognized. As an example, the delay level may include, but is not limited to, a high delay level, a medium delay level, or a low delay level.
在一种实现中,可以基于规则rule-based的方式确定待识别语音信号的延迟等级。In one implementation, the delay level of the speech signal to be recognized may be determined in a rule-based manner.
步骤120,根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间。Step 120: Determine a target delay estimation time of the speech signal to be recognized according to the delay level.
在该步骤中,目标延迟估计时间可以为根据在先的语音识别结果估测的、当前待识别语音信号的文字生成延迟。对于不同的语音场景,目标延迟估计时间是不同的,因此对于不同的延迟等级,目标延迟估计时间是不同的,延迟等级越高目标延迟估计时间越大。In this step, the target delay estimation time may be the text generation delay of the speech signal to be recognized currently estimated according to the previous speech recognition result. For different speech scenarios, the target delay estimation time is different, so for different delay levels, the target delay estimation time is different, and the higher the delay level, the larger the target delay estimation time.
例如,对于高延迟等级,高识别率和高字准率为第一优先级,则目标延迟估计时间比较大,比如,可以设置目标延迟估计时间为预设的极大值;对于中延迟等级,可以兼顾识别率和解码速度,则目标延迟估计时间的设置以不影响用户的体验和较高识别率为目标,比如,可以设置目标延迟估计时间为150ms,其中,当延迟超过150ms时用户的体验就受到影响。对于低延迟等级,以用户体验为第一优先级,则目标延迟估计时间的设置以提高用户的体验和适当的识别率为目标,比如,可以设置目标延迟估计时间为100ms。For example, for high delay level, high recognition rate and high word accuracy are the first priority, and the target delay estimation time is relatively large. For example, the target delay estimation time can be set to a preset maximum value; for medium delay level, The recognition rate and decoding speed can be taken into account, and the setting of the target delay estimation time is aimed at not affecting the user's experience and a higher recognition rate. For example, the target delay estimation time can be set to 150ms, and when the delay exceeds 150ms, the user's experience be affected. For a low delay level, user experience is the first priority, and the target delay estimation time is set to improve user experience and an appropriate recognition rate. For example, the target delay estimation time can be set to 100ms.
步骤130,确定所述待识别语音信号的时变延迟时间和非时变延迟时间。Step 130: Determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized.
从图2可知,语音数据的处理过程包括网络传输、语音解码和语音识别等阶段,每个阶段都可能带来时变的延迟或非时变的延迟。则在步骤130中可以获取待识别语音信号的时变延迟时间和非时变延迟时间。As can be seen from Figure 2, the processing of speech data includes stages such as network transmission, speech decoding, and speech recognition, and each stage may bring about time-varying delay or time-invariant delay. Then, in step 130, the time-varying delay time and the time-invariant delay time of the speech signal to be recognized can be obtained.
作为一种示例,非时变延迟时间可以包括但不限于网络传输延迟时间;时 变延迟时间可以包括但不限于:抖动缓冲延迟时间、语音解码延迟时间和语音识别延迟估计时间。As an example, the time-invariant delay time may include, but is not limited to, network transmission delay time; the time-variant delay time may include, but is not limited to: jitter buffer delay time, speech decoding delay time, and speech recognition delay estimation time.
上述的网络传输延迟时间、抖动缓冲延迟时间、语音解码延迟时间和语音识别延迟估计时间只是本实施例中对于时变延迟时间和非时变延迟时间的示例性说明,除了上述的延迟时间以外,还可以获取语音解码过程中的其他延迟时间,本实施例对获取的语音解码过程中的其他延迟时间不作限制。The above-mentioned network transmission delay time, jitter buffer delay time, speech decoding delay time, and speech recognition delay estimation time are only exemplary descriptions of the time-varying delay time and the time-invariant delay time in this embodiment. Other delay times in the speech decoding process may also be acquired, and this embodiment does not limit the acquired other delay times in the speech decoding process.
步骤140,结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间。Step 140: Determine whether the speech recognition delay time of the speech recognition system needs to be adjusted in combination with the time-varying delay time, the non-time-varying delay time, and the target delay estimation time.
在该步骤中,可以将时变延迟时间及非时变延迟时间与目标延迟估计时间进行比较,并根据比较的结果判断是否需要调整语音识别系统的语音识别延迟时间。In this step, the time-varying delay time and the time-invariant delay time may be compared with the target delay estimation time, and it is determined whether the speech recognition delay time of the speech recognition system needs to be adjusted according to the comparison result.
在一种实施方式中,步骤140可以包括如下步骤:In one embodiment, step 140 may include the following steps:
判断所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和是否小于或等于所述目标延迟估计时间;若所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和小于或等于所述目标延迟估计时间,则判定不需要调整所述语音识别系统的语音识别延迟时间;若所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和大于所述目标延迟估计时间,则判定需要调整所述语音识别系统的语音识别延迟时间。Determine whether the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is less than or equal to the target delay estimation time; if the network transmission delay time, If the sum of the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is less than or equal to the target delay estimation time, it is determined that the speech recognition delay time of the speech recognition system does not need to be adjusted; if If the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is greater than the target delay estimation time, it is determined that the speech recognition of the speech recognition system needs to be adjusted delay.
步骤150,若判定需要调整所述语音识别系统的语音识别延迟时间,则调整延迟控制参数的值。 Step 150, if it is determined that the speech recognition delay time of the speech recognition system needs to be adjusted, adjust the value of the delay control parameter.
在该实施例中,ASR系统中具有可调整的变量,即延迟控制参数,当判定为需要调整ASR系统的语音识别延迟时间时,则可以调整延迟控制参数的值来调整解码速度,达到调整语音识别延迟的目的。In this embodiment, the ASR system has an adjustable variable, that is, the delay control parameter. When it is determined that the speech recognition delay time of the ASR system needs to be adjusted, the value of the delay control parameter can be adjusted to adjust the decoding speed, so as to adjust the speech recognition speed. Identify the purpose of the delay.
在本实施例中,在ASR系统中配置自适应调整的延迟控制参数,通过确定待识别语音信号的延迟等级、时变延迟时间和非时变延迟时间,并根据延迟等级确定待识别语音信号的目标延迟估计时间,然后结合该时变延迟时间、非时变延迟时间以及目标延迟估计时间,判定需要调整语音识别ASR系统的语音识别延迟时间时,则可以调整ASR系统中延迟控制参数的值,以达到根据当前语境的延迟等级动态调整语音识别延迟的目的,提高ASR系统快速适应变化的延迟环境的能力,使得ASR系统在低延迟的环境下,采用高精度语音识别解码;在高延迟的环境下,为满足目标延迟采用低精度、低延迟语音识别解码。In this embodiment, an adaptively adjusted delay control parameter is configured in the ASR system, by determining the delay level, time-varying delay time and non-time-varying delay time of the speech signal to be recognized, and determining the delay level of the speech signal to be recognized according to the delay level The target delay estimation time, and then combined with the time-varying delay time, the time-invariant delay time and the target delay estimation time, when it is determined that the speech recognition delay time of the speech recognition ASR system needs to be adjusted, the value of the delay control parameter in the ASR system can be adjusted, In order to achieve the purpose of dynamically adjusting the speech recognition delay according to the delay level of the current context, and improve the ability of the ASR system to quickly adapt to the changing delay environment, the ASR system can use high-precision speech recognition and decoding in a low-latency environment; In the environment, low-precision, low-latency speech recognition decoding is used to meet the target delay.
实施例二Embodiment 2
图3是本申请实施例二提供的另一种延迟控制方法实施例的流程图,本申请实施例在实施例一的基础上进行说明。在该实施例中,对时变延迟时间和非时变延迟时间进行了示例性说明,如图4所示,在接收客户端传入的数据包时可以获取网络延迟t1,在使用语音解码器进行语音解码时可以获得语音解码延迟时间t2,以及估计在ASR系统中使用语音识别解码器进行语音识别时的语音识别延迟估计时间t3′,结合估测的目标延迟估计时间p,由自适应ASR延迟控制模块判断是否需要进行延迟调整,如果需要进行延迟调整,则调整ASR系统中延迟控制参数的值,达到动态调整语音识别延迟的目的。FIG. 3 is a flowchart of another embodiment of a delay control method provided in Embodiment 2 of the present application. This embodiment of the present application is described on the basis of Embodiment 1. As shown in FIG. In this embodiment, the time-varying delay time and the non-time-varying delay time are exemplified. As shown in Figure 4, the network delay t1 can be obtained when receiving the incoming data packet from the client, and when the voice decoder is used When performing speech decoding, the speech decoding delay time t2 can be obtained, and the speech recognition delay estimated time t3' is estimated when the speech recognition decoder is used in the ASR system for speech recognition. Combined with the estimated target delay estimation time p, the adaptive ASR The delay control module determines whether delay adjustment is required, and if delay adjustment is required, the value of the delay control parameter in the ASR system is adjusted to achieve the purpose of dynamically adjusting the speech recognition delay.
如图3所示,本实施例包括如下步骤:As shown in Figure 3, this embodiment includes the following steps:
步骤310,确定待识别语音信号的延迟等级,所述延迟等级包括高延迟等级、中延迟等级或低延迟等级。Step 310: Determine the delay level of the speech signal to be recognized, where the delay level includes a high delay level, a medium delay level or a low delay level.
在该实施例中,延迟等级可以用于表示ASR系统对该待识别语音信号进行语音识别时产生的、适于当前语境的延迟程度。作为一种示例,延迟等级可以包括但不限于:高延迟等级、中延迟等级或低延迟等级。In this embodiment, the delay level can be used to represent the delay level that is suitable for the current context and generated when the ASR system performs speech recognition on the speech signal to be recognized. As an example, the delay level may include, but is not limited to, a high delay level, a medium delay level, or a low delay level.
在一种实现中,可以基于规则rule-based的方式确定待识别语音信号的延迟等级。则在一种实施方式中,步骤310可以包括如下步骤:In one implementation, the delay level of the speech signal to be recognized may be determined in a rule-based manner. Then, in one embodiment, step 310 may include the following steps:
步骤310-1,若在先的语音识别结果中包含预设的敏感词,则判定所述待识别语音信号的延迟等级为高延迟等级。Step 310-1, if the previous speech recognition result contains a preset sensitive word, determine that the delay level of the speech signal to be recognized is a high delay level.
在一种例子中,在先的语音识别结果可以为当前待识别语音信号之前一段时间的语音识别结果,其中,该之前一段时间的长度可以根据实际需求而定,本实施例对该之前一段时间的长度不作限制。In one example, the previous speech recognition result may be the speech recognition result of a period of time before the current speech signal to be recognized, wherein the length of the previous period of time may be determined according to actual requirements. The length is not limited.
在一种实现中,可以预先建立敏感词规则库,如果之前一段时间的语音识别结果有命中敏感词规则库中的敏感词,例如谈话内容中存在暴恐、涉政、色情等敏感词,由于在这种语境下需要确保高字准率和高识别率来识别出潜在的敏感词,即高识别率为第一优先级,相应地延迟就比较高,因此可以判定该待识别语音信号的延迟等级为高延迟等级。In one implementation, a rule base of sensitive words can be established in advance. If the speech recognition results of a period of time have hit sensitive words in the rule base of sensitive words, for example, there are sensitive words such as violence, terrorism, politics, pornography, etc. in the content of the conversation, because In this context, it is necessary to ensure high word accuracy and high recognition rate to identify potential sensitive words, that is, the high recognition rate is the first priority, and the corresponding delay is relatively high, so it can be determined that the speech signal to be recognized is The delay class is the high delay class.
步骤310-2,若在先的语音识别结果中包含预设的生僻词且所述生僻词的数量大于或等于预设数量,则判定所述待识别语音信号的延迟等级为中延迟等级。Step 310-2, if the previous speech recognition result contains a preset rare word and the number of the rare word is greater than or equal to the preset number, determine that the delay level of the speech signal to be recognized is a medium delay level.
在一种实现中,可以预先建立生僻词规则库,如果之前一段时间的语音识别结果有命中生僻词规则库中的生僻词,且命中的生僻词的数量大于或等于预 设数量,例如在新闻播报时生僻词超过5个,在这种语境下ASR系统的识别难度较高,相应地解码速度比较低、延迟就比较高,但又无需像敏感词的识别那么高精度,则可以判定该待识别语音信号的延迟等级为中延迟等级。In one implementation, a rare word rule base may be pre-established, if the speech recognition results of a period of time before have hit rare words in the rare word rule base, and the number of hit rare words is greater than or equal to a preset number, for example, in news There are more than 5 uncommon words in the broadcast. In this context, the recognition difficulty of the ASR system is relatively high, the decoding speed is relatively low, and the delay is relatively high, but it does not need to be as high-precision as the recognition of sensitive words. The delay level of the speech signal to be recognized is a medium delay level.
步骤310-3,若在先的语音识别结果中不包含预设的敏感词或者生僻词,或者包含的生僻词的数量小于预设数量,则判定所述待识别语音信号的延迟等级为低延迟等级。Step 310-3, if the previous speech recognition result does not contain the preset sensitive words or uncommon words, or the number of uncommon words contained is less than the preset number, then determine that the delay level of the speech signal to be recognized is low delay grade.
在该步骤中,如果之前一段时间的语音识别结果没有敏感词或者生僻词,或者虽然有生僻词但生僻词的数量小于预设数量,例如对娱乐内容的语音识别,在这种语境下以用户有及时的文字体验为最高优先级,则可以判定该待识别语音信号的延迟等级为低延迟等级,当然在低延迟等级时也需要保证适当的字准率。In this step, if there are no sensitive words or uncommon words in the speech recognition results of a period of time before, or although there are uncommon words, the number of uncommon words is less than the preset number, such as speech recognition of entertainment content, in this context, use If the user has a timely text experience as the highest priority, it can be determined that the delay level of the speech signal to be recognized is a low delay level. Of course, an appropriate word accuracy rate needs to be ensured at a low delay level.
步骤320,根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间。Step 320: Determine a target delay estimation time of the speech signal to be recognized according to the delay level.
在该步骤中,目标延迟估计时间可以为根据在先的语音识别结果估测的、当前待识别语音信号的文字生成延迟。对于不同的语音场景,目标延迟估计时间是不同的,因此对于不同的延迟等级,目标延迟估计时间是不同的,延迟等级越高目标延迟估计时间越大。In this step, the target delay estimation time may be the text generation delay of the speech signal to be recognized currently estimated according to the previous speech recognition result. For different speech scenarios, the target delay estimation time is different, so for different delay levels, the target delay estimation time is different, and the higher the delay level, the larger the target delay estimation time.
在一种实施方式中,可以在预设的延迟等级数据表中查找上述高延迟等级、中延迟等级或低延迟等级,以获得对应的目标延迟估计时间。In one embodiment, the above-mentioned high delay level, medium delay level or low delay level may be searched in a preset delay level data table to obtain the corresponding target delay estimation time.
在该实施例中,为了提高目标延迟估计时间的获取效率,可以预先创建延迟等级数据表,该延迟等级数据表中记录多个延迟等级及对应的目标延迟估计时间的关联关系。当确定当前待识别语音信号的延迟等级以后,可以通过查询该延迟等级数据表来获得对应的目标延迟估计时间。例如,当确定当前待识别语音信号的延迟等级为低延迟等级时,通过查表得到的目标延迟估计时间可以为100ms;当确定当前待识别语音信号的延迟等级为中延迟等级时,通过查表得到的目标延迟估计时间可以为150ms;若延迟等级为高延迟等级,其目标延迟估计时间为预设的极大值。In this embodiment, in order to improve the acquisition efficiency of the target delay estimation time, a delay level data table may be created in advance, and the delay level data table records the association relationship between a plurality of delay levels and the corresponding target delay estimation time. After the current delay level of the speech signal to be recognized is determined, the corresponding target delay estimation time can be obtained by querying the delay level data table. For example, when it is determined that the current delay level of the speech signal to be recognized is a low delay level, the estimated target delay time obtained by looking up the table may be 100ms; The obtained target delay estimation time may be 150ms; if the delay level is a high delay level, the target delay estimation time is a preset maximum value.
步骤330,确定所述待识别语音信号的时变延迟时间和非时变延迟时间,所述非时变延迟时间包括网络传输延迟时间,所述时变延迟时间包括抖动缓冲延迟时间、语音解码延迟时间和语音识别延迟估计时间。Step 330: Determine a time-varying delay time and a time-invariant delay time of the speech signal to be recognized, where the time-invariant delay time includes a network transmission delay time, and the time-varying delay time includes a jitter buffer delay time, a speech decoding delay time Time and Speech Recognition Latency Estimated time.
在该实施例中,网络传输延迟时间是指从客户端到服务器的传输延迟,该延迟属于非时变的。在一种实现中,可以获取服务器接收数据包的时间戳和客户端发送该数据包的时间戳,计算两者的差值,得到网络传输延迟时间,即,网络传输延迟时间=服务器接收数据包的时间戳-客户端发送数据包的时间戳。In this embodiment, the network transmission delay time refers to the transmission delay from the client to the server, and the delay is time-invariant. In one implementation, the time stamp of the server receiving the data packet and the time stamp of the client sending the data packet can be obtained, the difference between the two can be calculated, and the network transmission delay time can be obtained, that is, the network transmission delay time = the server receiving the data packet timestamp - The timestamp when the client sent the packet.
抖动缓冲延迟时间为采用抖动缓冲区管理(Jitter Buffer Management,JBM)产生的时变延迟。抖动缓冲区管理是将接收到的语音数据先存到一个buffer里面,根据当前网络的延迟时间以及当前网络的上层获取的时间对buffer中的语音数据进行选取,以消除由于网络传输质量不稳定带来的抖动。使用抖动缓冲区管理是以延迟为代价保证语音的流畅。The jitter buffer delay time is the time-varying delay generated by Jitter Buffer Management (JBM). The jitter buffer management is to store the received voice data in a buffer first, and select the voice data in the buffer according to the delay time of the current network and the time obtained by the upper layer of the current network, so as to eliminate the problem of unstable network transmission quality. come jitter. Using jitter buffer management ensures smooth speech at the expense of latency.
抖动缓冲延迟时间的确定取决于算法设计。在一种实施例中,可以采用如下方式确定抖动缓冲延迟时间:获取在所述待识别语音信号之前预设时间段内的语音信号的多个网络传输延迟时间,并计算相邻的网络传输延迟时间之间的抖动;计算所述抖动的标准差,并根据所述标准差调整抖动缓冲区的长度,作为抖动缓冲延迟时间。The determination of the jitter buffer delay time depends on the algorithm design. In one embodiment, the jitter buffer delay time may be determined by the following manner: acquiring multiple network transmission delay times of the speech signal within a preset time period before the speech signal to be recognized, and calculating adjacent network transmission delays jitter between times; calculate the standard deviation of the jitter, and adjust the length of the jitter buffer according to the standard deviation as the jitter buffer delay time.
在该实施例中,将抖动缓冲区的长度作为抖动缓冲延迟时间。例如,假设观察前一段时间的4个数据包的网络传输延迟时间分别为100ms、100ms、30ms、60ms、40ms,则对应的抖动分别为0ms、70ms、30ms、20ms,则计算4个抖动的标准差sigma=25.4ms,此时表示抖动比较大、网络质量不稳定,可以根据标准差调整抖动缓冲区的长度,如将抖动缓冲区的长度调整为2*sigma=50.8ms,即抖动缓冲延迟时间=50.8ms。In this embodiment, the length of the jitter buffer is taken as the jitter buffer delay time. For example, assuming that the network transmission delay times of 4 data packets in the previous period are 100ms, 100ms, 30ms, 60ms, and 40ms, the corresponding jitters are 0ms, 70ms, 30ms, and 20ms, respectively. Then calculate the four jitter criteria. The difference sigma=25.4ms, which means that the jitter is relatively large and the network quality is unstable. You can adjust the length of the jitter buffer according to the standard deviation. For example, adjust the length of the jitter buffer to 2*sigma=50.8ms, that is, the jitter buffer delay time =50.8ms.
又如,假设网络质量很好、没抖动时,如标准差sigma=0ms,则可以将jitter buffer的长度调为0ms,即抖动缓冲延迟时间=0ms。For another example, if the network quality is good and there is no jitter, such as the standard deviation sigma=0ms, the length of the jitter buffer can be adjusted to 0ms, that is, the jitter buffer delay time=0ms.
由于网络传输延迟时间和抖动缓冲延迟时间都是在网络传输阶段产生的延迟,则可以将两者之和称为网络延迟,即网络延迟=网络传输延迟时间+抖动缓冲延迟时间。Since both the network transmission delay time and the jitter buffer delay time are delays generated in the network transmission stage, the sum of the two can be called network delay, that is, network delay = network transmission delay time + jitter buffer delay time.
语音解码延迟时间是采用语音解码器将接收到的语音数据包解码为PCM数据产生的延迟,不同的语音编解码算法会产生不同的语音解码延迟时间。The voice decoding delay time is the delay generated by using the voice decoder to decode the received voice data packets into PCM data. Different voice codec algorithms will produce different voice decoding delay times.
在一种实施例中,可以采用如下方式确定语音解码延迟时间:确定所述待识别语音信号对应的目标编解码算法和码率,并获取所述目标编解码算法以及所述码率对应的语音解码延迟时间。In an embodiment, the speech decoding delay time may be determined in the following manner: determining the target codec algorithm and code rate corresponding to the speech signal to be recognized, and obtaining the target codec algorithm and the speech corresponding to the code rate Decoding delay time.
根据不同的应用程序、不同的网络质量,可以选择不同的编解码算法和不同的码率。例如,若当前网络抖动比较大,用户在直播或唱歌模式下,在带有音乐背景下使用高级音频编码(Advanced Audio Coding,AAC)或动态图像专家组音频层3(Moving Picture Experts Group Audio Layer 3,MP3)编码会有较好音质;如果用户在通话模式下,且网络比较好,使用自适应多速率宽带(Adaptive Multi-rate WideBand,AMR-WB)就可以满足用户需求。According to different applications and different network quality, different codec algorithms and different bit rates can be selected. For example, if the current network jitter is relatively large, the user uses Advanced Audio Coding (AAC) or Moving Picture Experts Group Audio Layer 3 (Moving Picture Experts Group Audio Layer 3) in the background of music in live or singing mode. , MP3) encoding will have better sound quality; if the user is in call mode and the network is relatively good, using Adaptive Multi-rate WideBand (AMR-WB) can meet user needs.
在一种实现中,可以预先创建用于记录不同的编解码算法和不同的码率对 应的语音解码延迟时间的数据表,当确定当前的语音信号对应的编解码算法和码率以后,通过查表获得对应的语音解码延迟时间。例如,AMR-WB的语音解码延迟时间为0.9375ms;MP3的语音解码延迟时间为140ms;AAC的语音解码延迟时间为210ms;高效-高级音频编码(High Efficiency-Advanced Audio Coding,HE-AAC)的语音解码延迟时间为360ms。In one implementation, a data table for recording the speech decoding delay times corresponding to different codec algorithms and different code rates can be created in advance. Table to obtain the corresponding speech decoding delay time. For example, the voice decoding delay time of AMR-WB is 0.9375ms; the voice decoding delay time of MP3 is 140ms; the voice decoding delay time of AAC is 210ms; High Efficiency-Advanced Audio Coding (HE-AAC) The speech decoding delay time is 360ms.
语音识别延迟估计时间是预估的ASR系统对待识别语音信号的解码延迟。由于ASR系统在不同语音或噪声条件下的解码速度是不同的,所以估测的语音识别延迟估计时间也是动态变化的。The speech recognition delay estimation time is the estimated decoding delay of the ASR system for the recognized speech signal. Since the decoding speed of the ASR system is different under different speech or noise conditions, the estimated speech recognition delay estimation time is also dynamically changed.
在一种实施例中,可以采用如下方式确定语音识别延迟估计时间:获取上一单位时间的语音信号的语音识别延迟时间;将所述待识别语音信号以及所述上一单位时间的语音信号的语音识别延迟时间输入至已训练的延迟预测模型,并获得所述延迟预测模型输出的语音识别延迟估计时间。In one embodiment, the estimated speech recognition delay time may be determined in the following manner: acquiring the speech recognition delay time of the speech signal of the last unit time; The speech recognition delay time is input to the trained delay prediction model, and the speech recognition delay estimate time output by the delay prediction model is obtained.
在该实施例中,可以预先训练延迟预测模型,然后以上一单位时间的语音信号的语音识别延迟时间作为参考估计当前语音信号的延迟时间。将上一单位时间的语音信号的语音识别延迟时间以及当前的待识别语音信号输入至延迟预测模型,由该延迟预测模型进行处理,输出当前的待识别语音信号的语音识别延迟估计时间。In this embodiment, a delay prediction model may be pre-trained, and then the delay time of the current speech signal is estimated by using the speech recognition delay time of the speech signal of a unit time as a reference. Input the speech recognition delay time of the speech signal of the last unit time and the current speech signal to be recognized into the delay prediction model, which is processed by the delay prediction model to output the speech recognition delay estimation time of the current speech signal to be recognized.
示例性地,延迟预测模型可以为神经网络模型,可以采用通用的神经网络模型的训练方式对延迟预测模型进行训练,本实施例对延迟预测模型的训练方式不作限制。Exemplarily, the delay prediction model may be a neural network model, and a general neural network model training method may be used to train the delay prediction model, and this embodiment does not limit the training method of the delay prediction model.
步骤340,若判定所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和大于所述目标延迟估计时间,则判定需要调整所述语音识别系统的语音识别延迟时间。 Step 340, if it is determined that the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is greater than the target delay estimation time, then it is determined that the speech needs to be adjusted. The speech recognition delay time of the recognition system.
在该实施例中,延迟控制的目标是使得p>=d,其中,d=t1+t2+t3′,t1为网络延迟,即t1=网络传输延迟时间+抖动缓冲延迟时间;t2为语音解码延迟时间;t3′为语音识别延迟估计时间,p为目标延迟估计时间。In this embodiment, the goal of delay control is to make p>=d, where d=t1+t2+t3′, t1 is the network delay, that is, t1=network transmission delay time+jitter buffer delay time; t2 is speech decoding Delay time; t3' is the speech recognition delay estimation time, p is the target delay estimation time.
在获得t1、t2、t3′以后,如果三者之和d小于或等于p,则表示按照当前的语音识别解码得到的解码延迟符合要求,无需对语音识别系统的语音识别延迟时间进行调整。如果d超过p,则表示按照当前的语音识别解码得到的解码延迟不符合要求,需要对语音识别系统的语音识别延迟时间进行调整。After obtaining t1, t2, t3', if the sum d of the three is less than or equal to p, it means that the decoding delay obtained according to the current speech recognition decoding meets the requirements, and there is no need to adjust the speech recognition delay time of the speech recognition system. If d exceeds p, it means that the decoding delay obtained according to the current speech recognition decoding does not meet the requirements, and the speech recognition delay time of the speech recognition system needs to be adjusted.
步骤350,计算所述目标延迟估计时间与所述网络传输延迟时间、所述抖动缓冲延迟时间及所述语音解码延迟时间之和的差值,将所述差值作为可用语音识别延迟时间。Step 350: Calculate the difference between the estimated target delay time and the sum of the network transmission delay time, the jitter buffer delay time, and the speech decoding delay time, and use the difference as the available speech recognition delay time.
在该步骤中,在p>=d的控制目标下,可以计算ASR系统的可用语音识别延迟时间。在实现时,可以计算目标延迟估计时间与网络传输延迟时间、抖动缓冲延迟时间及语音解码延迟时间之和的差值,将该差值作为可用语音识别延迟时间。In this step, under the control objective of p>=d, the available speech recognition delay time of the ASR system can be calculated. During implementation, the difference between the target delay estimation time and the sum of the network transmission delay time, the jitter buffer delay time and the speech decoding delay time can be calculated, and the difference can be used as the available speech recognition delay time.
例如,当网络状况好时,并且用户处于语音通话模式,得到以下延迟:目标延迟估计时间=500ms,t1=网络传输延迟时间+抖动缓冲延迟时间=100ms+0ms=100ms,t2=AMR-WB解码延迟=1ms,则可用语音识别延迟时间t3=500ms-100ms-1ms=399ms,即在网络较好的情况下,ASR系统对待识别语音信号的语音识别延迟可达到400ms,此时解码速度无需很快,识别精度就比较高。For example, when the network condition is good, and the user is in voice call mode, the following delays are obtained: target delay estimate time = 500ms, t1 = network transmission delay time + jitter buffer delay time = 100ms + 0ms = 100ms, t2 = AMR-WB decoding Delay=1ms, the available speech recognition delay time t3=500ms-100ms-1ms=399ms, that is, in the case of a good network, the speech recognition delay for the ASR system to recognize the speech signal can reach 400ms, and the decoding speed does not need to be very fast at this time. , the recognition accuracy is higher.
又如,当网络状况差(抖动30ms),并且用户在直播、唱歌模式,在带有音乐下使用AAC/MP3编码会有较好音质,所以延迟如下:目标延迟估计时间=500ms,t1=网络传输延迟时间+抖动缓冲延迟时间=100ms+60ms=160ms,t2=AAC解码延迟=210ms,则可用语音识别延迟时间t3=500ms-160ms-210ms=130ms,即在存在网络抖动的情况下,ASR系统对待识别语音信号的语音识别延迟只允许130ms,此时解码速度就需要比较快,识别精度就没那么高。For another example, when the network condition is poor (jitter 30ms), and the user is in live broadcast or singing mode, using AAC/MP3 encoding with music will have better sound quality, so the delay is as follows: target delay estimated time = 500ms, t1 = network Transmission delay time + jitter buffer delay time = 100ms + 60ms = 160ms, t2 = AAC decoding delay = 210ms, then the available speech recognition delay time t3 = 500ms-160ms-210ms = 130ms, that is, in the presence of network jitter, the ASR system The speech recognition delay for the recognized speech signal is only allowed to be 130ms, and the decoding speed needs to be relatively fast at this time, and the recognition accuracy is not so high.
步骤360,根据所述可用语音识别延迟时间,对所述延迟控制参数的值进行调整。Step 360: Adjust the value of the delay control parameter according to the available speech recognition delay time.
在一种实施例中,延迟控制参数可以为与集束搜索(Beam search)算法相关的参数。作为一种示例,延迟控制参数可以包括集束搜索宽度。In one embodiment, the delay control parameter may be a parameter related to a beam search (Beam search) algorithm. As an example, the delay control parameter may include beam search width.
ASR系统在进行语音识别时,使用集束搜索算法找出词表所构成的最佳路径,词表所构成的搜索空间是所有可能的字符token。从根节点<sos>开始选取所有可能拓展的节点中拓展概率最大的k个节点为下一层节点,然后在这k个节点的可能拓展的节点中,再选取拓展概率最大的k个累乘拓展节点,以此类推,形成一棵每层都是k节点的树状结构,最后选出路径分数最高者回溯得到概率最高的字符序列。集束搜索是范围有限的搜索策略,因此集束搜索宽度k值需要足够大才能接近最优解,但代价是较大的计算量和较长的延迟。When the ASR system performs speech recognition, the cluster search algorithm is used to find the best path formed by the vocabulary, and the search space formed by the vocabulary is all possible character tokens. Starting from the root node <sos>, select the k nodes with the largest expansion probability among all the possible expansion nodes as the next layer of nodes, and then select the k cumulative multiplications with the largest expansion probability among the possible expansion nodes of these k nodes. Expand the nodes, and so on, to form a tree structure with k nodes in each layer, and finally select the one with the highest path score to backtrack to obtain the character sequence with the highest probability. Beam search is a search strategy with limited range, so the value of beam search width k needs to be large enough to get close to the optimal solution, but at the cost of a large amount of computation and a long delay.
通常集束搜索宽度k值是根据先验知识预先决定的固定值,虽然增大集束宽度可提高识别率但代价是降低解码速度。这种固定搜索范围的方式效率较差,因为它的搜索范围及置信度远低于当前最佳置信度的候选。在本实施例中,将集束搜索宽度修改为可调整的变量,在需要调整语音识别系统的语音识别延迟时间时,可以通过调整集束搜索宽度来调整解码速度。Usually, the value of the beam search width k is a fixed value determined in advance according to prior knowledge. Although increasing the beam width can improve the recognition rate, the cost is to reduce the decoding speed. This method of fixing the search range is inefficient because its search range and confidence are far lower than the current candidates with the best confidence. In this embodiment, the beam search width is modified to an adjustable variable, and when the speech recognition delay time of the speech recognition system needs to be adjusted, the decoding speed can be adjusted by adjusting the beam search width.
在一种实施例中,上述步骤360可以包括如下步骤:In an embodiment, the above step 360 may include the following steps:
当所述可用语音识别延迟时间在预设的高延迟范围内时,则扩大所述集束搜索宽度;当所述可用语音识别延迟时间在预设的低延迟范围内时,则缩小所述集束搜索宽度。When the available speech recognition delay time is within a preset high delay range, the beam search width is expanded; when the available speech recognition delay time is within a preset low delay range, the beam search width is reduced width.
在该实施例中,当可用语音识别延迟时间在预设的低延迟范围内时,即,针对低延迟场景,可以通过缩小集束搜索宽度来缩小搜索范围,增快解码速度。当可用语音识别延迟时间在预设的高延迟范围内时,即,针对高延迟场景,可以通过增大集束搜索宽度来扩大搜索范围,降低解码速度,提高解码精度。In this embodiment, when the available speech recognition delay time is within a preset low delay range, that is, for a low delay scenario, the search range can be narrowed by narrowing the beam search width, thereby increasing the decoding speed. When the available speech recognition delay time is within the preset high delay range, that is, for high delay scenarios, the search range can be expanded by increasing the beam search width, the decoding speed can be reduced, and the decoding accuracy can be improved.
在其他实施例中,延迟控制参数还可以包括修剪因子,则上述对延迟控制参数的值进行调整还包括:In other embodiments, the delay control parameter may further include a trim factor, and the above-mentioned adjustment of the value of the delay control parameter further includes:
在采用集束搜索算法查找出词表所构成的最佳路径的过程中,基于所述修剪因子确定修剪条件,并保留置信度符合所述修剪条件的候选字符,丢弃置信度不符合所述修剪条件的候选字符。In the process of finding the best path formed by the vocabulary using the beam search algorithm, pruning conditions are determined based on the pruning factors, and candidate characters whose confidence levels meet the pruning conditions are retained, and the confidence levels that do not meet the pruning conditions are discarded. candidate characters.
在一种例子中,上述基于所述修剪因子确定的修剪条件可以为:在确定最佳置信度的候选字符后,计算修剪因子和最佳置信度的乘积,作为修剪条件。则,保留相对置信度大于修剪因子和最佳置信度的乘积的候选字符,丢弃相对置信度小于或等于修剪因子和最佳置信度的乘积的候选字符。即,修剪条件的公式如下:In an example, the above-mentioned pruning condition determined based on the pruning factor may be: after determining the candidate character with the best confidence, calculating the product of the pruning factor and the best confidence as the pruning condition. Then, the candidate characters whose relative confidence is greater than the product of the pruning factor and the optimal confidence are retained, and the candidate characters whose relative confidence is less than or equal to the product of the pruning factor and the optimal confidence are discarded. That is, the formula for the trim condition is as follows:
Figure PCTCN2021108217-appb-000001
Figure PCTCN2021108217-appb-000001
其中,α为修剪因子,其设置得越大,对候选字符的筛选越严格。Among them, α is the pruning factor, and the larger it is set, the stricter the selection of candidate characters.
在另一种例子中,上述基于所述修剪因子确定的修剪条件可以为:如果一个候选字符的置信度与最佳置信度的差值超过该修剪因子,则丢弃该候选字符,如果一个候选字符的置信度与最佳置信度的差值不超过该修剪因子,则保留该候选字符。即,修剪条件的公式如下:In another example, the above-mentioned pruning condition determined based on the pruning factor may be: if the difference between the confidence of a candidate character and the optimal confidence exceeds the pruning factor, discard the candidate character, if a candidate character The difference between the confidence of and the best confidence does not exceed the trim factor, then the candidate character is retained. That is, the formula for the trim condition is as follows:
Figure PCTCN2021108217-appb-000002
Figure PCTCN2021108217-appb-000002
其中,η为修剪因子,其设置得越小,对候选字符的筛选越严格。Among them, η is the trimming factor, and the smaller it is set, the stricter the selection of candidate characters.
通过对集束搜索宽度的修改可以设定一个较大的宽度,联合修改因子的筛选,使得实际的搜索宽度会小于集束搜索宽度,降低延迟。By modifying the beam search width, a larger width can be set. Combined with the screening of the modification factor, the actual search width will be smaller than the beam search width and the delay will be reduced.
在其他实施例中,延迟控制参数还可以包括路径权重,则上述对延迟控制参数的值进行调整还包括:In other embodiments, the delay control parameter may further include a path weight, and the above-mentioned adjustment of the value of the delay control parameter further includes:
若一个候选字符的后代字符的数量超过预设的最大候选数量,则降低所述后代字符的路径权重。If the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of the descendant character is decreased.
在该实施例中,由于观察到通过上述集束搜索宽度和修改因子的策略会使得得分高的候选字符往往来自相同的历史路径,如果一候选字符token过于强势则下一层的备选token可能都只来源于该强势的token,因此有必要为其他的分数较低的token预留备选位置,避免搜索路径集中在单一节点的后代。在实现时,可以限制来自相同节点的最大候选数量,如果一个候选字符的后代字符的数量超过预设的最大候选数量,则可以降低后代字符的路径权重,使得其他分数较低的节点的权重高于上述后代字符的路径权重,从而更容易被选到,在不增加集束大小下达到增加分支多样性的目标。In this embodiment, it is observed that candidates with high scores often come from the same historical path through the above strategy of beam search width and modification factor. If a candidate character token is too strong, the candidate tokens of the next layer may all be It is only derived from the strong token, so it is necessary to reserve alternative positions for other tokens with lower scores to avoid the search path concentrating on the descendants of a single node. When implemented, the maximum number of candidates from the same node can be limited. If the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of descendant characters can be reduced, so that other nodes with lower scores have higher weights Based on the path weight of the above descendant characters, it is easier to be selected, and the goal of increasing branch diversity is achieved without increasing the size of the cluster.
在本实施例中,在ASR系统中配置自适应调整的延迟控制参数,如可调整的集束搜索宽度、修剪因子和路径权重等,在确定待识别语音信号的延迟等级以后,根据网络传输延迟时间、抖动缓冲延迟时间、语音解码延迟时间、语音识别延迟估计时间以及目标延迟估计时间来判断是否需要动态调整语音识别的延迟,并在需要调整语音识别延迟时,通过调整延迟控制参数的值来调整语音识别延迟,达到在网络延迟高的时候,为了维持实时语音识别则简化语音解码时间,识别率较差但至少维持基本流畅性;在网络延迟正常时,用高质的语音识别解码提高识别率。In this embodiment, adaptively adjusted delay control parameters are configured in the ASR system, such as adjustable beam search width, pruning factor and path weight, etc. After determining the delay level of the speech signal to be recognized, the network transmission delay time , jitter buffer delay time, speech decoding delay time, speech recognition delay estimation time and target delay estimation time to judge whether the delay of speech recognition needs to be dynamically adjusted, and when the delay of speech recognition needs to be adjusted, adjust the value of the delay control parameter to adjust Speech recognition delay, when the network delay is high, in order to maintain real-time speech recognition, the speech decoding time is simplified, the recognition rate is poor, but at least basic fluency is maintained; when the network delay is normal, high-quality speech recognition decoding is used to improve the recognition rate. .
实施例三Embodiment 3
图5是本申请实施例三提供的一种延迟控制装置的结构框图,该装置可以应用于语音识别系统中,所述语音识别系统中包括延迟控制参数,所述装置可以包括:5 is a structural block diagram of a delay control apparatus provided in Embodiment 3 of the present application. The apparatus may be applied to a speech recognition system, where the speech recognition system includes delay control parameters, and the apparatus may include:
延迟等级确定单元510,设置为确定待识别语音信号的延迟等级;目标延迟确定单元520,设置为根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间;时变及非时变延迟确定单元530,设置为确定所述待识别语音信号的时变延迟时间和非时变延迟时间;延迟调整判断单元540,设置为结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间;延迟控制参数调整单元550,设置为在判定需要调整所述语音识别系统的语音识别延迟时间时,则调整所述延迟控制参数的值。The delay level determination unit 510 is configured to determine the delay level of the speech signal to be recognized; the target delay determination unit 520 is configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level; time-varying and time-invariant delays The determining unit 530 is configured to determine the time-varying delay time and the non-time-varying delay time of the speech signal to be recognized; the delay adjustment judging unit 540 is configured to combine the time-varying delay time, the non-time-varying delay time and all The target delay estimation time is used to judge whether it is necessary to adjust the speech recognition delay time of the speech recognition system; the delay control parameter adjustment unit 550 is set to adjust the speech recognition delay time of the speech recognition system when it is determined that it is necessary to adjust the speech recognition delay time of the speech recognition system. The value of the delay control parameter.
本申请实施例所提供的延迟控制装置可执行本申请任意实施例所提供的延迟控制方法,具备执行方法相应的功能模块和效果。The delay control device provided in the embodiment of the present application can execute the delay control method provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
实施例四Embodiment 4
图6是本申请实施例四提供的一种服务器的结构示意图,如图6所示,该服务器包括处理器610、存储器620、输入装置630和输出装置640;服务器中处理器610的数量可以是一个或多个,图6中以一个处理器610为例;服务器中的处理器610、存储器620、输入装置630和输出装置640可以通过总线或其他方式连接,图6中以通过总线连接为例。FIG. 6 is a schematic structural diagram of a server according to Embodiment 4 of the present application. As shown in FIG. 6 , the server includes a processor 610, a memory 620, an input device 630, and an output device 640; the number of processors 610 in the server may be One or more, one processor 610 is taken as an example in FIG. 6; the processor 610, the memory 620, the input device 630 and the output device 640 in the server can be connected through a bus or other means, and the connection through a bus is taken as an example in FIG. 6 .
存储器620作为一种计算机可读存储介质,可设置为存储软件程序、计算机可执行程序以及模块,如本申请实施例中的延迟控制方法对应的程序指令/模块。处理器610通过运行存储在存储器620中的软件程序、指令以及模块,从而执行服务器的多种功能应用以及数据处理,即实现上述的延迟控制方法。As a computer-readable storage medium, the memory 620 may be configured to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the delay control method in the embodiments of the present application. The processor 610 executes various functional applications and data processing of the server by running the software programs, instructions and modules stored in the memory 620, that is, to implement the above-mentioned delay control method.
实施例五Embodiment 5
本申请实施例五还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由服务器的处理器执行时用于执行上述实施例中的延迟控制方法。Embodiment 5 of the present application further provides a storage medium including computer-executable instructions, where the computer-executable instructions are used to execute the delay control method in the foregoing embodiments when executed by a processor of a server.
对于装置、电子设备、存储介质实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。For the embodiments of the apparatus, electronic equipment, and storage medium, since they are basically similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for related parts.

Claims (14)

  1. 一种延迟控制方法,应用于语音识别系统中,所述语音识别系统中包括延迟控制参数,所述方法包括:A delay control method, applied in a speech recognition system, wherein the speech recognition system includes delay control parameters, and the method includes:
    确定待识别语音信号的延迟等级;Determine the delay level of the speech signal to be recognized;
    根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间;Determine the target delay estimation time of the speech signal to be recognized according to the delay level;
    确定所述待识别语音信号的时变延迟时间和非时变延迟时间;determining the time-varying delay time and the time-invariant delay time of the to-be-recognized speech signal;
    结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间;Combining the time-varying delay time, the non-time-varying delay time and the target delay estimation time, it is judged whether it is necessary to adjust the speech recognition delay time of the speech recognition system;
    响应于判定需要调整所述语音识别系统的语音识别延迟时间,调整所述延迟控制参数的值。In response to determining that the speech recognition delay time of the speech recognition system needs to be adjusted, the value of the delay control parameter is adjusted.
  2. 根据权利要求1所述的延迟控制方法,其中,所述非时变延迟时间包括网络传输延迟时间,所述时变延迟时间包括抖动缓冲延迟时间、语音解码延迟时间和语音识别延迟估计时间;The delay control method according to claim 1, wherein the time-invariant delay time includes a network transmission delay time, and the time-variant delay time includes a jitter buffer delay time, a speech decoding delay time, and a speech recognition delay estimation time;
    所述结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间,包括:The combination of the time-varying delay time, the non-time-varying delay time, and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system includes:
    判断所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和是否小于或等于所述目标延迟估计时间;Determine whether the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time is less than or equal to the target delay estimation time;
    响应于所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和小于或等于所述目标延迟估计时间,判定不需要调整所述语音识别系统的语音识别延迟时间;In response to the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time being less than or equal to the target delay estimation time, determining that the speech recognition need not be adjusted The speech recognition delay time of the system;
    响应于所述网络传输延迟时间、所述抖动缓冲延迟时间、所述语音解码延迟时间与所述语音识别延迟估计时间之和大于所述目标延迟估计时间,判定需要调整所述语音识别系统的语音识别延迟时间。In response to the sum of the network transmission delay time, the jitter buffer delay time, the speech decoding delay time and the speech recognition delay estimation time being greater than the target delay estimation time, it is determined that the speech of the speech recognition system needs to be adjusted Identify the delay time.
  3. 根据权利要求2所述的延迟控制方法,其中,所述确定所述待识别语音信号的时变延迟时间,包括:The delay control method according to claim 2, wherein the determining the time-varying delay time of the speech signal to be recognized comprises:
    获取在所述待识别语音信号之前预设时间段内的语音信号的多个网络传输延迟时间,并计算相邻的网络传输延迟时间之间的抖动;Acquiring multiple network transmission delay times of the speech signal within a preset time period before the to-be-recognized speech signal, and calculating the jitter between adjacent network transmission delay times;
    计算所述抖动的标准差,并根据所述标准差调整抖动缓冲区的长度,将所述抖动缓冲区的长度作为所述抖动缓冲延迟时间。Calculate the standard deviation of the jitter, adjust the length of the jitter buffer according to the standard deviation, and use the length of the jitter buffer as the jitter buffer delay time.
  4. 根据权利要求2所述的延迟控制方法,其中,所述确定所述待识别语音信号的时变延迟时间,包括:The delay control method according to claim 2, wherein the determining the time-varying delay time of the speech signal to be recognized comprises:
    确定所述待识别语音信号对应的目标编解码算法和码率,并获取所述目标 编解码算法以及所述码率对应的语音解码延迟时间。Determine the target codec algorithm and code rate corresponding to the voice signal to be recognized, and obtain the target codec algorithm and the corresponding voice decoding delay time of the code rate.
  5. 根据权利要求2所述的延迟控制方法,其中,所述确定所述待识别语音信号的时变延迟时间,包括:The delay control method according to claim 2, wherein the determining the time-varying delay time of the speech signal to be recognized comprises:
    获取上一单位时间的语音信号的语音识别延迟时间;Obtain the speech recognition delay time of the speech signal of the last unit time;
    将所述待识别语音信号以及所述上一单位时间的语音信号的语音识别延迟时间输入至已训练的延迟预测模型,并获得所述延迟预测模型输出的所述语音识别延迟估计时间。The speech recognition delay time of the speech signal to be recognized and the speech signal of the last unit time is input into the trained delay prediction model, and the estimated speech recognition delay time output by the delay prediction model is obtained.
  6. 根据权利要求2-5中任一项所述的延迟控制方法,其中,所述调整所述延迟控制参数的值,包括:The delay control method according to any one of claims 2-5, wherein the adjusting the value of the delay control parameter comprises:
    计算所述目标延迟估计时间与所述网络传输延迟时间、所述抖动缓冲延迟时间及所述语音解码延迟时间之和的差值,将所述差值作为可用语音识别延迟时间;calculating the difference between the estimated target delay time and the sum of the network transmission delay time, the jitter buffer delay time and the speech decoding delay time, and using the difference as the available speech recognition delay time;
    根据所述可用语音识别延迟时间,对所述延迟控制参数的值进行调整。The value of the delay control parameter is adjusted according to the available speech recognition delay time.
  7. 根据权利要求6所述的延迟控制方法,其中,所述延迟控制参数包括集束搜索宽度;The delay control method of claim 6, wherein the delay control parameter includes a beam search width;
    所述根据所述可用语音识别延迟时间,对所述延迟控制参数的值进行调整,包括:The adjusting the value of the delay control parameter according to the available speech recognition delay time includes:
    在所述可用语音识别延迟时间在预设的高延迟范围内的情况下,增大所述集束搜索宽度;When the available speech recognition delay time is within a preset high delay range, increasing the beam search width;
    在所述可用语音识别延迟时间在预设的低延迟范围内的情况下,缩小所述集束搜索宽度。When the available speech recognition delay time is within a preset low delay range, the beam search width is reduced.
  8. 根据权利要求7所述的延迟控制方法,其中,所述延迟控制参数还包括修剪因子,所述对所述延迟控制参数的值进行调整,还包括:The delay control method according to claim 7, wherein the delay control parameter further comprises a trim factor, and the adjusting the value of the delay control parameter further comprises:
    在采用集束搜索算法查找出词表所构成的路径的过程中,基于所述修剪因子确定修剪条件,并保留置信度符合所述修剪条件的候选字符,丢弃置信度不符合所述修剪条件的候选字符。In the process of using the beam search algorithm to find the path formed by the vocabulary, the pruning condition is determined based on the pruning factor, and the candidate characters whose confidence levels meet the pruning conditions are retained, and the candidates whose confidence levels do not meet the pruning conditions are discarded. character.
  9. 根据权利要求8所述的延迟控制方法,其中,所述延迟控制参数还包括路径权重,所述对所述延迟控制参数的值进行调整,还包括:The delay control method according to claim 8, wherein the delay control parameter further comprises a path weight, and the adjusting the value of the delay control parameter further comprises:
    在一个候选字符的后代字符的数量超过预设的最大候选数量的情况下,降低所述候选字符的后代字符的路径权重。In the case that the number of descendant characters of a candidate character exceeds the preset maximum number of candidates, the path weight of descendant characters of the candidate character is reduced.
  10. 根据权利要求1-5中任一项所述的延迟控制方法,其中,所述根据所述 延迟等级确定所述待识别语音信号的目标延迟估计时间,包括:The delay control method according to any one of claims 1-5, wherein the determining the target delay estimation time of the speech signal to be recognized according to the delay level comprises:
    在预设的延迟等级数据表中查找所述延迟等级,获得所述延迟等级对应的目标延迟估计时间,其中,所述延迟等级包括高延迟等级、中延迟等级或低延迟等级。The delay level is searched in a preset delay level data table to obtain a target delay estimation time corresponding to the delay level, wherein the delay level includes a high delay level, a medium delay level or a low delay level.
  11. 根据权利要求10所述的延迟控制方法,其中,所述确定待识别语音信号的延迟等级,包括:The delay control method according to claim 10, wherein the determining the delay level of the speech signal to be recognized comprises:
    在在先的语音识别结果中包含预设的敏感词的情况下,判定所述待识别语音信号的延迟等级为所述高延迟等级;In the case that a preset sensitive word is included in the previous speech recognition result, determine that the delay level of the to-be-recognized speech signal is the high delay level;
    在在先的语音识别结果中包含预设的生僻词且所述生僻词的数量大于或等于预设数量的情况下,判定所述待识别语音信号的延迟等级为所述中延迟等级;In the case where the previous speech recognition result contains a preset rare word and the number of the rare word is greater than or equal to the preset number, determine that the delay level of the speech signal to be recognized is the medium delay level;
    在在先的语音识别结果中不包含预设的敏感词或者生僻词,或者包含的生僻词的数量小于预设数量的情况下,判定所述待识别语音信号的延迟等级为所述低延迟等级。In the case that the previous speech recognition result does not contain a preset sensitive word or rare word, or the number of rare words contained is less than the preset number, it is determined that the delay level of the speech signal to be recognized is the low delay level .
  12. 一种延迟控制装置,应用于语音识别系统中,所述语音识别系统中包括延迟控制参数,所述装置包括:A delay control device is applied in a speech recognition system, the speech recognition system includes delay control parameters, and the device includes:
    延迟等级确定单元,设置为确定待识别语音信号的延迟等级;A delay level determination unit, configured to determine the delay level of the speech signal to be recognized;
    目标延迟确定单元,设置为根据所述延迟等级确定所述待识别语音信号的目标延迟估计时间;a target delay determination unit, configured to determine the target delay estimation time of the speech signal to be recognized according to the delay level;
    时变及非时变延迟确定单元,设置为确定所述待识别语音信号的时变延迟时间和非时变延迟时间;a time-varying and time-invariant delay determining unit, configured to determine the time-varying delay time and the time-invariant delay time of the speech signal to be recognized;
    延迟调整判断单元,设置为结合所述时变延迟时间、所述非时变延迟时间以及所述目标延迟估计时间,判断是否需要调整所述语音识别系统的语音识别延迟时间;A delay adjustment and judgment unit, configured to combine the time-varying delay time, the non-time-varying delay time and the target delay estimation time to determine whether it is necessary to adjust the speech recognition delay time of the speech recognition system;
    延迟控制参数调整单元,设置为响应于判定需要调整所述语音识别系统的语音识别延迟时间,调整所述延迟控制参数的值。The delay control parameter adjustment unit is configured to adjust the value of the delay control parameter in response to determining that the speech recognition delay time of the speech recognition system needs to be adjusted.
  13. 一种服务器,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1-11中任一项所述的延迟控制方法。A server, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein, when the processor executes the program, any one of claims 1-11 is implemented The delay control method described in item.
  14. 一种计算机可读存储介质,存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-11中任一项所述的延迟控制方法。A computer-readable storage medium storing a computer program, wherein when the program is executed by a processor, the delay control method according to any one of claims 1-11 is implemented.
PCT/CN2021/108217 2020-08-31 2021-07-23 Delay control method and apparatus WO2022042159A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010901269.2 2020-08-31
CN202010901269.2A CN112017666A (en) 2020-08-31 2020-08-31 Delay control method and device

Publications (1)

Publication Number Publication Date
WO2022042159A1 true WO2022042159A1 (en) 2022-03-03

Family

ID=73516444

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108217 WO2022042159A1 (en) 2020-08-31 2021-07-23 Delay control method and apparatus

Country Status (2)

Country Link
CN (1) CN112017666A (en)
WO (1) WO2022042159A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343966A (en) * 2023-03-27 2023-06-27 山东大学 Probability multiplication accumulation structural damage imaging positioning method and system based on delay factors

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017666A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Delay control method and device
CN115720142A (en) * 2021-08-23 2023-02-28 富联精密电子(天津)有限公司 Slave equipment address identification system, method and equipment
CN114827104B (en) * 2022-05-17 2024-02-23 咪咕文化科技有限公司 Time delay adjustment method, device, equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0736995A2 (en) * 1995-04-07 1996-10-09 Texas Instruments Incorporated Improvements in or relating to speech recognition
US20070211704A1 (en) * 2006-03-10 2007-09-13 Zhe-Hong Lin Method And Apparatus For Dynamically Adjusting The Playout Delay Of Audio Signals
CN103888381A (en) * 2012-12-20 2014-06-25 杜比实验室特许公司 Device and method used for controlling jitter buffer
US9514747B1 (en) * 2013-08-28 2016-12-06 Amazon Technologies, Inc. Reducing speech recognition latency
US9613624B1 (en) * 2014-06-25 2017-04-04 Amazon Technologies, Inc. Dynamic pruning in speech recognition
CN112017666A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Delay control method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2001292798A1 (en) * 2000-09-28 2002-04-08 Motorola, Inc., A Corporation Of The State Of Delaware Adaptive packet bundling for system performance optimization
KR101199448B1 (en) * 2011-01-31 2012-11-09 국방과학연구소 APPARATUS, METHOD AND RECORDING DEVICE FOR PREDICTION VoIP BASED SPEECH TRANSMISSION QUALITY USING EXTENDED E-MODEL
US9401150B1 (en) * 2014-04-21 2016-07-26 Anritsu Company Systems and methods to detect lost audio frames from a continuous audio signal
CN105991477B (en) * 2015-02-11 2019-07-19 腾讯科技(深圳)有限公司 A kind of method of adjustment and device in voice jitter buffer area
CN109525881A (en) * 2018-11-29 2019-03-26 青岛海信电器股份有限公司 Sound draws synchronous method, device and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0736995A2 (en) * 1995-04-07 1996-10-09 Texas Instruments Incorporated Improvements in or relating to speech recognition
US20070211704A1 (en) * 2006-03-10 2007-09-13 Zhe-Hong Lin Method And Apparatus For Dynamically Adjusting The Playout Delay Of Audio Signals
CN103888381A (en) * 2012-12-20 2014-06-25 杜比实验室特许公司 Device and method used for controlling jitter buffer
US9514747B1 (en) * 2013-08-28 2016-12-06 Amazon Technologies, Inc. Reducing speech recognition latency
US9613624B1 (en) * 2014-06-25 2017-04-04 Amazon Technologies, Inc. Dynamic pruning in speech recognition
CN112017666A (en) * 2020-08-31 2020-12-01 广州市百果园信息技术有限公司 Delay control method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343966A (en) * 2023-03-27 2023-06-27 山东大学 Probability multiplication accumulation structural damage imaging positioning method and system based on delay factors
CN116343966B (en) * 2023-03-27 2023-11-17 山东大学 Probability multiplication accumulation structural damage imaging positioning method and system based on delay factors

Also Published As

Publication number Publication date
CN112017666A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2022042159A1 (en) Delay control method and apparatus
EP3413305B1 (en) Dual mode speech recognition
US20210109706A1 (en) Audio output control
US9514747B1 (en) Reducing speech recognition latency
US10121471B2 (en) Language model speech endpointing
CN105869629B (en) Audio recognition method and device
US11862162B2 (en) Adapting an utterance cut-off period based on parse prefix detection
US10854186B1 (en) Processing audio data received from local devices
US20220262352A1 (en) Improving custom keyword spotting system accuracy with text-to-speech-based data augmentation
US9613624B1 (en) Dynamic pruning in speech recognition
US9129609B2 (en) Speech speed conversion factor determining device, speech speed conversion device, program, and storage medium
EP3092639B1 (en) A methodology for enhanced voice search experience
US10854192B1 (en) Domain specific endpointing
US20130090925A1 (en) System and method for supplemental speech recognition by identified idle resources
CN112242144A (en) Voice recognition decoding method, device and equipment based on streaming attention model and computer readable storage medium
US11017763B1 (en) Synthetic speech processing
US9218806B1 (en) Generation and use of multiple speech processing transforms
CN115428066A (en) Synthesized speech processing
US20160284364A1 (en) Voice detection method
US9449598B1 (en) Speech recognition with combined grammar and statistical language models
US11348579B1 (en) Volume initiated communications
US9583095B2 (en) Speech processing device, method, and storage medium
JP5621786B2 (en) Voice detection device, voice detection method, and voice detection program
CN115762521A (en) Keyword identification method and related device
CN104934040B (en) The duration adjusting and device of audio signal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21859999

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21859999

Country of ref document: EP

Kind code of ref document: A1