CN117524192A

CN117524192A - Speaker pause processing method in speech recognition

Info

Publication number: CN117524192A
Application number: CN202311481957.8A
Authority: CN
Inventors: 黄明明
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-06

Abstract

A method for speaker stall handling in speech recognition, comprising the steps of: step one: acquiring voice, and recognizing the voice through ASR; step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model; step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete; step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow. The invention can solve the problem of sentence breaking caused by pause in ASR, and has the advantages of extremely low research and development investment, extremely high accuracy, small concurrence pressure and the like, thereby being capable of being widely popularized and applied in the field of speech recognition.

Description

Speaker pause processing method in speech recognition

Technical Field

The invention belongs to the field of server maintenance, and particularly relates to a method for processing speaker pauses in voice recognition.

Background

With the development of artificial intelligence AI technology, speech recognition ASR has also been widely used in society. ASR recognition is often in the front part of the multiple processes, i.e., text after ASR recognition is sent to subsequent processes for processing. In most cases, the flow after ASR depends on the user speaking, on the "complete" text identified by the ASR, and the subsequent flow Cheng Hui makes a corresponding feedback action for the "complete semantics" contained by the user speaking.

ASR clients can interactively be categorized into two categories:

the beginning and ending of the utterance are manually controlled. For example: the key is pressed to speak, and the release is stopped.

The algorithm automatically determines the beginning and end of the utterance. For example:

the dedicated VAD model (webtc-VAD, silero-VAD) is considered to be speaking beginning when a human voice is detected, and speaking ending when no human voice is detected.

Within the algorithm, a period of time (e.g., 500 ms) is implemented when no text is recognized by the ASR, which is considered to be a sentence end. (industry has Paraformer, wenet is to detect an endpoint in this way)

The first type of interaction mode can basically ensure that the speaking words of the user are complete, but the interaction process is not automatic and intelligent enough.

The second type of interaction mode is widely applied to more automatic and intelligent scenes, can perform full-flow voice interaction, and does not need to touch keys. The disadvantage of this solution is: the integrity of the user speaking (the user speaking may have a pause caused by thinking, breathing, etc.) cannot be ensured, if the result of the ASR is directly used, the user may only say half a sentence, and the later is not so finished, so the integrity of the user's intention' cannot be ensured, and the effect of the subsequent flow link cannot be ensured.

The patent number is: CN202310131353.4, patent name: the patent of a speech recognition method, a device, equipment and a medium discloses the following technical scheme, and the method for performing speech segment segmentation on the target speech stream information according to the sentence-breaking feature information to obtain the target speech segment information comprises the following steps: carrying out statistical processing on the sentence-breaking feature information to obtain a sentence-breaking time threshold value and a target sentence-breaking word; continuously detecting the target voice stream information to obtain sentence-breaking information, wherein the sentence-breaking information comprises a pause position, pause duration and a target detection word; judging whether the pause time is greater than the sentence-breaking time threshold; if the pause time is greater than the sentence-breaking time threshold, determining whether the target detection word belongs to the target sentence-breaking word; and if the target detection word belongs to the target sentence-breaking word, performing speech segment segmentation on the target voice stream based on the pause position to obtain the target voice segment information. The defects are that: the target detects the word. In 2B business, clients facing clients have numerous scenes, and the requirement of providing comprehensive target detection words is not realistic and has weak practicability.

The patent number is: CN202110983301.0, patent name: the patent of speech sentence-breaking method, computer equipment and storage medium discloses the following technical scheme, and the speech data mute information (speech pause feature) and ASR-recognized text are combined, and fed into sentence-breaking prediction model to make prediction. The information provided by this patent may indicate (figures 2-S30, 3 of its patent specification) that uses silence information of speech data and ASR-recognized text as features, trains a sentence-breaking predictive model (including but not limited to convolutional neural network models, conditional random field models, and recurrent neural networks, etc.), and uses the model to calculate a cumulative silence score. The defects are that: model training costs are high. The training of the model requires the special collection of massive sentence-breaking text information and language pause information. A model structure needs to be selected or designed and trained. The whole process has long period and large labor investment. The calling frequency is high. This patent requires the calculation of text information for two adjacent speech packets, and an example of this gives a speech packet duration of 20ms. The call frequency of the model was found to be 50 times/sec at 20ms. Although the length (20 ms) of the voice packet is an example value, it is known from this example that it is not too long, and the call frequency of the model is necessarily tens of times/second. Higher concurrency pressures can be imposed on the server. The scheme has high input cost and high concurrency pressure of the server.

Disclosure of Invention

The invention provides a speaker pause processing method in voice recognition, which is used for solving the defects in the prior art.

The invention is realized by the following technical scheme:

a method for speaker stall handling in speech recognition, comprising the steps of:

step one: acquiring voice, and recognizing the voice through ASR;

step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;

step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;

step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.

In the above-mentioned method, in the first step, the speech recognition is offline speech recognition (i.e. asr+vad scheme is also called as offline ASR, and the speech recognition is started to call one sentence-break detection only when silence is detected by VAD) or online real-time speech recognition (asr+endpoint detection technique is also called as online ASR, and the speech recognition is performed while the speech is being said, and the sentence-break detection is only called when Endpoint occurs).

The method for speaker pause processing in speech recognition is characterized in that the threshold value of the waiting time in the fourth step is 100 ms-1000 ms.

In the above method for speaker pause processing in speech recognition, the upper limit of the waiting times in the fourth step is 2 times.

The invention has the advantages that: the invention can solve the problem of sentence breaking caused by pause in ASR, and has the advantages of extremely low research and development investment, extremely high accuracy, small concurrence pressure and the like, thereby being capable of being widely popularized and applied in the field of speech recognition.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a flow chart of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, a method for speaker pause processing in speech recognition includes the steps of:

step one: acquiring voice, and recognizing the voice through ASR;

Specifically, in the first step of the embodiment, the speech recognition is offline speech recognition (i.e. asr+vad scheme is also called as flush ASR, and the speech recognition is started after speaking and only one sentence break is called when silence is detected by VAD) or online real-time speech recognition (asr+endpoint detection technology is also called as online ASR, and the speech recognition is performed while speaking and only one sentence break is called when Endpoint occurs).

Specifically, the threshold value of the waiting duration in the fourth step in this embodiment is 100 ms-1000 ms, and may be adjusted according to the actual situation.

Further, the upper limit of the waiting times in the fourth step described in this embodiment is 2 times, and may be adjusted according to the actual situation.

Preferably, LLM is a comprehensive model that can perform chat, question-answer, content generation, etc., and has increasingly strong ICL (content-based reasoning) capabilities. Based on LLM, only one Prompt needs to be written, and the sentence-breaking detection function for detecting whether the semantics are complete can be realized. The Prompt is exemplified as follows: neglecting punctuation marks, please ask whether the following sentence has a complete meaning, please directly give the answer "complete" or "incomplete", which is: "hello ask what name you call" in what name you call "is the result of the user's speech ASR recognition, and" \n "is a line feed. In more and more enterprises, LLM is privately deployed, and only one Prompt is needed at present, so that the function can be realized based on the LLM. In practice, the function may share one LLM service with other LLM functions, but the promt of the function is different from the promt of other functions, and the input cost is less than about 1 day, so that the function can be completed.

Preferably, the comprehensive capability of the LLM far exceeds a specific NLP model, which benefits from the fact that the LLM finishes pretraining on massive text data by GPT (generated Pre-trained Transformer), then finishes instruction fine tuning on massive SFT (enhanced-tuning) data, and even further has the methods of reinforcement learning and the like, and the training enables the LLM model to have strong instruction compliance capability, strong context-based reasoning (ICL) capability, namely strong comprehension promt and promt execution capability, and can ensure the accuracy of semantic sentence detection.

Example 1

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for speaker stall handling in speech recognition, comprising: the method comprises the following steps:

step one: acquiring voice, and recognizing the voice through ASR;

2. A method of speaker stall handling in speech recognition according to claim 1, wherein: in the first step, the voice recognition is off-line voice recognition or on-line real-time voice recognition.

3. A method of speaker stall handling in speech recognition according to claim 1, wherein: the threshold value of the waiting time in the fourth step is 100-1000 ms.

4. A method of speaker stall handling in speech recognition according to claim 1, wherein: the upper limit of the waiting times in the fourth step is 2 times.