CN117524192A - Speaker pause processing method in speech recognition - Google Patents
Speaker pause processing method in speech recognition Download PDFInfo
- Publication number
- CN117524192A CN117524192A CN202311481957.8A CN202311481957A CN117524192A CN 117524192 A CN117524192 A CN 117524192A CN 202311481957 A CN202311481957 A CN 202311481957A CN 117524192 A CN117524192 A CN 117524192A
- Authority
- CN
- China
- Prior art keywords
- asr
- voice
- waiting
- sentence
- speech recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title description 3
- 238000000034 method Methods 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims abstract description 18
- 238000001514 detection method Methods 0.000 claims abstract description 14
- 230000008901 benefit Effects 0.000 abstract description 5
- 238000012827 research and development Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 7
- 230000003993 interaction Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
A method for speaker stall handling in speech recognition, comprising the steps of: step one: acquiring voice, and recognizing the voice through ASR; step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model; step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete; step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow. The invention can solve the problem of sentence breaking caused by pause in ASR, and has the advantages of extremely low research and development investment, extremely high accuracy, small concurrence pressure and the like, thereby being capable of being widely popularized and applied in the field of speech recognition.
Description
Technical Field
The invention belongs to the field of server maintenance, and particularly relates to a method for processing speaker pauses in voice recognition.
Background
With the development of artificial intelligence AI technology, speech recognition ASR has also been widely used in society. ASR recognition is often in the front part of the multiple processes, i.e., text after ASR recognition is sent to subsequent processes for processing. In most cases, the flow after ASR depends on the user speaking, on the "complete" text identified by the ASR, and the subsequent flow Cheng Hui makes a corresponding feedback action for the "complete semantics" contained by the user speaking.
ASR clients can interactively be categorized into two categories:
the beginning and ending of the utterance are manually controlled. For example: the key is pressed to speak, and the release is stopped.
The algorithm automatically determines the beginning and end of the utterance. For example:
the dedicated VAD model (webtc-VAD, silero-VAD) is considered to be speaking beginning when a human voice is detected, and speaking ending when no human voice is detected.
Within the algorithm, a period of time (e.g., 500 ms) is implemented when no text is recognized by the ASR, which is considered to be a sentence end. (industry has Paraformer, wenet is to detect an endpoint in this way)
The first type of interaction mode can basically ensure that the speaking words of the user are complete, but the interaction process is not automatic and intelligent enough.
The second type of interaction mode is widely applied to more automatic and intelligent scenes, can perform full-flow voice interaction, and does not need to touch keys. The disadvantage of this solution is: the integrity of the user speaking (the user speaking may have a pause caused by thinking, breathing, etc.) cannot be ensured, if the result of the ASR is directly used, the user may only say half a sentence, and the later is not so finished, so the integrity of the user's intention' cannot be ensured, and the effect of the subsequent flow link cannot be ensured.
The patent number is: CN202310131353.4, patent name: the patent of a speech recognition method, a device, equipment and a medium discloses the following technical scheme, and the method for performing speech segment segmentation on the target speech stream information according to the sentence-breaking feature information to obtain the target speech segment information comprises the following steps: carrying out statistical processing on the sentence-breaking feature information to obtain a sentence-breaking time threshold value and a target sentence-breaking word; continuously detecting the target voice stream information to obtain sentence-breaking information, wherein the sentence-breaking information comprises a pause position, pause duration and a target detection word; judging whether the pause time is greater than the sentence-breaking time threshold; if the pause time is greater than the sentence-breaking time threshold, determining whether the target detection word belongs to the target sentence-breaking word; and if the target detection word belongs to the target sentence-breaking word, performing speech segment segmentation on the target voice stream based on the pause position to obtain the target voice segment information. The defects are that: the target detects the word. In 2B business, clients facing clients have numerous scenes, and the requirement of providing comprehensive target detection words is not realistic and has weak practicability.
The patent number is: CN202110983301.0, patent name: the patent of speech sentence-breaking method, computer equipment and storage medium discloses the following technical scheme, and the speech data mute information (speech pause feature) and ASR-recognized text are combined, and fed into sentence-breaking prediction model to make prediction. The information provided by this patent may indicate (figures 2-S30, 3 of its patent specification) that uses silence information of speech data and ASR-recognized text as features, trains a sentence-breaking predictive model (including but not limited to convolutional neural network models, conditional random field models, and recurrent neural networks, etc.), and uses the model to calculate a cumulative silence score. The defects are that: model training costs are high. The training of the model requires the special collection of massive sentence-breaking text information and language pause information. A model structure needs to be selected or designed and trained. The whole process has long period and large labor investment. The calling frequency is high. This patent requires the calculation of text information for two adjacent speech packets, and an example of this gives a speech packet duration of 20ms. The call frequency of the model was found to be 50 times/sec at 20ms. Although the length (20 ms) of the voice packet is an example value, it is known from this example that it is not too long, and the call frequency of the model is necessarily tens of times/second. Higher concurrency pressures can be imposed on the server. The scheme has high input cost and high concurrency pressure of the server.
Disclosure of Invention
The invention provides a speaker pause processing method in voice recognition, which is used for solving the defects in the prior art.
The invention is realized by the following technical scheme:
a method for speaker stall handling in speech recognition, comprising the steps of:
step one: acquiring voice, and recognizing the voice through ASR;
step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;
step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;
step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.
In the above-mentioned method, in the first step, the speech recognition is offline speech recognition (i.e. asr+vad scheme is also called as offline ASR, and the speech recognition is started to call one sentence-break detection only when silence is detected by VAD) or online real-time speech recognition (asr+endpoint detection technique is also called as online ASR, and the speech recognition is performed while the speech is being said, and the sentence-break detection is only called when Endpoint occurs).
The method for speaker pause processing in speech recognition is characterized in that the threshold value of the waiting time in the fourth step is 100 ms-1000 ms.
In the above method for speaker pause processing in speech recognition, the upper limit of the waiting times in the fourth step is 2 times.
The invention has the advantages that: the invention can solve the problem of sentence breaking caused by pause in ASR, and has the advantages of extremely low research and development investment, extremely high accuracy, small concurrence pressure and the like, thereby being capable of being widely popularized and applied in the field of speech recognition.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flow chart of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a method for speaker pause processing in speech recognition includes the steps of:
step one: acquiring voice, and recognizing the voice through ASR;
step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;
step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;
step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.
Specifically, in the first step of the embodiment, the speech recognition is offline speech recognition (i.e. asr+vad scheme is also called as flush ASR, and the speech recognition is started after speaking and only one sentence break is called when silence is detected by VAD) or online real-time speech recognition (asr+endpoint detection technology is also called as online ASR, and the speech recognition is performed while speaking and only one sentence break is called when Endpoint occurs).
Specifically, the threshold value of the waiting duration in the fourth step in this embodiment is 100 ms-1000 ms, and may be adjusted according to the actual situation.
Further, the upper limit of the waiting times in the fourth step described in this embodiment is 2 times, and may be adjusted according to the actual situation.
Preferably, LLM is a comprehensive model that can perform chat, question-answer, content generation, etc., and has increasingly strong ICL (content-based reasoning) capabilities. Based on LLM, only one Prompt needs to be written, and the sentence-breaking detection function for detecting whether the semantics are complete can be realized. The Prompt is exemplified as follows: neglecting punctuation marks, please ask whether the following sentence has a complete meaning, please directly give the answer "complete" or "incomplete", which is: "hello ask what name you call" in what name you call "is the result of the user's speech ASR recognition, and" \n "is a line feed. In more and more enterprises, LLM is privately deployed, and only one Prompt is needed at present, so that the function can be realized based on the LLM. In practice, the function may share one LLM service with other LLM functions, but the promt of the function is different from the promt of other functions, and the input cost is less than about 1 day, so that the function can be completed.
Preferably, the comprehensive capability of the LLM far exceeds a specific NLP model, which benefits from the fact that the LLM finishes pretraining on massive text data by GPT (generated Pre-trained Transformer), then finishes instruction fine tuning on massive SFT (enhanced-tuning) data, and even further has the methods of reinforcement learning and the like, and the training enables the LLM model to have strong instruction compliance capability, strong context-based reasoning (ICL) capability, namely strong comprehension promt and promt execution capability, and can ensure the accuracy of semantic sentence detection.
Example 1
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (4)
1. A method for speaker stall handling in speech recognition, comprising: the method comprises the following steps:
step one: acquiring voice, and recognizing the voice through ASR;
step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;
step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;
step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.
2. A method of speaker stall handling in speech recognition according to claim 1, wherein: in the first step, the voice recognition is off-line voice recognition or on-line real-time voice recognition.
3. A method of speaker stall handling in speech recognition according to claim 1, wherein: the threshold value of the waiting time in the fourth step is 100-1000 ms.
4. A method of speaker stall handling in speech recognition according to claim 1, wherein: the upper limit of the waiting times in the fourth step is 2 times.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311481957.8A CN117524192A (en) | 2023-11-08 | 2023-11-08 | Speaker pause processing method in speech recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311481957.8A CN117524192A (en) | 2023-11-08 | 2023-11-08 | Speaker pause processing method in speech recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117524192A true CN117524192A (en) | 2024-02-06 |
Family
ID=89761920
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311481957.8A Pending CN117524192A (en) | 2023-11-08 | 2023-11-08 | Speaker pause processing method in speech recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117524192A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140372119A1 (en) * | 2008-09-26 | 2014-12-18 | Google, Inc. | Compounded Text Segmentation |
US20190318759A1 (en) * | 2018-04-12 | 2019-10-17 | Qualcomm Incorporated | Context-based detection of end-point of utterance |
CN112995419A (en) * | 2021-02-05 | 2021-06-18 | 支付宝(杭州)信息技术有限公司 | Voice conversation processing method and system |
KR20220158573A (en) * | 2021-05-24 | 2022-12-01 | 네이버 주식회사 | Method and system for controlling for persona chatbot |
US20230083512A1 (en) * | 2021-09-10 | 2023-03-16 | Salesforce.Com, Inc. | Systems and methods for factual extraction from language model |
KR20230129875A (en) * | 2022-03-02 | 2023-09-11 | 네이버 주식회사 | Method and system for goods recommendation |
CN116775183A (en) * | 2023-05-31 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Task generation method, system, equipment and storage medium based on large language model |
CN116823203A (en) * | 2023-07-17 | 2023-09-29 | 先看看闪聘(江苏)数字科技有限公司 | Recruitment system and recruitment method based on AI large language model |
CN116955561A (en) * | 2023-07-24 | 2023-10-27 | 百度国际科技(深圳)有限公司 | Question answering method, question answering device, electronic equipment and storage medium |
-
2023
- 2023-11-08 CN CN202311481957.8A patent/CN117524192A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140372119A1 (en) * | 2008-09-26 | 2014-12-18 | Google, Inc. | Compounded Text Segmentation |
US20190318759A1 (en) * | 2018-04-12 | 2019-10-17 | Qualcomm Incorporated | Context-based detection of end-point of utterance |
CN112995419A (en) * | 2021-02-05 | 2021-06-18 | 支付宝(杭州)信息技术有限公司 | Voice conversation processing method and system |
KR20220158573A (en) * | 2021-05-24 | 2022-12-01 | 네이버 주식회사 | Method and system for controlling for persona chatbot |
US20230083512A1 (en) * | 2021-09-10 | 2023-03-16 | Salesforce.Com, Inc. | Systems and methods for factual extraction from language model |
KR20230129875A (en) * | 2022-03-02 | 2023-09-11 | 네이버 주식회사 | Method and system for goods recommendation |
CN116775183A (en) * | 2023-05-31 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Task generation method, system, equipment and storage medium based on large language model |
CN116823203A (en) * | 2023-07-17 | 2023-09-29 | 先看看闪聘(江苏)数字科技有限公司 | Recruitment system and recruitment method based on AI large language model |
CN116955561A (en) * | 2023-07-24 | 2023-10-27 | 百度国际科技(深圳)有限公司 | Question answering method, question answering device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107437415B (en) | Intelligent voice interaction method and system | |
CN107993665B (en) | Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system | |
CN110364178B (en) | Voice processing method and device, storage medium and electronic equipment | |
CN112233680A (en) | Speaker role identification method and device, electronic equipment and storage medium | |
CN111816172A (en) | Voice response method and device | |
CN113076770A (en) | Intelligent figure portrait terminal based on dialect recognition | |
CN111009235A (en) | Voice recognition method based on CLDNN + CTC acoustic model | |
CN114708856A (en) | Voice processing method and related equipment thereof | |
Raux | Flexible turn-taking for spoken dialog systems | |
Montenegro et al. | Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process | |
Jia et al. | A deep learning system for sentiment analysis of service calls | |
CN112185392A (en) | Voice recognition processing system for power supply intelligent client | |
CN117524192A (en) | Speaker pause processing method in speech recognition | |
KR20210123545A (en) | Method and apparatus for conversation service based on user feedback | |
CN111009236A (en) | Voice recognition method based on DBLSTM + CTC acoustic model | |
CN113345423B (en) | Voice endpoint detection method, device, electronic equipment and storage medium | |
CN113946670A (en) | Contrast type context understanding enhancement method for dialogue emotion recognition | |
CN114328867A (en) | Intelligent interruption method and device in man-machine conversation | |
CN113160821A (en) | Control method and device based on voice recognition | |
CN112506405A (en) | Artificial intelligent voice large screen command method based on Internet supervision field | |
JP2005258235A (en) | Interaction controller with interaction correcting function by feeling utterance detection | |
CN110910904A (en) | Method for establishing voice emotion recognition model and voice emotion recognition method | |
WO2023092399A1 (en) | Speech recognition method, speech recognition apparatus, and system | |
CN116483960B (en) | Dialogue identification method, device, equipment and storage medium | |
KR102533368B1 (en) | Method and system for analyzing fluency using big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Country or region after: China Address after: Room 911, 9th Floor, Block B, Xingdi Center, Building 2, No.10, Jiuxianqiao North Road, Jiangtai Township, Chaoyang District, Beijing, 100000 Applicant after: Beijing Zhongke Shenzhi Technology Co.,Ltd. Address before: Room 605, 6th Floor, Block B, Xingdi Center, Building 2, No. 10 Jiuxianqiao North Road, Jiangtai Township, Chaoyang District, Beijing, 100000 Applicant before: Beijing Zhongke Shenzhi Technology Co.,Ltd. Country or region before: China |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |