CN117524192A - Speaker pause processing method in speech recognition - Google Patents

Speaker pause processing method in speech recognition Download PDF

Info

Publication number
CN117524192A
CN117524192A CN202311481957.8A CN202311481957A CN117524192A CN 117524192 A CN117524192 A CN 117524192A CN 202311481957 A CN202311481957 A CN 202311481957A CN 117524192 A CN117524192 A CN 117524192A
Authority
CN
China
Prior art keywords
asr
voice
waiting
sentence
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311481957.8A
Other languages
Chinese (zh)
Inventor
黄明明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Shenzhi Technology Co ltd
Original Assignee
Beijing Zhongke Shenzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongke Shenzhi Technology Co ltd filed Critical Beijing Zhongke Shenzhi Technology Co ltd
Priority to CN202311481957.8A priority Critical patent/CN117524192A/en
Publication of CN117524192A publication Critical patent/CN117524192A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

A method for speaker stall handling in speech recognition, comprising the steps of: step one: acquiring voice, and recognizing the voice through ASR; step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model; step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete; step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow. The invention can solve the problem of sentence breaking caused by pause in ASR, and has the advantages of extremely low research and development investment, extremely high accuracy, small concurrence pressure and the like, thereby being capable of being widely popularized and applied in the field of speech recognition.

Description

Speaker pause processing method in speech recognition
Technical Field
The invention belongs to the field of server maintenance, and particularly relates to a method for processing speaker pauses in voice recognition.
Background
With the development of artificial intelligence AI technology, speech recognition ASR has also been widely used in society. ASR recognition is often in the front part of the multiple processes, i.e., text after ASR recognition is sent to subsequent processes for processing. In most cases, the flow after ASR depends on the user speaking, on the "complete" text identified by the ASR, and the subsequent flow Cheng Hui makes a corresponding feedback action for the "complete semantics" contained by the user speaking.
ASR clients can interactively be categorized into two categories:
the beginning and ending of the utterance are manually controlled. For example: the key is pressed to speak, and the release is stopped.
The algorithm automatically determines the beginning and end of the utterance. For example:
the dedicated VAD model (webtc-VAD, silero-VAD) is considered to be speaking beginning when a human voice is detected, and speaking ending when no human voice is detected.
Within the algorithm, a period of time (e.g., 500 ms) is implemented when no text is recognized by the ASR, which is considered to be a sentence end. (industry has Paraformer, wenet is to detect an endpoint in this way)
The first type of interaction mode can basically ensure that the speaking words of the user are complete, but the interaction process is not automatic and intelligent enough.
The second type of interaction mode is widely applied to more automatic and intelligent scenes, can perform full-flow voice interaction, and does not need to touch keys. The disadvantage of this solution is: the integrity of the user speaking (the user speaking may have a pause caused by thinking, breathing, etc.) cannot be ensured, if the result of the ASR is directly used, the user may only say half a sentence, and the later is not so finished, so the integrity of the user's intention' cannot be ensured, and the effect of the subsequent flow link cannot be ensured.
The patent number is: CN202310131353.4, patent name: the patent of a speech recognition method, a device, equipment and a medium discloses the following technical scheme, and the method for performing speech segment segmentation on the target speech stream information according to the sentence-breaking feature information to obtain the target speech segment information comprises the following steps: carrying out statistical processing on the sentence-breaking feature information to obtain a sentence-breaking time threshold value and a target sentence-breaking word; continuously detecting the target voice stream information to obtain sentence-breaking information, wherein the sentence-breaking information comprises a pause position, pause duration and a target detection word; judging whether the pause time is greater than the sentence-breaking time threshold; if the pause time is greater than the sentence-breaking time threshold, determining whether the target detection word belongs to the target sentence-breaking word; and if the target detection word belongs to the target sentence-breaking word, performing speech segment segmentation on the target voice stream based on the pause position to obtain the target voice segment information. The defects are that: the target detects the word. In 2B business, clients facing clients have numerous scenes, and the requirement of providing comprehensive target detection words is not realistic and has weak practicability.
The patent number is: CN202110983301.0, patent name: the patent of speech sentence-breaking method, computer equipment and storage medium discloses the following technical scheme, and the speech data mute information (speech pause feature) and ASR-recognized text are combined, and fed into sentence-breaking prediction model to make prediction. The information provided by this patent may indicate (figures 2-S30, 3 of its patent specification) that uses silence information of speech data and ASR-recognized text as features, trains a sentence-breaking predictive model (including but not limited to convolutional neural network models, conditional random field models, and recurrent neural networks, etc.), and uses the model to calculate a cumulative silence score. The defects are that: model training costs are high. The training of the model requires the special collection of massive sentence-breaking text information and language pause information. A model structure needs to be selected or designed and trained. The whole process has long period and large labor investment. The calling frequency is high. This patent requires the calculation of text information for two adjacent speech packets, and an example of this gives a speech packet duration of 20ms. The call frequency of the model was found to be 50 times/sec at 20ms. Although the length (20 ms) of the voice packet is an example value, it is known from this example that it is not too long, and the call frequency of the model is necessarily tens of times/second. Higher concurrency pressures can be imposed on the server. The scheme has high input cost and high concurrency pressure of the server.
Disclosure of Invention
The invention provides a speaker pause processing method in voice recognition, which is used for solving the defects in the prior art.
The invention is realized by the following technical scheme:
a method for speaker stall handling in speech recognition, comprising the steps of:
step one: acquiring voice, and recognizing the voice through ASR;
step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;
step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;
step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.
In the above-mentioned method, in the first step, the speech recognition is offline speech recognition (i.e. asr+vad scheme is also called as offline ASR, and the speech recognition is started to call one sentence-break detection only when silence is detected by VAD) or online real-time speech recognition (asr+endpoint detection technique is also called as online ASR, and the speech recognition is performed while the speech is being said, and the sentence-break detection is only called when Endpoint occurs).
The method for speaker pause processing in speech recognition is characterized in that the threshold value of the waiting time in the fourth step is 100 ms-1000 ms.
In the above method for speaker pause processing in speech recognition, the upper limit of the waiting times in the fourth step is 2 times.
The invention has the advantages that: the invention can solve the problem of sentence breaking caused by pause in ASR, and has the advantages of extremely low research and development investment, extremely high accuracy, small concurrence pressure and the like, thereby being capable of being widely popularized and applied in the field of speech recognition.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flow chart of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, a method for speaker pause processing in speech recognition includes the steps of:
step one: acquiring voice, and recognizing the voice through ASR;
step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;
step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;
step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.
Specifically, in the first step of the embodiment, the speech recognition is offline speech recognition (i.e. asr+vad scheme is also called as flush ASR, and the speech recognition is started after speaking and only one sentence break is called when silence is detected by VAD) or online real-time speech recognition (asr+endpoint detection technology is also called as online ASR, and the speech recognition is performed while speaking and only one sentence break is called when Endpoint occurs).
Specifically, the threshold value of the waiting duration in the fourth step in this embodiment is 100 ms-1000 ms, and may be adjusted according to the actual situation.
Further, the upper limit of the waiting times in the fourth step described in this embodiment is 2 times, and may be adjusted according to the actual situation.
Preferably, LLM is a comprehensive model that can perform chat, question-answer, content generation, etc., and has increasingly strong ICL (content-based reasoning) capabilities. Based on LLM, only one Prompt needs to be written, and the sentence-breaking detection function for detecting whether the semantics are complete can be realized. The Prompt is exemplified as follows: neglecting punctuation marks, please ask whether the following sentence has a complete meaning, please directly give the answer "complete" or "incomplete", which is: "hello ask what name you call" in what name you call "is the result of the user's speech ASR recognition, and" \n "is a line feed. In more and more enterprises, LLM is privately deployed, and only one Prompt is needed at present, so that the function can be realized based on the LLM. In practice, the function may share one LLM service with other LLM functions, but the promt of the function is different from the promt of other functions, and the input cost is less than about 1 day, so that the function can be completed.
Preferably, the comprehensive capability of the LLM far exceeds a specific NLP model, which benefits from the fact that the LLM finishes pretraining on massive text data by GPT (generated Pre-trained Transformer), then finishes instruction fine tuning on massive SFT (enhanced-tuning) data, and even further has the methods of reinforcement learning and the like, and the training enables the LLM model to have strong instruction compliance capability, strong context-based reasoning (ICL) capability, namely strong comprehension promt and promt execution capability, and can ensure the accuracy of semantic sentence detection.
Example 1
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. A method for speaker stall handling in speech recognition, comprising: the method comprises the following steps:
step one: acquiring voice, and recognizing the voice through ASR;
step two: realizing sentence-breaking detection by using the Prompt word operation of the LLM large language model;
step three: outputting the ASR text to a subsequent processing flow if the second step detects that the meaning of the sentence breaking is complete;
step four: if the meaning of the sentence breaking is not complete, waiting for a threshold value of a certain time period, waiting for the speaking voice, if the voice exists, waiting for the speaking, combining the ASR texts for two or more times, sending the ASR texts into a subsequent processing flow, and if the speaking voice is not found in the threshold value of the time period, not waiting any more, and directly sending the ASR texts into the subsequent processing flow.
2. A method of speaker stall handling in speech recognition according to claim 1, wherein: in the first step, the voice recognition is off-line voice recognition or on-line real-time voice recognition.
3. A method of speaker stall handling in speech recognition according to claim 1, wherein: the threshold value of the waiting time in the fourth step is 100-1000 ms.
4. A method of speaker stall handling in speech recognition according to claim 1, wherein: the upper limit of the waiting times in the fourth step is 2 times.
CN202311481957.8A 2023-11-08 2023-11-08 Speaker pause processing method in speech recognition Pending CN117524192A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311481957.8A CN117524192A (en) 2023-11-08 2023-11-08 Speaker pause processing method in speech recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311481957.8A CN117524192A (en) 2023-11-08 2023-11-08 Speaker pause processing method in speech recognition

Publications (1)

Publication Number Publication Date
CN117524192A true CN117524192A (en) 2024-02-06

Family

ID=89761920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311481957.8A Pending CN117524192A (en) 2023-11-08 2023-11-08 Speaker pause processing method in speech recognition

Country Status (1)

Country Link
CN (1) CN117524192A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372119A1 (en) * 2008-09-26 2014-12-18 Google, Inc. Compounded Text Segmentation
US20190318759A1 (en) * 2018-04-12 2019-10-17 Qualcomm Incorporated Context-based detection of end-point of utterance
CN112995419A (en) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
KR20220158573A (en) * 2021-05-24 2022-12-01 네이버 주식회사 Method and system for controlling for persona chatbot
US20230083512A1 (en) * 2021-09-10 2023-03-16 Salesforce.Com, Inc. Systems and methods for factual extraction from language model
KR20230129875A (en) * 2022-03-02 2023-09-11 네이버 주식회사 Method and system for goods recommendation
CN116775183A (en) * 2023-05-31 2023-09-19 腾讯科技(深圳)有限公司 Task generation method, system, equipment and storage medium based on large language model
CN116823203A (en) * 2023-07-17 2023-09-29 先看看闪聘(江苏)数字科技有限公司 Recruitment system and recruitment method based on AI large language model
CN116955561A (en) * 2023-07-24 2023-10-27 百度国际科技(深圳)有限公司 Question answering method, question answering device, electronic equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140372119A1 (en) * 2008-09-26 2014-12-18 Google, Inc. Compounded Text Segmentation
US20190318759A1 (en) * 2018-04-12 2019-10-17 Qualcomm Incorporated Context-based detection of end-point of utterance
CN112995419A (en) * 2021-02-05 2021-06-18 支付宝(杭州)信息技术有限公司 Voice conversation processing method and system
KR20220158573A (en) * 2021-05-24 2022-12-01 네이버 주식회사 Method and system for controlling for persona chatbot
US20230083512A1 (en) * 2021-09-10 2023-03-16 Salesforce.Com, Inc. Systems and methods for factual extraction from language model
KR20230129875A (en) * 2022-03-02 2023-09-11 네이버 주식회사 Method and system for goods recommendation
CN116775183A (en) * 2023-05-31 2023-09-19 腾讯科技(深圳)有限公司 Task generation method, system, equipment and storage medium based on large language model
CN116823203A (en) * 2023-07-17 2023-09-29 先看看闪聘(江苏)数字科技有限公司 Recruitment system and recruitment method based on AI large language model
CN116955561A (en) * 2023-07-24 2023-10-27 百度国际科技(深圳)有限公司 Question answering method, question answering device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107437415B (en) Intelligent voice interaction method and system
CN107993665B (en) Method for determining role of speaker in multi-person conversation scene, intelligent conference method and system
CN110364178B (en) Voice processing method and device, storage medium and electronic equipment
CN112233680A (en) Speaker role identification method and device, electronic equipment and storage medium
CN111816172A (en) Voice response method and device
CN113076770A (en) Intelligent figure portrait terminal based on dialect recognition
CN111009235A (en) Voice recognition method based on CLDNN + CTC acoustic model
CN114708856A (en) Voice processing method and related equipment thereof
Raux Flexible turn-taking for spoken dialog systems
Montenegro et al. Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process
Jia et al. A deep learning system for sentiment analysis of service calls
CN112185392A (en) Voice recognition processing system for power supply intelligent client
CN117524192A (en) Speaker pause processing method in speech recognition
KR20210123545A (en) Method and apparatus for conversation service based on user feedback
CN111009236A (en) Voice recognition method based on DBLSTM + CTC acoustic model
CN113345423B (en) Voice endpoint detection method, device, electronic equipment and storage medium
CN113946670A (en) Contrast type context understanding enhancement method for dialogue emotion recognition
CN114328867A (en) Intelligent interruption method and device in man-machine conversation
CN113160821A (en) Control method and device based on voice recognition
CN112506405A (en) Artificial intelligent voice large screen command method based on Internet supervision field
JP2005258235A (en) Interaction controller with interaction correcting function by feeling utterance detection
CN110910904A (en) Method for establishing voice emotion recognition model and voice emotion recognition method
WO2023092399A1 (en) Speech recognition method, speech recognition apparatus, and system
CN116483960B (en) Dialogue identification method, device, equipment and storage medium
KR102533368B1 (en) Method and system for analyzing fluency using big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: Room 911, 9th Floor, Block B, Xingdi Center, Building 2, No.10, Jiuxianqiao North Road, Jiangtai Township, Chaoyang District, Beijing, 100000

Applicant after: Beijing Zhongke Shenzhi Technology Co.,Ltd.

Address before: Room 605, 6th Floor, Block B, Xingdi Center, Building 2, No. 10 Jiuxianqiao North Road, Jiangtai Township, Chaoyang District, Beijing, 100000

Applicant before: Beijing Zhongke Shenzhi Technology Co.,Ltd.

Country or region before: China

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination