CN114420103A - Voice processing method and device, electronic equipment and storage medium - Google Patents

Voice processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114420103A
CN114420103A CN202210077930.1A CN202210077930A CN114420103A CN 114420103 A CN114420103 A CN 114420103A CN 202210077930 A CN202210077930 A CN 202210077930A CN 114420103 A CN114420103 A CN 114420103A
Authority
CN
China
Prior art keywords
processed
text
voice information
voice
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210077930.1A
Other languages
Chinese (zh)
Inventor
张文权
吕贵林
高洪伟
马剑桥
富文泰
高楚
张彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FAW Group Corp
Original Assignee
FAW Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FAW Group Corp filed Critical FAW Group Corp
Priority to CN202210077930.1A priority Critical patent/CN114420103A/en
Publication of CN114420103A publication Critical patent/CN114420103A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The embodiment of the invention discloses a voice processing method, a voice processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: when the voice information to be processed is monitored, determining the ending time of the voice information to be processed; if the interval duration of the current time and the ending time is greater than a preset mute duration threshold, performing integrity analysis on the text to be processed corresponding to the voice information to be processed; and when the integrity analysis result meets a preset analysis result, determining a regulation and control instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation and control instruction. According to the technical scheme of the embodiment of the invention, the voice activity of the user is dynamically monitored, and whether the voice command of the user is input and finished or not is judged in an automatic mode, so that the intelligence of the vehicle-mounted voice system and the success rate of human-computer voice interaction are improved.

Description

Voice processing method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the field of intelligent voice interaction, in particular to a voice processing method and device, electronic equipment and a storage medium.
Background
At present, with the popularization of a man-machine intelligent voice interaction technology on a vehicle-mounted system, a user can realize effective information transmission with a vehicle by utilizing natural language. For example, after a user inputs an instruction to the vehicle in a voice manner, the vehicle-mounted system can receive and recognize the instruction so as to control the vehicle to execute a corresponding action.
In the prior art, the detection of the voice activity of the user by the vehicle is usually judged according to the audio input by the user, and it can be understood that the system judges that the voice instruction input of the user is finished if no new audio is emitted again within a period of time after the user inputs the last audio, so that the received audio is identified and processed.
However, different users have different input habits of voice commands, and meanwhile, the user inevitably pauses in the process of inputting audio, at this time, based on the scheme in the prior art, the system may recognize a part of the content before the audio pauses as a complete command, and meanwhile, the system stops receiving the audio input by the user in the subsequent process, which may cause the audio received by the system to be incomplete, and further, the corresponding voice command cannot be generated, thereby affecting the experience of the user in the voice interaction process.
Disclosure of Invention
The invention provides a voice processing method, a voice processing device, electronic equipment and a storage medium, which can judge whether a user voice command is input and completed in an automatic mode while dynamically monitoring the voice activity of a user, thereby improving the intelligence of a vehicle-mounted voice system and the success rate of human-computer voice interaction.
In a first aspect, an embodiment of the present invention provides a speech processing method, where the method includes:
when the voice information to be processed is monitored, determining the ending time of the voice information to be processed;
if the interval duration of the current time and the ending time is greater than a preset mute duration threshold, performing integrity analysis on the text to be processed corresponding to the voice information to be processed;
and when the integrity analysis result meets a preset analysis result, determining a regulation and control instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation and control instruction.
In a second aspect, an embodiment of the present invention further provides a speech processing apparatus, where the apparatus includes:
the system comprises a to-be-processed voice information monitoring module, a processing module and a processing module, wherein the to-be-processed voice information monitoring module is used for determining the ending time of the to-be-processed voice information when the to-be-processed voice information is monitored;
the integrity analysis module is used for carrying out integrity analysis on the text to be processed corresponding to the voice information to be processed if the interval duration between the current time and the ending time is greater than a preset mute duration threshold;
and the regulating and controlling instruction determining module is used for determining a regulating and controlling instruction corresponding to the text to be processed when the integrity analysis result meets a preset analysis result so as to execute a corresponding function based on the regulating and controlling instruction.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement a speech processing method according to any one of the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention further provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a speech processing method according to any one of the embodiments of the present invention.
The technical proposal of the embodiment of the invention determines the ending time of the voice information to be processed when the voice information to be processed is monitored, so as to judge the mute time of the audio received this time, if the interval time between the current time and the ending time is greater than the preset mute time threshold, then the integrity analysis is carried out on the text to be processed corresponding to the voice information to be processed, when the integrity analysis result meets the preset analysis result, determining a regulation instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation instruction, when the voice activity of the user is dynamically monitored, whether the voice command of the user is input or not is judged in an automatic mode, the problem that partial audio of the user is processed as a complete command is solved, the intelligence of a vehicle-mounted voice system and the success rate of man-machine voice interaction are improved, and the use experience of the user is improved.
Drawings
In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, a brief description is given below of the drawings used in describing the embodiments. It should be clear that the described figures are only views of some of the embodiments of the invention to be described, not all, and that for a person skilled in the art, other figures can be derived from these figures without inventive effort.
Fig. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a speech processing method according to a second embodiment of the present invention;
fig. 3 is a flowchart illustrating a speech processing method according to a third embodiment of the present invention;
fig. 4 is a block diagram of a speech processing apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a voice processing method according to an embodiment of the present invention, where the present embodiment is applicable to a situation where a vehicle-mounted voice system identifies and processes a voice instruction input by a user, and the method may be executed by a voice processing apparatus, where the apparatus may be implemented in the form of software and/or hardware, and the hardware may be an electronic device, such as a mobile terminal, a PC terminal, or a server.
As shown in fig. 1, the method specifically includes the following steps:
s110, when the voice information to be processed is monitored, determining the ending time of the voice information to be processed.
The voice information to be processed may be voice information sent by a user and monitored by the vehicle-mounted voice system, and for example, after the audio information "please start navigation to city a" sent by the user is received by the vehicle-mounted voice system, the system may use the audio information as the voice information to be processed. Further, when the system uses a piece of audio information sent by the user as the voice information to be processed, the end time of the piece of audio information may also be determined, and in the actual application process, the end time of the voice information to be processed may be the time when the user stops sending the voice instruction after completing the audio input, or the time when the user pauses at least once in the process of sending the voice instruction.
In this embodiment, there are various ways to monitor the voice information to be processed, for example, the vehicle-mounted voice system may monitor the voice activity of the user in real time and determine the end time of the audio information input by the user, which may be understood as that, after the vehicle is started, the vehicle-mounted voice system is in an awake state at any time. In the practical application process, in order to save system resources, the vehicle-mounted voice system can also monitor the audio sent by the user after receiving a system starting instruction. For example, a user may start the vehicle-mounted voice system in a voice or touch operation mode, and then input related audio information to the system, and the system determines the voice information to be processed and the corresponding end time based on the received audio information. It should be understood by those skilled in the art that the specific manner of listening to the to-be-processed voice information may be selected according to the actual situation, and this is not specifically limited by the implementation example of the present disclosure.
And S120, if the interval duration of the current time and the ending time is greater than a preset mute duration threshold, performing integrity analysis on the text to be processed corresponding to the voice information to be processed.
In this embodiment, the interval duration between the current time and the ending time may be understood as a period of mute duration. Specifically, after the vehicle-mounted voice system determines the ending time of the monitored voice information to be processed, the time can be used as the starting time of the mute time, and the mute time is accumulated based on the starting time, so that whether the voice instruction of the user is input completely is preliminarily judged.
Further, if it is determined that the interval duration between the current time and the ending time (i.e., the mute duration) is greater than the preset mute duration threshold, it is preliminarily determined that the voice instruction input by the user is finished this time, and the subsequent processing procedure is executed. Illustratively, after a user sends out an audio of 'please turn on navigation to city a', the vehicle-mounted voice system can accumulate the mute time after the user inputs the audio, and when the mute time is judged to be greater than the preset mute time by 300ms, the subsequent processing steps can be executed on the voice information to be processed. It should be noted that, in the vehicle-mounted voice system, a value may be generally selected within a range of 300 to 800ms as the preset mute duration threshold, but in an actual application process, a user may also manually change a specific parameter of the mute duration threshold outside the range, which is not specifically limited in the embodiment of the present disclosure.
In this embodiment, when the system monitors a certain section of voice information to be processed and determines that the subsequent mute time is greater than the preset mute time threshold, in order to further accurately determine whether the user has input the voice instruction this time, it is further necessary to perform integrity analysis on the voice information to be processed.
Specifically, the vehicle-mounted voice system may recognize the voice information to be processed by using a pre-trained voice recognition model, so as to obtain a text to be processed corresponding to the voice information to be processed, further, the integrity of the text to be processed is judged, so as to determine whether the voice instruction input by the user is completed or not, if the input by the user is determined to be completed, the subsequent processing step is executed, and if the input by the user is determined not to be completed, the audio information subsequently sent by the user is monitored, or a prompt in the form of voice or text is sent to the user by using the vehicle-mounted voice system or the display device. Illustratively, when a pause occurs after a user sends an audio of "please open an arrival", and the pause duration exceeds a preset mute duration threshold, the system can identify the voice by using a voice identification model and obtain a text to be processed corresponding to the "please open the arrival", further, after the system judges the integrity of the text, the text to be processed is determined to be not a complete text, and a corresponding instruction cannot be generated based on the incomplete text, so the system prompts the user to continue issuing a voice instruction in a voice form and continues to monitor the audio information subsequently sent by the user.
And S130, when the integrity analysis result meets a preset analysis result, determining a regulation and control instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation and control instruction.
In this embodiment, after the integrity analysis is performed on the text to be processed corresponding to the voice information to be processed by the vehicle-mounted voice system, a corresponding integrity analysis result can be obtained. The integrity analysis result at least can reflect whether the vehicle can generate a corresponding regulation and control instruction based on the text to be processed, and correspondingly, the preset analysis result is information reflecting that the vehicle can generate the corresponding regulation and control instruction based on the text to be processed.
Based on this, when the system determines that the integrity analysis result corresponding to the text to be processed meets the preset analysis result, the system can issue the text to be processed to a module for deeply analyzing the text to be processed and generating a corresponding regulation and control instruction, and meanwhile, the module can issue the generated regulation and control instruction to control units associated with various parts of the vehicle, so that the units control the vehicle to execute a specific action, and the function corresponding to the user audio information is realized.
It should be noted that, if it is determined that the integrity analysis result of the to-be-processed text does not satisfy the preset analysis result, after the vehicle-mounted voice system continues to monitor subsequent audio information of the user, it may determine new to-be-processed voice information and a corresponding end time thereof, and further, when the new silence duration exceeds the preset silence duration threshold, the system may convert the newly monitored to-be-processed voice information into the to-be-processed text, and perform integrity analysis on the text in combination with other to-be-processed texts again to obtain a new integrity analysis result, so as to determine whether the new integrity analysis result satisfies the preset analysis result.
The technical scheme of the embodiment determines the ending time of the voice information to be processed when the voice information to be processed is monitored, so as to judge the mute time of the audio received this time, if the interval time between the current time and the ending time is greater than the preset mute time threshold, then the integrity analysis is carried out on the text to be processed corresponding to the voice information to be processed, when the integrity analysis result meets the preset analysis result, determining a regulation instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation instruction, when the voice activity of the user is dynamically monitored, whether the voice command of the user is input or not is judged in an automatic mode, the problem that partial audio of the user is processed as a complete command is solved, the intelligence of a vehicle-mounted voice system and the success rate of man-machine voice interaction are improved, and the use experience of the user is improved.
Example two
Fig. 2 is a schematic flow chart of a speech processing method according to a second embodiment of the present invention, and based on the foregoing embodiment, speech information to be processed is acquired based on an audio receiving device, and is subjected to denoising processing, so that accuracy of the system in identifying user audio information is further improved; after the word segmentation processing is carried out on the text to be processed, the completeness of the text to be processed is analyzed by the completion semantic discrimination submodule and the grammar detection submodule, so that the completeness of the user voice instruction is accurately judged, a corresponding regulation and control instruction is further generated, or the subsequent audio information of the user is continuously monitored, and the dynamic detection of the voice activity of the user is realized in a differentiated processing mode; finally, a timeout strategy for bottom reception is deployed in the system, and the use experience of the user is further improved. The specific implementation manner can be referred to the technical scheme of the embodiment. The technical terms that are the same as or corresponding to the above embodiments are not repeated herein.
As shown in fig. 2, the method specifically includes the following steps:
s210, collecting voice information to be processed based on a preset audio receiving device; and determining the time stamp of the voice information to be processed, and taking the time stamp corresponding to the finally received voice information to be processed as the end time of the voice information to be processed.
The preset audio receiving device may be a plurality of voice collecting devices with microphones, for example, a device connected to a voice transmission device and including an audio input interface, a gain amplifier and a plurality of audio processing modules, and meanwhile, the audio receiving device and the system may be connected in a wired or wireless manner.
In this embodiment, after the system collects the voice information to be processed, the timestamp of each section of the voice information to be processed can be determined, and then the corresponding end time is determined by using the timestamp. The timestamp is data generated by using a digital signature technology, the signature object comprises information such as original file information, signature parameters and signature time, and it can be understood that the vehicle-mounted voice system can at least determine the starting time and the ending time of the voice information to be processed based on the timestamp information.
Optionally, after the voice information to be processed is collected, denoising processing can be performed on the voice information to be processed to obtain voice information to be recognized; and inputting the voice information to be recognized into a voice recognition module to obtain a text to be processed corresponding to the voice information to be recognized, and carrying out integrity analysis on the text to be processed when the interval duration between the current time and the ending time is greater than a preset mute duration threshold.
In this embodiment, in the process of acquiring the voice information to be processed by the system using the audio receiving device, all signal processing devices, including analog and digital devices, have characteristics susceptible to noise. Therefore, in order to further improve the accuracy of the system in identifying the user audio information, the system can also perform denoising processing on the voice information to be processed by using a denoising technology. Noise includes random noise or white noise with uniform frequency distribution, and the denoising technique is a technique for removing such noise from a signal. For example, when a driver inputs audio information into a vehicle-mounted voice system, various noises may exist in a vehicle interior and a vehicle exterior, and at this time, the system may remove the noise in the received audio information based on a denoising technique, so as to ensure the accuracy of the obtained text to be processed in a subsequent voice conversion process.
In this embodiment, the speech recognition module is a module for converting the speech information to be processed into a text to be processed, and it can be understood that the speech recognition module includes a pre-trained speech recognition model. Illustratively, a convolutional neural network is used as an algorithm model, a model training set and a verification set are constructed based on a plurality of voices related to a user, further, the constructed set can be used for training the voice recognition model, meanwhile, model parameters are estimated and optimized, when a target detection evaluation index of the algorithm model reaches a preset threshold value, the voice recognition model is considered to be trained, and at the moment, the voice recognition module can output a corresponding text to be processed based on voice information to be processed.
S220, if the interval duration between the current time and the ending time is greater than a preset mute duration threshold, performing word segmentation processing on the text to be processed based on a semantic understanding module to obtain a field to be analyzed; and carrying out integrity analysis on the text to be processed according to the field to be analyzed.
The vehicle-mounted Voice system may perform logic determination on the mute duration by using a Voice Activity Detection (VAD) module. VAD, also known as voice endpoint detection or voice boundary detection, is capable of identifying and eliminating long periods of silence from a voice signal stream.
In this embodiment, when the vehicle-mounted speech system determines that the mute duration is greater than the preset mute duration threshold, word segmentation processing may be performed on the text to be processed based on the semantic understanding model. For example, when the text to be processed is "please start navigation to city a", after the word segmentation processing of the system, it can be determined that the fields to be analyzed are "start", "go", "city a", and "navigation", respectively. Meanwhile, the semantic understanding module further includes a semantic discrimination completion sub-module and a syntax detection sub-module, that is, after at least one field to be analyzed is processed by the two sub-modules, the integrity of the corresponding text to be processed can be determined, and the process is specifically described below.
Optionally, at least one semantic feature corresponding to the text to be processed is constructed according to the field to be analyzed; and inputting at least one semantic feature into the finished semantic discrimination submodule to obtain a finished semantic discrimination result, and inputting at least one semantic feature into the grammar detection submodule to obtain a grammar integrity detection result.
Continuing with the above example, when four fields to be analyzed are determined as "open", "go", "A City", and "navigate", a semantic feature "open navigation to A City" corresponding to the text to be processed may be constructed based on these fields. On one hand, the completion semantic discrimination submodule can determine that the text to be processed contains the semantic meaning of 'audio input completion' based on the semantic feature, and then a completion semantic discrimination result representing the audio input completion is obtained; on the other hand, the grammar detection submodule can determine that the grammar structure of the text to be processed is complete based on the semantic features, that is, the subsequent modules can generate corresponding instructions based on the text, and further obtain a grammar integrity detection result representing the complete grammar of the text to be processed.
S230, when the semantic discrimination result and the grammar integrity detection result meet preset conditions, determining at least one target regulation and control instruction corresponding to the text to be processed; and issuing the at least one target regulation and control instruction to each application program in the vehicle-mounted machine system so that each application program responds to the at least one target regulation and control instruction.
In the embodiment in this market, when the semantic discrimination submodule determines that the text to be processed already contains the semantic meaning of 'audio input completed', and the grammar detection submodule determines that the grammar of the text to be processed is complete, the text to be processed can be sent to the module for deeply analyzing the text to be processed and generating the corresponding regulation and control instruction, meanwhile, the module is prestored with the mapping table for representing the association relation between each text and the regulation and control instruction, and when the module determines the corresponding target regulation and control instruction based on the text to be processed, the target regulation and control instruction can be fed back to each application program of the vehicle-mounted machine system, so that the application programs control the vehicle to execute the corresponding action, and the function corresponding to the audio information input by the user is realized. It should be understood by those skilled in the art that after analyzing the text to be processed, the system may generate one or more target control commands, and the specific amount of the target control commands is determined by the audio information sent by the user.
It should be noted that, when the integrity analysis result does not satisfy the preset analysis result (that is, at least one of the result of completing the semantic discrimination and the result of completing the grammar integrity detection does not satisfy the preset condition), the to-be-processed voice information after the end time is monitored, the end time corresponding to the to-be-processed voice information which is continuously monitored is determined, when the interval duration between the current time and the end time is greater than the preset mute duration threshold, all the to-be-processed voice information before the end time is determined, and the integrity analysis is performed on the to-be-processed text corresponding to the to-be-processed voice information.
For example, when it is determined that the text to be processed is "please open to go", the corresponding field to be analyzed is "open" or "go", and at this time, the completion semantic discrimination sub-module may determine that the text to be processed does not have completion semantics based on the corresponding semantic features, indicating that the input of the user audio information is not finished; meanwhile, the grammar detection submodule can also judge that the user does not describe the destination information based on the corresponding semantic features, and the grammar of the text to be processed is not complete. Therefore, the system can control the audio receiving device to continuously acquire the audio information of the user, simultaneously issue a corresponding instruction to the voice recognition module, and continuously perform denoising processing and semantic recognition on the subsequently acquired audio information so as to obtain a corresponding text to be processed. After the ending time of the newly acquired voice information to be processed is determined again based on the scheme of the embodiment, all texts to be processed are combined, integrity analysis is performed on the texts to be processed obtained by combination again, if the integrity analysis result obtained by combination meets the preset analysis result, a corresponding regulation and control instruction is generated, and if the preset analysis result is still not met, subsequent audio information of the user is monitored continuously.
It can be understood that by performing VAD detection on audio information input by a user, a voice instruction of the user can be continuously received after the voice of the user is paused, and this process not only realizes dynamic detection on voice activity of the user, but also substantially realizes dynamic extension of a mute duration threshold, thereby effectively avoiding the problem that the system can not generate an instruction due to processing of incomplete audio information.
Meanwhile, in the practical application process, a timeout strategy for bottom reception can be deployed in the system. Specifically, the timeout duration can be preset, when a user pauses for many times in the process of inputting audio, the system cannot generate a corresponding target regulation and control instruction all the time based on the combined text to be processed, and meanwhile, when the total duration of voice recognition exceeds the preset timeout duration, the system can prompt and guide the user in the form of voice broadcasting or text displaying, and the use experience of the user is further improved by deploying the timeout strategy for bottom reception in the system.
According to the technical scheme of the embodiment, the voice information to be processed is collected based on the audio receiving device and is subjected to denoising processing, so that the accuracy of the system in recognizing the user audio information is further improved; after the word segmentation processing is carried out on the text to be processed, the completeness of the text to be processed is analyzed by the completion semantic discrimination submodule and the grammar detection submodule, so that the completeness of the user voice instruction is accurately judged, a corresponding regulation and control instruction is further generated, or the subsequent audio information of the user is continuously monitored, and the dynamic detection of the voice activity of the user is realized in a differentiated processing mode; finally, a timeout strategy for bottom reception is deployed in the system, and the use experience of the user is further improved.
EXAMPLE III
As an alternative embodiment of the foregoing embodiment, fig. 3 is a flowchart illustrating a speech processing method according to a third embodiment of the present invention. For clearly describing the technical solution of the present embodiment, the application scenario may be described as an example of recognizing and processing a voice instruction input by a user based on a voice system, but the application scenario is not limited to the above scenario and may be applied to various scenarios requiring processing of monitored voice information.
Referring to fig. 3, when a user inputs audio information through the vehicle-mounted voice system, the system may use the received audio information as the voice information to be processed, perform denoising processing on the received audio information, and further recognize the denoised voice information to be processed by using the voice recognition module to obtain a corresponding text to be processed.
With reference to fig. 3, when the user stops inputting audio information into the system, the VAD detection module may preliminarily determine that the voice activity of the user is finished, and at this time, the system may determine the integrity of the text to be processed by using the language model module. Specifically, it may be determined whether the text to be processed has a semantic meaning of "audio input complete", and meanwhile, in order to ensure that the subsequent module can accurately generate the corresponding target control instruction, it is further required to detect the grammar of the text to be processed, that is, to determine whether the text to be processed meets the requirement of preset grammar integrity. When one of the two judgment results does not meet the preset result, the system feeds the judgment result back to the voice recognition module, namely, the module is controlled to continuously maintain the voice recognition function, receive the subsequent audio information of the user, and continuously convert the voice information to be processed into the corresponding text to be processed. And if the judgment result output by the language model module does not meet the preset result all the time within the specified time, prompting and guiding the user in a voice broadcast or text display mode based on a timeout strategy pre-deployed in the system.
With reference to fig. 3, when the two judgment results output by the language model module both satisfy the preset result, the system may issue the text to be processed to the semantic understanding module, and the semantic understanding module processes the text to be processed, so as to output the corresponding regulation and control instruction. Specifically, the dialog management function in the system may be utilized to distribute the control instruction corresponding to the text to be processed to each application program in the car machine system, and each application program executes a corresponding function based on the received control instruction and feeds back a corresponding reply to the user.
The beneficial effects of the above technical scheme are: when the voice activity of the user is dynamically monitored, whether the voice command of the user is input or not is judged in an automatic mode, the problem that partial audio of the user is processed as a complete command is solved, the intelligence of a vehicle-mounted voice system and the success rate of man-machine voice interaction are improved, and the use experience of the user is improved.
Example four
Fig. 4 is a block diagram of a speech processing apparatus according to a fourth embodiment of the present invention, which is capable of executing a speech processing method according to any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 4, the apparatus specifically includes: a pending voice message listening module 310, an integrity analysis module 320, and a regulation instruction determination module 330.
The to-be-processed voice information monitoring module 310 is configured to determine an ending time of the to-be-processed voice information when the to-be-processed voice information is monitored.
And the integrity analysis module 320 is configured to perform integrity analysis on the to-be-processed text corresponding to the to-be-processed voice information if the interval duration between the current time and the ending time is greater than a preset mute duration threshold.
And a control instruction determining module 330, configured to determine, when the integrity analysis result meets a preset analysis result, a control instruction corresponding to the text to be processed, so as to execute a corresponding function based on the control instruction.
On the basis of the above technical solutions, the to-be-processed voice information monitoring module 310 includes a to-be-processed voice information acquisition unit and an end time determination unit.
And the voice information acquisition unit to be processed is used for acquiring the voice information to be processed based on a preset audio receiving device.
And the ending time determining unit is used for determining the time stamp of the voice information to be processed and taking the time stamp corresponding to the finally received voice information to be processed as the ending time of the voice information to be processed.
On the basis of the technical schemes, the voice processing device also comprises a text to be processed determining module.
The to-be-processed text determining module is used for denoising the to-be-processed voice information to obtain to-be-recognized voice information; and inputting the voice information to be recognized into a voice recognition module to obtain a text to be processed corresponding to the voice information to be recognized, and carrying out integrity analysis on the text to be processed when the interval duration between the current time and the ending time is greater than a preset mute duration threshold.
On the basis of the above technical solutions, the integrity analysis module 320 includes a word segmentation processing unit and an integrity analysis unit.
And the word segmentation processing unit is used for carrying out word segmentation processing on the text to be processed based on the semantic understanding module to obtain a field to be analyzed.
And the integrity analysis unit is used for carrying out integrity analysis on the text to be processed according to the field to be analyzed.
On the basis of the technical schemes, the semantic understanding module comprises a semantic discrimination completion sub-module and a grammar detection sub-module.
Optionally, the integrity analysis unit is further configured to construct at least one semantic feature corresponding to the text to be processed according to the field to be analyzed; and inputting the at least one semantic feature into the finished semantic discrimination submodule to obtain a finished semantic discrimination result, and inputting the at least one semantic feature into the grammar detection submodule to obtain a grammar integrity detection result.
On the basis of the above technical solutions, the regulation instruction determining module 330 includes a target regulation instruction determining unit and an instruction issuing unit.
And the target regulation and control instruction determining unit is used for determining at least one target regulation and control instruction corresponding to the text to be processed when the completed semantic discrimination result and the grammar integrity detection result both meet preset conditions.
And the instruction issuing unit is used for issuing the at least one target regulation and control instruction to each application program in the vehicle-mounted machine system so that each application program responds to the at least one target regulation and control instruction.
Optionally, the to-be-processed voice information monitoring module 310 is further configured to monitor, when the integrity analysis result does not satisfy the preset analysis result, the to-be-processed voice information after the ending time, and determine the ending time corresponding to the to-be-processed voice information that continues to be monitored, so as to determine, when the interval duration between the current time and the ending time is greater than the preset mute duration threshold, all the to-be-processed voice information before the ending time, and perform integrity analysis on the to-be-processed text corresponding to the to-be-processed voice information.
The technical solution provided by this embodiment determines the ending time of the voice message to be processed when the voice message to be processed is monitored, so as to judge the mute time of the audio received this time, if the interval time between the current time and the ending time is greater than the preset mute time threshold, then the integrity analysis is carried out on the text to be processed corresponding to the voice information to be processed, when the integrity analysis result meets the preset analysis result, determining a regulation instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation instruction, when the voice activity of the user is dynamically monitored, whether the voice command of the user is input or not is judged in an automatic mode, the problem that partial audio of the user is processed as a complete command is solved, the intelligence of a vehicle-mounted voice system and the success rate of man-machine voice interaction are improved, and the use experience of the user is improved.
The voice processing device provided by the embodiment of the invention can execute the voice processing method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
It should be noted that, the units and modules included in the apparatus are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the embodiment of the invention.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention. FIG. 5 illustrates a block diagram of an exemplary electronic device 40 suitable for use in implementing embodiments of the present invention. The electronic device 40 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 5, electronic device 40 is embodied in the form of a general purpose computing device. The components of electronic device 40 may include, but are not limited to: one or more processors or processing units 401, a system memory 402, and a bus 403 that couples the various system components (including the system memory 402 and the processing unit 401).
Bus 403 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 40 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 40 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)404 and/or cache memory 405. The electronic device 40 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 406 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard drive"). Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 403 by one or more data media interfaces. Memory 402 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 408 having a set (at least one) of program modules 407 may be stored, for example, in memory 402, such program modules 407 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 407 generally perform the functions and/or methods of the described embodiments of the invention.
The electronic device 40 may also communicate with one or more external devices 409 (e.g., keyboard, pointing device, display 410, etc.), with one or more devices that enable a user to interact with the electronic device 40, and/or with any devices (e.g., network card, modem, etc.) that enable the electronic device 40 to communicate with one or more other computing devices. Such communication may be through input/output (I/O) interface 411. Also, the electronic device 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 412. As shown, the network adapter 412 communicates with the other modules of the electronic device 40 over the bus 403. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with electronic device 40, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processing unit 401 executes various functional applications and data processing, for example, implementing a voice processing method provided by an embodiment of the present invention, by running a program stored in the system memory 402.
EXAMPLE six
An embodiment of the present invention also provides a storage medium containing computer-executable instructions for performing a speech processing method when executed by a computer processor.
The method comprises the following steps:
when the voice information to be processed is monitored, determining the ending time of the voice information to be processed;
if the interval duration of the current time and the ending time is greater than a preset mute duration threshold, performing integrity analysis on the text to be processed corresponding to the voice information to be processed;
and when the integrity analysis result meets a preset analysis result, determining a regulation and control instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation and control instruction.
Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable item code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
The item code embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer project code for carrying out operations for embodiments of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The project code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method of speech processing, comprising:
when the voice information to be processed is monitored, determining the ending time of the voice information to be processed;
if the interval duration of the current time and the ending time is greater than a preset mute duration threshold, performing integrity analysis on the text to be processed corresponding to the voice information to be processed;
and when the integrity analysis result meets a preset analysis result, determining a regulation and control instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation and control instruction.
2. The method of claim 1, wherein determining an ending time of the pending voice message when the pending voice message is monitored comprises:
acquiring the voice information to be processed based on a preset audio receiving device;
and determining the time stamp of the voice information to be processed, and taking the time stamp corresponding to the finally received voice information to be processed as the end time of the voice information to be processed.
3. The method of claim 1, further comprising:
denoising the voice information to be processed to obtain voice information to be recognized;
and inputting the voice information to be recognized into a voice recognition module to obtain a text to be processed corresponding to the voice information to be recognized, and carrying out integrity analysis on the text to be processed when the interval duration between the current time and the ending time is greater than a preset mute duration threshold.
4. The method of claim 1, wherein the performing integrity analysis on the text to be processed corresponding to the speech information to be processed comprises:
performing word segmentation processing on the text to be processed based on a semantic understanding module to obtain a field to be analyzed;
and carrying out integrity analysis on the text to be processed according to the field to be analyzed.
5. The method according to claim 4, wherein the semantic understanding module includes a completion semantic discrimination sub-module and a grammar detection sub-module, and the performing integrity analysis on the text to be processed according to the field to be analyzed includes:
constructing at least one semantic feature corresponding to the text to be processed according to the field to be analyzed;
and inputting the at least one semantic feature into the finished semantic discrimination submodule to obtain a finished semantic discrimination result, and inputting the at least one semantic feature into the grammar detection submodule to obtain a grammar integrity detection result.
6. The method according to claim 5, wherein when the integrity analysis result meets a preset analysis result, determining a regulation instruction corresponding to the text to be processed so as to execute a corresponding function based on the regulation instruction comprises:
when the semantic discrimination completion result and the grammar integrity detection result both meet preset conditions, determining at least one target regulation and control instruction corresponding to the text to be processed;
and issuing the at least one target regulation and control instruction to each application program in the vehicle-mounted machine system so that each application program responds to the at least one target regulation and control instruction.
7. The method of claim 1, further comprising:
and when the integrity analysis result does not meet the preset analysis result, monitoring the voice information to be processed after the ending time, determining the ending time corresponding to the voice information to be processed which continues to be monitored, determining all the voice information to be processed before the ending time when the interval duration between the current time and the ending time is greater than a preset mute duration threshold, and performing integrity analysis on the text to be processed corresponding to the voice information to be processed.
8. A speech processing apparatus, comprising:
the system comprises a to-be-processed voice information monitoring module, a processing module and a processing module, wherein the to-be-processed voice information monitoring module is used for determining the ending time of the to-be-processed voice information when the to-be-processed voice information is monitored;
the integrity analysis module is used for carrying out integrity analysis on the text to be processed corresponding to the voice information to be processed if the interval duration between the current time and the ending time is greater than a preset mute duration threshold;
and the regulating and controlling instruction determining module is used for determining a regulating and controlling instruction corresponding to the text to be processed when the integrity analysis result meets a preset analysis result so as to execute a corresponding function based on the regulating and controlling instruction.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the speech processing method of any of claims 1-7.
10. A storage medium containing computer-executable instructions for performing the speech processing method of any of claims 1-7 when executed by a computer processor.
CN202210077930.1A 2022-01-24 2022-01-24 Voice processing method and device, electronic equipment and storage medium Pending CN114420103A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210077930.1A CN114420103A (en) 2022-01-24 2022-01-24 Voice processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210077930.1A CN114420103A (en) 2022-01-24 2022-01-24 Voice processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114420103A true CN114420103A (en) 2022-04-29

Family

ID=81274532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210077930.1A Pending CN114420103A (en) 2022-01-24 2022-01-24 Voice processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114420103A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115512687A (en) * 2022-11-08 2022-12-23 之江实验室 Voice sentence-breaking method and device, storage medium and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115512687A (en) * 2022-11-08 2022-12-23 之江实验室 Voice sentence-breaking method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN108962255B (en) Emotion recognition method, emotion recognition device, server and storage medium for voice conversation
CN108520743B (en) Voice control method of intelligent device, intelligent device and computer readable medium
CN107657950B (en) Automobile voice control method, system and device based on cloud and multi-command words
CN107886944B (en) Voice recognition method, device, equipment and storage medium
CN110047481B (en) Method and apparatus for speech recognition
CN111429899A (en) Speech response processing method, device, equipment and medium based on artificial intelligence
CN113362828B (en) Method and apparatus for recognizing speech
WO2020024620A1 (en) Voice information processing method and device, apparatus, and storage medium
CN108831477A (en) A kind of audio recognition method, device, equipment and storage medium
CN113779208A (en) Method and device for man-machine conversation
CN111833870A (en) Awakening method and device of vehicle-mounted voice system, vehicle and medium
CN111933149A (en) Voice interaction method, wearable device, terminal and voice interaction system
CN114582333A (en) Voice recognition method and device, electronic equipment and storage medium
CN114420103A (en) Voice processing method and device, electronic equipment and storage medium
CN111400463B (en) Dialogue response method, device, equipment and medium
CN108962226B (en) Method and apparatus for detecting end point of voice
CN113889091A (en) Voice recognition method and device, computer readable storage medium and electronic equipment
CN113643704A (en) Test method, upper computer, system and storage medium of vehicle-mounted machine voice system
CN113330513A (en) Voice information processing method and device
CN112863496B (en) Voice endpoint detection method and device
CN113593565B (en) Intelligent home device management and control method and system
CN114155845A (en) Service determination method and device, electronic equipment and storage medium
CN113096651A (en) Voice signal processing method and device, readable storage medium and electronic equipment
CN112542157A (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN113241073B (en) Intelligent voice control method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination