WO2022222045A1 - Speech information processing method, and device - Google Patents

Speech information processing method, and device Download PDF

Info

Publication number
WO2022222045A1
WO2022222045A1 PCT/CN2021/088522 CN2021088522W WO2022222045A1 WO 2022222045 A1 WO2022222045 A1 WO 2022222045A1 CN 2021088522 W CN2021088522 W CN 2021088522W WO 2022222045 A1 WO2022222045 A1 WO 2022222045A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice information
valid
sensitivity
invalid
voice
Prior art date
Application number
PCT/CN2021/088522
Other languages
French (fr)
Chinese (zh)
Inventor
杨世辉
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180001492.4A priority Critical patent/CN113330513A/en
Priority to PCT/CN2021/088522 priority patent/WO2022222045A1/en
Publication of WO2022222045A1 publication Critical patent/WO2022222045A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of speech processing, and in particular to methods and devices for processing speech information.
  • smart devices In intelligent voice interaction scenarios, smart devices have two commonly used modes for listening to user voices, namely continuous listening mode and full-time wake-up-free mode. Full-time wake-up-free mode can also be called full-time listening mode. In the continuous listening or full-time listening state, the smart device needs to distinguish whether the user content is a valid instruction for it, that is, it needs to distinguish the content of the dialogue between man and machine, and the content of dialogue between man and man.
  • the voice information collected by the device includes chat data.
  • the rule matching module is often used, or the inference module (such as a neural network inference module) is used for judgment.
  • the voice information is a valid voice control command.
  • the validity of the same voice information or voice information with the same semantics may be different. For example, a sentence is a valid voice control command in the current scenario, but in another scenario It's just chatting information, which is invalid information.
  • the existing voice information valid determination solutions cannot adapt to the valid voice information recognition under different usage environments and scenarios, which easily leads to low recognition accuracy and false triggering of invalid voices.
  • the present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false trigger rate of invalid voices in different intelligent voice interaction scenarios.
  • the present application provides a voice information processing method, the method comprising:
  • first voice information in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, execute the operation indicated by the first voice information, wherein the judgment condition is when the first voice information is generated based on the first voice information.
  • the environmental conditions in which it is located can be adjusted.
  • this application adaptively adjusts the judgment conditions for judging the validity of the voice information for the voice information received under different environmental conditions, which can better judge the validity of the voice information in different environmental conditions, and improve the effectiveness of the voice information.
  • the accuracy of the judgment can reduce the false trigger rate of invalid signals.
  • the environmental conditions in which the first voice information is generated include one or more of the following: until the device obtains the first voice information, speaking within a second preset time period The number of people, the number of people within a preset range when the first voice information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.
  • the probability that the voice information received by the device is idle chat is invalid voice.
  • the confidence of the voice information The higher the degree and/or the signal-to-noise ratio, the higher the probability that the device can correctly recognize the sentences of the speech information, and the recognition of the validity of the speech information will also be affected. Adjusting the judgment conditions for judging the validity of the voice information can better judge the validity of the voice information, improve the accuracy of effective judgment, and reduce the false trigger rate of invalid signals.
  • the judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.
  • the judgment voice information is adaptively adjusted according to the environmental conditions when the voice information is generated and the continuous listening time of the device.
  • the validity judgment condition can further judge the validity of the speech information better, improve the accuracy of valid judgment, and reduce the false trigger rate of invalid signals.
  • the judgment condition is adjusted based on the environmental conditions and the continuous listening duration of the device, including: the judgment condition is based on the environmental conditions, the continuous listening duration and historical voice information. The situation is adjusted.
  • the historical voice information Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in this application, in addition to the environmental conditions and the listening time of the device for the generation of the voice information described above, the historical voice information is also used to adaptively adjust the judgment conditions for judging the validity of the voice information, which can further better judge the validity of the voice information. improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.
  • the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.
  • the judgment conditions for judging the validity of the voice information are adaptively adjusted in combination with the environmental conditions generated by the voice information and historical voice information, and the validity of the voice information can be further judged better and the judgment of the effective judgment can be improved. Accuracy, reduce the false trigger rate of invalid signals.
  • the situation of the historical voice information includes one or more of the following:
  • the second similarity of the acoustic features of the first voice information and the historical invalid voice information is the second similarity of the acoustic features of the first voice information and the historical invalid voice information.
  • the historical voice information that can be used to help judge the validity of the currently acquired voice information includes one or more of the above, and the decision to judge the validity of the voice information is adaptively adjusted based on the one or more items. All conditions can better judge the validity of speech information, improve the accuracy of effective discrimination, and reduce the false trigger rate of invalid signals.
  • the sensitivity of the decision condition is increased
  • the sensitivity of the decision condition is lowered.
  • the threshold for validity judgment can be lowered, that is, the sensitivity of the judgment condition can be improved, and if the probability of being valid is small, the threshold for validity judgment can be raised. That is to reduce the sensitivity of the judgment conditions, so that the voice information received under different environmental conditions can be flexibly recognized and its effectiveness can be improved, and the accuracy of the recognition can be improved, instead of using fixed judgment conditions across the board to judge the voice information in each scene. effectiveness.
  • the threshold of validity judgment can be increased, that is, the sensitivity of the judgment condition can be reduced, so that voice can be more accurately recognized. whether the information is valid.
  • the situation of the historical voice information includes a first time interval between when the first voice information is acquired and when valid voice information is acquired most recently; the longer the first time interval, the The sensitivity of the decision condition is adjusted lower.
  • the threshold is to reduce the sensitivity of the decision condition, so that whether the voice information is valid or not can be more accurately identified.
  • the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired most recently; the longer the second time interval, the The sensitivity of the decision condition is adjusted lower.
  • the threshold is to reduce the sensitivity of the decision condition, so that whether the voice information is valid or not can be more accurately identified.
  • the situation of the historical voice information includes the first time interval between when the first voice information is acquired and the last time valid voice information is acquired, and includes the time when the first voice information is acquired.
  • the second time interval between the latest acquisition of invalid voice information; in the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
  • the above-mentioned first time interval is less than the second time interval, indicating that the time interval between the obtained first voice information and the most recent acquisition of historical valid voice information is not long, therefore, the first voice information is valid voice
  • the probability of the instruction is relatively large, therefore, the judgment threshold of validity can be lowered, that is, the sensitivity of the judgment condition can be improved, so that whether the voice information is valid can be more accurately identified.
  • the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
  • the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
  • the situation of the historical voice information includes the state of the device and the user's voice dialogue until the first voice information is obtained; in the case that the state of the device and the user's voice dialogue exists, The sensitivity of the decision condition is adjusted up.
  • the state of the device and the user's voice dialogue refers to the state in which the device and the user are in a voice conversation.
  • the device can be tracked through the dialogue state tracking function. If this state currently exists, it indicates that the above-mentioned first voice information is likely to be a valid voice command. Therefore, the judgment threshold of validity can be lowered, and the sensitivity of the judgment condition can be increased, so that whether the speech information is valid can be more accurately identified.
  • the device may receive the sensitivity of the specified judgment condition, adjust the judgment condition based on the sensitivity, and then use the adjusted judgment condition to judge whether the above-mentioned first voice information is valid.
  • the above-specified sensitivity is the sensitivity input by the user, and the device can more flexibly adjust the sensitivity of the decision condition based on the user's needs, so as to better meet the user's needs.
  • the present application provides another voice information processing method.
  • the method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is obtained by adjusting based on the continuous listening duration of the device.
  • the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening time of the device, The validity of the voice information can be better judged, the accuracy of effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
  • the present application provides another voice information processing method.
  • the method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is adjusted based on historical voice information.
  • the historical voice information it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the present application, by adaptively adjusting the judgment conditions for judging the validity of the voice information through the historical voice information, the validity of the voice information can be better judged, the accuracy of the effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
  • the present application provides a voice information processing device, the device comprising:
  • an execution unit configured to execute the operation indicated by the first voice information when it is determined based on a judgment condition that the first voice information is a valid voice control instruction, wherein the judgment condition is based on the first voice
  • the environmental conditions in which the information is generated are adjusted.
  • the environmental conditions in which the first voice information is generated include one or more of the following: until the device obtains the first voice information, speaking within a second preset time period The number of people, the number of people within a preset range when the first voice information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.
  • the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.
  • the judgment condition is adjusted based on the environmental conditions and the continuous listening duration of the device, including: the judgment condition is based on the environmental conditions, the continuous listening duration and historical voice information. The situation is adjusted.
  • the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.
  • the situation of the historical voice information includes one or more of the following:
  • the second similarity of the acoustic features of the first voice information and the historical invalid voice information is the second similarity of the acoustic features of the first voice information and the historical invalid voice information.
  • the sensitivity of the decision condition is increased
  • the sensitivity of the decision condition is lowered.
  • the situation of the historical voice information includes a first time interval between when the first voice information is acquired and when valid voice information is acquired most recently; the longer the first time interval, the The sensitivity of the decision condition is adjusted lower.
  • the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired most recently; the longer the second time interval, the The sensitivity of the decision condition is adjusted lower.
  • the situation of the historical voice information includes the first time interval between when the first voice information is acquired and the last time valid voice information is acquired, and includes the time when the first voice information is acquired.
  • the second time interval between the latest acquisition of invalid voice information; in the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
  • the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
  • the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
  • the situation of the historical voice information includes the state of the device and the user's voice dialogue until the first voice information is obtained; in the case that the state of the device and the user's voice dialogue exists, The sensitivity of the decision condition is adjusted up.
  • the present application provides a device, which may include a processor and a memory, for implementing the voice information processing method described in the first aspect above.
  • the memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the first aspect or any possible implementation manner of the first aspect can be implemented.
  • the device may also include a communication interface for the device to communicate with other devices, and the communication interface may, by way of example, be a transceiver, circuit, bus, module, or other type of communication interface.
  • the device may include:
  • a processor configured to obtain the first voice information; in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, execute the operation indicated by the first voice information, wherein the judgment condition is based on the first voice information
  • the environmental conditions in which the voice information is generated are adjusted and obtained.
  • the computer program in the memory in this application can be pre-stored or downloaded from the Internet when the device is used and stored, and this application does not specifically limit the source of the computer program in the memory.
  • the coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
  • embodiments of the present application provide a chip system, which is applied to an electronic device; the chip system includes an interface circuit and a processor; the interface circuit and the processor are interconnected by lines; the interface circuit is used to receive data from a memory of the electronic device A signal is sent to the processor, where the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the system-on-a-chip executes the method described in the first aspect and any possible implementation manner thereof.
  • the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the first aspect or any possible possibility of the first aspect. implement the method described.
  • the present application provides a computer program product.
  • the computer program product is executed by a processor, the method described in the first aspect or any possible implementation manner of the first aspect will be executed.
  • FIG. 1 shows a schematic diagram of a system architecture to which the voice information processing method provided by the present application is applicable
  • FIG. 2 shows a schematic flowchart of a voice information processing method provided by the present application
  • Fig. 3 shows the structural representation of a kind of invalid refusal model provided by this application
  • FIG. 4 and FIG. 5 are schematic diagrams showing the sensitivity of adjusting decision conditions based on influencing factors provided by the present application
  • FIG. 6A and FIG. 6B are schematic diagrams showing the sensitivity of adjusting decision conditions based on influencing factors provided by the present application.
  • 6C and 6D are schematic diagrams showing the change of the proportion of voice information in this application.
  • FIG. 7 shows a schematic diagram of the sensitivity of adjusting decision conditions based on influencing factors provided by the present application
  • FIG. 8A and FIG. 8B are schematic diagrams of judging the correlation degree of voice information in the present application.
  • FIG. 9 is a schematic diagram showing the sensitivity of adjusting the decision condition based on the influencing factors provided by the present application.
  • FIG. 10 shows a schematic flowchart of another voice information processing method provided by the present application.
  • FIG. 11 shows a schematic flowchart of the validity recognition of voice information provided by the application
  • FIG. 12 is a schematic diagram of a logical structure of an apparatus provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a logical structure of another apparatus provided by an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a hardware structure of another apparatus provided by an embodiment of the present application.
  • ASR Automatic speech recognition
  • the overall process of building a speech recognition system includes two parts: training and recognition.
  • the training is usually done offline.
  • Signal processing and knowledge mining are performed on the pre-collected massive speech and language databases to obtain the acoustic model required by the speech recognition system (acoustic model is the variable of acoustics, phonetics, environment, speaker gender, etc.). , accent, etc.) and language model (a language model is a knowledge representation for a set of word sequences.).
  • the recognition process is usually completed online, and the real-time voice of the user is automatically recognized.
  • the recognition process can usually be divided into two modules: front-end and back-end: the main function of the front-end module is to perform endpoint detection (removing redundant silence and non-speaking sounds), noise reduction, feature extraction, etc.; the function of the back-end module is to use training A good acoustic model and language model perform statistical pattern recognition (also known as decoding) on the feature vector of the user's speech, and obtain the text information contained in it.
  • a good acoustic model and language model perform statistical pattern recognition (also known as decoding) on the feature vector of the user's speech, and obtain the text information contained in it.
  • an adaptive feedback module in the back-end module, which can self-learn the user's voice, so as to make necessary corrections to the acoustic model and the voice model, and further improve the accuracy of recognition.
  • Voiceprint recognition is a type of biometric technology, also known as speaker recognition, which is a technology that identifies the speaker's identity through sound.
  • speaker recognition is a technology that identifies the speaker's identity through sound.
  • voiceprint recognition technologies There are two types of voiceprint recognition technologies, namely speaker recognition and speaker confirmation. Different tasks and applications will use different voiceprint recognition technologies. For example, identification technology may be required when narrowing the scope of criminal investigations, while confirmation technology may be required for banking transactions.
  • Speech synthesis also known as text to speech (TTS) technology, is a technology that converts text information generated by a computer or input from external sources into understandable and fluent spoken language output. Artificial mouth, let the machine speak like a human.
  • Task-based dialogue can be understood as a sequential decision-making process.
  • the machine needs to update and maintain the internal dialogue state by understanding user sentences, and then select the next optimal action according to the current dialogue state (such as confirming requirements, querying restrictions) conditions, provide results, etc.) to complete the task.
  • the task-based dialogue system commonly used in the industry is a system with a modular structure, which generally includes four key modules:
  • Natural language understanding Identify and parse the user's text input to obtain computer-understandable semantic labels such as slot values and intents.
  • Dialogue state tracking Maintains the current dialogue state according to the dialogue history.
  • the dialogue state is the cumulative semantic representation of the entire dialogue history, generally slot-value pairs.
  • Dialogue policy output the next system action according to the current dialogue state.
  • the general dialogue state tracking module and the dialogue strategy module are collectively referred to as the dialogue manager (DM) module.
  • Natural language generation Convert system actions into natural language output.
  • This modular system structure is highly interpretable and easy to implement. Most practical task-based dialogue systems in the industry use this structure.
  • Computer vision also known as machine vision, is a science that studies how to make machines "see”. Its main task is to obtain information about the corresponding scene by processing the collected pictures or videos.
  • the invalid rejection model is used to judge the validity of the user's voice information obtained by the device.
  • the validity can be used to indicate whether the voice information is a valid voice control instruction for the device that obtains the voice information.
  • the voice information may be text information or the like obtained by converting the voice signal received by the device.
  • the device may receive a lot of voice information from the user, but some voice information is just the voice information of chatting between users, which is invalid information for the device.
  • the voice information that the user actually interacts with the device is the information effective for the device, and the effective information is the user's voice control instructions.
  • the invalid recognition model may include a pre-judgment module and a decision-making module for the validity of voice information.
  • the pre-judgment module includes a rule matching module and a reasoning module, and is used to make a preliminary judgment on the validity of the speech information. in:
  • the rule matching module can match the input voice information through preset rules, such as preset sentences, etc. If there is a sentence matching the input voice information in the preset sentences, then the input voice information is valid, If the preset sentence does not have a sentence matching the input voice information, the input voice information is invalid.
  • the inference module can be a deep learning prediction model trained on large-scale data using neural networks or traditional machine learning (such as a supervised learning model such as a support vector machine (SVM)).
  • SVM support vector machine
  • the decision-making module can make a final judgment decision on the processing result of at least one of the rule matching module and the reasoning module by synthesizing the judgment conditions, and determine whether the voice information is valid, which can greatly improve the accuracy of the validity judgment of the voice information.
  • the comprehensive judgment condition will be introduced later, and will not be described in detail here.
  • the above invalid recognition model can also be called a validity judgment model, etc.
  • the following exemplarily introduces the system architecture to which the voice information processing method is applicable.
  • FIG. 1 exemplarily shows a system architecture diagram used by the voice information processing method provided by the present application.
  • the system architecture may include an audio manager 110 , a video manager 120 , a memory 130 and a processor 140 , which may be connected by a bus 150 .
  • Audio manager 110 may include a speaker and microphone array.
  • a loudspeaker is a transducer that converts electrical signals into sound signals, and is used to output the sound of the device.
  • a microphone is an energy conversion device that converts a sound signal into an electrical signal, and is used to collect human voice and other sound information.
  • Video manager 120 may include an array of cameras. Cameras can convert optical image signals into electrical signals for storage or transmission.
  • the memory 130 is used to store computer programs and data.
  • the memory 130 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM) or Portable read-only memory (compact disc read-only memory, CD-ROM), etc.
  • the memory 130 may store computer programs or codes for models such as automatic speech recognition model, voiceprint recognition model, computer vision model, invalid recognition model, natural language understanding model, dialogue management model, and speech synthesis model.
  • models such as automatic speech recognition model, voiceprint recognition model, computer vision model, invalid recognition model, natural language understanding model, dialogue management model, and speech synthesis model.
  • the processor 140 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
  • a processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the processor 140 may be configured to read the computer program and data stored in the above-mentioned memory 130, and execute the voice information processing method provided by the embodiment of the present application.
  • the bus 150 may be a desktop data bus (desktop bus, D-BUS), and D-BUS is an inter-process communication (IPC) optimized for desktop environments. ) mechanism for inter-process communication or process-kernel communication.
  • the bus 150 may be a data bus (DB), an address bus (AB), a control bus (CB), and the like.
  • the system architecture shown in FIG. 1 may be a system architecture of a terminal device or a server or other device.
  • the terminal device may include, but is not limited to, any device based on an intelligent operating system, which can perform human-computer interaction with the user through input devices such as keyboards, virtual keyboards, touchpads, touchscreens, and voice-activated devices, such as smart phones and tablet computers. , handheld computers, wearable electronic devices or in-vehicle devices (such as in-vehicle computers, etc.) and so on.
  • the server may be an edge server or a cloud server, the server may be a virtual server or a physical server, etc., which is not limited in this application.
  • FIG. 1 The system architecture shown in FIG. 1 above is only an example, and does not constitute a limitation on the system architecture applicable to the embodiments of the present application.
  • the method can be applied to the system architecture shown in FIG. 1 above, that is, the method is executed by the above-mentioned terminal device or server or other devices, or can be executed by The terminal device or a processing device such as a chip or a processor in the server executes the method, and the execution body of the method is collectively referred to as a device in the following description.
  • the terminal device may first receive the voice information, and then the terminal device sends the received voice information to the server for processing.
  • the voice information sent by the terminal device to the server may be original information received by the terminal device, or may be voice information preprocessed by the terminal device.
  • a voice information processing method provided by an embodiment of the application may include, but is not limited to, the following steps:
  • the device may receive the user's voice signal through a microphone. Then, the device can recognize the voice signal through the automatic voice recognition ASR model to obtain voice information corresponding to the voice signal, and the voice information can include text information and the like.
  • the voice interaction function between the device and the user can be woken up by receiving a wake-up signal from the user, for example, receiving a specific wake-up word from the user.
  • the device can detect and receive the user's voice signal through the microphone, and the process of detecting and receiving the user's voice signal can be referred to as a listening process of the device.
  • a listening process of the device In order to reduce the repetitive operation of having to wake up the device every time a voice control command is issued, there are currently two main listening methods: continuous listening and full-time listening.
  • the continuous listening method refers to: after the device is awakened or the voice command operation is successful, within a period of time (such as 30s), the device does not need to be awakened again, and can continue to listen during this period of time, and perform voice interaction with the user, execute The user's voice control commands.
  • the full-time listening mode means that the device only needs to be woken up once after it is started, and until the device is turned off, you can listen all the time, interact with the user by voice, and execute the user's voice control instructions.
  • the above-mentioned first voice information may be voice information corresponding to any voice signal received by the device in the listening stage.
  • FIG. 3 exemplarily shows a process flow diagram of the invalid identification rejection model.
  • the invalid refusal model receives voice information, for example, receives the above-mentioned first voice information, and selects a pre-judgment module for judging the validity of the voice information based on the voice information and preset selection conditions, that is, selects the above-mentioned reasoning module and rule. At least one of the matching modules predicts the validity of the speech information.
  • the selection condition may be a condition set based on factors affecting the validity of the voice information.
  • the selection condition may be: when the listening time of the device is greater than the first threshold, select the rule matching module to judge the validity of the voice information; when the listening time of the device is less than the second threshold, select the reasoning The module judges the validity of the voice information; and when the listening time of the device is between the second threshold and the first threshold, the rule matching module and the reasoning module can be selected at the same time to judge the validity of the voice information.
  • the influencing factor of the validity of the voice information is not limited to the listening time of the device, which will be introduced in detail below, and will not be described in detail here.
  • the device inputs the acquired speech information into the reasoning module, and obtains the output result after calculation.
  • the output result may be the probability of predicting the validity of the input voice information, and then comparing the probability with a preset judgment threshold to obtain a prejudgment result. Specifically, if the probability is greater than the judgment threshold, the pre-judgment result is that the input voice information is valid, and if the probability is less than the judgment threshold, the pre-judgment result is that the input voice information is invalid.
  • the judgment threshold is 70%
  • the effective probability of the voice information predicted by the reasoning module is 80%, greater than 70%, then , the voice information is valid information. If the effective probability of the voice information predicted by the reasoning module is 50% and less than 70%, then the voice information is invalid information.
  • the result output by the above-mentioned reasoning module is not limited to the valid probability of voice information, but can also be in other data forms, such as the form of scoring.
  • the score exceeds the judgment threshold, indicating that the voice information is valid, etc. This does not limit.
  • the device inputs the acquired voice information into the rule matching module, and the rule matching module compares the input voice information with the information in the preset rule base get the prediction result. If the information in the preset rule base matches the input voice information, the pre-judgment result is that the input voice information is valid. On the contrary, if the information in the preset rule base does not match the input voice information, the pre-judgment result is that the input voice information is invalid.
  • the pre-judgment result can be input into the decision-making module, and the decision-making module can synthesize the validity of the speech information.
  • the judgment condition judges whether the prejudgment result is reasonable, so as to output a final indication of whether the voice information is valid.
  • the comprehensive judgment condition is: the valid voice information includes not less than 3 characters, then, if the input voice information contains less than 3 characters, and the pre-judgment result output by the inference module or the rule matching module is the voice If the information is valid, the pre-judgment result is unreasonable, and then the decision-making module determines that the voice information is invalid, and outputs the final indication information indicating that the voice information is invalid; otherwise, if the input voice information has no less than 3 characters, the reasoning module Or the pre-judgment result output by the rule matching module is reasonable, and the decision module finally determines that the voice information is valid, and outputs indication information indicating that the voice information is valid.
  • the above comprehensive judgment condition is not limited to the above examples, and may also be other forms of conditions.
  • the comprehensive judgment condition may be a voting mechanism. It is determined that the voice information is valid, and the number of votes for which the voice information is invalid is large, and the voice information is determined to be invalid.
  • the inference module and the rule matching module are selected at the same time to predict the validity of the speech information, then the above-obtained speech information is input into the inference module and the rule matching module respectively, and the two modules follow their own processes (see the above description). (not repeated here) pre-judging the validity of the voice information, respectively obtaining the respective validity pre-judgment results, then, inputting the two pre-judgment results into the decision-making module, based on the comprehensive judgment conditions in the decision-making module to The two validity prediction results are finalized to output the final result of the invalid rejection model.
  • the comprehensive judgment condition may be: the valid voice information includes no less than 3 characters, and then, the decision-making module checks the rationality of the above-mentioned two pre-judgment results based on the comprehensive judgment condition, and the specific inspection process refers to the previous description, which will not be repeated here.
  • the comprehensive judgment condition may be a voting mechanism, that is, if the voice information has a large number of valid votes, it is determined that the voice information is valid, and if the voice information has a large number of invalid votes, it is determined that the voice information is valid. Information is invalid. If the above two pre-judgment results of the validity of the voice information are valid, the final judgment result of the voice information is also valid. If the two validity prediction results are invalid, the final judgment result of the voice information is also invalid. If one of the two validity prediction results is valid and the other is invalid, then further judgment can be made, for example, according to the priority. If the priority of the reasoning module is higher than that of the rule matching module, then the prediction result of the reasoning module is used. output as the final result. If the priority of the rule matching module is higher than that of the inference module, the pre-judgment result of the rule matching module is used as the final result output.
  • a voting mechanism that is, if the voice information has a large number of valid votes, it is determined that the voice information is
  • the above comprehensive judgment condition is only an example, and its main purpose is to more accurately synthesize the pre-judgment results of the reasoning module and/or the rule matching module to judge the validity of the acquired voice information.
  • the comprehensive judgment condition may also be other conditions that can achieve the purpose, which is not limited in this solution.
  • the above judgment conditions in S202 may include one or more of the selection conditions in the invalid recognition model, the judgment threshold of the output result of the judgment inference module, and the comprehensive judgment conditions. That is, in this application, in order to improve the accuracy of valid speech recognition and reduce the false trigger rate of invalid speech in different scenarios, in different speech interaction scenarios, based on one or more factors that can affect the input speech information.
  • the influencing factors of validity judgment The above judgment conditions are adjusted flexibly, so that the validity recognition of speech information is more flexible and more suitable for the context and scene at that time.
  • the above-mentioned adjustment of the decision condition based on the influencing factor of the validity of the first voice information may be:
  • the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the The judgment condition determines that the probability that the first voice information is valid is higher; in the case that the probability of the first voice information being valid is less than the probability of being invalid based on one or more factors that influence the validity of the voice information, the judgment condition is determined.
  • the sensitivity is adjusted down, the lower the sensitivity of the decision condition, the lower the probability of determining that the first voice information is valid through the decision condition.
  • the above-mentioned influencing factors that can affect the validity recognition of the input speech information may include one or more of the following:
  • the second time interval between the proportion of valid voice information and invalid voice information in the first preset time period before the device obtains the voice information, the first degree of semantic relevance between the voice information and the last valid voice information obtained by the device , the second degree of association between the voice information and the semantics of the invalid voice information obtained by the device last time, the third degree of association between the first voice information and the valid voice information obtained by the device last time, until the current voice information is obtained when the device
  • the state of the voice dialogue with the user the first similarity between the acoustic features of the voice information and the historically valid voice information, and the second similarity between the voice information and the acoustic features of the historically invalid voice information.
  • the device can adjust the selection conditions in the above-mentioned invalid recognition model based on a first factor, and the first factor can include one or more of the above-mentioned influencing factors. .
  • the specific adjustment process will be introduced later, and will not be described in detail here.
  • the device after the device obtains the above-mentioned first voice information, it can adjust the judgment threshold of the output result of the decision inference module in the above-mentioned invalid rejection model based on a second factor, and the second factor can include the above-mentioned influencing factors. one or more of.
  • the second factor and the influencing factors included in the above-mentioned first factor may be different, or may be partially the same, or may be completely the same, which is specifically determined according to the actual situation, which is not limited in this solution. The specific adjustment process will be introduced later, and will not be described in detail here.
  • the device after the device obtains the above-mentioned first voice information, it can adjust the comprehensive judgment condition of the decision-making module in the above-mentioned invalid rejection model based on a third factor, and the third factor can include one of the above-mentioned influencing factors. or more.
  • the third factor may be different from the influencing factors included in the above-mentioned first factor and the above-mentioned second factor, or may be partially the same, or may be completely the same, which is specifically determined according to the actual situation, which is not limited in this solution. The specific adjustment process will be introduced later, and will not be described in detail here.
  • the above selection conditions, judgment thresholds and comprehensive judgment conditions can be adjusted together, or one or both of the selection conditions, judgment thresholds and comprehensive judgment conditions can be adjusted, and the specific selection can be based on actual needs. , this program does not limit this.
  • the device after the device acquires the above-mentioned first voice information, after adjusting the judgment conditions in the invalid recognition model based on the above-mentioned influencing factors, the device identifies the validity of the first voice information based on the adjusted invalid recognition model. sex.
  • the device can select one or more models in the above rule matching module and inference module to pre-judgment based on the adjusted selection conditions. the validity of the first voice information.
  • the device adjusts the judgment threshold of the above-mentioned reasoning module, and the device selects the pre-judgment module for judging the validity of the first voice information to include the reasoning module, then the output of the reasoning module indicates that the first voice information is valid. After obtaining the valid data, the device can judge whether the first voice information is valid based on the data indicating the validity of the first voice information and the adjusted judgment threshold.
  • the device adjusts the comprehensive judgment conditions of the decision-making module in the above invalid denial model, then, after obtaining the pre-judgment results of the above-mentioned rule matching module and/or inference module, it can be based on the adjusted result.
  • the comprehensive judgment condition performs a comprehensive judgment on the prediction result of the rule matching module and/or the reasoning module, so as to determine the validity of the above-mentioned first voice information.
  • the device starts to perform semantic understanding of the first voice information.
  • the processor in the device can call the natural language understanding model in the memory to execute the semantic understanding of the first voice information. understand to obtain the specific meaning of the first voice information.
  • the device After understanding the meaning of the first voice information, the device performs a corresponding operation based on the meaning to provide the user with the desired service.
  • the meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.
  • the judgment condition may include one or more of the selection conditions, judgment thresholds and comprehensive judgment conditions in the above invalid rejection model, and the adjustment process described below can be applied to the selection conditions, judgment thresholds and comprehensive judgment conditions. Adjustment of one or more of the judgment conditions.
  • the sensitivity refers to the degree of relaxation and strictness of the judgment condition. The stricter the judgment condition, the lower the sensitivity, and the looser the judgment condition, the higher the sensitivity.
  • the inference module predicts the validity of the speech information, it belongs to fuzzy matching, while the rule matching module is a pattern-matching prediction, so yes, no. No, it is relatively strict. Therefore, when selecting a pre-judgment model, if the voice information obtained by the device has a high probability of being valid, then an inference module or a rule matching module can be selected for pre-judgment, or at this time, if you want to improve the accuracy of the effective recognition of the voice information , you can choose the reasoning module to predict. If the probability that the voice information obtained by the device is valid is small, in order to effectively avoid false triggering of invalid information, a rule matching module can be selected to prejudge.
  • the device can adjust the selection conditions to a more severe direction, that is, lower the sensitivity of the selection conditions.
  • the device can adjust the selection condition to: the listening time of the device is less than 5 seconds, select the reasoning module to predict, if the listening time of the device is greater than 10 seconds, select the rule matching module to predict, if the listening time of the device is between 5 seconds and 10 seconds, select the reasoning module and the rule matching module to predict at the same time .
  • the device can adjust the selection condition to a looser direction, that is, increase the sensitivity of the selection condition. If the listening time of the device is greater than 25 seconds, the rule matching module is selected for prediction. If the listening time of the device is between 15 seconds and 25 seconds, the inference module and the rule matching module are selected for prediction.
  • the judgment threshold of the above reasoning module assuming that the standard judgment threshold is 70%, that is, the probability that the reasoning module predicts that the voice information is valid is greater than 70%, it is determined that the voice information is valid.
  • the judgment threshold is adjusted to 80%, that is, the judgment condition is adjusted in a strict direction.
  • the probability that the reasoning module predicts that the voice information is valid needs to be greater than 80% before it can be judged to be valid. It can be seen that the judgment The sensitivity of the condition is reduced.
  • the judgment threshold is adjusted to 60%, that is, the judgment condition is adjusted in a relaxed direction. In this case, the inference module predicts that the voice information is valid only if the probability is greater than 60% before it can be judged to be valid. It can be seen that the judgment condition Sensitivity is improved.
  • the comprehensive judgment condition is: the characters included in the valid speech information are not less than 3, then, if the comprehensive judgment condition is adjusted as: the characters included in the valid speech information are not less than 3. 5, it can be seen that the requirements for voice information are increased and more stringent, so the sensitivity of the comprehensive judgment condition is reduced. If the comprehensive judgment condition is adjusted to include no less than 2 characters in valid voice information, it can be seen that the requirements for voice information are reduced and more relaxed, so that the sensitivity of the comprehensive judgment condition is improved.
  • Negative correlation adjustment sensitivity it means that when the value corresponding to the influencing factor increases, the sensitivity is adjusted lower, and the more the increase, the lower the sensitivity adjustment; and when the value corresponding to the influencing factor decreases, the sensitivity is adjusted higher, and the more the decrease is , the higher the sensitivity is.
  • Positive correlation adjustment sensitivity it means that when the value corresponding to the influencing factor increases, the sensitivity is adjusted higher, and the more the increase is, the higher the sensitivity is adjusted; and when the value corresponding to the influencing factor decreases, the sensitivity is adjusted lower, and the more the decrease is , the lower the sensitivity.
  • the specific adjustment amount of the sensitivity adjustment described in this application can be set according to the actual situation, which is not limited in this application.
  • the adjustment of the sensitivity of the above judgment condition has a range. For example, for the adjustment of the above judgment threshold, the maximum is 100%, the minimum is 0, etc.
  • the adjustment range of the sensitivity of the judgment condition is determined according to the actual situation. This does not limit.
  • the adjustment process of the above judgment condition is introduced based on the influence factor of the environmental situation in which the above-mentioned first voice information is generated.
  • the environmental conditions in which the first voice information is generated include one or more of the following: the number of speakers within the second preset time period until the device acquires the first voice information (hereinafter referred to as the number of speakers), The number of people within a preset range when the first voice information is generated (hereinafter referred to as the number of people around), the confidence level of the first voice information, and the signal-to-noise ratio of the first voice information, and so on.
  • the number of speakers specifically refers to the number of different voiceprints included in the first voice information, because each person has different voiceprints, therefore, the number of voiceprints can be used to represent the speaking of the first voice information number of people.
  • FIG. 4 takes the above listed several environmental influence factors as examples to describe how to adjust the above judgment conditions based on the environmental influence factors.
  • the device may acquire the number of people around and the number of speakers of the first voice information. Specifically, the device can drive the camera to shoot pictures or videos of the surrounding environment by calling the computer vision model in the memory, and then analyze the captured pictures and videos to know the number of people around and the number of speakers.
  • the number of speakers can be obtained by analyzing the above Find out which people's mouths are moving in the video within the second preset duration.
  • the surrounding number includes the number of speakers.
  • the second preset duration may be, for example, 5 seconds, 10 seconds, or 1 minute, etc., which is not limited in this application.
  • the device can identify the voiceprint features in the voice signal received by the device within the second preset duration by calling the voiceprint recognition model in the memory, and the number of different voiceprint features identified is the number of speakers.
  • the voiceprint recognition model may be a dynamic monitoring model to flexibly adapt to voiceprint recognition in different situations.
  • the above device After the above device obtains the number of people around (assuming m people, m is a positive integer) and the number of speakers (assuming n people, n is a positive integer), it first determines whether the number of speakers n is 0, if it is 0, it means the above If the first voice information does not include human voice information, it is not necessary to adjust the corresponding judgment conditions.
  • the number of speakers n is not 0, it indicates that the first voice information includes human voice information. Further, it is judged whether the number of people around m is greater than 1. If m is not greater than 1, it can be judged whether m is 1.
  • m 1
  • the first voice information sent by him is very likely to be a voice control command sent to the device. Then, the sensitivity of the judgment condition can be adjusted to be better. The validity of the first voice information is recognized.
  • the currently acquired first voice information is a voice control instruction for the device, which is valid information. Then, the sensitivity of the judgment condition can be adjusted to the highest, or the invalid rejection model does not further perform validity judgment, and directly outputs an indication that the first voice information is valid.
  • the first voice information is likely to be the content of small talk, which may be invalid voice information for the device.
  • the sensitivity of the decision condition is lowered by the size of the size, and the larger the number of people around m, the lower the sensitivity of the decision condition. Because the more people around, the higher the probability that the first voice information belongs to chatting voice. Therefore, stricter judgment conditions need to be set to identify the validity of the first voice information, so as to prevent invalid voice information from falsely triggering related
  • the service operation wastes the resources of the device.
  • the device can call the automatic speech recognition model in the memory to calculate the confidence of the first voice information, or use the channel information to calculate the signal-to-noise ratio of the first voice information, or the confidence Both the degree and the signal-to-noise ratio are calculated, and then the sensitivity of the decision condition is adjusted based on this confidence degree and/or the signal-to-noise ratio.
  • the sensitivity of the decision condition can be adjusted based on the confidence and/or the negative correlation of the SNR, because the higher the confidence, the higher the probability that the first voice information is correctly recognized, and the higher the SNR. High, indicating that the quality of the collected first voice information is better. At this time, even if the sensitivity of the judgment condition is harsh, the validity of the first voice information can be better recognized, and the invalid voice of chatting can be effectively filtered.
  • the sensitivity of the decision condition can be appropriately increased, and the decision condition can be relaxed, so that the validity of the first voice information can be better recognized.
  • the device may set a confidence threshold and/or a signal-to-noise ratio threshold for voice information. If the confidence of the first voice information is greater than the confidence threshold and/or the signal-to-noise ratio is greater than the signal-to-noise ratio threshold, then the confidence The higher the degree and/or the signal-to-noise ratio, the lower the sensitivity of the decision condition. If the confidence of the first speech information is smaller than the confidence threshold and/or the SNR is smaller than the SNR threshold, then the lower the confidence and/or the SNR, the higher the sensitivity of the decision condition is adjusted.
  • the confidence threshold may be, for example, 50% or 60%
  • the signal-to-noise ratio threshold may be, for example, 50db or 60db, and the present application does not limit the confidence threshold and the signal-to-noise ratio threshold.
  • the device does not need to set the confidence threshold and/or the signal-to-noise ratio threshold of the speech information, but can set the corresponding adjustment decision condition within each confidence and/or signal-to-noise ratio range. .
  • the judgment threshold can be increased, and the judgment threshold can be set to 50%; within the range of the confidence level from 31% to 60%, you can set the judgment threshold to 60%; within the range of the confidence level from 61% to 70%, you can not adjust it and keep the original 70% threshold; Within the range of 71% to 100% confidence, the sensitivity can be adjusted down, and the judgment threshold can be set to 80%.
  • the device can individually adjust the sensitivity of the decision condition based on any one of them.
  • the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
  • a weight may be configured for each of the multiple influencing factors, and the sensitivity of the decision condition may be adjusted in a weighted manner. For example, for the adjustment of the above judgment threshold, it is assumed that the three influencing factors of the surrounding number m, the confidence degree and the signal-to-noise ratio are adjusted.
  • the corresponding weights of the three factors are w1, w2 and w3.
  • the calculated and adjusted judgment thresholds are a1, a2, and a3.
  • the adjusted judgment threshold determined by synthesizing the three factors is (a1*w1+a2*w2+a3*w3). It should be noted that this weighted synthesis method is only an example. In actual implementation, the most or least adjusted among multiple influencing factors can be taken as the final adjustment result, etc. This scheme does not do the calculation process of specific synthesis. limit.
  • FIG. 5 exemplarily shows that based on the continuous listening time (hereinafter referred to as t1) before the device acquires the above-mentioned first voice information, the first time between the device acquiring the first voice information and the most recent acquisition of valid voice information.
  • the device after the device acquires the above-mentioned first voice information, it can acquire the duration t1 that the device continues to listen until the time when the first voice information is acquired, and the first time between the acquisition of the first voice information and the most recent acquisition of valid voice information.
  • the acquisition of t1, ⁇ t1 and ⁇ t2 can be obtained by timing and calculation by a timer.
  • the device can adjust the sensitivity of the above judgment condition based on the negative correlation of the t1, that is, the longer the duration t1 of continuous listening is, the lower the sensitivity of the judgment condition is adjusted. This is because when the device is woken up, it begins to enter a new round of continuous listening stage. Generally, the user's voice information obtained by the device in the early stage of the continuous listening stage is more likely to be effective. With the passage of time, the voice information obtained by the device is more likely to be chat information between users. To reduce false triggering, the sensitivity needs to be reduced. Therefore, the device can adjust the sensitivity of the above judgment conditions based on the negative correlation of the continuous listening time length.
  • the judgment condition is the judgment threshold of the output result of the above inference module
  • the judgment threshold can be 60%
  • the condition is relatively loose
  • the sensitivity is high, but with the gradual increase of t1
  • the judgment threshold is increased by a preset increment value, such as an increase of 1%, etc., that is, with the increase of t1
  • the judgment threshold is larger and larger, and the conditions are more and more harsh. Sensitivity gradually decreases. It should be noted that this is just an example, and the present application does not limit the specific negative correlation adjustment method.
  • the device may determine whether the ⁇ t1 is greater than the first time interval threshold T1. If ⁇ t1 is greater than this T1, the sensitivity of the decision condition is not adjusted. This is because when the ⁇ t1 is greater than the T1, it can be considered that the length of time included in the first time interval ⁇ t1 overlaps with the above-mentioned time length t1 of continuous listening, and the sensitivity of the judgment condition can be adjusted by the above-mentioned t1, and there is no need to adjust the sensitivity of the judgment condition according to the above-mentioned t1. ⁇ t1 to adjust the sensitivity of the decision condition.
  • the negative correlation adjusts the sensitivity of the decision condition. This is because, within a period of time after the device obtains valid voice information, that is, the length of time T1, the longer the interval, the greater the probability that the voice information obtained by the device is invalid voice information such as chat, therefore, in order to reduce false triggers, The device can negatively correlate to adjust the sensitivity of the decision condition.
  • the device may determine whether the ⁇ t2 is greater than the second time interval threshold T2. If ⁇ t2 is greater than this T2, the sensitivity of the decision condition is not adjusted. This is because when the ⁇ t2 is greater than the T2, it can be considered that the length of time included in the second time interval ⁇ t2 overlaps with the above-mentioned time length t1 of continuous listening, and the sensitivity of the judgment condition can be adjusted by the above-mentioned t1, and there is no need to adjust the sensitivity of the judgment condition according to the above-mentioned t1. ⁇ t2 to adjust the sensitivity of the decision condition.
  • the negative correlation adjusts the sensitivity of the decision condition. This is because, within a period of time after the device acquires invalid voice information, that is, the length of time T2, the longer the interval, the greater the probability that the voice information acquired by the device is invalid voice information such as chat, therefore, in order to reduce false triggers, The device can negatively correlate to adjust the sensitivity of the decision condition.
  • the device can compare whether ⁇ t1 is smaller than ⁇ t2, and if so, increase the sensitivity of the decision condition. This is because, when the previous voice information obtained from the first voice information is valid voice information, it is more likely that the first voice information is an addition or modification of the previous voice information, that is, the first voice information In order to better identify the validity of the first voice information, the device may adjust the judgment condition to a relaxed direction, that is, increase the sensitivity.
  • the adjustment process shown in FIG. 5 above is an example of implementation of the present application.
  • the sensitivity of the judgment condition is dynamically adjusted in real time through the characteristics of the length of the continuous listening time and the time interval between valid voice information and invalid voice information, so that at different listening times At this stage, even if the voice information obtained by the device has the same content, there are differences in the threshold for being judged to be valid, so that the valid voice can be better recognized, the false trigger of invalid voice can be reduced, and the user's voice interaction experience can be improved.
  • the device can individually adjust the sensitivity of the decision condition based on any one of them.
  • the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
  • FIGS. 6A and 6B exemplarily illustrate the adjustment of the decision condition based on the influence factor of the ratio of valid voice information and invalid voice information in the first preset time period before the device obtains the above-mentioned first voice information.
  • the first preset duration may be the duration of continuous listening before the device obtains the first voice information, or the first preset duration may be any duration before the device obtains the first voice information,
  • the arbitrary duration may be pre-configured, which is not limited in this application.
  • the above-mentioned proportion of valid voice information within the first preset duration refers to the proportion of valid voice information acquired by the device to all voice information acquired by the device within the first preset duration.
  • the ratio is the reciprocal of the number of invalid voice information acquired between the time when a valid voice control instruction was last received and the time when the first voice information was acquired. If the number of invalid voice information acquired during the period is 0, the proportion of the valid voice information is 1.
  • the proportion of invalid voice information within the first preset duration refers to the proportion of invalid voice information acquired by the device to all voice information acquired by the device within the first preset duration. Or, the ratio is the reciprocal of the number of valid voice information acquired between the time when an invalid voice control instruction is received last time and the time when the first voice information is acquired. If the number of valid voice information acquired during the period is 0, then the proportion of invalid voice information is 1.
  • the device after the device obtains the first voice information, the device obtains the proportion of valid voice information (referred to as f1) and the proportion of invalid voice information (referred to as f2) within the first preset duration, and the device may Compare the sizes of the f1 and f2 (see Figure 6A). If f1 is greater than f2, it indicates that more valid voice information is obtained within the above-mentioned first preset duration, and the user frequently interacts with the device in voice, then the above judgment condition can be adjusted according to the positive correlation of the parameter (f1-f2). sensitivity.
  • the device may adjust the sensitivity of the above decision conditions based on f1 and f2. For example, the larger the proportion of f1, the higher the sensitivity adjustment, and the smaller the proportion of f2, the lower the sensitivity adjustment, and so on.
  • the device can adjust the sensitivity of the decision condition according to the change rate of f1 and the change rate of f2.
  • the device can adjust the sensitivity of the decision condition according to the positive correlation of the rate of change of f1. That is, the larger the change rate of f1, the greater the probability that the first voice information is valid, the higher the sensitivity, the looser the judgment condition; and the smaller the change rate of f1, the lower the probability that the first voice information is valid. , the lower the sensitivity is adjusted, the harsher the judgment condition.
  • the rate of change of f1 are exemplarily given in FIG.
  • the judgment condition for adjustment is the judgment threshold of the output result of the above inference module, and assuming that the judgment threshold before adjustment is 70%, then the adjusted judgment threshold corresponding to the rate of change of the five f1 sorted from small to large is 85% , 80%, 78%, 68% and 65%. It should be noted that the lower the judgment threshold, the higher the sensitivity, that is, increasing the sensitivity here means lowering the judgment threshold, and lowering the sensitivity means increasing the judgment threshold.
  • the device can adjust the sensitivity of the decision condition according to the negative correlation of the rate of change of f2. That is, the smaller the rate of change of f2, the higher the proportion of valid voice information, that is, the greater the probability that the first voice information is valid. Therefore, the higher the sensitivity, the looser the judgment condition; and the rate of change of f2 The larger the value, the smaller the proportion of valid voice information, that is, the lower the probability that the first voice information is valid. Therefore, the lower the sensitivity is, the stricter the judgment condition is.
  • the device obtains the proportion of valid voice information (referred to as f1) and the ratio of invalid voice information (referred to as f2) within the first preset duration, and the device does not need to compare f1 and f2.
  • the sensitivity of the above judgment condition can also be adjusted according to the positive correlation of this parameter (f1-f2), the sensitivity of the judgment condition can be adjusted according to the positive correlation of the rate of change of f1, and/or the sensitivity of the judgment condition can be adjusted according to the negative correlation of the rate of change of f2 ( See Figure 6B).
  • the specific adjustment process refer to the above description of FIG. 6A , which will not be repeated here.
  • the device can individually adjust the sensitivity of the decision condition based on any one of them.
  • the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
  • FIG. 7 exemplarily shows the semantics of the first voice information and the invalid voice information acquired by the device based on the first correlation degree between the first voice information and the semantics of the valid voice information acquired by the device last time.
  • the device can acquire the most recently acquired valid voice information (referred to as the most recent historical valid voice information), based on the first voice information obtained by analysis and the recent historical valid voice information.
  • the semantic analysis of the voice information analyzes the degree of association of the two voice information (referred to as the first degree of association for short).
  • semantic understanding of the first speech information may be performed by invoking a natural language understanding model in the memory.
  • the device can calculate the specific first correlation degree, and then adjust the sensitivity of the decision condition based on the positive correlation of the calculated first correlation degree.
  • the first correlation degree is greater than a certain threshold, it indicates that the probability that the first voice information is valid voice information is high, and the greater the first correlation degree is, the higher the sensitivity is; If the correlation degree is smaller than a certain threshold, it indicates that the probability that the first voice information is valid voice information is small, and the lower the first correlation degree is, the lower the sensitivity is adjusted.
  • the device does not need to set the threshold of the first correlation degree, but can set the corresponding adjustment decision conditions within each range of the first correlation degree.
  • the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then, in the range of the first correlation degree from 0 to 30%, the sensitivity can be lowered, and the judgment threshold can be set Adjusted to 80%; in the range of the first correlation degree from 31% to 60%, you can set the judgment threshold to 75%; in the range of the first correlation degree from 61% to 70%, you can not adjust it and keep the original 70% of the threshold; in the range of the first correlation degree from 71% to 100%, the sensitivity can be increased, and the judgment threshold can be set to 60%.
  • the sensitivity when it is judged that the first degree of relevance is 100% relevant, the sensitivity can be adjusted to the highest, or the invalid rejection model does not conduct further validity judgment, and directly outputs the first voice information. valid instructions.
  • the device can obtain the invalid voice information obtained the last time (referred to as the recent invalid voice information for short), based on the first voice information obtained by analysis and the recent history invalid voice information
  • the semantic analysis of the voice information analyzes the degree of association between the two voice information (referred to as the second degree of association for short). If the semantics of the two speech information are not related, that is, the second degree of correlation is zero, then the sensitivity of the decision condition is not adjusted.
  • the device can calculate the specific second correlation degree, and then based on the calculated second correlation The degree of negative correlation adjusts the sensitivity of the decision condition.
  • the second correlation degree is greater than a certain threshold, it indicates that the probability that the first voice information is invalid voice information is high, and the greater the second correlation degree is, the lower the sensitivity is; If the correlation degree is smaller than a certain threshold, it indicates that the probability that the first voice information is invalid voice information is small, and the smaller the second correlation degree is, the higher the sensitivity is.
  • the device does not need to set the threshold of the second correlation degree, but can set the corresponding adjustment decision conditions within each range of the second correlation degree.
  • the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then in the range of the second correlation degree from 0 to 30%, the sensitivity can be increased, and the judgment threshold can be set Adjust to 60%; in the range of the second correlation degree from 31% to 60%, you can set the judgment threshold to 65%; in the range of the second correlation degree from 61% to 70%, you can not adjust it and keep the original 70% of the threshold; in the range of the second correlation degree from 71% to 100%, the sensitivity can be lowered, and the judgment threshold can be set to 80%.
  • the sensitivity when it is judged that the second degree of relevance is 100% relevant, the sensitivity can be adjusted to the lowest level, or the invalid rejection model does not conduct further validity judgment, and directly outputs the first voice information. Invalid instruction.
  • the device in addition to adjusting the degree of association of the judgment condition based on the first degree of association between the first voice information and the semantics of the most recent valid voice information obtained by the device, the device may also The third correlation degree of the valid voice information obtained once is used to adjust the correlation degree of the judgment condition.
  • the third degree of association refers to the degree of association between the first voice information and the content of the valid voice information obtained by the device last time, and the above-mentioned first degree of association refers to the association between the semantics of the two voice information Spend.
  • FIG. 8A and FIG. 8B To facilitate understanding of the first degree of association and the third degree of association, reference may be made to FIG. 8A and FIG. 8B .
  • the semantic association inference model is a pre-trained neural network model or a machine learning model or the like.
  • Fig. 8B similarly, assume that "play music for me” is the latest valid voice information acquired by the device, and "I usually like to listen to singer A's songs" is the first voice information.
  • the two pieces of speech information can be structurally parsed through a natural language understanding model. Specifically, after structural analysis of the piece of speech information "help me play music”, it is known that : The field described by this voice message is music, and the intent is to play music. After structural analysis of the voice information "I usually like to listen to singer A's songs", we know that the field described by the voice information is music, and the singer is singer A.
  • the relevant judgment model may be, for example, a dialogue state tracking DST model or the like.
  • the first correlation degree of the two voice information "help me play music” and "I usually like to listen to singer A's songs” output in the above-mentioned FIG. 8A may be zero, that is, the semantics are not related; while the output in the above-mentioned FIG. 8B
  • the third degree of relevance of the two voice information "help me play music” and "I usually like to listen to singer A's songs” may be 100%, that is, the two voice information are related.
  • the third degree of correlation between the first voice information obtained based on the method described in FIG. 8B and the last valid voice information obtained by the device may be a clear 0 or 100%, that is, if the above When the correlation judgment model outputs irrelevant indication information, the third correlation degree is 0, and when the above correlation judgment model outputs relevant indication information, the third correlation degree is 100%.
  • the third degree of association between the first voice information obtained in the manner described in FIG. 8B and the last valid voice information obtained by the device may also be a specific percentage (for example, 60% Or 90%, etc.) or similarity score, etc., and then, it can be determined whether it is related by comparing with a preset threshold.
  • the device After obtaining the third degree of correlation between the above-mentioned first voice information and the last valid voice information obtained by the device, the device can positively correlate to adjust the sensitivity of the decision condition based on the third degree of correlation.
  • the third degree of correlation is zero, that is, when the first voice information is not related to the valid voice information acquired by the device last time, the sensitivity of the decision condition is not adjusted.
  • the device after the device obtains the above-mentioned first voice information, it can obtain the status of the voice dialogue between the device and the user until the first voice information is obtained.
  • the state of judgment or small talk, etc. the device may learn the state based on the dialog state tracking DST technology. If there is a state in which the device has a voice dialogue with the user, it indicates that the user and the device have conducted a long interactive dialogue. Then, the device can increase the sensitivity of the decision condition according to the continuous dialogue state. If there is no state in which the device has a voice dialogue with the user, the user does not have a long interactive dialogue with the device, and the device may not adjust the sensitivity of the decision condition according to this factor.
  • the device can individually adjust the sensitivity of the decision condition based on any one of them.
  • the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
  • FIG. 9 exemplarily shows the first similarity based on the acoustic features of the first voice information and the historically valid voice information, and the second similarity based on the acoustic features of the first voice information and the historical invalid voice information.
  • the acoustic features include features such as intonation and/or speed of speech.
  • the device after acquiring the above-mentioned first voice information, extracts the acoustic features of the first voice information by invoking the acoustic model stored in the memory, and then compares the extracted acoustic features with historical valid voice information (may be is to compare the acoustic features of one or more historically valid voice information), and obtain the similarity (referred to as the first similarity for short) between the acoustic features of the first voice information and the acoustic features of the historically valid voice information. If the similarity between the acoustic feature of the first voice information and the acoustic feature of the historically valid voice information is zero, the device may not adjust the sensitivity of the decision condition according to the first similarity.
  • the sensitivity of the decision condition that is, the similarity
  • the similarity can be adjusted in a positive correlation (exemplarily, the similarity It can be the largest similarity among the obtained similarities, or the greater the average formality of the obtained similarities, etc.), the higher the sensitivity is adjusted.
  • the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically valid voice information is greater than a certain threshold (for example, the threshold may be between 60% and 100%). In the case of any value), it indicates that the acoustic features of the first voice information are similar to the acoustic features of one or more historically valid voice information, then the device can increase the sensitivity of the decision condition to a preset value.
  • the judgment threshold will be equal to Adjust to 60%.
  • the device after acquiring the above-mentioned first voice information, extracts the acoustic features of the first voice information by invoking the acoustic model stored in the memory, and then compares the extracted acoustic features with the historical invalid voice information (may be is to compare the acoustic features of one or more historical invalid voice information), and obtain the similarity (referred to as the second similarity) between the acoustic features of the first voice information and the acoustic features of the historical invalid voice information. If the similarity between the acoustic feature of the first voice information and the acoustic feature of the historical invalid voice information is zero, then the device may not adjust the sensitivity of the decision condition according to the second similarity.
  • the sensitivity of the decision condition may be adjusted in a negative correlation (exemplarily, the similarity It can be the largest similarity among the obtained similarities, or the greater the average formality of the obtained similarities, etc.), the lower the sensitivity is adjusted.
  • the similarity between the acoustic features of the first voice information and the acoustic features of one or more historical invalid voice information is greater than a certain threshold (for example, the threshold may be between 60% and 100%). Any value), it indicates that the acoustic features of the first voice information are similar to the acoustic features of one or more historical invalid voice information, then the device can lower the sensitivity of the decision condition to a preset value. For example, taking the above judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically invalid voice information is greater than a certain threshold, the judgment thresholds are all Adjust to 75%.
  • the device can individually adjust the sensitivity of the decision condition based on any one of them.
  • the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
  • the device may receive an instruction input by the user, and adaptively adjust the sensitivity of the decision condition based on the instruction.
  • the instruction may be, for example, a specific decision condition sensitivity specified by the user, or may be an instruction such as turning off or canceling the voice information validity recognition.
  • the sensitivity of the above judgment condition can be adaptively adjusted according to the user's preference, so as to better meet the user's needs and improve the user experience.
  • the adjustment of the sensitivity of the above-mentioned judgment condition may be sent to the above-mentioned equipment after being adjusted by another device or device (for example, it may be a server corresponding to the above-mentioned equipment, etc.) based on the above-mentioned one or more influencing factors.
  • the above-mentioned device may directly judge the validity of the above-mentioned first voice information based on the adjusted judgment condition.
  • FIG. 10 shows a voice information processing method provided by the present application, and the method includes but is not limited to the following steps:
  • step S201 in FIG. 2 For the specific implementation of this step, reference may be made to the description in step S201 in FIG. 2 above, which will not be repeated here.
  • the operation indicated by the first voice information is executed, wherein the judgment condition is based on the environmental condition where the first voice information is generated get adjusted.
  • the device can adaptively adjust the judgment condition for judging whether the first voice information is a valid voice command based on the environment in which the first voice information is generated.
  • the judgment condition for judging whether the first voice information is a valid voice command based on the environment in which the first voice information is generated.
  • the device uses the adjusted judgment condition to determine whether the first voice information is valid.
  • the device starts to perform semantic understanding on the first voice information.
  • the processor in the device can call the natural language understanding model in the memory to execute the semantic understanding of the first voice information. understand to obtain the specific meaning of the first voice information.
  • the device performs a corresponding operation based on the meaning to provide the user with the desired service.
  • the meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.
  • the device can receive the sensitivity of the specified judgment condition input by the user, and then adaptively adjust the judgment condition for judging whether the first voice information is a valid voice command based on the sensitivity, so that when using the adjustment
  • the judgment sensitivity specified by the user can be achieved when the latter judgment condition judges whether the voice information is valid.
  • the device uses the adjusted judgment condition to judge whether the first voice information is valid.
  • the device starts to perform semantic understanding on the first voice information to obtain the meaning of the first voice information, and performs corresponding operations based on the meaning to provide the user with the desired service.
  • the meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.
  • the environment in which the above-mentioned first voice information is generated includes one or more of the following: the number of speakers within the second preset time period when the device obtains the first voice information, the first voice The number of people within a preset range when the information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.
  • the probability that the voice information received by the device is idle chat is invalid voice.
  • the confidence of the voice information The higher the degree and/or the signal-to-noise ratio, the higher the probability that the device can correctly recognize the sentences of the speech information, and the recognition of the validity of the speech information will also be affected. Adjusting the judgment conditions for judging the validity of the voice information can better judge the validity of the voice information, improve the accuracy of effective judgment, and reduce the false trigger rate of invalid signals.
  • the sensitivity of the above-mentioned judgment condition when the above-mentioned environmental conditions indicate that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the above-mentioned judgment condition is increased; and that the environmental conditions indicate that the probability that the first voice information is valid is less than In the case of invalid probability, the sensitivity of the decision condition is adjusted down.
  • the sensitivity of the decision condition is adjusted down.
  • the embodiment of the present application adaptively adjusts the judgment conditions for judging the validity of the voice information for the voice information received under different environmental conditions, so that the validity of the voice information can be better judged in different environmental conditions, Improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.
  • the above judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.
  • the device can adaptively adjust the sensitivity of the above-mentioned decision condition in combination with the environmental conditions in which the first voice information is generated and the duration of the device's continuous listening to the voice information.
  • the device can adaptively adjust the sensitivity of the above-mentioned decision condition in combination with the environmental conditions in which the first voice information is generated and the duration of the device's continuous listening to the voice information.
  • the decision condition based on the continuous listening duration of the device to the voice information reference may be made to the corresponding description in FIG. 5 above, and details are not repeated here.
  • the device may configure a weight for each of the foregoing environmental conditions and listening duration, and comprehensively adjust the sensitivity of the decision condition in a weighted manner.
  • the two influencing factors, the environmental situation and the listening duration are adjusted.
  • the thresholds are a4 and a5, then, the adjusted judgment threshold determined by combining the two factors is (a4*w4+a5*w5).
  • this weighted synthesis method is only an example. In actual implementation, the most or least adjusted among multiple influencing factors can be taken as the final adjustment result, etc. This scheme does not do the calculation process of specific synthesis. limit.
  • the judgment is adaptively adjusted according to the environmental conditions when the voice information is generated and the continuous listening time of the device.
  • the judgment condition of the validity of the voice information can further judge the validity of the voice information better, improve the accuracy of the effective judgment, and reduce the false trigger rate of invalid signals.
  • the above judgment condition is adjusted based on the environmental condition and the continuous listening duration of the device, including: the judgment condition is adjusted based on the environmental condition, the continuous listening duration and historical voice information.
  • the situation of the historical voice information includes one or more of the following: the first time interval between when the first voice information is acquired and the last time valid voice information is acquired; when the first voice information is acquired The second time interval between the last acquisition of invalid voice information; the proportion of valid voice information and invalid voice information within the first preset time period before the first voice information is obtained; the first voice information and the latest acquisition The first degree of relevance of the semantics of the valid voice information obtained; the second degree of relevance between the first voice information and the semantics of the invalid voice information obtained last time; The third degree of correlation; the state of the device and the user's voice dialogue until the first voice information is obtained; the first similarity of the acoustic features of the first voice information and the historically valid voice information; the first voice information and the history are invalid The second similarity of the acoustic features of the speech information.
  • the sensitivity of the above-mentioned decision condition is increased.
  • the sensitivity of the above-mentioned judgment condition is increased;
  • the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on an upward trend, and the sensitivity of the judgment condition is increased; the proportion of the valid voice information is on a downward trend , the sensitivity of the decision condition is reduced.
  • the sensitivity of the judgment condition is adjusted to be higher.
  • the device can adaptively adjust the sensitivity of the above judgment condition in combination with the environment in which the first voice information is generated, the duration of the device's continuous listening to the voice information, and the historical voice information heard by the device.
  • the specific implementation of adjusting the judgment conditions based on the environmental conditions where the first voice information is generated reference may be made to the corresponding description in FIG. 4 , which will not be repeated here;
  • the implementation can refer to the corresponding description in the above-mentioned FIG. 5 , which will not be repeated here;
  • the specific implementation of adjusting the judgment condition based on the historical voice information heard by the device can refer to the corresponding description in the above-mentioned FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG. 9 . description, which will not be repeated here.
  • the sensitivity of the judgment condition is adjusted in combination with the above-mentioned environmental conditions, listening duration, and historical voice information.
  • the most or the least adjusted result is the result of the final adjustment, etc. This scheme does not limit the specific comprehensive calculation process.
  • the historical voice information Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the embodiment of the present application, in addition to the environmental conditions and the listening duration of the voice information described above, the historical voice information is also used to adaptively adjust the judgment conditions for judging the validity of the voice information, and the voice information can be further judged better. It can improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.
  • the above judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.
  • the device can adaptively adjust the sensitivity of the above-mentioned judgment condition in combination with the environmental conditions where the first voice information is generated and the historical voice information heard by the device.
  • the specific implementation of adjusting the judgment conditions based on the environmental conditions where the first voice information is generated may refer to the corresponding description in FIG. 4 , which will not be repeated here; the specific implementation of adjusting the judgment conditions based on the historical voice information heard by the device Reference may be made to the corresponding descriptions in FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG. 9 , and details are not repeated here.
  • the sensitivity of the decision condition is adjusted in combination with the above-mentioned environmental conditions and historical voice information. Or at least as the result of the final adjustment, etc., this scheme does not limit the specific comprehensive calculation process.
  • the judgment conditions for judging the validity of the voice information are adaptively adjusted in combination with the environmental conditions generated by the voice information and the historical voice information, and the validity of the voice information can be further judged better, and the effectiveness of the voice information can be improved.
  • the accuracy of the judgment can reduce the false trigger rate of invalid signals.
  • the present application provides another voice information processing method.
  • the method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is obtained by adjusting based on the continuous listening duration of the device.
  • the specific implementation of executing the operation indicated by the first voice information when it is determined based on the judgment condition that the first voice information is a valid voice control command can refer to the description in step S203 in FIG. Repeat.
  • the specific implementation of the above-mentioned judgment condition based on the continuous listening duration adjustment of the device to the voice information may refer to the corresponding description in FIG. 5 , which will not be repeated here.
  • the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening time of the device, The validity of the voice information can be better judged, the accuracy of effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
  • the present application provides another voice information processing method.
  • the method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is adjusted based on historical voice information.
  • the historical voice information it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the present application, by adaptively adjusting the judgment conditions for judging the validity of the voice information through the historical voice information, the validity of the voice information can be better judged, the accuracy of the effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
  • FIG. 11 In order to facilitate an overall understanding of the voice information processing method provided by the present application, for example, reference may be made to the flowchart shown in FIG. 11 .
  • the voice interaction system of the device is awakened, and then the system starts to listen to the user's voice. After the system acquires the user's voice information, the voice information is input into the above-mentioned invalid recognition model to identify the validity of the voice information. If it is recognized that the voice information is valid, the voice information is semantically understood, and instructions are parsed and executed based on the understood semantics.
  • the voice interaction system will determine whether to continue listening to the user's voice, and if so, perform the operation of listening to the voice. If it is determined not to continue listening, perform the operation of ending listening. Exemplarily, judging whether to continue listening may be determined according to a preset listening duration, if the current range of the preset listening duration is not exceeded, the listening may be continued; otherwise, the listening is terminated.
  • the system determines whether to continue listening to the user's voice, and if so, performs the operation of listening to the voice. If it is determined not to continue listening, perform the operation of ending listening.
  • the two steps of judging whether to continuously listen to the user's voice and semantic understanding can also be carried out simultaneously, or first determine whether to continue listening to the user. and then perform semantic understanding.
  • the present application does not limit the sequence of execution of the two operations.
  • the semantics of the understood speech information can also be returned to the process of validating the speech information, for example, input into the above-mentioned invalid rejection model for the adjustment of the sensitivity of the above-mentioned judgment conditions.
  • the above-mentioned embodiments of the voice information processing method provided by the present application are mainly introduced by taking the judgment conditions in the invalid recognition model as an example.
  • the decision condition may not be limited to be the decision condition in the invalid rejection model.
  • the scheme of adjusting the judgment condition of the validity of the voice information is within the protection scope of the present application.
  • the voice information processing method starts from one or more influencing factors that affect the validity judgment of voice information, and adjusts the sensitivity of the judgment condition of the validity of the voice information obtained by the device in real time, so that the device can Based on different scenarios, different user states can flexibly and effectively determine the validity of voice information, which can improve the accuracy of voice information validity recognition, reduce the false trigger rate of invalid voice information, and save the computing resources wasted by devices due to false triggering. It can also improve the user's physical examination during the voice interaction process.
  • each device includes corresponding hardware structures and/or software modules for performing each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • the device may be divided into functional modules according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 12 shows a schematic diagram of a possible logical structure of the device, and the device may be the above-mentioned device, or may be a chip in the device, or may be the device processing system, etc.
  • the apparatus 1200 includes an acquisition unit 1201 , an adjustment unit 1202 , a semantic understanding unit 1203 and an execution unit 1204 . in:
  • the obtaining unit 1201 is configured to obtain the first voice information.
  • the obtaining unit 1201 may be implemented by a communication interface or a transceiver, and may perform the operations described in step 201 shown in FIG. 2 .
  • the adjustment unit 1202 is used to adjust the judgment condition based on the influence factor of the validity of the first voice information, the judgment condition is one or more judgment conditions in the validity judgment model of the first voice information, and the validity is used to indicate Whether the first voice information is a valid voice control instruction for the device that obtained the first voice information.
  • the adjustment unit 1202 may be implemented by a processor, and may perform the operations described in step 202 shown in FIG. 2 .
  • the semantic understanding unit 1203 is configured to perform semantic understanding on the first voice information when it is determined that the first voice information is valid based on the adjusted judgment condition.
  • the semantic understanding unit 1203 may be implemented by a processor, and may perform the semantic understanding operation described in step 203 shown in FIG. 2 .
  • the execution unit 1204 is configured to execute the instruction of the first voice information.
  • the execution unit 1204 may be implemented by a processor, and may perform the execution operations described in step 203 shown in FIG. 2 .
  • the adjustment unit 1202 is specifically used for:
  • the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The higher the probability of being effective;
  • the sensitivity of the judgment condition is lowered, and the lower the sensitivity of the judgment condition, the lower the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The lower the probability of being effective.
  • the judgment condition includes a selection condition of a pre-judgment module of the validity of the first speech information in the validity judgment model, and the pre-judgment module includes a rule matching module and a reasoning module.
  • the judgment condition includes a judgment threshold of an inference module used to predict the validity of the first voice information in the validity judgment model.
  • the judgment condition includes a comprehensive judgment condition of a decision module in the validity judgment model; the comprehensive judgment condition is a judgment condition for determining whether the first speech signal is valid based on a prejudgment result; the prejudgment result is the pre-judgment result of the validity of the first voice information by the pre-judgment module in the validity judgment model.
  • the influencing factor is one or more of the following:
  • the continuous listening time of the device 1200 is the continuous listening time of the device 1200.
  • the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained is obtained
  • the first degree of relevance of the semantics of the first voice information and the most recently acquired valid voice information is the first degree of relevance of the semantics of the first voice information and the most recently acquired valid voice information
  • the second similarity between the first voice information and the acoustic features of the historical invalid voice information is the second similarity between the first voice information and the acoustic features of the historical invalid voice information.
  • the environment in which the first voice information is generated includes one or more of the following:
  • the device 1200 obtains the number of speakers within the second preset time period of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the first voice information The signal-to-noise ratio of speech information.
  • FIG. 13 shows a schematic diagram of a possible logical structure of the device, and the device may be the above-mentioned device, or may be a chip in the device, or may be the device processing system, etc.
  • the apparatus 1300 includes an acquisition unit 1301 and an execution unit 1302 . in:
  • the obtaining unit 1301 is configured to obtain the first voice information.
  • the obtaining unit 1301 may be implemented by a communication interface or a transceiver, and may perform the operations described in step S1001 shown in FIG. 10 .
  • the executing unit 1302 is configured to execute the operation indicated by the first voice information in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is based on the first voice information
  • the environmental conditions in which the voice information is generated are adjusted.
  • the execution unit 1302 may be implemented by a processor, and may perform the operations described in step S1002 shown in FIG. 10 .
  • FIG. 14 shows a schematic diagram of a possible hardware structure of the device provided by the present application, and the device may be the device in the method described in the foregoing embodiment.
  • the device 1400 includes: a processor 1401 , a memory 1402 and a communication interface 1403 .
  • the processor 1401 , the communication interface 1403 , and the memory 1402 may be connected to each other or to each other through a bus 1404 .
  • the memory 1402 is used to store computer programs and data of the device 1400, and the memory 1402 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), memory Erase programmable read only memory (erasable programmable read only memory, EPROM) or portable read only memory (compact disc read-only memory, CD-ROM), etc.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read only memory
  • portable read only memory compact disc read-only memory
  • the software or program codes required to perform the functions of all or part of the units in FIG. 14 are stored in the memory 1402 .
  • the processor 1401 can not only call the program codes in the memory 1402 to realize some functions, but also cooperate with other The components (eg, the communication interface 1403 ) together perform other functions (eg, the function of receiving or sending data) described in the embodiment of FIG. 14 .
  • the number of the communication interfaces 1403 may be multiple, and is used to support the device 1400 to communicate, such as receiving or sending data or signals.
  • the processor 1401 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof.
  • a processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like.
  • the processor 1401 can be used to read the program stored in the above-mentioned memory 1402, and perform the following operations:
  • the judgment condition is one or more judgment conditions in the validity judgment model of the first voice information, and the validity is used to indicate Whether the first voice information is a valid voice control instruction for the device 1400 that obtained the first voice information; if it is determined that the first voice information is valid based on the adjusted judgment condition, the first voice information is checked. Semantically understands and executes the instructions of the first voice information.
  • the adjustment of the decision condition based on the influencing factor of the validity of the first voice information includes:
  • the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The higher the probability of being effective;
  • the sensitivity of the judgment condition is lowered, and the lower the sensitivity of the judgment condition, the lower the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The lower the probability of being effective.
  • FIG. 15 is a schematic structural diagram of another voice information processing apparatus provided by an embodiment of the present application.
  • the apparatus may be the device in the above-mentioned embodiment, or may be a chip in the device, or may be a processing system in the device, etc. , and can implement the above-mentioned voice information processing method and various optional embodiments thereof provided by the present application.
  • the voice information processing apparatus 1500 includes: a processor 1501 , and an interface circuit 1502 coupled to the processor 1501 . It should be understood that although only one processor and one interface circuit are shown in FIG. 15 .
  • the voice information processing apparatus 1500 may include other numbers of processors and interface circuits.
  • the interface circuit 1502 is used to communicate with other components of the apparatus 1500, such as memory or other processors.
  • the processor 1501 is used for signal interaction with other components through the interface circuit 1502 .
  • the interface circuit 1502 may be an input/output interface of the processor 1501 .
  • the processor 1501 reads computer programs or instructions in a memory coupled thereto through the interface circuit 1502, and decodes and executes the computer programs or instructions. It should be understood that these computer programs or instructions may include various functional programs in the above-described methods. When the corresponding function program is decoded and executed by the processor 1501, the voice information processing apparatus 1500 can be made to implement the solution in the voice information processing method provided by the embodiments of the present application.
  • these functional programs are stored in a memory outside the voice information processing apparatus 1500 .
  • the function program is decoded and executed by the processor 1501
  • part or all of the content of the function program is temporarily stored in the internal memory.
  • these functional programs are stored in the internal memory of the voice information processing apparatus 1500 .
  • the voice information processing apparatus 1500 may be set in the device of the embodiment of the present application.
  • part of the content of these function programs is stored in a memory outside the voice information processing apparatus 1500 , and other parts of the content of these function programs are stored in a memory inside the voice information processing apparatus 1500 .
  • any of the apparatuses or devices shown in FIG. 1 , FIG. 12 or FIG. 13 , FIG. 14 and FIG. 15 may be combined with each other, and the apparatus or apparatus shown in any of The relevant design details of the device and each optional embodiment can be referred to each other, and can also be referred to the voice information processing method shown in any one of FIG. 2 or FIG. 10 and the relevant design details of each optional embodiment. It will not be repeated here.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement any one of the foregoing embodiments and possible embodiments thereof. The operation done by the server.
  • the embodiments of the present application also provide a computer program product, when the computer program product is read and executed by a computer, the operations performed by the server in any one of the foregoing embodiments and possible embodiments thereof will be executed.
  • the embodiments of the present application also provide a computer program, which, when executed on a computer, enables the computer to implement the operations performed by the server in any one of the foregoing embodiments and possible embodiments.
  • the present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false trigger rate of invalid voices in different intelligent voice interaction scenarios.
  • first, second and other words are used to distinguish the same or similar items with basically the same function and function, and it should be understood that between “first”, “second” and “nth” There are no logical or timing dependencies, and no restrictions on the number and execution order. It will also be understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and, similarly, a second image may be referred to as a first image, without departing from the scope of various described examples. Both the first image and the second image may be images, and in some cases, may be separate and distinct images.
  • the size of the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be used in the embodiment of the present application. Implementation constitutes any limitation.
  • references throughout the specification to "one embodiment,” “an embodiment,” and “one possible implementation” mean that a particular feature, structure, or characteristic associated with the embodiment or implementation is included herein. in at least one embodiment of the application. Thus, appearances of "in one embodiment” or “in an embodiment” or “one possible implementation” in various places throughout this specification are not necessarily necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephone Function (AREA)

Abstract

A speech information processing method and apparatus, and a device, a chip system, a computer-readable storage medium and a computer program product. The method comprises: S1001, acquiring first speech information; and S1002, when it is determined, on the basis of a decision condition, that the first speech information is an effective speech control instruction, executing an operation indicated by the first speech information, wherein the decision condition is obtained by means of an adjustment performed on the basis of the situation of an environment of when the first speech information is generated. By means of the method, the accuracy rate of effective speech recognition can be increased in different intelligent speech interaction scenarios, thereby reducing the false triggering rate of ineffective speech.

Description

语音信息处理方法及设备Voice information processing method and device 技术领域technical field
本申请涉及语音处理技术领域,具体涉及语音信息处理方法及设备。The present application relates to the technical field of speech processing, and in particular to methods and devices for processing speech information.
背景技术Background technique
在智能语音交互场景中,智能设备存在两种常用的聆听用户语音的模式,分别是持续聆听模式和全时免唤醒模式,全时免唤醒模式又可以称为全时聆听模式。持续聆听或全时聆听状态下,智能设备需要区分用户内容是否为对其有效的指令,即需要区分人与机器的对话内容、人与人的对话内容。In intelligent voice interaction scenarios, smart devices have two commonly used modes for listening to user voices, namely continuous listening mode and full-time wake-up-free mode. Full-time wake-up-free mode can also be called full-time listening mode. In the continuous listening or full-time listening state, the smart device needs to distinguish whether the user content is a valid instruction for it, that is, it needs to distinguish the content of the dialogue between man and machine, and the content of dialogue between man and man.
具体的,在聆听状态下,设备采集到的语音信息包括闲聊数据,为避免智能设备被闲聊内容误触发,常利用规则匹配模块,或利用推理模块(如神经网络推理模块)进行判断获取到的语音信息是否为有效的语音控制指令。但是,由于在不同的使用环境和场景下,相同的语音信息或者相同语义的语音信息的有效性可能不同,例如,某个语句在当前场景下属于有效的语音控制指令,但是在另一个场景下只是闲聊的信息,属于无效信息。而现有的语音信息有效判定方案无法适应这种不同使用环境和场景下的语音信息有效性识别,容易导致识别准确率低,无效语音误触发的情况。Specifically, in the listening state, the voice information collected by the device includes chat data. In order to prevent the smart device from being mistakenly triggered by the chat content, the rule matching module is often used, or the inference module (such as a neural network inference module) is used for judgment. Whether the voice information is a valid voice control command. However, due to the different usage environments and scenarios, the validity of the same voice information or voice information with the same semantics may be different. For example, a sentence is a valid voice control command in the current scenario, but in another scenario It's just chatting information, which is invalid information. However, the existing voice information valid determination solutions cannot adapt to the valid voice information recognition under different usage environments and scenarios, which easily leads to low recognition accuracy and false triggering of invalid voices.
综上所述,如何在不同的智能语音交互场景中提高有效语音识别的准确率,降低无效语音的误触发率是本领域技术人员急需解决的技术问题。To sum up, how to improve the accuracy of valid speech recognition and reduce the false trigger rate of invalid speech in different intelligent speech interaction scenarios is a technical problem that those skilled in the art need to solve urgently.
发明内容SUMMARY OF THE INVENTION
本申请提供一种语音信息处理方法及设备,能够在不同的智能语音交互场景中提高有效语音识别的准确率,降低无效语音的误触发率。The present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false trigger rate of invalid voices in different intelligent voice interaction scenarios.
第一方面,本申请提供一种语音信息处理方法,该方法包括:In a first aspect, the present application provides a voice information processing method, the method comprising:
获取第一语音信息;在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于该第一语音信息产生时所在的环境情况调整得到。Obtain first voice information; in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, execute the operation indicated by the first voice information, wherein the judgment condition is when the first voice information is generated based on the first voice information. The environmental conditions in which it is located can be adjusted.
由于语音信息产生的环境情况会对语音信息是否为有效的语音控制指令有较大的影响,相同的或相似的语音信息在一个环境情况下为有效指令,但在另一个环境情况下就不一定是有效指令,因此,本申请针对不同环境情况下接收到的语音信息,适应性地调整判决语音信息有效性的判决条件,能够在不同环境情况下更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Since the environmental conditions generated by the voice information have a great influence on whether the voice information is a valid voice control command, the same or similar voice information is a valid command in one environmental situation, but not necessarily in another environmental situation. It is a valid instruction. Therefore, this application adaptively adjusts the judgment conditions for judging the validity of the voice information for the voice information received under different environmental conditions, which can better judge the validity of the voice information in different environmental conditions, and improve the effectiveness of the voice information. The accuracy of the judgment can reduce the false trigger rate of invalid signals.
一种可能的实施方式中,所述第一语音信息产生时所在的环境情况包括如下的一项或多项:截止至所述设备获取到该第一语音信息的第二预设时长内的说话人数,所述第一语音信息产生时预设范围内的人数,所述第一语音信息的置信度,或所述第一语音信息的信噪比。In a possible implementation manner, the environmental conditions in which the first voice information is generated include one or more of the following: until the device obtains the first voice information, speaking within a second preset time period The number of people, the number of people within a preset range when the first voice information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.
由于在一段时间内说话人的数量越多,和/或语音信息产生时周围的人数越多,那么设备接收到的语音信息是闲聊即为无效语音的概率就越大,另外,语音信息的置信度和/或信噪比越高,表明设备可以正确识别出语音信息的语句的概率大,也会影响语音信息有效性的识别,因此,基于该几项中的一项或多项适应性地调整判决语音信息有效性的判决条件,能够更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Since the number of speakers in a period of time is greater, and/or the number of people around when the voice information is generated, the greater the probability that the voice information received by the device is idle chat is invalid voice. In addition, the confidence of the voice information The higher the degree and/or the signal-to-noise ratio, the higher the probability that the device can correctly recognize the sentences of the speech information, and the recognition of the validity of the speech information will also be affected. Adjusting the judgment conditions for judging the validity of the voice information can better judge the validity of the voice information, improve the accuracy of effective judgment, and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,所述判决条件为基于所述第一语音信息产生时所在的环境情况 调整得到,包括:所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到。In a possible implementation manner, the judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.
由于设备持续聆听语音的时长越长,聆听到的语音信息为无效语音的概率越大,因此,本申请中结合语音信息产生时的环境情况和设备的持续聆听时长来适应性地调整判决语音信息有效性的判决条件,可以进一步更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Since the longer the device continues to listen to the voice, the greater the probability that the voice information it hears is invalid voice. Therefore, in this application, the judgment voice information is adaptively adjusted according to the environmental conditions when the voice information is generated and the continuous listening time of the device. The validity judgment condition can further judge the validity of the speech information better, improve the accuracy of valid judgment, and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到,包括:所述判决条件为基于所述环境情况、所述持续聆听时长以及历史语音信息的情况调整得到。In a possible implementation manner, the judgment condition is adjusted based on the environmental conditions and the continuous listening duration of the device, including: the judgment condition is based on the environmental conditions, the continuous listening duration and historical voice information. The situation is adjusted.
基于历史语音信息也可以帮助判断当前获取的语音信息的有效性,例如若当前获取的语音信息与历史获取的有效语音信息相似度较大,那么当前获取的语音信息为有效语音指令的概率较大,反之,若当前获取的语音信息与历史获取的无效语音信息相似度较大,那么当前获取的语音信息为无效语音指令的概率较大。因此,本申请中除了上述介绍的语音信息产生的环境情况和设备聆听时长,还结合历史语音信息来适应性地调整判决语音信息有效性的判决条件,也可以进一步更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in this application, in addition to the environmental conditions and the listening time of the device for the generation of the voice information described above, the historical voice information is also used to adaptively adjust the judgment conditions for judging the validity of the voice information, which can further better judge the validity of the voice information. improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:所述判决条件为基于所述环境情况以及历史语音信息的情况调整得到。In a possible implementation manner, the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.
基于前面的描述,本申请中结合语音信息产生的环境情况和历史语音信息来适应性地调整判决语音信息有效性的判决条件,也可以进一步更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Based on the foregoing description, in this application, the judgment conditions for judging the validity of the voice information are adaptively adjusted in combination with the environmental conditions generated by the voice information and historical voice information, and the validity of the voice information can be further judged better and the judgment of the effective judgment can be improved. Accuracy, reduce the false trigger rate of invalid signals.
一种可能的实施方式中,所述历史语音信息的情况包括如下中的一种或多种:In a possible implementation manner, the situation of the historical voice information includes one or more of the following:
获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;
获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;
获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
所述第一语音信息与最近一次获取到的有效语音信息的语义的第一关联度;The first semantic correlation between the first voice information and the most recently acquired valid voice information;
所述第一语音信息与最近一次获取到的无效语音信息的语义的第二关联度;The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;
第一语音信息与设备最近一次获取到的有效语音信息的第三关联度;the third degree of association between the first voice information and the last valid voice information obtained by the device;
截止至获取到所述第一语音信息时设备与用户语音对话的状态;The state of the voice dialogue between the device and the user when the first voice information is obtained;
所述第一语音信息与历史有效语音信息的声学特征的第一相似度;the first similarity between the acoustic features of the first voice information and historically valid voice information;
所述第一语音信息与历史无效语音信息的声学特征的第二相似度。The second similarity of the acoustic features of the first voice information and the historical invalid voice information.
在本申请中,可以用于帮助判断当前获取的语音信息的有效性的历史语音信息包括上述的一项或多项,基于该一项或多项来适应性地调整判决语音信息有效性的判决条件,均可以更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。In this application, the historical voice information that can be used to help judge the validity of the currently acquired voice information includes one or more of the above, and the decision to judge the validity of the voice information is adaptively adjusted based on the one or more items. All conditions can better judge the validity of speech information, improve the accuracy of effective discrimination, and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,在所述环境情况指示所述第一语音信息有效的概率大于无效的概率的情况下,所述判决条件的灵敏度被调高;In a possible implementation manner, when the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;
在所述环境情况指示所述第一语音信息有效的概率小于无效的概率的情况下,所述判决条件的灵敏度被调低。In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.
在本申请实施例中,对于接收到的语音信息若有效的概率较大,那么可以降低有效性的判决门槛即提高判决条件的灵敏度,若有效的概率较小,那么可以提高有效性判决的门槛即降低判决条件的灵敏度,从而可以对于不同的环境情况下接收的语音信息进行灵活地识别其 有效性,提高识别的准确率,而不是一刀切地使用固定的判决条件来判断各个场景下的语音信息的有效性。In the embodiment of the present application, if the received voice information has a high probability of being valid, the threshold for validity judgment can be lowered, that is, the sensitivity of the judgment condition can be improved, and if the probability of being valid is small, the threshold for validity judgment can be raised. That is to reduce the sensitivity of the judgment conditions, so that the voice information received under different environmental conditions can be flexibly recognized and its effectiveness can be improved, and the accuracy of the recognition can be improved, instead of using fixed judgment conditions across the board to judge the voice information in each scene. effectiveness.
一种可能的实施方式中,所述设备的持续聆听时长越长所述判决条件的灵敏度被调得越低。In a possible implementation manner, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.
由于设备持续聆听语音的时长越长,聆听到的语音信息为无效语音的概率越大,因此,在本申请中可以提高有效性判决的门槛即降低判决条件的灵敏度,从而可以更准确地识别语音信息是否有效。Since the longer the device continues to listen to the voice, the higher the probability of the voice information being heard is invalid voice. Therefore, in this application, the threshold of validity judgment can be increased, that is, the sensitivity of the judgment condition can be reduced, so that voice can be more accurately recognized. whether the information is valid.
一种可能的实施方式中,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;所述第一时间间隔越长所述判决条件的灵敏度被调得越低。In a possible implementation manner, the situation of the historical voice information includes a first time interval between when the first voice information is acquired and when valid voice information is acquired most recently; the longer the first time interval, the The sensitivity of the decision condition is adjusted lower.
由于获取当前语音信号的时间与最近一次获取到有效语音信息之间的间隔越长,那么该获取的当前语音信号为无效语音指令的概率越大,因此,在本申请中可以提高有效性判决的门槛即降低判决条件的灵敏度,从而可以更准确地识别语音信息是否有效。Because the longer the interval between the time when the current voice signal is obtained and the last time valid voice information is obtained, the higher the probability that the obtained current voice signal is an invalid voice command is. Therefore, in this application, the validity judgment can be improved. The threshold is to reduce the sensitivity of the decision condition, so that whether the voice information is valid or not can be more accurately identified.
一种可能的实施方式中,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;所述第二时间间隔越长所述判决条件的灵敏度被调得越低。In a possible implementation manner, the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired most recently; the longer the second time interval, the The sensitivity of the decision condition is adjusted lower.
由于获取当前语音信号的时间与最近一次获取到无效语音信息之间的间隔越长,那么该获取的当前语音信号为无效语音指令的概率越大,因此,在本申请中可以提高有效性判决的门槛即降低判决条件的灵敏度,从而可以更准确地识别语音信息是否有效。Because the longer the interval between the time when the current voice signal is acquired and the latest acquisition of invalid voice information is, the higher the probability that the acquired current voice signal is an invalid voice command is. Therefore, in this application, the validity judgment can be improved. The threshold is to reduce the sensitivity of the decision condition, so that whether the voice information is valid or not can be more accurately identified.
一种可能的实施方式中,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔,以及包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;在所述第一时间间隔小于所述第二时间间隔的情况下,所述判决条件的灵敏度被调高。In a possible implementation manner, the situation of the historical voice information includes the first time interval between when the first voice information is acquired and the last time valid voice information is acquired, and includes the time when the first voice information is acquired. The second time interval between the latest acquisition of invalid voice information; in the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
在本申请中,上述第一时间间隔小于第二时间间隔,表明上述获取到的第一语音信息与最近一次获取到历史有效语音信息的时间间隔不长,因此,该第一语音信息为有效语音指令的概率相对较大,因此,可以降低有效性的判决门槛即提高判决条件的灵敏度,从而可以更准确地识别语音信息是否有效。In the present application, the above-mentioned first time interval is less than the second time interval, indicating that the time interval between the obtained first voice information and the most recent acquisition of historical valid voice information is not long, therefore, the first voice information is valid voice The probability of the instruction is relatively large, therefore, the judgment threshold of validity can be lowered, that is, the sensitivity of the judgment condition can be improved, so that whether the voice information is valid can be more accurately identified.
一种可能的实施方式中,所述历史语音信息的情况包括获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;In a possible implementation manner, the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
在所述有效语音信息的占比大于所述无效语音信息的占比的情况下,所述判决条件的灵敏度被调高;In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;
在所述有效语音信息的占比小于所述无效语音信息的占比的情况下,所述有效语音信息的占比呈上升趋势,所述判决条件的灵敏度被调高;所述有效语音信息的占比呈下降趋势,所述判决条件的灵敏度被调低。In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
在本申请中,上述第一预设时长内,有效语音信息占比较大,那么当前获取的第一语音信息为有效指令的概率较大,因此,可以降低有效性的判决门槛,调高判决条件的灵敏度;另外,若有效语音信息的占比小于无效语音信息的占比,但有效语音信息的占比呈上升趋势,表明有效语音信息越来越多,那么第一语音信号为有效指令的概率较大,因此,可以降低有效性的判决门槛,调高判决条件的灵敏度,从而可以更准确地识别语音信息是否有效。In the present application, within the above-mentioned first preset time period, valid voice information accounts for a relatively large proportion, so the probability that the currently obtained first voice information is a valid command is relatively high. Therefore, the judgment threshold of validity can be lowered and the judgment conditions can be increased. In addition, if the proportion of valid voice information is smaller than the proportion of invalid voice information, but the proportion of valid voice information is on the rise, indicating that there are more and more valid voice information, then the probability that the first voice signal is a valid command Therefore, the judgment threshold of validity can be lowered, and the sensitivity of the judgment condition can be increased, so that whether the voice information is valid can be more accurately identified.
一种可能的实施方式中,所述历史语音信息的情况包括截止至获取到所述第一语音信息 时设备与用户语音对话的状态;在所述设备与用户语音对话的状态存在的情况下,所述判决条件的灵敏度被调高。In a possible implementation manner, the situation of the historical voice information includes the state of the device and the user's voice dialogue until the first voice information is obtained; in the case that the state of the device and the user's voice dialogue exists, The sensitivity of the decision condition is adjusted up.
设备与用户语音对话的状态指的是设备与用户在用语音交流对话的状态,设备可以通过对话状态跟踪功能跟踪,若当前存在该状态,表明上述第一语音信息很大可能为有效的语音指令,因此,可以降低有效性的判决门槛,调高判决条件的灵敏度,从而可以更准确地识别语音信息是否有效。The state of the device and the user's voice dialogue refers to the state in which the device and the user are in a voice conversation. The device can be tracked through the dialogue state tracking function. If this state currently exists, it indicates that the above-mentioned first voice information is likely to be a valid voice command. Therefore, the judgment threshold of validity can be lowered, and the sensitivity of the judgment condition can be increased, so that whether the speech information is valid can be more accurately identified.
一种可能的实施方式中,设备可以接收指定的判决条件的灵敏度,基于该灵敏度来调整该判决条件,然后,用调整后的判决条件来判断上述第一语音信息的是否有效。In a possible implementation manner, the device may receive the sensitivity of the specified judgment condition, adjust the judgment condition based on the sensitivity, and then use the adjusted judgment condition to judge whether the above-mentioned first voice information is valid.
本申请中,上述指定的灵敏度为用户输入的灵敏度,设备可以基于用户的需求更加灵活地调整判决条件的灵敏度,进而可以更好的满足用户的需求。In this application, the above-specified sensitivity is the sensitivity input by the user, and the device can more flexibly adjust the sensitivity of the decision condition based on the user's needs, so as to better meet the user's needs.
一种可能的实施方式中,本申请提供另一种语音信息处理方法,该方法包括:获取第一语音信息;在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于设备的持续聆听时长调整得到。In a possible implementation manner, the present application provides another voice information processing method. The method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is obtained by adjusting based on the continuous listening duration of the device.
本申请中,由于设备持续聆听语音的时长越长,聆听到的语音信息为无效语音的概率越大,因此,可以通过设备的持续聆听时长来适应性地调整判决语音信息有效性的判决条件,可以更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。In this application, since the longer the device continues to listen to the voice, the greater the probability that the voice information heard is invalid voice, therefore, the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening time of the device, The validity of the voice information can be better judged, the accuracy of effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
一种可能的实施方式中,本申请提供另一种语音信息处理方法,该方法包括:获取第一语音信息;在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于历史语音信息调整得到。In a possible implementation manner, the present application provides another voice information processing method. The method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is adjusted based on historical voice information.
基于历史语音信息也可以帮助判断当前获取的语音信息的有效性,例如若当前获取的语音信息与历史获取的有效语音信息相似度较大,那么当前获取的语音信息为有效语音指令的概率较大,反之,若当前获取的语音信息与历史获取的无效语音信息相似度较大,那么当前获取的语音信息为无效语音指令的概率较大。因此,本申请中,通过历史语音信息来适应性地调整判决语音信息有效性的判决条件,可以更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the present application, by adaptively adjusting the judgment conditions for judging the validity of the voice information through the historical voice information, the validity of the voice information can be better judged, the accuracy of the effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
第二方面,本申请提供一种语音信息处理装置,所述装置包括:In a second aspect, the present application provides a voice information processing device, the device comprising:
获取单元,用于获取第一语音信息;an acquisition unit for acquiring the first voice information;
执行单元,用于在基于判决条件确定所述第一语音信息为有效的语音控制指令的情况下,执行所述第一语音信息指示的操作,其中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到。an execution unit, configured to execute the operation indicated by the first voice information when it is determined based on a judgment condition that the first voice information is a valid voice control instruction, wherein the judgment condition is based on the first voice The environmental conditions in which the information is generated are adjusted.
一种可能的实施方式中,所述第一语音信息产生时所在的环境情况包括如下的一项或多项:截止至所述设备获取到该第一语音信息的第二预设时长内的说话人数,所述第一语音信息产生时预设范围内的人数,所述第一语音信息的置信度,或所述第一语音信息的信噪比。In a possible implementation manner, the environmental conditions in which the first voice information is generated include one or more of the following: until the device obtains the first voice information, speaking within a second preset time period The number of people, the number of people within a preset range when the first voice information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.
一种可能的实施方式中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到。In a possible implementation manner, the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.
一种可能的实施方式中,所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到,包括:所述判决条件为基于所述环境情况、所述持续聆听时长以及历史语音信息的情况调整得到。In a possible implementation manner, the judgment condition is adjusted based on the environmental conditions and the continuous listening duration of the device, including: the judgment condition is based on the environmental conditions, the continuous listening duration and historical voice information. The situation is adjusted.
一种可能的实施方式中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:所述判决条件为基于所述环境情况以及历史语音信息的情况调整得到。In a possible implementation manner, the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.
一种可能的实施方式中,所述历史语音信息的情况包括如下中的一种或多种:In a possible implementation manner, the situation of the historical voice information includes one or more of the following:
获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;
获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;
获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
所述第一语音信息与最近一次获取到的有效语音信息的语义的第一关联度;The first semantic correlation between the first voice information and the most recently acquired valid voice information;
所述第一语音信息与最近一次获取到的无效语音信息的语义的第二关联度;The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;
第一语音信息与设备最近一次获取到的有效语音信息的第三关联度;the third degree of association between the first voice information and the last valid voice information obtained by the device;
截止至获取到所述第一语音信息时设备与用户语音对话的状态;The state of the voice dialogue between the device and the user when the first voice information is obtained;
所述第一语音信息与历史有效语音信息的声学特征的第一相似度;the first similarity between the acoustic features of the first voice information and historically valid voice information;
所述第一语音信息与历史无效语音信息的声学特征的第二相似度。The second similarity of the acoustic features of the first voice information and the historical invalid voice information.
一种可能的实施方式中,在所述环境情况指示所述第一语音信息有效的概率大于无效的概率的情况下,所述判决条件的灵敏度被调高;In a possible implementation manner, when the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;
在所述环境情况指示所述第一语音信息有效的概率小于无效的概率的情况下,所述判决条件的灵敏度被调低。In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.
一种可能的实施方式中,所述设备的持续聆听时长越长所述判决条件的灵敏度被调得越低。In a possible implementation manner, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.
一种可能的实施方式中,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;所述第一时间间隔越长所述判决条件的灵敏度被调得越低。In a possible implementation manner, the situation of the historical voice information includes a first time interval between when the first voice information is acquired and when valid voice information is acquired most recently; the longer the first time interval, the The sensitivity of the decision condition is adjusted lower.
一种可能的实施方式中,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;所述第二时间间隔越长所述判决条件的灵敏度被调得越低。In a possible implementation manner, the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired most recently; the longer the second time interval, the The sensitivity of the decision condition is adjusted lower.
一种可能的实施方式中,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔,以及包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;在所述第一时间间隔小于所述第二时间间隔的情况下,所述判决条件的灵敏度被调高。In a possible implementation manner, the situation of the historical voice information includes the first time interval between when the first voice information is acquired and the last time valid voice information is acquired, and includes the time when the first voice information is acquired. The second time interval between the latest acquisition of invalid voice information; in the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
一种可能的实施方式中,所述历史语音信息的情况包括获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;In a possible implementation manner, the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
在所述有效语音信息的占比大于所述无效语音信息的占比的情况下,所述判决条件的灵敏度被调高;In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;
在所述有效语音信息的占比小于所述无效语音信息的占比的情况下,所述有效语音信息的占比呈上升趋势,所述判决条件的灵敏度被调高;所述有效语音信息的占比呈下降趋势,所述判决条件的灵敏度被调低。In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
一种可能的实施方式中,所述历史语音信息的情况包括截止至获取到所述第一语音信息时设备与用户语音对话的状态;在所述设备与用户语音对话的状态存在的情况下,所述判决条件的灵敏度被调高。In a possible implementation manner, the situation of the historical voice information includes the state of the device and the user's voice dialogue until the first voice information is obtained; in the case that the state of the device and the user's voice dialogue exists, The sensitivity of the decision condition is adjusted up.
第三方面,本申请提供一种设备,该设备可以包括处理器和存储器,用于实现上述第一方面描述的语音信息处理方法。该存储器与处理器耦合,处理器执行存储器中存储的计算机程序时,可以实现上述第一方面或第一方面任一种可能的实现方式所述的方法。该设备还可以包括通信接口,通信接口用于该设备与其它设备进行通信,示例性的,通信接口可以是收 发器、电路、总线、模块或其它类型的通信接口。In a third aspect, the present application provides a device, which may include a processor and a memory, for implementing the voice information processing method described in the first aspect above. The memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the first aspect or any possible implementation manner of the first aspect can be implemented. The device may also include a communication interface for the device to communicate with other devices, and the communication interface may, by way of example, be a transceiver, circuit, bus, module, or other type of communication interface.
在一种可能的实现中,该设备可以包括:In one possible implementation, the device may include:
存储器,用于存储计算机程序;memory for storing computer programs;
处理器,用于获取第一语音信息;在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于该第一语音信息产生时所在的环境情况调整得到。a processor, configured to obtain the first voice information; in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, execute the operation indicated by the first voice information, wherein the judgment condition is based on the first voice information The environmental conditions in which the voice information is generated are adjusted and obtained.
需要说明的是,本申请中存储器中的计算机程序可以预先存储也可以使用该设备时从互联网下载后存储,本申请对于存储器中计算机程序的来源不进行具体限定。本申请实施例中的耦合是装置、单元或模块之间的间接耦合或连接,其可以是电性,机械或其它的形式,用于装置、单元或模块之间的信息交互。It should be noted that the computer program in the memory in this application can be pre-stored or downloaded from the Internet when the device is used and stored, and this application does not specifically limit the source of the computer program in the memory. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.
第四方面,本申请实施例提供一种芯片系统,该芯片系统应用于电子装置;芯片系统包括接口电路和处理器;接口电路和处理器通过线路互联;接口电路用于从电子装置的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行该计算机指令时,芯片系统执行如上述第一方面及其任一种可能的实现方式所述的方法。In a fourth aspect, embodiments of the present application provide a chip system, which is applied to an electronic device; the chip system includes an interface circuit and a processor; the interface circuit and the processor are interconnected by lines; the interface circuit is used to receive data from a memory of the electronic device A signal is sent to the processor, where the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the system-on-a-chip executes the method described in the first aspect and any possible implementation manner thereof.
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现上述第一方面或第一方面任一种可能的实现方式所述的方法。In a fifth aspect, the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the first aspect or any possible possibility of the first aspect. implement the method described.
第六方面,本申请一种计算机程序产品,所述计算机程序产品被处理器执行时,上述第一方面或第一方面任一种可能的实现方式所述的方法将被执行。In a sixth aspect, the present application provides a computer program product. When the computer program product is executed by a processor, the method described in the first aspect or any possible implementation manner of the first aspect will be executed.
上述第二方面至第六方面提供的方案,用于实现或配合实现上述第一方面中对应提供的方法,因此可以与第一方面中对应的方法达到相同或相应的有益效果,此处不再进行赘述。The solutions provided in the second to sixth aspects above are used to implement or cooperate with the methods provided in the first aspect, so they can achieve the same or corresponding beneficial effects as the corresponding methods in the first aspect, which are not repeated here. Repeat.
附图说明Description of drawings
图1所示为本申请提供的语音信息处理方法适用的系统架构示意图;1 shows a schematic diagram of a system architecture to which the voice information processing method provided by the present application is applicable;
图2所示为本申请提供的语音信息处理方法的流程示意图;2 shows a schematic flowchart of a voice information processing method provided by the present application;
图3所示为本申请提供的一种无效拒识模型的结构示意图;Fig. 3 shows the structural representation of a kind of invalid refusal model provided by this application;
图4和图5所示为本申请提供的基于影响因素调整判决条件的灵敏度示意图;FIG. 4 and FIG. 5 are schematic diagrams showing the sensitivity of adjusting decision conditions based on influencing factors provided by the present application;
图6A和图6B所示为本申请提供的基于影响因素调整判决条件的灵敏度示意图;6A and FIG. 6B are schematic diagrams showing the sensitivity of adjusting decision conditions based on influencing factors provided by the present application;
图6C和图6D所示为本申请中语音信息占比变化示意图;6C and 6D are schematic diagrams showing the change of the proportion of voice information in this application;
图7所示为本申请提供的基于影响因素调整判决条件的灵敏度示意图;FIG. 7 shows a schematic diagram of the sensitivity of adjusting decision conditions based on influencing factors provided by the present application;
图8A和图8B所示为本申请中语音信息关联度判断的示意图;FIG. 8A and FIG. 8B are schematic diagrams of judging the correlation degree of voice information in the present application;
图9所示为本申请提供的基于影响因素调整判决条件的灵敏度示意图;FIG. 9 is a schematic diagram showing the sensitivity of adjusting the decision condition based on the influencing factors provided by the present application;
图10所示为本申请提供的另一种语音信息处理方法的流程示意图;Figure 10 shows a schematic flowchart of another voice information processing method provided by the present application;
图11所示为本申请提供的语音信息有效性识别的流程示意图;11 shows a schematic flowchart of the validity recognition of voice information provided by the application;
图12为本申请实施例提供的一种装置的逻辑结构示意图;FIG. 12 is a schematic diagram of a logical structure of an apparatus provided by an embodiment of the present application;
图13为本申请实施例提供的另一种装置的逻辑结构示意图;FIG. 13 is a schematic diagram of a logical structure of another apparatus provided by an embodiment of the present application;
图14为本申请实施例提供的设备的硬件结构示意图;14 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application;
图15为本申请实施例提供的另一种装置的硬件结构示意图。FIG. 15 is a schematic diagram of a hardware structure of another apparatus provided by an embodiment of the present application.
具体实施方式Detailed ways
为了便于理解,下面首先介绍一下本申请实施例涉及到的技术术语。For ease of understanding, the following first introduces the technical terms involved in the embodiments of the present application.
1、自动语音识别(automatic speech recognition,ASR)一般是指以语音为研究对象,通过语音信号处理和模式识别让机器自动识别和理解人类口述的语音,是让机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术。1. Automatic speech recognition (ASR) generally refers to taking speech as the research object, and allowing machines to automatically recognize and understand human spoken speech through speech signal processing and pattern recognition. The technique of transforming into corresponding text or commands.
语音识别系统构建过程整体上包括两大部分:训练和识别。训练通常是离线完成的,对预先收集好的海量语音、语言数据库进行信号处理和知识挖掘,获取语音识别系统所需要的声学模型(声学模型是对声学、语音学、环境的变量、说话人性别、口音等的差异的知识表示)和语言模型(语言模型是对一组字序列构成的知识表示。)。而识别过程通常是在线完成的,对用户实时的语音进行自动识别。识别过程通常又可以分为前端和后端两大模块:前端模块主要的作用是进行端点检测(去除多余的静音和非说话声)、降噪、特征提取等;后端模块的作用是利用训练好的声学模型和语言模型对用户说话的特征向量进行统计模式识别(又称解码),得到其包含的文字信息。此外,后端模块还存在一个自适应的反馈模块,可以对用户的语音进行自学习,从而对声学模型和语音模型进行必要的校正,进一步提高识别的准确率。The overall process of building a speech recognition system includes two parts: training and recognition. The training is usually done offline. Signal processing and knowledge mining are performed on the pre-collected massive speech and language databases to obtain the acoustic model required by the speech recognition system (acoustic model is the variable of acoustics, phonetics, environment, speaker gender, etc.). , accent, etc.) and language model (a language model is a knowledge representation for a set of word sequences.). The recognition process is usually completed online, and the real-time voice of the user is automatically recognized. The recognition process can usually be divided into two modules: front-end and back-end: the main function of the front-end module is to perform endpoint detection (removing redundant silence and non-speaking sounds), noise reduction, feature extraction, etc.; the function of the back-end module is to use training A good acoustic model and language model perform statistical pattern recognition (also known as decoding) on the feature vector of the user's speech, and obtain the text information contained in it. In addition, there is an adaptive feedback module in the back-end module, which can self-learn the user's voice, so as to make necessary corrections to the acoustic model and the voice model, and further improve the accuracy of recognition.
2、声纹识别(voiceprint recognition,VR)2. Voiceprint recognition (VR)
声纹识别是生物识别技术的一种,也称为说话人识别,是一种通过声音判别说话人身份的技术。声纹识别技术有两类,即说话人辨认和说话人确认。不同的任务和应用会使用不同的声纹识别技术,如缩小刑侦范围时可能需要辨认技术,而银行交易时则需要确认技术。Voiceprint recognition is a type of biometric technology, also known as speaker recognition, which is a technology that identifies the speaker's identity through sound. There are two types of voiceprint recognition technologies, namely speaker recognition and speaker confirmation. Different tasks and applications will use different voiceprint recognition technologies. For example, identification technology may be required when narrowing the scope of criminal investigations, while confirmation technology may be required for banking transactions.
3、语音合成3. Speech synthesis
语音合成,又称文语转换(text to speech,TTS)技术,是将计算机自己产生的或外部输入的文字信息转变为可以听得懂的、流利的口语输出的技术,相当于给机器装上了人工嘴巴,让机器像人一样开口说话。Speech synthesis, also known as text to speech (TTS) technology, is a technology that converts text information generated by a computer or input from external sources into understandable and fluent spoken language output. Artificial mouth, let the machine speak like a human.
4、任务型对话系统4. Task-based dialogue system
任务型对话可以被理解为一个序列决策过程,机器需要在对话过程中,通过理解用户语句更新维护内部的对话状态,再根据当前的对话状态选择下一步的最优动作(例如确认需求,询问限制条件,提供结果等等),从而完成任务。Task-based dialogue can be understood as a sequential decision-making process. During the dialogue process, the machine needs to update and maintain the internal dialogue state by understanding user sentences, and then select the next optimal action according to the current dialogue state (such as confirming requirements, querying restrictions) conditions, provide results, etc.) to complete the task.
业界目前常用的任务型对话系统为采用模块化结构的系统,一般包括四个关键模块:The task-based dialogue system commonly used in the industry is a system with a modular structure, which generally includes four key modules:
自然语言理解(natural language understanding,NLU):对用户的文本输入进行识别解析,得到槽值和意图等计算机可理解的语义标签。Natural language understanding (NLU): Identify and parse the user's text input to obtain computer-understandable semantic labels such as slot values and intents.
对话状态跟踪(dialog state tracking,DST):根据对话历史,维护当前对话状态,对话状态是对整个对话历史的累积语义表示,一般就是槽值对(slot-value pairs)。Dialogue state tracking (DST): Maintains the current dialogue state according to the dialogue history. The dialogue state is the cumulative semantic representation of the entire dialogue history, generally slot-value pairs.
对话策略(dialogue policy,DP):根据当前对话状态输出下一步系统动作。一般对话状态跟踪模块和对话策略模块统称为对话管理(dialogue manager,DM)模块。Dialogue policy (DP): output the next system action according to the current dialogue state. The general dialogue state tracking module and the dialogue strategy module are collectively referred to as the dialogue manager (DM) module.
自然语言生成(natural language generation,NLG):将系统动作转换成自然语言输出。Natural language generation (NLG): Convert system actions into natural language output.
这种模块化的系统结构的可解释性强,易于落地,大部分业界的实用性任务型对话系统都采用的此结构。This modular system structure is highly interpretable and easy to implement. Most practical task-based dialogue systems in the industry use this structure.
5、计算机视觉(computer vision,CV)5. Computer vision (CV)
计算机视觉又称为机器视觉(machine vision),是一门研究如何使机器“看”的科学,其主要任务就是通过对采集的图片或视频进行处理以获得相应场景的信息。Computer vision, also known as machine vision, is a science that studies how to make machines "see". Its main task is to obtain information about the corresponding scene by processing the collected pictures or videos.
6、无效拒识模型6. Invalid rejection model
无效拒识模型用于判断设备获取到的用户的语音信息的有效性。该有效性可以用于指示语音信息对于获取到该语音信息的设备是否为有效的语音控制指令。该语音信息可以是由设 备接收到的语音信号转换得到的文本信息等。The invalid rejection model is used to judge the validity of the user's voice information obtained by the device. The validity can be used to indicate whether the voice information is a valid voice control instruction for the device that obtains the voice information. The voice information may be text information or the like obtained by converting the voice signal received by the device.
设备在聆听过程中可能接收到用户的很多语音信息,但有些语音信息只是用户之间闲聊的语音信息,这些信息对于设备来说是无效的信息。而用户真正与设备交互的语音信息才是对于设备来说有效的信息,这些有效的信息即为用户的语音控制指令。During the listening process, the device may receive a lot of voice information from the user, but some voice information is just the voice information of chatting between users, which is invalid information for the device. The voice information that the user actually interacts with the device is the information effective for the device, and the effective information is the user's voice control instructions.
在本申请中,无效拒识模型可以包括语音信息有效性的预判模块和决策模块。该预判模块包括规则匹配模块和推理模块,用于对语音信息的有效性做出初步的判断。其中:In this application, the invalid recognition model may include a pre-judgment module and a decision-making module for the validity of voice information. The pre-judgment module includes a rule matching module and a reasoning module, and is used to make a preliminary judgment on the validity of the speech information. in:
规则匹配模块可以通过预先设置好的规则例如预先设置好的语句等,来匹配输入的语音信息,若预先设置好的语句存在与该输入的语音信息匹配的语句,则该输入的语音信息有效,若预先设置好的语句没有与该输入的语音信息匹配的语句,则该输入的语音信息无效。The rule matching module can match the input voice information through preset rules, such as preset sentences, etc. If there is a sentence matching the input voice information in the preset sentences, then the input voice information is valid, If the preset sentence does not have a sentence matching the input voice information, the input voice information is invalid.
推理模块可以是利用神经网络或传统机器学习(例如支持向量机(support vector machine,SVM)等监督学习模型)经过大规模数据训练得到的深度学习预测模型。设备将获取的语音信息输入到该推理模块中可以预测出该语音信息有效的概率等,或直接输出是否有效的结果等。The inference module can be a deep learning prediction model trained on large-scale data using neural networks or traditional machine learning (such as a supervised learning model such as a support vector machine (SVM)). By inputting the acquired voice information into the reasoning module, the device can predict the probability that the voice information is valid, or directly output the result of whether it is valid or not.
决策模块可以通过综合判断条件对上述规则匹配模块和推理模块中的至少一个模块的处理结果做最终的判断决策,确定出语音信息是否有效,可以极大地提高语音信息有效性判断的准确度。该综合判断条件后面会介绍,此处暂不详述。The decision-making module can make a final judgment decision on the processing result of at least one of the rule matching module and the reasoning module by synthesizing the judgment conditions, and determine whether the voice information is valid, which can greatly improve the accuracy of the validity judgment of the voice information. The comprehensive judgment condition will be introduced later, and will not be described in detail here.
需要说明的是,上述无效拒识模型也可以称为有效性判断模型等等,下面以无效拒识模型为例进行介绍,用于判断设备获取到的语音信息的有效性的模型的名称不构成对本申请的限制。It should be noted that the above invalid recognition model can also be called a validity judgment model, etc. The following takes the invalid recognition model as an example to introduce, the name of the model used to judge the validity of the voice information obtained by the device does not constitute a model. LIMITATIONS ON THIS APPLICATION.
为了更好的理解本申请实施例提供的一种语音信息处理方法,下面对该语音信息处理方法适用的系统架构进行示例性地介绍。In order to better understand the voice information processing method provided by the embodiments of the present application, the following exemplarily introduces the system architecture to which the voice information processing method is applicable.
参见图1,图1示例性示出了一种本申请提供的语音信息处理方法使用的系统架构图。该系统架构可以包括音频管理器110、视频管理器120、存储器130和处理器140,该几个部件可以通过总线150连接。Referring to FIG. 1, FIG. 1 exemplarily shows a system architecture diagram used by the voice information processing method provided by the present application. The system architecture may include an audio manager 110 , a video manager 120 , a memory 130 and a processor 140 , which may be connected by a bus 150 .
音频管理器110可以包括扬声器和麦克风阵列。扬声器是一种把电信号转变为声音信号的换能器件,用于输出设备的声音。麦克风是将声音信号转换为电信号的能量转换器件,用于采集人的语音等声音信息。 Audio manager 110 may include a speaker and microphone array. A loudspeaker is a transducer that converts electrical signals into sound signals, and is used to output the sound of the device. A microphone is an energy conversion device that converts a sound signal into an electrical signal, and is used to collect human voice and other sound information.
视频管理器120可以包括摄像机阵列。摄像机能够把光学图像信号转变为电信号,以便于存储或者传输。 Video manager 120 may include an array of cameras. Cameras can convert optical image signals into electrical signals for storage or transmission.
存储器130用于存储计算机程序和数据。存储器130可以是但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)或便携式只读存储器(compact disc read-only memory,CD-ROM)等。The memory 130 is used to store computer programs and data. The memory 130 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM) or Portable read-only memory (compact disc read-only memory, CD-ROM), etc.
在本申请中,存储器130中可以存储自动语音识别模型、声纹识别模型、计算机视觉模型、无效拒识模型、自然语言理解模型、对话管理模型和语音合成模型等模型的计算机程序或者代码。In this application, the memory 130 may store computer programs or codes for models such as automatic speech recognition model, voiceprint recognition model, computer vision model, invalid recognition model, natural language understanding model, dialogue management model, and speech synthesis model.
处理器140可以是中央处理器单元、通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。处理器140可以用于读取上述存储器130中存储的计算机程序和数据, 执行本申请实施例提供的语音信息处理方法。The processor 140 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like. The processor 140 may be configured to read the computer program and data stored in the above-mentioned memory 130, and execute the voice information processing method provided by the embodiment of the present application.
本申请对总线150的类型不做限制,示例性地,总线150可以是桌面数据总线(desktop bus,D-BUS),D-BUS是针对桌面环境优化的进程间通信(inter-process communication,IPC)机制,用于进程间的通信或进程与内核的通信。或者,总线150可以是数据总线(data bus,D B)、地址总线(address bus,AB)和控制总线(control bus,CB)等等。This application does not limit the type of the bus 150. For example, the bus 150 may be a desktop data bus (desktop bus, D-BUS), and D-BUS is an inter-process communication (IPC) optimized for desktop environments. ) mechanism for inter-process communication or process-kernel communication. Alternatively, the bus 150 may be a data bus (DB), an address bus (AB), a control bus (CB), and the like.
示例性地,上述图1所示的系统架构可以是终端设备或者服务器等设备的系统架构。该终端设备可以包括但不限于任何一种基于智能操作系统的设备,其可与用户通过键盘、虚拟键盘、触摸板、触摸屏以及声控设备等输入设备来进行人机交互,诸如智能手机、平板电脑、手持计算机、可穿戴电子设备或车载设备(例如车载电脑等等)等等。该服务器可以是边缘服务器或者云服务器,该服务器可以是虚拟服务器或者可以是实体服务器等等,本申请对此不做限制。Exemplarily, the system architecture shown in FIG. 1 may be a system architecture of a terminal device or a server or other device. The terminal device may include, but is not limited to, any device based on an intelligent operating system, which can perform human-computer interaction with the user through input devices such as keyboards, virtual keyboards, touchpads, touchscreens, and voice-activated devices, such as smart phones and tablet computers. , handheld computers, wearable electronic devices or in-vehicle devices (such as in-vehicle computers, etc.) and so on. The server may be an edge server or a cloud server, the server may be a virtual server or a physical server, etc., which is not limited in this application.
上述图1所示的系统架构仅为一个示例,不构成对本申请实施例适用的系统架构的限制。The system architecture shown in FIG. 1 above is only an example, and does not constitute a limitation on the system architecture applicable to the embodiments of the present application.
下面介绍本申请实施例提供的一种语音信息处理方法,该方法可以适用于上述图1所示的系统架构,即由上述所述的终端设备或服务器等设备来执行该方法,或者,可以由该终端设备或者服务器中的芯片或处理器等处理装置来执行该方法,该方法的执行主体在后面的描述中统称为设备。可选的,若该方法的执行主体为服务器或者服务器中的芯片或处理器,那么可以是终端设备先接收语音信息,然后终端设备将接收到的语音信息发送给服务器进行处理。终端设备向服务器发送的语音信息可以是终端设备接收到的原始信息,或者可以是终端设备预处理后的语音信息。The following describes a voice information processing method provided by an embodiment of the present application. The method can be applied to the system architecture shown in FIG. 1 above, that is, the method is executed by the above-mentioned terminal device or server or other devices, or can be executed by The terminal device or a processing device such as a chip or a processor in the server executes the method, and the execution body of the method is collectively referred to as a device in the following description. Optionally, if the execution body of the method is a server or a chip or processor in the server, the terminal device may first receive the voice information, and then the terminal device sends the received voice information to the server for processing. The voice information sent by the terminal device to the server may be original information received by the terminal device, or may be voice information preprocessed by the terminal device.
参见图2,申请实施例提供的一种语音信息处理方法可以包括但不限于如下步骤:Referring to FIG. 2 , a voice information processing method provided by an embodiment of the application may include, but is not limited to, the following steps:
S201、获取第一语音信息。S201. Acquire first voice information.
在具体实施例中,设备可以通过麦克风接收用户的语音信号。然后,设备可以通过自动语音识别ASR模型识别该语音信号得到该语音信号对应的语音信息,该语音信息可以包括文本信息等。In a specific embodiment, the device may receive the user's voice signal through a microphone. Then, the device can recognize the voice signal through the automatic voice recognition ASR model to obtain voice information corresponding to the voice signal, and the voice information can include text information and the like.
具体的,设备与用户之间的语音交互功能可以通过接收到用户的唤醒信号,例如接收到用户的特定唤醒词来唤醒。被唤醒之后设备可以通过麦克风检测并接收用户的语音信号,该检测并接收用户的语音信号的过程可以称为设备的聆听过程。为减少每次发出语音控制指令前必须唤醒设备的重复操作,目前主要存在两种聆听方式:持续聆听和全时聆听。Specifically, the voice interaction function between the device and the user can be woken up by receiving a wake-up signal from the user, for example, receiving a specific wake-up word from the user. After being woken up, the device can detect and receive the user's voice signal through the microphone, and the process of detecting and receiving the user's voice signal can be referred to as a listening process of the device. In order to reduce the repetitive operation of having to wake up the device every time a voice control command is issued, there are currently two main listening methods: continuous listening and full-time listening.
其中,持续聆听方式指的是:设备被唤醒或者语音指令操作成功后,一段时间内(如30s),设备无需再次被唤醒,可以在这段时间内一直聆听,并与用户进行语音交互,执行用户的语音控制指令。Among them, the continuous listening method refers to: after the device is awakened or the voice command operation is successful, within a period of time (such as 30s), the device does not need to be awakened again, and can continue to listen during this period of time, and perform voice interaction with the user, execute The user's voice control commands.
全时聆听方式指的是:设备启动后只需被唤醒一次,直至设备被关闭的这段时间内,可以一直聆听,并与用户进行语音交互,执行用户的语音控制指令。The full-time listening mode means that the device only needs to be woken up once after it is started, and until the device is turned off, you can listen all the time, interact with the user by voice, and execute the user's voice control instructions.
上述第一语音信息可以是在聆听阶段设备接收到的任意一个语音信号对应的语音信息。The above-mentioned first voice information may be voice information corresponding to any voice signal received by the device in the listening stage.
S202、基于该第一语音信息有效性的影响因素调整判决条件,该判决条件为判断该第一语音信息有效性的无效拒识模型中的一个或多个判断条件。S202. Adjust a judgment condition based on an influencing factor of the validity of the first voice information, where the judgment condition is one or more judgment conditions in an invalid recognition model for judging the validity of the first voice information.
为了便于理解上述无效拒识模型,可以参见图3。图3示例性示出一种该无效拒识模型的处理流程示意图。首先,该无效拒识模型接收到语音信息,例如接收到上述第一语音信息,基于该语音信息以及预设的选择条件选择判断该语音信息有效性的预判模块,即选择上述推 理模块和规则匹配模块中的至少一个模块预判语音信息的有效性。In order to facilitate the understanding of the above invalid recognition model, please refer to FIG. 3 . FIG. 3 exemplarily shows a process flow diagram of the invalid identification rejection model. First, the invalid refusal model receives voice information, for example, receives the above-mentioned first voice information, and selects a pre-judgment module for judging the validity of the voice information based on the voice information and preset selection conditions, that is, selects the above-mentioned reasoning module and rule. At least one of the matching modules predicts the validity of the speech information.
该选择条件可以是基于语音信息有效性的影响因素设定的条件。示例性地,例如该选择条件可以是:在设备的聆听时长大于第一阈值的情况下,选择规则匹配模块判断语音信息的有效性;在设备的聆听时长小于第二阈值的情况下,选择推理模块判断语音信息的有效性;而在设备的聆听时长在第二阈值和第一阈值之间的情况下,可以同时选择规则匹配模快和推理模块判断语音信息的有效性。需要说明的是,语音信息有效性的影响因素不限于是设备的聆听时长,下面会详细介绍,此处暂不详述。The selection condition may be a condition set based on factors affecting the validity of the voice information. Exemplarily, for example, the selection condition may be: when the listening time of the device is greater than the first threshold, select the rule matching module to judge the validity of the voice information; when the listening time of the device is less than the second threshold, select the reasoning The module judges the validity of the voice information; and when the listening time of the device is between the second threshold and the first threshold, the rule matching module and the reasoning module can be selected at the same time to judge the validity of the voice information. It should be noted that the influencing factor of the validity of the voice information is not limited to the listening time of the device, which will be introduced in detail below, and will not be described in detail here.
若只选择推理模块来预判语音信息的有效性,那么,设备将获取到的语音信息输入到该推理模块中,经计算得到输出结果。示例性地,该输出结果可以是预测该输入的语音信息有效的概率,然后将该概率与预设的判断阈值比较得到预判结果。具体的,若该概率大于判断阈值,则预判结果为该输入的语音信息有效,若该概率小于判断阈值,则预判结果为输入的语音信息无效。例如,假设该判断阈值为70%,规定只要语音信息的有效概率大于70%,那么即可确定该语音信息有效,如果该语音信息经推理模块预测的有效概率为80%,大于70%,那么,该语音信息即为有效信息。如果该语音信息经推理模块预测的有效概率为50%,小于70%,那么,该语音信息即为无效信息。If only the reasoning module is selected to predict the validity of the speech information, the device inputs the acquired speech information into the reasoning module, and obtains the output result after calculation. Exemplarily, the output result may be the probability of predicting the validity of the input voice information, and then comparing the probability with a preset judgment threshold to obtain a prejudgment result. Specifically, if the probability is greater than the judgment threshold, the pre-judgment result is that the input voice information is valid, and if the probability is less than the judgment threshold, the pre-judgment result is that the input voice information is invalid. For example, assuming that the judgment threshold is 70%, it is stipulated that as long as the effective probability of the voice information is greater than 70%, then the voice information can be determined to be valid. If the effective probability of the voice information predicted by the reasoning module is 80%, greater than 70%, then , the voice information is valid information. If the effective probability of the voice information predicted by the reasoning module is 50% and less than 70%, then the voice information is invalid information.
需要说明的是,上述推理模块输出的结果不限于是语音信息的有效概率,还可以是其它的数据形式,例如可以是打分的形式,分数超过判断阈值则表明语音信息有效等等,本申请对此不做限制。It should be noted that the result output by the above-mentioned reasoning module is not limited to the valid probability of voice information, but can also be in other data forms, such as the form of scoring. The score exceeds the judgment threshold, indicating that the voice information is valid, etc. This does not limit.
若只选择规则匹配模块来预判语音信息的有效性,那么,设备将获取到的语音信息输入该规则匹配模块,该规则匹配模块将该输入的语音信息与预设的规则库中的信息比较得到预判结果。若预设的规则库中的信息有与输入的语音信息匹配的,那么该预判结果为该输入的语音信息有效。反之,若预设的规则库中的信息没有与输入的语音信息匹配的,那么该预判结果为该输入的语音信息无效。If only the rule matching module is selected to predict the validity of the voice information, the device inputs the acquired voice information into the rule matching module, and the rule matching module compares the input voice information with the information in the preset rule base get the prediction result. If the information in the preset rule base matches the input voice information, the pre-judgment result is that the input voice information is valid. On the contrary, if the information in the preset rule base does not match the input voice information, the pre-judgment result is that the input voice information is invalid.
在上述只选择推理模块或者规则匹配模块来预判语音信息的有效性的情况下,获得语音信息有效性的预判结果之后,可以再将该预判结果输入到决策模块,由决策模块通过综合判断条件判断该预判结果是否合理,从而输出语音信息是否有效的最终指示。例如,该综合判断条件为:有效的语音信息包括的字符不少于3个,那么,若输入的语音信息的字符少于3个,而推理模块或规则匹配模块输出的预判结果为该语音信息有效,则该预判结果不合理,进而该决策模块确定该语音信息无效,并输出指示该语音信息无效的最终指示信息;反之,若输入的语音信息的字符不少于3个,推理模块或规则匹配模块输出的预判结果为有效则是合理的,该决策模块最终确定该语音信息有效,并输出指示该语音信息有效的指示信息。In the above case where only the reasoning module or the rule matching module is selected to predict the validity of the speech information, after obtaining the pre-judgment result of the validity of the speech information, the pre-judgment result can be input into the decision-making module, and the decision-making module can synthesize the validity of the speech information. The judgment condition judges whether the prejudgment result is reasonable, so as to output a final indication of whether the voice information is valid. For example, the comprehensive judgment condition is: the valid voice information includes not less than 3 characters, then, if the input voice information contains less than 3 characters, and the pre-judgment result output by the inference module or the rule matching module is the voice If the information is valid, the pre-judgment result is unreasonable, and then the decision-making module determines that the voice information is invalid, and outputs the final indication information indicating that the voice information is invalid; otherwise, if the input voice information has no less than 3 characters, the reasoning module Or the pre-judgment result output by the rule matching module is reasonable, and the decision module finally determines that the voice information is valid, and outputs indication information indicating that the voice information is valid.
需要说明的是,上述综合判断条件不限于上述的示例,还可以是其它形式的条件,一种可能的实施方式中,综合判断条件可以是一种投票机制,即语音信息有效的票数多,则确定该语音信息为有效,语音信息无效的票数多,则确定该语音信息为无效。It should be noted that the above comprehensive judgment condition is not limited to the above examples, and may also be other forms of conditions. In a possible implementation, the comprehensive judgment condition may be a voting mechanism. It is determined that the voice information is valid, and the number of votes for which the voice information is invalid is large, and the voice information is determined to be invalid.
或者,一种可能的实施方式中,在只选择推理模块或者规则匹配模块来预判语音信息的有效性的情况下,不需要再进行综合判断,而是将推理模块或者规则匹配模块输出的结果作为无效拒识模型的最终的结果输出。Or, in a possible implementation, in the case where only the inference module or the rule matching module is selected to predict the validity of the speech information, it is not necessary to make a comprehensive judgment, but the result output by the inference module or the rule matching module is used. Output as the final result of the invalid rejection model.
若同时选择推理模块和规则匹配模块来预判语音信息的有效性,那么,将上述获取到的语音信息分别输入推理模块和规则匹配模块,该两个模块各自按照自己的流程(参见上面的描述,此处不再赘述)预判该语音信息的有效性,分别得到各自的有效性预判结果,然后,将该 两个预判结果输入到决策模块中,基于决策模块中的综合判断条件对该两个有效性预判结果进行最后的判决,以输出无效拒识模型的最终的结果。If the inference module and the rule matching module are selected at the same time to predict the validity of the speech information, then the above-obtained speech information is input into the inference module and the rule matching module respectively, and the two modules follow their own processes (see the above description). (not repeated here) pre-judging the validity of the voice information, respectively obtaining the respective validity pre-judgment results, then, inputting the two pre-judgment results into the decision-making module, based on the comprehensive judgment conditions in the decision-making module to The two validity prediction results are finalized to output the final result of the invalid rejection model.
示例性地,该综合判断条件可以为:有效的语音信息包括的字符不少于3个,然后,决策模块基于该综合判断条件检查上述两个预判结果的合理性,具体的检查过程参见前面的描述,此处不再赘述。Exemplarily, the comprehensive judgment condition may be: the valid voice information includes no less than 3 characters, and then, the decision-making module checks the rationality of the above-mentioned two pre-judgment results based on the comprehensive judgment condition, and the specific inspection process refers to the previous description, which will not be repeated here.
示例性地,一种可能的实施方式中,该综合判断条件可以是一种投票机制,即语音信息有效的票数多,则确定该语音信息为有效,语音信息无效的票数多,则确定该语音信息为无效。若上述两个对语音信息的有效性预判结果均为有效,则该语音信息的最终的判决结果也是有效。若该两个有效性预判结果均为无效,那么该语音信息的最终的判决结果也是无效。若该两个有效性预判结果一个是有效,一个是无效,那么可以进一步判断,例如根据优先级来做判断,如果推理模块的优先级高于规则匹配模块,那么以推理模块的预判结果作为最终的结果输出。如果规则匹配模块的优先级高于推理模块,那么以规则匹配模块的预判结果作为最终的结果输出。Exemplarily, in a possible implementation, the comprehensive judgment condition may be a voting mechanism, that is, if the voice information has a large number of valid votes, it is determined that the voice information is valid, and if the voice information has a large number of invalid votes, it is determined that the voice information is valid. Information is invalid. If the above two pre-judgment results of the validity of the voice information are valid, the final judgment result of the voice information is also valid. If the two validity prediction results are invalid, the final judgment result of the voice information is also invalid. If one of the two validity prediction results is valid and the other is invalid, then further judgment can be made, for example, according to the priority. If the priority of the reasoning module is higher than that of the rule matching module, then the prediction result of the reasoning module is used. output as the final result. If the priority of the rule matching module is higher than that of the inference module, the pre-judgment result of the rule matching module is used as the final result output.
需要说明的是,上述的综合判断条件仅为一个示例,其主要的目的就是用于比较准确地综合推理模块和/或规则匹配模块的预判结果判断出获取到的语音信息的有效性,在具体实施例中该综合判断条件也可以是其它能够达到该目的的条件,本方案对此不做限制。It should be noted that the above comprehensive judgment condition is only an example, and its main purpose is to more accurately synthesize the pre-judgment results of the reasoning module and/or the rule matching module to judge the validity of the acquired voice information. In the specific embodiment, the comprehensive judgment condition may also be other conditions that can achieve the purpose, which is not limited in this solution.
基于上述对图3的描述,上述S202中所述的判决条件可以包括上述无效拒识模型中的选择条件、判决推理模块输出结果的判断阈值以及综合判断条件中的一项或多项。即在本申请中,为了在不同的场景下提高有效语音识别的准确度,降低无效语音的误触发率,可以在不同的语音交互的场景下,基于一种或多种能够影响输入语音信息的有效性判断的影响因素灵活调整上述判决条件,使得语音信息的有效性识别更灵活,更符合当时的语境和场景。Based on the above description of FIG. 3 , the above judgment conditions in S202 may include one or more of the selection conditions in the invalid recognition model, the judgment threshold of the output result of the judgment inference module, and the comprehensive judgment conditions. That is, in this application, in order to improve the accuracy of valid speech recognition and reduce the false trigger rate of invalid speech in different scenarios, in different speech interaction scenarios, based on one or more factors that can affect the input speech information. The influencing factors of validity judgment The above judgment conditions are adjusted flexibly, so that the validity recognition of speech information is more flexible and more suitable for the context and scene at that time.
一种可能的实施方式中,上述基于第一语音信息有效性的影响因素调整判决条件,可以是:In a possible implementation manner, the above-mentioned adjustment of the decision condition based on the influencing factor of the validity of the first voice information may be:
在基于一种或多种语音信息有效性影响因素分析出该第一语音信息有效的概率大于无效的概率的情况下,将该判决条件的灵敏度调高,该判决条件的灵敏度越高指示通过该判决条件确定该第一语音信息有效的概率越高;在基于一种或多种语音信息有效性影响因素分析出该第一语音信息有效的概率小于无效的概率的情况下,将该判决条件的灵敏度调低,该判决条件的灵敏度越低指示通过该判决条件确定该第一语音信息有效的概率越低。关于判决条件的灵敏度以及具体调整过程可以参见后面的介绍,此处暂不详述。In the case that the probability that the first voice information is valid is greater than the probability that it is invalid based on one or more voice information validity influencing factors, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the The judgment condition determines that the probability that the first voice information is valid is higher; in the case that the probability of the first voice information being valid is less than the probability of being invalid based on one or more factors that influence the validity of the voice information, the judgment condition is determined. When the sensitivity is adjusted down, the lower the sensitivity of the decision condition, the lower the probability of determining that the first voice information is valid through the decision condition. For the sensitivity of the decision condition and the specific adjustment process, please refer to the following introduction, which will not be described in detail here.
可选的,上述能够影响输入语音信息的有效性识别的影响因素可以包括以下的一种或多种:Optionally, the above-mentioned influencing factors that can affect the validity recognition of the input speech information may include one or more of the following:
语音信息产生时所在的环境情况,设备的持续聆听时长,设备获取语音信息时与最近一次获取到有效语音信息之间的第一时间间隔,设备获取语音信息时与最近一次获取到无效语音信息之间的第二时间间隔,设备获取到语音信息前的第一预设时长内有效语音信息和无效语音信息的占比,语音信息与设备最近一次获取到的有效语音信息的语义的第一关联度,语音信息与设备最近一次获取到的无效语音信息的语义的第二关联度,第一语音信息与设备最近一次获取到的有效语音信息的第三关联度,截止至获取到当前语音信息时设备与用户语音对话的状态,语音信息与历史有效语音信息的声学特征的第一相似度,以及语音信息与历史无效语音信息的声学特征的第二相似度。The environment in which the voice information is generated, the continuous listening time of the device, the first time interval between when the device obtains voice information and the last time it obtains valid voice information, and the time between when the device obtains voice information and the last time it obtains invalid voice information. The second time interval between , the proportion of valid voice information and invalid voice information in the first preset time period before the device obtains the voice information, the first degree of semantic relevance between the voice information and the last valid voice information obtained by the device , the second degree of association between the voice information and the semantics of the invalid voice information obtained by the device last time, the third degree of association between the first voice information and the valid voice information obtained by the device last time, until the current voice information is obtained when the device The state of the voice dialogue with the user, the first similarity between the acoustic features of the voice information and the historically valid voice information, and the second similarity between the voice information and the acoustic features of the historically invalid voice information.
一种可能的实施方式中,设备获取到上述第一语音信息之后,可以基于第一因素调整上 述无效拒识模型中的选择条件,该第一因素可以包括上述影响因素中的一种或多种。具体的调整过程后面会介绍,此处暂不详述。In a possible implementation manner, after acquiring the above-mentioned first voice information, the device can adjust the selection conditions in the above-mentioned invalid recognition model based on a first factor, and the first factor can include one or more of the above-mentioned influencing factors. . The specific adjustment process will be introduced later, and will not be described in detail here.
一种可能的实施方式中,设备获取到上述第一语音信息之后,可以基于第二因素调整上述无效拒识模型中的判决推理模块输出结果的判断阈值,该第二因素可以包括上述影响因素中的一种或多种。该第二因素和上述第一因素中包括的影响因素可以不同,或者可以部分相同,或者可以完全相同,具体根据实际情况确定,本方案对此不做限制。具体的调整过程后面会介绍,此处暂不详述。In a possible implementation, after the device obtains the above-mentioned first voice information, it can adjust the judgment threshold of the output result of the decision inference module in the above-mentioned invalid rejection model based on a second factor, and the second factor can include the above-mentioned influencing factors. one or more of. The second factor and the influencing factors included in the above-mentioned first factor may be different, or may be partially the same, or may be completely the same, which is specifically determined according to the actual situation, which is not limited in this solution. The specific adjustment process will be introduced later, and will not be described in detail here.
一种可能的实施方式中,设备获取到上述第一语音信息之后,可以基于第三因素调整上述无效拒识模型中决策模块的综合判断条件,该第三因素可以包括上述影响因素中的一种或多种。该第三因素与上述第一因素及上述第二因素中包括的影响因素可以不同,或者可以部分相同,或者可以完全相同,具体根据实际情况确定,本方案对此不做限制。具体的调整过程后面会介绍,此处暂不详述。In a possible implementation manner, after the device obtains the above-mentioned first voice information, it can adjust the comprehensive judgment condition of the decision-making module in the above-mentioned invalid rejection model based on a third factor, and the third factor can include one of the above-mentioned influencing factors. or more. The third factor may be different from the influencing factors included in the above-mentioned first factor and the above-mentioned second factor, or may be partially the same, or may be completely the same, which is specifically determined according to the actual situation, which is not limited in this solution. The specific adjustment process will be introduced later, and will not be described in detail here.
在具体实现中,上述选择条件、判断阈值和综合判断条件可以一起调整,或者,也可以选择该选择条件、判断阈值和综合判断条件中的一项或两项调整,具体的可以根据实际需求选择,本方案对此不做限制。In specific implementation, the above selection conditions, judgment thresholds and comprehensive judgment conditions can be adjusted together, or one or both of the selection conditions, judgment thresholds and comprehensive judgment conditions can be adjusted, and the specific selection can be based on actual needs. , this program does not limit this.
S203、在基于调整后的该判决条件确定该第一语音信息有效的情况下,对该第一语音信息进行语义理解,并执行该第一语音信息的指令。S203. Under the condition that the first voice information is determined to be valid based on the adjusted judgment condition, perform semantic understanding on the first voice information, and execute an instruction of the first voice information.
在具体实施例中,设备获取到上述第一语音信息之后,基于上述的影响因素调整了无效拒识模型中的判决条件后,基于调整之后的无效拒识模型来识别该第一语音信息的有效性。In a specific embodiment, after the device acquires the above-mentioned first voice information, after adjusting the judgment conditions in the invalid recognition model based on the above-mentioned influencing factors, the device identifies the validity of the first voice information based on the adjusted invalid recognition model. sex.
一种可能的实施方式中,若设备调整了上述无效拒识模型中的选择条件,那么,设备可以基于调整后的选择条件选择上述规则匹配模块和推理模块中的一个或多个模型来预判断该第一语音信息的有效性。In a possible implementation, if the device adjusts the selection conditions in the above invalid denial model, then the device can select one or more models in the above rule matching module and inference module to pre-judgment based on the adjusted selection conditions. the validity of the first voice information.
一种可能的实施方式中,若设备调整了上述推理模块的判断阈值,且设备选择判断第一语音信息有效性的预判模块包括该推理模块,那么在推理模块输出指示该第一语音信息有效性的数据后,设备可以基于该指示该第一语音信息有效性的数据和该调整后的判断阈值判断该第一语音信息是否有效。In a possible implementation, if the device adjusts the judgment threshold of the above-mentioned reasoning module, and the device selects the pre-judgment module for judging the validity of the first voice information to include the reasoning module, then the output of the reasoning module indicates that the first voice information is valid. After obtaining the valid data, the device can judge whether the first voice information is valid based on the data indicating the validity of the first voice information and the adjusted judgment threshold.
一种可能的实施方式中,若设备调整了上述无效拒识模型中决策模块的综合判断条件,那么,在获得上述规则匹配模块和/或推理模块的预判结果后,可以基于该调整后的综合判断条件对该规则匹配模块和/或推理模块的预判结果进行一个综合性的判断,从而确定上述第一语音信息的有效性。In a possible implementation, if the device adjusts the comprehensive judgment conditions of the decision-making module in the above invalid denial model, then, after obtaining the pre-judgment results of the above-mentioned rule matching module and/or inference module, it can be based on the adjusted result. The comprehensive judgment condition performs a comprehensive judgment on the prediction result of the rule matching module and/or the reasoning module, so as to determine the validity of the above-mentioned first voice information.
上述第一语音信息的有效性识别的具体的过程可以参见关于上述图3的描述,此处不再赘述。For the specific process of the validity identification of the above-mentioned first voice information, reference may be made to the description about the above-mentioned FIG. 3 , which will not be repeated here.
在上述第一语音信息有效的情况下,设备开始对该第一语音信息进行语义理解,具体的,设备中的处理器可以调用存储器中的自然语言理解模型来执行对该第一语音信息的语义理解,以获得该第一语音信息具体的含义。设备理解了该第一语音信息的含义后,基于该含义执行对应的操作,以为用户提供需要的服务。该第一语音信息的含义对于设备来说即为执行该对应操作的控制指令。In the case that the above-mentioned first voice information is valid, the device starts to perform semantic understanding of the first voice information. Specifically, the processor in the device can call the natural language understanding model in the memory to execute the semantic understanding of the first voice information. understand to obtain the specific meaning of the first voice information. After understanding the meaning of the first voice information, the device performs a corresponding operation based on the meaning to provide the user with the desired service. The meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.
下面分别从语音信息有效性的不同的影响因素,介绍上述第一语音信息有效性识别中的判决条件的调整过程。需要说明的是,该判决条件可以包括上述无效拒识模型中的选择条件、 判断阈值和综合判断条件中的一项或多项,下面介绍的调整过程可以适用于该选择条件、判断阈值和综合判断条件中的一项或多项的调整。The following describes the adjustment process of the judgment condition in the above-mentioned first voice information validity recognition from different influencing factors of the voice information validity. It should be noted that the judgment condition may include one or more of the selection conditions, judgment thresholds and comprehensive judgment conditions in the above invalid rejection model, and the adjustment process described below can be applied to the selection conditions, judgment thresholds and comprehensive judgment conditions. Adjustment of one or more of the judgment conditions.
在介绍该调整过程之前,首先介绍一下调整过程中涉及的相关概念:Before introducing the adjustment process, first introduce the relevant concepts involved in the adjustment process:
判决条件的灵敏度:该灵敏度指的是判决条件的宽松和严苛的程度,判决条件越严苛,则灵敏度越低,判决条件越宽松,则灵敏度越高。Sensitivity of the judgment condition: The sensitivity refers to the degree of relaxation and strictness of the judgment condition. The stricter the judgment condition, the lower the sensitivity, and the looser the judgment condition, the higher the sensitivity.
示例性地,对于上述选择预判模型的选择条件,一般地,由于推理模块是预测语音信息有效的可能性,属于模糊匹配,而规则匹配模块是模式匹配型的预判,是就是,不是就不是,相对而言,比较严格。因此,在选择预判模型时,若设备获取的语音信息为有效的概率较大,那么,可以选择推理模块或规则匹配模块来预判,或者此时若想提高该语音信息有效识别的准确率,可以选择推理模块来预判。若设备获取的语音信息为有效的概率较小,为了有效避免无效信息的误触发,可以选择规则匹配模块来预判。Exemplarily, for the above-mentioned selection conditions for selecting a prediction model, generally, since the inference module predicts the validity of the speech information, it belongs to fuzzy matching, while the rule matching module is a pattern-matching prediction, so yes, no. No, it is relatively strict. Therefore, when selecting a pre-judgment model, if the voice information obtained by the device has a high probability of being valid, then an inference module or a rule matching module can be selected for pre-judgment, or at this time, if you want to improve the accuracy of the effective recognition of the voice information , you can choose the reasoning module to predict. If the probability that the voice information obtained by the device is valid is small, in order to effectively avoid false triggering of invalid information, a rule matching module can be selected to prejudge.
例如,假设选择条件为:设备的聆听时长小于10秒,选择推理模块来预判,设备的聆听时长大于20秒,选择规则匹配模块来预判,设备的聆听时长在10秒至20秒之间则同时选择推理模块和规则匹配模块来预判。若想更好地过滤无效的信息,减少误触发,那么,设备可以将选择条件往较严苛的方向调整,即调低选择条件的灵敏度,例如可以将选择条件调整为:设备的聆听时长小于5秒,选择推理模块来预判,设备的聆听时长大于10秒,选择规则匹配模块来预判,设备的聆听时长在5秒至10秒之间则同时选择推理模块和规则匹配模块来预判。反之,若想更好地识别有效语音信息,设备可以将选择条件往较宽松的方向调整,即调高选择条件的灵敏度,例如可以将选择条件调整为:设备的聆听时长小于15秒,选择推理模块来预判,设备的聆听时长大于25秒,选择规则匹配模块来预判,设备的聆听时长在15秒至25秒之间则同时选择推理模块和规则匹配模块来预判。For example, suppose the selection conditions are: the listening time of the device is less than 10 seconds, the reasoning module is selected to predict, the listening time of the device is greater than 20 seconds, the rule matching module is selected to predict, the listening time of the device is between 10 seconds and 20 seconds Then, the inference module and the rule matching module are selected at the same time to predict. If you want to better filter invalid information and reduce false triggering, the device can adjust the selection conditions to a more severe direction, that is, lower the sensitivity of the selection conditions. For example, you can adjust the selection conditions to: the listening time of the device is less than 5 seconds, select the reasoning module to predict, if the listening time of the device is greater than 10 seconds, select the rule matching module to predict, if the listening time of the device is between 5 seconds and 10 seconds, select the reasoning module and the rule matching module to predict at the same time . Conversely, if you want to better recognize valid voice information, the device can adjust the selection condition to a looser direction, that is, increase the sensitivity of the selection condition. If the listening time of the device is greater than 25 seconds, the rule matching module is selected for prediction. If the listening time of the device is between 15 seconds and 25 seconds, the inference module and the rule matching module are selected for prediction.
示例性地,对于上述推理模块的判断阈值,假设标准判断阈值为70%,即推理模块预测语音信息有效的概率大于70%,则确定该语音信息有效。但是,当把判断阈值调到80%,即把判决条件往严苛的方向调整,这种情况下,推理模块预测语音信息有效的概率需要大于80%才可以判定其为有效,由此可见判决条件的灵敏度降低了。而,如果把判断阈值调到60%,即把判决条件往宽松的方向调整,这种情况下,推理模块预测语音信息有效的概率只要大于60%才可以判定其为有效,由此可见判决条件的灵敏度提高了。Exemplarily, for the judgment threshold of the above reasoning module, assuming that the standard judgment threshold is 70%, that is, the probability that the reasoning module predicts that the voice information is valid is greater than 70%, it is determined that the voice information is valid. However, when the judgment threshold is adjusted to 80%, that is, the judgment condition is adjusted in a strict direction. In this case, the probability that the reasoning module predicts that the voice information is valid needs to be greater than 80% before it can be judged to be valid. It can be seen that the judgment The sensitivity of the condition is reduced. However, if the judgment threshold is adjusted to 60%, that is, the judgment condition is adjusted in a relaxed direction. In this case, the inference module predicts that the voice information is valid only if the probability is greater than 60% before it can be judged to be valid. It can be seen that the judgment condition Sensitivity is improved.
示例性地,对于上述综合判断条件,假设该综合判断条件为:有效的语音信息包括的字符不少于3个,那么,若将综合判断条件调整为:有效的语音信息包括的字符不少于5个,可以看到,对语音信息的要求提高了,更严苛了,从而该综合判断条件的灵敏度降低了。若将综合判断条件调整为有效的语音信息包括的字符不少于2个,可以看到,对语音信息的要求降低了,更宽松了,从而该综合判断条件的灵敏度提高了。Exemplarily, for the above-mentioned comprehensive judgment condition, it is assumed that the comprehensive judgment condition is: the characters included in the valid speech information are not less than 3, then, if the comprehensive judgment condition is adjusted as: the characters included in the valid speech information are not less than 3. 5, it can be seen that the requirements for voice information are increased and more stringent, so the sensitivity of the comprehensive judgment condition is reduced. If the comprehensive judgment condition is adjusted to include no less than 2 characters in valid voice information, it can be seen that the requirements for voice information are reduced and more relaxed, so that the sensitivity of the comprehensive judgment condition is improved.
负相关调整灵敏度:指的是影响因素对应的值增加时,则灵敏度调低,且增加越多,灵敏度调得越低;而影响因素对应的值减少时,则灵敏度调高,且减少越多,灵敏度调得越高。Negative correlation adjustment sensitivity: it means that when the value corresponding to the influencing factor increases, the sensitivity is adjusted lower, and the more the increase, the lower the sensitivity adjustment; and when the value corresponding to the influencing factor decreases, the sensitivity is adjusted higher, and the more the decrease is , the higher the sensitivity is.
正相关调整灵敏度:指的是影响因素对应的值增加时,则灵敏度调高,且增加越多,灵敏度调得越高;而影响因素对应的值减少时,则灵敏度调低,且减少越多,灵敏度调得越低。Positive correlation adjustment sensitivity: it means that when the value corresponding to the influencing factor increases, the sensitivity is adjusted higher, and the more the increase is, the higher the sensitivity is adjusted; and when the value corresponding to the influencing factor decreases, the sensitivity is adjusted lower, and the more the decrease is , the lower the sensitivity.
需要说明的是,本申请所述的调高灵敏度或者调低灵敏度,具体调多少可以根据实际情况设定,本申请对此不做限制。此外,上述判决条件的灵敏度的调整是有范围的,例如,对于上述判断阈值的调整,最高为100%,最低为0等等,该判决条件的灵敏度的调整范围根据实际情况确定,本方案对此不做限制。It should be noted that the specific adjustment amount of the sensitivity adjustment described in this application can be set according to the actual situation, which is not limited in this application. In addition, the adjustment of the sensitivity of the above judgment condition has a range. For example, for the adjustment of the above judgment threshold, the maximum is 100%, the minimum is 0, etc. The adjustment range of the sensitivity of the judgment condition is determined according to the actual situation. This does not limit.
首先,基于上述第一语音信息产生时所在的环境情况这一影响因素对上述判决条件的调整过程进行介绍。示例性地,第一语音信息产生时所在的环境情况包括如下的一项或多项:截止至设备获取该第一语音信息的第二预设时长内的说话人数(下面简称为说话人数),该第一语音信息产生时预设范围内的人数(下面简称为周围人数),该第一语音信息的置信度,以及该第一语音信息的信噪比等等。该说话人数具体指的是该第一语音信息中包括的不同的声纹的个数,因为每个人的声纹都不同,因此,可以通过声纹的个数来表示该第一语音信息的说话人数。First, the adjustment process of the above judgment condition is introduced based on the influence factor of the environmental situation in which the above-mentioned first voice information is generated. Exemplarily, the environmental conditions in which the first voice information is generated include one or more of the following: the number of speakers within the second preset time period until the device acquires the first voice information (hereinafter referred to as the number of speakers), The number of people within a preset range when the first voice information is generated (hereinafter referred to as the number of people around), the confidence level of the first voice information, and the signal-to-noise ratio of the first voice information, and so on. The number of speakers specifically refers to the number of different voiceprints included in the first voice information, because each person has different voiceprints, therefore, the number of voiceprints can be used to represent the speaking of the first voice information number of people.
参见图4,图4以上述列出的几项环境影响因素为例介绍如何基于环境影响因素调整上述判决条件。Referring to FIG. 4, FIG. 4 takes the above listed several environmental influence factors as examples to describe how to adjust the above judgment conditions based on the environmental influence factors.
设备获取上述第一语音信息的过程中,可以获取该第一语音信息的周围人数和说话人数。具体的,设备可以通过调用存储器中的计算机视觉模型驱动摄像头对周围的环境进行图片或视频的拍摄,然后解析拍摄的图片和视频即可获知周围人数和说话人数,说话人数的获取可以通过分析上述第二预设时长内的视频中哪些人的嘴巴在动来得到。该周围人数包括了说话人数。该第二预设时长例如可以是5秒、10秒或者1分钟等等,本申请对此不做限制。During the process of acquiring the above-mentioned first voice information, the device may acquire the number of people around and the number of speakers of the first voice information. Specifically, the device can drive the camera to shoot pictures or videos of the surrounding environment by calling the computer vision model in the memory, and then analyze the captured pictures and videos to know the number of people around and the number of speakers. The number of speakers can be obtained by analyzing the above Find out which people's mouths are moving in the video within the second preset duration. The surrounding number includes the number of speakers. The second preset duration may be, for example, 5 seconds, 10 seconds, or 1 minute, etc., which is not limited in this application.
或者,设备可以通过调用存储器中的声纹识别模型来识别该第二预设时长内设备接收到的语音信号中的声纹特征,识别出的不同的声纹特征的数量即为说话人数。可选的,该声纹识别模型可以是动态监测的模型,以灵活地适应不同情况下的声纹识别。Alternatively, the device can identify the voiceprint features in the voice signal received by the device within the second preset duration by calling the voiceprint recognition model in the memory, and the number of different voiceprint features identified is the number of speakers. Optionally, the voiceprint recognition model may be a dynamic monitoring model to flexibly adapt to voiceprint recognition in different situations.
上述设备获取到周围人数(假设为m个人,m为正整数)和说话人数(假设为n个人,n为正整数)后,首先判断一下说话人数n是否为0,若为0,则表明上述第一语音信息中不包括人的语音信息,则不需要调整对应的判决条件。After the above device obtains the number of people around (assuming m people, m is a positive integer) and the number of speakers (assuming n people, n is a positive integer), it first determines whether the number of speakers n is 0, if it is 0, it means the above If the first voice information does not include human voice information, it is not necessary to adjust the corresponding judgment conditions.
如果说话人数n不为0,表明该第一语音信息中包括人的语音信息,进一步地,判断一下周围人数m是否大于1,若m不大于1,则可以判断一下m是否为1。If the number of speakers n is not 0, it indicates that the first voice information includes human voice information. Further, it is judged whether the number of people around m is greater than 1. If m is not greater than 1, it can be judged whether m is 1.
若m为1,则表明周围环境中只有一个人,其发出的该第一语音信息很大概率是对设备发出的语音控制指令,那么,可以将判决条件的灵敏度调高,以便于更好地识别出该第一语音信息的有效性。If m is 1, it means that there is only one person in the surrounding environment, and the first voice information sent by him is very likely to be a voice control command sent to the device. Then, the sensitivity of the judgment condition can be adjusted to be better. The validity of the first voice information is recognized.
或者,若m为1,默认当前获取的第一语音信息是对设备的语音控制指令,即为有效信息。那么,可以将判决条件的灵敏度调到最高,或者,无效拒识模型不再进一步进行有效性判断,直接输出该第一语音信息有效的指示。Or, if m is 1, by default, the currently acquired first voice information is a voice control instruction for the device, which is valid information. Then, the sensitivity of the judgment condition can be adjusted to the highest, or the invalid rejection model does not further perform validity judgment, and directly outputs an indication that the first voice information is valid.
若m不为1,可能检测有误,无法通过该信息进行判决条件的灵敏度的调整,因此不调整。If m is not 1, there may be an error in detection, and the sensitivity of the decision condition cannot be adjusted based on this information, so it is not adjusted.
在说话人数n不为0,且周围人数m大于1的情况下,该第一语音信息很大概率是闲聊的内容,对于设备来说可能是无效的语音信息,那么,设备可以基于周围人数的大小来调低判决条件的灵敏度,且周围人数m越大,则该判决条件的灵敏度调得越低。因为周围人数越多,则该第一语音信息属于闲聊语音的概率越大,因此,需要设置较严苛的判决条件来识别该第一语音信息的有效性,以免无效的语音信息误触发相关的服务操作,浪费设备的资源。When the number of speakers n is not 0 and the number of people around m is greater than 1, the first voice information is likely to be the content of small talk, which may be invalid voice information for the device. The sensitivity of the decision condition is lowered by the size of the size, and the larger the number of people around m, the lower the sensitivity of the decision condition. Because the more people around, the higher the probability that the first voice information belongs to chatting voice. Therefore, stricter judgment conditions need to be set to identify the validity of the first voice information, so as to prevent invalid voice information from falsely triggering related The service operation wastes the resources of the device.
另外,设备获取到第一语音信息之后,可以调用存储器中的自动语音识别模型来计算该第一语音信息的置信度,或者利用声道信息计算该第一语音信息的信噪比,或者该置信度和信噪比都计算出来,然后,基于该置信度和/或信噪比调整判决条件的灵敏度。In addition, after acquiring the first voice information, the device can call the automatic speech recognition model in the memory to calculate the confidence of the first voice information, or use the channel information to calculate the signal-to-noise ratio of the first voice information, or the confidence Both the degree and the signal-to-noise ratio are calculated, and then the sensitivity of the decision condition is adjusted based on this confidence degree and/or the signal-to-noise ratio.
具体的,可以基于该置信度和/或信噪比负相关调整判决条件的灵敏度,这是因为该置信 度越高,表明该第一语音信息被正确识别的概率越大,该信噪比越高,表明该采集的第一语音信息的质量越好,此时,即使判决条件的灵敏度苛刻也可以较好地识别出该第一语音信息的有效性,还可以有效地过滤闲聊的无效语音。Specifically, the sensitivity of the decision condition can be adjusted based on the confidence and/or the negative correlation of the SNR, because the higher the confidence, the higher the probability that the first voice information is correctly recognized, and the higher the SNR. High, indicating that the quality of the collected first voice information is better. At this time, even if the sensitivity of the judgment condition is harsh, the validity of the first voice information can be better recognized, and the invalid voice of chatting can be effectively filtered.
相反,若该置信度越低,表明该第一语音信息被正确识别的概率越小,该信噪比越低,表明该采集的第一语音信息的质量越差,可能语音内容的识别有误,为了提高设备语音交互的鲁棒性,可以适当提高判决条件的灵敏度,将判决条件调宽松一些,从而可以较好地识别该第一语音信息的有效性。On the contrary, if the confidence level is lower, it indicates that the probability of the first voice information being correctly recognized is smaller, and the signal-to-noise ratio is lower, indicating that the quality of the collected first voice information is worse, and the recognition of the voice content may be wrong. , in order to improve the robustness of the voice interaction of the device, the sensitivity of the decision condition can be appropriately increased, and the decision condition can be relaxed, so that the validity of the first voice information can be better recognized.
示例性地,设备可以设定一个语音信息的置信度阈值和/或信噪比阈值,若第一语音信息的置信度大于置信度阈值和/或信噪比大于信噪比阈值,那么,置信度和/或信噪比越高,判决条件的灵敏度调得越低。若第一语音信息的置信度小于置信度阈值和/或信噪比小于信噪比阈值,那么,置信度和/或信噪比越低,判决条件的灵敏度调得越高。该置信度阈值例如可以是50%或者60%等,该信噪比的阈值例如可以是50db或者60db等等,本申请对置信度阈值和信噪比阈值不做限制。Exemplarily, the device may set a confidence threshold and/or a signal-to-noise ratio threshold for voice information. If the confidence of the first voice information is greater than the confidence threshold and/or the signal-to-noise ratio is greater than the signal-to-noise ratio threshold, then the confidence The higher the degree and/or the signal-to-noise ratio, the lower the sensitivity of the decision condition. If the confidence of the first speech information is smaller than the confidence threshold and/or the SNR is smaller than the SNR threshold, then the lower the confidence and/or the SNR, the higher the sensitivity of the decision condition is adjusted. The confidence threshold may be, for example, 50% or 60%, and the signal-to-noise ratio threshold may be, for example, 50db or 60db, and the present application does not limit the confidence threshold and the signal-to-noise ratio threshold.
示例性地,一种可能的实施方式中,设备无需设置语音信息的置信度阈值和/或信噪比阈值,而是可以设置各个置信度和/或信噪比范围内对应调整判决条件的情况。例如,以判决条件为上述推理模型的判断阈值为例说明,假设初始的判断阈值为70%,那么,在置信度为0至30%的范围内,可以调高灵敏度,可以设置判断阈值调至50%;在置信度为31%至60%的范围内,可以设置判断阈值调至60%;在置信度为61%至70%的范围内,可以不调整,保持原来的70%的阈值;在置信度为71%至100%的范围内,可以调低灵敏度,可以设置判断阈值调至80%。Exemplarily, in a possible implementation manner, the device does not need to set the confidence threshold and/or the signal-to-noise ratio threshold of the speech information, but can set the corresponding adjustment decision condition within each confidence and/or signal-to-noise ratio range. . For example, taking the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then, within the range of the confidence level of 0 to 30%, the sensitivity can be increased, and the judgment threshold can be set to 50%; within the range of the confidence level from 31% to 60%, you can set the judgment threshold to 60%; within the range of the confidence level from 61% to 70%, you can not adjust it and keep the original 70% threshold; Within the range of 71% to 100% confidence, the sensitivity can be adjusted down, and the judgment threshold can be set to 80%.
需要说明的是,对于上述说话人数n、周围人数m、置信度和信噪比这几个影响因素,设备可以基于其中的任意一个单独调整判决条件的灵敏度。或者,设备可以基于其中的任意多个影响因素综合调整判决条件的灵敏度。示例性地,可以为该多个影响因素各自配置一个权重,按照加权的方式来调整判决条件的灵敏度。例如,对于上述判断阈值的调整,假设综合该周围人数m、置信度和信噪比这三个影响因素进行调整,该三个因素对应设置的权重为w1、w2和w3,该三个因素对应计算得到的调整后的判断阈值为a1、a2和a3,那么,综合该三个因素确定的调整后的判断阈值为(a1*w1+a2*w2+a3*w3)。需要说明的是,这种加权综合的方式仅为一个示例,实际实现中也可以取多个影响因素中调整最多或最少的作为最后调整的结果等等,本方案对具体综合的计算过程不做限制。It should be noted that, for the above-mentioned influencing factors of the number of speakers n, the number of surrounding people m, the confidence level and the signal-to-noise ratio, the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors. Exemplarily, a weight may be configured for each of the multiple influencing factors, and the sensitivity of the decision condition may be adjusted in a weighted manner. For example, for the adjustment of the above judgment threshold, it is assumed that the three influencing factors of the surrounding number m, the confidence degree and the signal-to-noise ratio are adjusted. The corresponding weights of the three factors are w1, w2 and w3. The calculated and adjusted judgment thresholds are a1, a2, and a3. Then, the adjusted judgment threshold determined by synthesizing the three factors is (a1*w1+a2*w2+a3*w3). It should be noted that this weighted synthesis method is only an example. In actual implementation, the most or least adjusted among multiple influencing factors can be taken as the final adjustment result, etc. This scheme does not do the calculation process of specific synthesis. limit.
参见图5,图5示例性示出了基于设备获取上述第一语音信息前的持续聆听时长(下面简称为t1),设备获取第一语音信息与最近一次获取到有效语音信息之间的第一时间间隔(下面简称为△t1),以及设备获取第一语音信息与最近一次获取到无效语音信息之间的第二时间间隔(下面简称为△t2)这三个影响因素调整判决条件的灵敏度的示意图。Referring to FIG. 5, FIG. 5 exemplarily shows that based on the continuous listening time (hereinafter referred to as t1) before the device acquires the above-mentioned first voice information, the first time between the device acquiring the first voice information and the most recent acquisition of valid voice information. The time interval (hereinafter referred to as Δt1), and the second time interval (hereinafter referred to as Δt2) between the device acquiring the first voice information and the last time it obtained invalid voice information (hereinafter referred to as Δt2), these three influencing factors adjust the sensitivity of the judgment condition. Schematic.
具体的,设备获取到上述第一语音信息后,可以获取截止获取到该第一语音信息时,该设备持续聆听的时长t1,获取第一语音信息与最近一次获取到有效语音信息之间的第一时间间隔△t1,以及获取第一语音信息与最近一次获取到无效语音信息之间的第二时间间隔△t2。示例性地,该t1、△t1和△t2的获取可以通过计时器计时和计算得到。Specifically, after the device acquires the above-mentioned first voice information, it can acquire the duration t1 that the device continues to listen until the time when the first voice information is acquired, and the first time between the acquisition of the first voice information and the most recent acquisition of valid voice information. A time interval Δt1, and a second time interval Δt2 between the acquisition of the first voice information and the latest acquisition of invalid voice information. Exemplarily, the acquisition of t1, Δt1 and Δt2 can be obtained by timing and calculation by a timer.
获得该t1之后,设备可以基于该t1负相关调整上述判决条件的灵敏度,即持续聆听的时长t1越大,则判决条件的灵敏度调得越低。这是因为当设备被唤醒后,开始进入新一轮的持 续聆听阶段,一般在持续聆听阶段前期设备获取到的用户的语音信息为有效的可能性较大,所以要保持较高的灵敏度,随着时间的推移,设备获取到的语音信息更大概率为用户之间的交谈信息,为降低误触发性,需要将灵敏度降低,因此设备可以基于持续聆听时间长度负相关调整上述判决条件的灵敏度。After obtaining the t1, the device can adjust the sensitivity of the above judgment condition based on the negative correlation of the t1, that is, the longer the duration t1 of continuous listening is, the lower the sensitivity of the judgment condition is adjusted. This is because when the device is woken up, it begins to enter a new round of continuous listening stage. Generally, the user's voice information obtained by the device in the early stage of the continuous listening stage is more likely to be effective. With the passage of time, the voice information obtained by the device is more likely to be chat information between users. To reduce false triggering, the sensitivity needs to be reduced. Therefore, the device can adjust the sensitivity of the above judgment conditions based on the negative correlation of the continuous listening time length.
为了便于理解该基于t1负相关调整判决条件的灵敏度,举例说明。例如,假设该判决条件为上述推理模块输出结果的判断阈值,在持续聆听开始阶段,该判断阈值可以是60%,条件比较宽松,灵敏度较高,但是随之t1的逐渐增加,t1每增加一个单位间隔(例如5秒钟的间隔),该判断阈值就增加有一个预设递增值,例如增加1%等等,即随着t1的增加,判断阈值越来越大,条件越来越苛刻,灵敏度逐渐降低。需要说明的是,这里仅为一个示例,本申请对具体的负相关调整方式不做限制。In order to facilitate the understanding of the sensitivity of adjusting the decision condition based on the t1 negative correlation, an example is given. For example, assuming that the judgment condition is the judgment threshold of the output result of the above inference module, in the initial stage of continuous listening, the judgment threshold can be 60%, the condition is relatively loose, and the sensitivity is high, but with the gradual increase of t1, every time t1 increases by one Unit interval (such as 5-second interval), the judgment threshold is increased by a preset increment value, such as an increase of 1%, etc., that is, with the increase of t1, the judgment threshold is larger and larger, and the conditions are more and more harsh. Sensitivity gradually decreases. It should be noted that this is just an example, and the present application does not limit the specific negative correlation adjustment method.
获得上述第一时间间隔△t1之后,设备可以判断一下该△t1是否大于第一时间间隔阈值T1。若△t1大于该T1,则不调整判决条件的灵敏度。这是因为,当该△t1大于该T1,可以认为该第一时间间隔△t1包括的时间长度与上述持续聆听的时间长度t1重叠,通过上述t1调整判决条件的灵敏度即可,无需再根据该△t1来调整判决条件的灵敏度。After obtaining the above-mentioned first time interval Δt1, the device may determine whether the Δt1 is greater than the first time interval threshold T1. If Δt1 is greater than this T1, the sensitivity of the decision condition is not adjusted. This is because when the Δt1 is greater than the T1, it can be considered that the length of time included in the first time interval Δt1 overlaps with the above-mentioned time length t1 of continuous listening, and the sensitivity of the judgment condition can be adjusted by the above-mentioned t1, and there is no need to adjust the sensitivity of the judgment condition according to the above-mentioned t1. Δt1 to adjust the sensitivity of the decision condition.
若△t1小于该T1,则负相关调整该判决条件的灵敏度。这是因为,设备在获取到有效语音信息之后一段时间即T1时间长度内,间隔的时间越长,设备获取到的语音信息为闲谈等无效语音信息的概率更大,因此,为了减少误触发,设备可以负相关调整判决条件的灵敏度。If Δt1 is smaller than the T1, the negative correlation adjusts the sensitivity of the decision condition. This is because, within a period of time after the device obtains valid voice information, that is, the length of time T1, the longer the interval, the greater the probability that the voice information obtained by the device is invalid voice information such as chat, therefore, in order to reduce false triggers, The device can negatively correlate to adjust the sensitivity of the decision condition.
获得上述第二时间间隔△t2之后,设备可以判断一下该△t2是否大于第二时间间隔阈值T2。若△t2大于该T2,则不调整判决条件的灵敏度。这是因为,当该△t2大于该T2,可以认为该第二时间间隔△t2包括的时间长度与上述持续聆听的时间长度t1重叠,通过上述t1调整判决条件的灵敏度即可,无需再根据该△t2来调整判决条件的灵敏度。After obtaining the second time interval Δt2, the device may determine whether the Δt2 is greater than the second time interval threshold T2. If Δt2 is greater than this T2, the sensitivity of the decision condition is not adjusted. This is because when the Δt2 is greater than the T2, it can be considered that the length of time included in the second time interval Δt2 overlaps with the above-mentioned time length t1 of continuous listening, and the sensitivity of the judgment condition can be adjusted by the above-mentioned t1, and there is no need to adjust the sensitivity of the judgment condition according to the above-mentioned t1. Δt2 to adjust the sensitivity of the decision condition.
若△t2小于该T2,则负相关调整该判决条件的灵敏度。这是因为,设备在获取到无效语音信息之后一段时间即T2时间长度内,间隔的时间越长,设备获取到的语音信息为闲谈等无效语音信息的概率更大,因此,为了减少误触发,设备可以负相关调整判决条件的灵敏度。If Δt2 is smaller than the T2, the negative correlation adjusts the sensitivity of the decision condition. This is because, within a period of time after the device acquires invalid voice information, that is, the length of time T2, the longer the interval, the greater the probability that the voice information acquired by the device is invalid voice information such as chat, therefore, in order to reduce false triggers, The device can negatively correlate to adjust the sensitivity of the decision condition.
另外,对于上述获取的第一时间间隔△t1和第二时间间隔△t2,设备可以比较△t1是否小于△t2,若是,则将判决条件的灵敏度调高。这是因为,在获取到该第一语音信息的前一个语音信息为有效语音信息,那么该第一语音信息是该前一个语音信息的追加或修改的可能性较大,即该第一语音信息为有效语音信息的可能性较大,那么,为了更好地识别该第一语音信息的有效性,设备可以将判决条件往宽松的方向调整,即调高该灵敏度。In addition, for the first time interval Δt1 and the second time interval Δt2 obtained above, the device can compare whether Δt1 is smaller than Δt2, and if so, increase the sensitivity of the decision condition. This is because, when the previous voice information obtained from the first voice information is valid voice information, it is more likely that the first voice information is an addition or modification of the previous voice information, that is, the first voice information In order to better identify the validity of the first voice information, the device may adjust the judgment condition to a relaxed direction, that is, increase the sensitivity.
上述图5所示调整流程是本申请的一种实施示例,通过持续聆听时间长短、与有效语音信息和无效语音信息时间间隔的特征,对判决条件的灵敏度进行实时动态调整,使得在不同聆听时间阶段,设备获取到的语音信息即使是内容相同的语音信息被判决为有效的门槛存在差异,从而可以更好地识别有效语音,并减少无效语音的误触发,提高用户的语音交互体验。The adjustment process shown in FIG. 5 above is an example of implementation of the present application. The sensitivity of the judgment condition is dynamically adjusted in real time through the characteristics of the length of the continuous listening time and the time interval between valid voice information and invalid voice information, so that at different listening times At this stage, even if the voice information obtained by the device has the same content, there are differences in the threshold for being judged to be valid, so that the valid voice can be better recognized, the false trigger of invalid voice can be reduced, and the user's voice interaction experience can be improved.
需要说明的是,对于图5所示的几个影响因素,设备可以基于其中的任意一个单独调整判决条件的灵敏度。或者,设备可以基于其中的任意多个影响因素综合调整判决条件的灵敏度。It should be noted that, for the several influencing factors shown in FIG. 5 , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
参见图6A和图6B,图6A和图6B示例性示出了基于设备获取到上述第一语音信息前的第一预设时长内有效语音信息和无效语音信息的占比这个影响因素调整判决条件的灵敏度的示意图。Referring to FIGS. 6A and 6B , FIGS. 6A and 6B exemplarily illustrate the adjustment of the decision condition based on the influence factor of the ratio of valid voice information and invalid voice information in the first preset time period before the device obtains the above-mentioned first voice information. Schematic diagram of the sensitivity of .
示例性的,该第一预设时长可以是该设备获取到该第一语音信息前持续聆听的时长,或者该第一预设时长可以是该设备获取到该第一语音信息前的任意时长,该任意时长可以是预先配置好的,本申请对此不做限制。Exemplarily, the first preset duration may be the duration of continuous listening before the device obtains the first voice information, or the first preset duration may be any duration before the device obtains the first voice information, The arbitrary duration may be pre-configured, which is not limited in this application.
上述第一预设时长内有效语音信息的占比指的是,在该第一预设时长内,设备获取到的有效语音信息占设备获取到的所有语音信息的比例。或者,该占比是最近一次接收到有效的语音控制指令的时间点,至获取到上述第一语音信息之间获取的无效语音信息数量的倒数。若期间获取的无效语音信息的数量为0,那么该有效语音信息的占比为1。The above-mentioned proportion of valid voice information within the first preset duration refers to the proportion of valid voice information acquired by the device to all voice information acquired by the device within the first preset duration. Or, the ratio is the reciprocal of the number of invalid voice information acquired between the time when a valid voice control instruction was last received and the time when the first voice information was acquired. If the number of invalid voice information acquired during the period is 0, the proportion of the valid voice information is 1.
上述第一预设时长内无效语音信息的占比指的是,在该第一预设时长内,设备获取到的无效语音信息占设备获取到的所有语音信息的比例。或者,该占比是最近一次接收到无效的语音控制指令的时间点,至获取到上述第一语音信息之间获取的有效语音信息数量的倒数。若期间获取的有效语音信息的数量为0,那么该无效语音信息的占比为1。The proportion of invalid voice information within the first preset duration refers to the proportion of invalid voice information acquired by the device to all voice information acquired by the device within the first preset duration. Or, the ratio is the reciprocal of the number of valid voice information acquired between the time when an invalid voice control instruction is received last time and the time when the first voice information is acquired. If the number of valid voice information acquired during the period is 0, then the proportion of invalid voice information is 1.
在具体实施例中,设备获取到上述第一语音信息之后,获取上述第一预设时长内有效语音信息的占比(简称为f1)和无效语音信息(简称为f2)的占比,设备可以比较一下该f1和f2的大小(参见图6A)。若f1大于f2,表明在上述第一预设时长内获取到的有效语音信息更多,用户在频繁地与设备进行语音交互,那么,可以根据(f1-f2)这个参数正相关调整上述判决条件的灵敏度。即有效语音信息的占比越大,表明该第一语音信息有效的概率越大,那么,该判决条件的灵敏度调整得越高,从而可以更好地识别获取的语音信息的有效性,减少有效语音信息漏识别的可能性。In a specific embodiment, after the device obtains the first voice information, the device obtains the proportion of valid voice information (referred to as f1) and the proportion of invalid voice information (referred to as f2) within the first preset duration, and the device may Compare the sizes of the f1 and f2 (see Figure 6A). If f1 is greater than f2, it indicates that more valid voice information is obtained within the above-mentioned first preset duration, and the user frequently interacts with the device in voice, then the above judgment condition can be adjusted according to the positive correlation of the parameter (f1-f2). sensitivity. That is, the larger the proportion of valid voice information, the greater the probability that the first voice information is valid, then the higher the sensitivity of the judgment condition is adjusted, so that the validity of the acquired voice information can be better recognized, and the effective Possibility of missing recognition of voice information.
一种可能的实施方式中,设备可以基于f1和f2调整上述判决条件的灵敏度。例如,在f1占比越大,灵敏度调的越高,而f2占比越小,灵敏度调的越低等等。In a possible implementation manner, the device may adjust the sensitivity of the above decision conditions based on f1 and f2. For example, the larger the proportion of f1, the higher the sensitivity adjustment, and the smaller the proportion of f2, the lower the sensitivity adjustment, and so on.
在图6A中,若f1不大于f2,那么,设备可以根据f1的变化率和f2的变化率来调整判决条件的灵敏度。In FIG. 6A, if f1 is not greater than f2, the device can adjust the sensitivity of the decision condition according to the change rate of f1 and the change rate of f2.
示例性地,以获取到语音信息的次数为横轴(或者说以持续聆听的时间为横轴),以f1为纵轴构建坐标系,在该坐标系中,最近一次获取到有效语音信息时的f1与该最近一次的前一次获取到有效语音信息时的f1连线的斜率即为该f1的变化率。为了便于理解,可以参见图6C。在图6C中,假设在获取到上述第一语音信息之前已经接受到了6次语音信息,图6C中示例性示出了每次获取到语音信息并进行有效性判断后有效语音信息的占比情况。那么,在图6C中,设备在获取到该第一语音信息后,获取的f1的变化率为k=-10%。Exemplarily, take the number of times the voice information is acquired as the horizontal axis (or take the time of continuous listening as the horizontal axis), and take f1 as the vertical axis to construct a coordinate system, in this coordinate system, when valid voice information is acquired last time. The slope of the line connecting the f1 with the f1 when the last valid voice information was acquired last time is the change rate of the f1. For ease of understanding, reference may be made to Figure 6C. In FIG. 6C , it is assumed that the voice information has been received 6 times before the above-mentioned first voice information is obtained, and FIG. 6C exemplarily shows the proportion of valid voice information after each time the voice information is obtained and the validity is judged . Then, in FIG. 6C , after the device acquires the first voice information, the acquired change rate of f1 is k=-10%.
同理,示例性地,以获取到语音信息的次数为横轴(或者说以持续聆听的时间为横轴),以f2为纵轴构建坐标系,在该坐标系中,最近一次获取到无效语音信息时的f2与该最近一次的前一次获取到无效语音信息时的f2连线的斜率即为该f2的变化率。为了便于理解,可以参见图6D。在图6D中,假设在获取到上述第一语音信息之前已经接受到了6次语音信息,图6D中示例性示出了每次获取到语音信息并进行有效性判断后无效语音信息的占比情况。那么,在图6D中,设备在获取到该第一语音信息后,获取的f2的变化率为k=10%。In the same way, exemplarily, take the number of times the voice information is acquired as the horizontal axis (or take the continuous listening time as the horizontal axis), and take f2 as the vertical axis to construct a coordinate system, in this coordinate system, the most recent acquisition is invalid. The slope of the line connecting f2 at the time of voice information and f2 when invalid voice information was acquired last time is the rate of change of f2. For ease of understanding, reference may be made to Figure 6D. In FIG. 6D , it is assumed that the voice information has been received 6 times before the above-mentioned first voice information is obtained, and FIG. 6D exemplarily shows the proportion of invalid voice information after each time the voice information is obtained and the validity is judged . Then, in FIG. 6D , after the device acquires the first voice information, the acquired change rate of f2 is k=10%.
基于上述的描述,在f1不大于f2的情况下,表明用户与设备之间的语音交互减少,那么,为了减少无效语音的误触发,设备可以根据f1的变化率正相关调整判决条件的灵敏度。即f1的变化率越大,表明该第一语音信息有效的概率越大,灵敏度调得越高,判决条件越宽松;而f1的变化率越小,表明该第一语音信息有效的概率越小,灵敏度调得越低,判决条件越苛刻。例如,参见上述图6C,图6C中示例性给出了几个f1的变化率:k=-50%、k=16.6%、k=8.3%、k=-15%和k=-10%,其从小到大的排序为:-50%<-15%<-10%<8.3%<16.6%。假 设调整的判决条件为上述推理模块输出结果的判断阈值,假设调整前的判断阈值为70%,那么,该5个从小到大排序的f1的变化率对应的调整后的该判断阈值为85%、80%、78%、68%和65%。需要说明的是,判断阈值越低,灵敏度越高,即此处调高灵敏度即为调低判断阈值,调低灵敏度即为调高判断阈值。Based on the above description, when f1 is not greater than f2, it indicates that the voice interaction between the user and the device is reduced. Then, in order to reduce false triggering of invalid voices, the device can adjust the sensitivity of the decision condition according to the positive correlation of the rate of change of f1. That is, the larger the change rate of f1, the greater the probability that the first voice information is valid, the higher the sensitivity, the looser the judgment condition; and the smaller the change rate of f1, the lower the probability that the first voice information is valid. , the lower the sensitivity is adjusted, the harsher the judgment condition. For example, referring to FIG. 6C above, several rates of change of f1 are exemplarily given in FIG. 6C: k=-50%, k=16.6%, k=8.3%, k=-15% and k=-10%, The order from small to large is: -50%<-15%<-10%<8.3%<16.6%. Assuming that the judgment condition for adjustment is the judgment threshold of the output result of the above inference module, and assuming that the judgment threshold before adjustment is 70%, then the adjusted judgment threshold corresponding to the rate of change of the five f1 sorted from small to large is 85% , 80%, 78%, 68% and 65%. It should be noted that the lower the judgment threshold, the higher the sensitivity, that is, increasing the sensitivity here means lowering the judgment threshold, and lowering the sensitivity means increasing the judgment threshold.
而在f1不大于f2的情况下,设备可以根据f2的变化率负相关调整判决条件的灵敏度。即f2的变化率越小,此时表明有效语音信息的占比在增加,即该第一语音信息有效的概率越大,因此,灵敏度调得越高,判决条件越宽松;而f2的变化率越大,此时表明有效语音信息的占比在减少,即该第一语音信息有效的概率越小,因此,灵敏度调得越低,判决条件越严苛。例如,参见上述图6D,图6D中示例性给出了几个f2的变化率:k=50%、k=-16.6%、k=-8.3%、k=15%和k=10%,其从小到大的排序为:-16.6%<-8.3%<10%<15%<50%。假设调整的判决条件为上述推理模块输出结果的判断阈值,假设调整前的判断阈值为70%,那么,该5个从小到大排序的f2的变化率对应的调整后的该判断阈值为65%、68%、78%、80%和85%。In the case where f1 is not greater than f2, the device can adjust the sensitivity of the decision condition according to the negative correlation of the rate of change of f2. That is, the smaller the rate of change of f2, the higher the proportion of valid voice information, that is, the greater the probability that the first voice information is valid. Therefore, the higher the sensitivity, the looser the judgment condition; and the rate of change of f2 The larger the value, the smaller the proportion of valid voice information, that is, the lower the probability that the first voice information is valid. Therefore, the lower the sensitivity is, the stricter the judgment condition is. For example, referring to Figure 6D above, several f2 rates of change are exemplified in Figure 6D: k=50%, k=-16.6%, k=-8.3%, k=15% and k=10%, which The order from small to large is: -16.6%<-8.3%<10%<15%<50%. Assuming that the judgment condition for adjustment is the judgment threshold of the output result of the above inference module, and assuming that the judgment threshold before adjustment is 70%, then the adjusted judgment threshold corresponding to the rate of change of f2 sorted from small to large is 65% , 68%, 78%, 80% and 85%.
或者,设备获取到上述第一语音信息之后,获取上述第一预设时长内有效语音信息的占比(简称为f1)和无效语音信息(简称为f2)的占比,设备无需比较f1和f2的大小,也可以根据(f1-f2)这个参数正相关调整上述判决条件的灵敏度、根据f1的变化率正相关调整判决条件的灵敏度和/或根据f2的变化率负相关调整判决条件的灵敏度(参见图6B)。具体的调整过程参见上述对图6A的描述,此处不再赘述。Or, after the device obtains the first voice information, the device obtains the proportion of valid voice information (referred to as f1) and the ratio of invalid voice information (referred to as f2) within the first preset duration, and the device does not need to compare f1 and f2. The sensitivity of the above judgment condition can also be adjusted according to the positive correlation of this parameter (f1-f2), the sensitivity of the judgment condition can be adjusted according to the positive correlation of the rate of change of f1, and/or the sensitivity of the judgment condition can be adjusted according to the negative correlation of the rate of change of f2 ( See Figure 6B). For the specific adjustment process, refer to the above description of FIG. 6A , which will not be repeated here.
需要说明的是,对于图6A或图6B所示的几个影响因素,设备可以基于其中的任意一个单独调整判决条件的灵敏度。或者,设备可以基于其中的任意多个影响因素综合调整判决条件的灵敏度。It should be noted that, for several influencing factors shown in FIG. 6A or FIG. 6B , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
参见图7,图7示例性示出了基于第一语音信息与设备最近一次获取到的有效语音信息的语义的第一关联度,第一语音信息与设备最近一次获取到的无效语音信息的语义的第二关联度,第一语音信息与设备最近一次获取到的有效语音信息的第三关联度,以及截止至获取到第一语音信息设备与用户语音对话的状态这三个影响因素调整判决条件的灵敏度的示意图。Referring to FIG. 7 , FIG. 7 exemplarily shows the semantics of the first voice information and the invalid voice information acquired by the device based on the first correlation degree between the first voice information and the semantics of the valid voice information acquired by the device last time. The second correlation degree of the first voice information, the third degree of correlation between the first voice information and the last valid voice information obtained by the device, and the three influencing factors of the state of the voice dialogue between the device and the user until the first voice information is obtained. Schematic diagram of the sensitivity of .
在具体实施例中,设备获取到上述第一语音信息之后,可以获取最近一次获取到的有效语音信息(简称为最近历史有效语音信息),基于解析得到的该第一语音信息和该最近历史有效语音信息的语义分析该两个语音信息的关联度(简称为第一关联度)。具体的,可以通过调用存储器中的自然语言理解模型来对该第一语音信息进行语义理解。In a specific embodiment, after acquiring the above-mentioned first voice information, the device can acquire the most recently acquired valid voice information (referred to as the most recent historical valid voice information), based on the first voice information obtained by analysis and the recent historical valid voice information. The semantic analysis of the voice information analyzes the degree of association of the two voice information (referred to as the first degree of association for short). Specifically, semantic understanding of the first speech information may be performed by invoking a natural language understanding model in the memory.
若该两个语音信息的语义不关联,即该第一关联度为零,那么,不调整判决条件的灵敏度。若该两个语音信息的语义关联,例如该两个语音信息的语义相同、存在继承关系(例如最近历史有效语音信息的语义为“打开空调”,该第一语音信息的语义为“温度高一点”)、存在递进关系(例如最近历史有效语音信息的语义为“温度高一点”,第一语音信息的语义为“再高一点”)或者存在对立关系(例如最近历史有效语音信息的语义为“打开空调”,第一语音信息的语义为“关闭”)等,则设备可以计算出具体的第一关联度,然后基于计算得到的第一关联度正相关调整判决条件的灵敏度。If the semantics of the two speech information are not related, that is, the first degree of correlation is zero, then the sensitivity of the decision condition is not adjusted. If the semantics of the two voice information is related, for example, the two voice information have the same semantics and there is an inheritance relationship (for example, the semantics of the recent historical valid voice information is "turn on the air conditioner", the semantics of the first voice information is "the temperature is higher." ”), there is a progressive relationship (for example, the semantics of the recent historically valid voice information is “a little higher”, and the semantics of the first voice information is “a little higher”), or there is an opposite relationship (for example, the semantics of the recent historically valid voice information is "Turn on the air conditioner", the semantics of the first voice information is "off"), etc., the device can calculate the specific first correlation degree, and then adjust the sensitivity of the decision condition based on the positive correlation of the calculated first correlation degree.
示例性地,若第一关联度大于某个阈值,表明该第一语音信息为有效语音信息的概率较大,则该第一关联度越大,则灵敏度调得越高;反之,若第一关联度小于某个阈值,表明该第一语音信息为有效语音信息的概率较小,则第一关联度越小,则灵敏度调得越低。Exemplarily, if the first correlation degree is greater than a certain threshold, it indicates that the probability that the first voice information is valid voice information is high, and the greater the first correlation degree is, the higher the sensitivity is; If the correlation degree is smaller than a certain threshold, it indicates that the probability that the first voice information is valid voice information is small, and the lower the first correlation degree is, the lower the sensitivity is adjusted.
示例性地,一种可能的实施方式中,设备无需设置第一关联度的阈值,而是可以设置第 一关联度各个范围内对应调整判决条件的情况。例如,以判决条件为上述推理模型的判断阈值为例说明,假设初始的判断阈值为70%,那么,在第一关联度为0至30%的范围内,可以调低灵敏度,可以设置判断阈值调至80%;在第一关联度为31%至60%的范围内,可以设置判断阈值调至75%;在第一关联度为61%至70%的范围内,可以不调整,保持原来的70%的阈值;在第一关联度为71%至100%的范围内,可以调高灵敏度,可以设置判断阈值调至60%。Exemplarily, in a possible implementation manner, the device does not need to set the threshold of the first correlation degree, but can set the corresponding adjustment decision conditions within each range of the first correlation degree. For example, taking the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then, in the range of the first correlation degree from 0 to 30%, the sensitivity can be lowered, and the judgment threshold can be set Adjusted to 80%; in the range of the first correlation degree from 31% to 60%, you can set the judgment threshold to 75%; in the range of the first correlation degree from 61% to 70%, you can not adjust it and keep the original 70% of the threshold; in the range of the first correlation degree from 71% to 100%, the sensitivity can be increased, and the judgment threshold can be set to 60%.
一种可能的实施方式中,当判断出第一关联度100%关联的情况下,可以将灵敏度调到最高,或者,无效拒识模型不再进一步进行有效性判断,直接输出该第一语音信息有效的指示。In a possible implementation, when it is judged that the first degree of relevance is 100% relevant, the sensitivity can be adjusted to the highest, or the invalid rejection model does not conduct further validity judgment, and directly outputs the first voice information. valid instructions.
在具体实施例中,设备获取到上述第一语音信息之后,可以获取最近一次获取到的无效语音信息(简称为最近历史无效语音信息),基于解析得到的该第一语音信息和该最近历史无效语音信息的语义分析该两个语音信息的关联度(简称为第二关联度)。若该两个语音信息的语义不关联,即该第二关联度为零,那么,不调整判决条件的灵敏度。若该两个语音信息的语义关联,例如该两个语音信息的语义相同、存在继承关系(例如最近历史无效语音信息的语义为“我们可以星期天去深圳”,该第一语音信息的语义为“可以星期六去”)、存在递进关系(例如最近历史无效语音信息的语义为“早上六点起床很早”,第一语音信息的语义为“我还可以更早起床”)或者存在对立关系(例如最近历史无效语音信息的语义为“我们去深圳吧”,第一语音信息的语义为“不去”)等,则设备可以计算出具体的第二关联度,然后基于计算得到的第二关联度负相关调整判决条件的灵敏度。In a specific embodiment, after obtaining the above-mentioned first voice information, the device can obtain the invalid voice information obtained the last time (referred to as the recent invalid voice information for short), based on the first voice information obtained by analysis and the recent history invalid voice information The semantic analysis of the voice information analyzes the degree of association between the two voice information (referred to as the second degree of association for short). If the semantics of the two speech information are not related, that is, the second degree of correlation is zero, then the sensitivity of the decision condition is not adjusted. If the semantic association of the two voice information, for example, the semantics of the two voice information are the same, and there is an inheritance relationship (for example, the semantics of the recent invalid voice information is "We can go to Shenzhen on Sunday", the semantics of the first voice information is " I can go on Saturday”), there is a progressive relationship (for example, the semantics of the recent invalid voice information is “get up early at six in the morning”, and the semantics of the first voice information is “I can get up earlier”) or there is an antagonistic relationship ( For example, the semantics of the recent invalid voice information is "Let's go to Shenzhen", and the semantics of the first voice information is "don't go"), etc., the device can calculate the specific second correlation degree, and then based on the calculated second correlation The degree of negative correlation adjusts the sensitivity of the decision condition.
示例性地,若第二关联度大于某个阈值,表明该第一语音信息为无效语音信息的概率较大,则该第二关联度越大,则灵敏度调得越低;反之,若第二关联度小于某个阈值,表明该第一语音信息为无效语音信息的概率较小,则第二关联度越小,则灵敏度调得越高。Exemplarily, if the second correlation degree is greater than a certain threshold, it indicates that the probability that the first voice information is invalid voice information is high, and the greater the second correlation degree is, the lower the sensitivity is; If the correlation degree is smaller than a certain threshold, it indicates that the probability that the first voice information is invalid voice information is small, and the smaller the second correlation degree is, the higher the sensitivity is.
示例性地,一种可能的实施方式中,设备无需设置第二关联度的阈值,而是可以设置第二关联度各个范围内对应调整判决条件的情况。例如,以判决条件为上述推理模型的判断阈值为例说明,假设初始的判断阈值为70%,那么,在第二关联度为0至30%的范围内,可以调高灵敏度,可以设置判断阈值调至60%;在第二关联度为31%至60%的范围内,可以设置判断阈值调至65%;在第二关联度为61%至70%的范围内,可以不调整,保持原来的70%的阈值;在第二关联度为71%至100%的范围内,可以调低灵敏度,可以设置判断阈值调至80%。Exemplarily, in a possible implementation manner, the device does not need to set the threshold of the second correlation degree, but can set the corresponding adjustment decision conditions within each range of the second correlation degree. For example, taking the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then in the range of the second correlation degree from 0 to 30%, the sensitivity can be increased, and the judgment threshold can be set Adjust to 60%; in the range of the second correlation degree from 31% to 60%, you can set the judgment threshold to 65%; in the range of the second correlation degree from 61% to 70%, you can not adjust it and keep the original 70% of the threshold; in the range of the second correlation degree from 71% to 100%, the sensitivity can be lowered, and the judgment threshold can be set to 80%.
一种可能的实施方式中,当判断出第二关联度100%关联的情况下,可以将灵敏度调到最低,或者,无效拒识模型不再进一步进行有效性判断,直接输出该第一语音信息无效的指示。In a possible implementation, when it is judged that the second degree of relevance is 100% relevant, the sensitivity can be adjusted to the lowest level, or the invalid rejection model does not conduct further validity judgment, and directly outputs the first voice information. Invalid instruction.
在具体实施例中,设备除了可以基于上述第一语音信息与设备最近一次获取到的有效语音信息的语义的第一关联度来调整判决条件的关联度,还可以基于第一语音信息与设备最近一次获取到的有效语音信息的第三关联度来调整判决条件的关联度。该第三关联度指的是第一语音信息与设备最近一次获取到的有效语音信息的内容之间的关联度,而上述第一关联度指的是该两个语音信息的语义之间的关联度。为了便于理解该第一关联度和第三关联度可以参见图8A和图8B。In a specific embodiment, in addition to adjusting the degree of association of the judgment condition based on the first degree of association between the first voice information and the semantics of the most recent valid voice information obtained by the device, the device may also The third correlation degree of the valid voice information obtained once is used to adjust the correlation degree of the judgment condition. The third degree of association refers to the degree of association between the first voice information and the content of the valid voice information obtained by the device last time, and the above-mentioned first degree of association refers to the association between the semantics of the two voice information Spend. To facilitate understanding of the first degree of association and the third degree of association, reference may be made to FIG. 8A and FIG. 8B .
首先参见图8A,假设“帮我播放音乐”为设备最近一次获取到的有效语音信息,“我平常喜欢听歌手A的歌”为上述第一语音信息。为了获取到该两个语音信息的第一关联度,在通过自然语言理解模型获得该两个语音信息的语义信息之后,将该两个语义信息输入到语义关联推理模型中进行处理。经该语义关联推理模型处理,输出该两个语义信息的第一关联度。该语义关联推理模型是预先训练好的神经网络模型或机器学习模型等。Referring first to FIG. 8A , it is assumed that "play music for me" is the latest valid voice information acquired by the device, and "I usually like to listen to singer A's songs" is the above-mentioned first voice information. In order to obtain the first correlation degree of the two pieces of speech information, after obtaining the semantic information of the two pieces of speech information through the natural language understanding model, the two pieces of semantic information are input into the semantic correlation inference model for processing. After being processed by the semantic correlation inference model, the first correlation degree of the two semantic information is output. The semantic association inference model is a pre-trained neural network model or a machine learning model or the like.
参见图8B,同样地,假设“帮我播放音乐”为设备最近一次获取到的有效语音信息,“我 平常喜欢听歌手A的歌”为上述第一语音信息。为了获取到该两个语音信息的第三关联度,可以通过自然语言理解模型结构化解析该两个语音信息,具体的,对“帮我播放音乐”这一语音信息进行结构化解析后得知:该语音信息描述的领域是音乐,其意图是播放音乐。对“我平常喜欢听歌手A的歌”这一语音信息进行结构化解析后得知:该语音信息描述的领域是音乐,歌手为歌手A。获得该两个语音信息的结构化信息后,将该两个结构化信息输入到相关判断模型中进行处理。经该相关判断模型处理,输出该两个语音信息的第三关联度。该相关判断模型例如可以是对话状态跟踪DST模型等。Referring to Fig. 8B, similarly, assume that "play music for me" is the latest valid voice information acquired by the device, and "I usually like to listen to singer A's songs" is the first voice information. In order to obtain the third degree of correlation between the two pieces of speech information, the two pieces of speech information can be structurally parsed through a natural language understanding model. Specifically, after structural analysis of the piece of speech information "help me play music", it is known that : The field described by this voice message is music, and the intent is to play music. After structural analysis of the voice information "I usually like to listen to singer A's songs", we know that the field described by the voice information is music, and the singer is singer A. After the structured information of the two voice information is obtained, the two structured information is input into the relevant judgment model for processing. After being processed by the correlation judgment model, the third correlation degree of the two voice information is output. The relevant judgment model may be, for example, a dialogue state tracking DST model or the like.
在上述图8A中输出的“帮我播放音乐”和“我平常喜欢听歌手A的歌”这两个语音信息的第一关联度可以是零,即语义不关联;而上述图8B中输出的“帮我播放音乐”和“我平常喜欢听歌手A的歌”这两个语音信息的第三关联度可以是100%,即该两个语音信息是关联的。The first correlation degree of the two voice information "help me play music" and "I usually like to listen to singer A's songs" output in the above-mentioned FIG. 8A may be zero, that is, the semantics are not related; while the output in the above-mentioned FIG. 8B The third degree of relevance of the two voice information "help me play music" and "I usually like to listen to singer A's songs" may be 100%, that is, the two voice information are related.
一种可能的实施方式中,基于上述图8B所述的方式获取的第一语音信息与设备最近一次获取到的有效语音信息的第三关联度,可以是明确的0或者100%,即若上述相关判断模型输出不相关的指示信息时,该第三关联度为0,若上述相关判断模型输出相关的指示信息时,该第三关联度为100%。In a possible implementation manner, the third degree of correlation between the first voice information obtained based on the method described in FIG. 8B and the last valid voice information obtained by the device may be a clear 0 or 100%, that is, if the above When the correlation judgment model outputs irrelevant indication information, the third correlation degree is 0, and when the above correlation judgment model outputs relevant indication information, the third correlation degree is 100%.
另一种可能的实施方式中,基于上述图8B所述的方式获取的第一语音信息与设备最近一次获取到的有效语音信息的第三关联度,也可以是一个具体的百分比(例如60%或者90%等等)或者相似度打分等等,然后,可以通过与预设的阈值比较确定是否关联。In another possible implementation manner, the third degree of association between the first voice information obtained in the manner described in FIG. 8B and the last valid voice information obtained by the device may also be a specific percentage (for example, 60% Or 90%, etc.) or similarity score, etc., and then, it can be determined whether it is related by comparing with a preset threshold.
获得上述第一语音信息与设备最近一次获取到的有效语音信息的第三关联度后,设备可以基于该第三关联度正相关调整判决条件的灵敏度。具体的正相关调整方式可以参考上述基于该第一关联度正相关调整判决条件的灵敏度,此处不再赘述。另外,当第三关联度为零,即第一语音信息与设备最近一次获取到的有效语音信息不相关的情况下,不调整判决条件的灵敏度。After obtaining the third degree of correlation between the above-mentioned first voice information and the last valid voice information obtained by the device, the device can positively correlate to adjust the sensitivity of the decision condition based on the third degree of correlation. For a specific positive correlation adjustment method, reference may be made to the above-mentioned sensitivity of the positive correlation adjustment decision condition based on the first correlation degree, which will not be repeated here. In addition, when the third degree of correlation is zero, that is, when the first voice information is not related to the valid voice information acquired by the device last time, the sensitivity of the decision condition is not adjusted.
在具体实施例中,设备获取到上述第一语音信息之后,可以获取截止至获取到第一语音信息设备与用户语音对话的状态,该状态例如可以是设备基于用户的语音控制指令选择、询问、判断或者闲聊的状态等等。具体的,设备可以基于对话状态跟踪DST技术获知该状态。在存在该设备与用户语音对话的状态的情况下,表明用户与设备之间进行了长时间的交互对话,那么,设备可以根据这一持续的对话状态调高判决条件的灵敏度。若不存在该设备与用户语音对话的状态,则用户没有与设备进行长时间的交互对话,设备可以不根据这个因素调整判决条件的灵敏度。In a specific embodiment, after the device obtains the above-mentioned first voice information, it can obtain the status of the voice dialogue between the device and the user until the first voice information is obtained. The state of judgment or small talk, etc. Specifically, the device may learn the state based on the dialog state tracking DST technology. If there is a state in which the device has a voice dialogue with the user, it indicates that the user and the device have conducted a long interactive dialogue. Then, the device can increase the sensitivity of the decision condition according to the continuous dialogue state. If there is no state in which the device has a voice dialogue with the user, the user does not have a long interactive dialogue with the device, and the device may not adjust the sensitivity of the decision condition according to this factor.
需要说明的是,对于图7所示的几个影响因素,设备可以基于其中的任意一个单独调整判决条件的灵敏度。或者,设备可以基于其中的任意多个影响因素综合调整判决条件的灵敏度。It should be noted that, for the several influencing factors shown in FIG. 7 , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
参见图9,图9示例性示出了基于第一语音信息与历史有效语音信息的声学特征的第一相似度,以及第一语音信息与历史无效语音信息的声学特征的第二相似度这两个影响因素调整判决条件的灵敏度的示意图。示例性的,该声学特征包括语音的语调和/或语速等特征。Referring to FIG. 9, FIG. 9 exemplarily shows the first similarity based on the acoustic features of the first voice information and the historically valid voice information, and the second similarity based on the acoustic features of the first voice information and the historical invalid voice information. A schematic diagram of the sensitivity of each influencing factor to adjust the decision condition. Exemplarily, the acoustic features include features such as intonation and/or speed of speech.
在具体实施例中,设备获取到上述第一语音信息之后,通过调用存储在存储器中的声学模型提取该第一语音信息的声学特征,然后,将该提取的声学特征与历史有效语音信息(可以是一个或多个历史有效语音信息)的声学特征比较,获取该第一语音信息的声学特征与历史有 效语音信息的声学特征的相似度(简称为第一相似度)。若该第一语音信息的声学特征与历史有效语音信息的声学特征的相似度均为零,那么,设备可以不根据该第一相似度调整判决条件的灵敏度。若该第一语音信息的声学特征与一个或多个历史有效语音信息的声学特征的相似度不为零,那么,可以正相关调整判决条件的灵敏度,即相似度(示例性的,该相似度可以是获得的相似度中最大的相似度,或者获得的相似度的平均形式度等)越大,该灵敏度调得越高。In a specific embodiment, after acquiring the above-mentioned first voice information, the device extracts the acoustic features of the first voice information by invoking the acoustic model stored in the memory, and then compares the extracted acoustic features with historical valid voice information (may be is to compare the acoustic features of one or more historically valid voice information), and obtain the similarity (referred to as the first similarity for short) between the acoustic features of the first voice information and the acoustic features of the historically valid voice information. If the similarity between the acoustic feature of the first voice information and the acoustic feature of the historically valid voice information is zero, the device may not adjust the sensitivity of the decision condition according to the first similarity. If the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically valid voice information is not zero, then the sensitivity of the decision condition, that is, the similarity, can be adjusted in a positive correlation (exemplarily, the similarity It can be the largest similarity among the obtained similarities, or the greater the average formality of the obtained similarities, etc.), the higher the sensitivity is adjusted.
一种可能的实施方式中,在该第一语音信息的声学特征与一个或多个历史有效语音信息的声学特征的相似度大于某个阈值(该阈值例如可以是60%至100%之间的任一个值)的情况下,此时表明该第一语音信息的声学特征与一个或多个历史有效语音信息的声学特征相似,那么,设备可以将判决条件的灵敏度调高到预设值。例如,以上述判断阈值为例,假设原来的判断阈值为70%,只要该第一语音信息的声学特征与一个或多个历史有效语音信息的声学特征的相似度大于某个阈值,判断阈值均调到60%。In a possible implementation manner, the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically valid voice information is greater than a certain threshold (for example, the threshold may be between 60% and 100%). In the case of any value), it indicates that the acoustic features of the first voice information are similar to the acoustic features of one or more historically valid voice information, then the device can increase the sensitivity of the decision condition to a preset value. For example, taking the above judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic feature of the first voice information and the acoustic characteristics of one or more historically valid voice information is greater than a certain threshold, the judgment threshold will be equal to Adjust to 60%.
在具体实施例中,设备获取到上述第一语音信息之后,通过调用存储在存储器中的声学模型提取该第一语音信息的声学特征,然后,将该提取的声学特征与历史无效语音信息(可以是一个或多个历史无效语音信息)的声学特征比较,获取该第一语音信息的声学特征与历史无效语音信息的声学特征的相似度(简称为第二相似度)。若该第一语音信息的声学特征与历史无效语音信息的声学特征的相似度均为零,那么,设备可以不根据该第二相似度调整判决条件的灵敏度。若该第一语音信息的声学特征与一个或多个历史无效语音信息的声学特征的相似度不为零,那么,可以负相关调整判决条件的灵敏度,即相似度(示例性的,该相似度可以是获得的相似度中最大的相似度,或者获得的相似度的平均形式度等)越大,该灵敏度调得越低。In a specific embodiment, after acquiring the above-mentioned first voice information, the device extracts the acoustic features of the first voice information by invoking the acoustic model stored in the memory, and then compares the extracted acoustic features with the historical invalid voice information (may be is to compare the acoustic features of one or more historical invalid voice information), and obtain the similarity (referred to as the second similarity) between the acoustic features of the first voice information and the acoustic features of the historical invalid voice information. If the similarity between the acoustic feature of the first voice information and the acoustic feature of the historical invalid voice information is zero, then the device may not adjust the sensitivity of the decision condition according to the second similarity. If the similarity between the acoustic features of the first voice information and the acoustic features of one or more historical invalid voice information is not zero, then the sensitivity of the decision condition, that is, the similarity, may be adjusted in a negative correlation (exemplarily, the similarity It can be the largest similarity among the obtained similarities, or the greater the average formality of the obtained similarities, etc.), the lower the sensitivity is adjusted.
一种可能的实施方式中,在该第一语音信息的声学特征与一个或多个历史无效语音信息的声学特征的相似度大于某个阈值(该阈值例如可以是60%至100%之间的任一个值)的情况下,此时表明该第一语音信息的声学特征与一个或多个历史无效语音信息的声学特征相似,那么,设备可以将判决条件的灵敏度调低到预设值。例如,以上述判断阈值为例,假设原来的判断阈值为70%,只要该第一语音信息的声学特征与一个或多个历史无效语音信息的声学特征的相似度大于某个阈值,判断阈值均调到75%。In a possible implementation manner, the similarity between the acoustic features of the first voice information and the acoustic features of one or more historical invalid voice information is greater than a certain threshold (for example, the threshold may be between 60% and 100%). Any value), it indicates that the acoustic features of the first voice information are similar to the acoustic features of one or more historical invalid voice information, then the device can lower the sensitivity of the decision condition to a preset value. For example, taking the above judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically invalid voice information is greater than a certain threshold, the judgment thresholds are all Adjust to 75%.
需要说明的是,对于图9所示的几个影响因素,设备可以基于其中的任意一个单独调整判决条件的灵敏度。或者,设备可以基于其中的任意多个影响因素综合调整判决条件的灵敏度。It should be noted that, for the several influencing factors shown in FIG. 9 , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.
一种可能的实施方式中,设备可以接收用户输入的指令,基于该指令适应性调整判决条件的灵敏度。示例性地,该指令例如可以是用户指定的具体的判决条件灵敏度,或者可以是关闭或取消语音信息有效性识别等指令。本申请实施例可以根据用户的喜好来适应性调整上述判决条件的灵敏度,从而可以更好地满足用户需求,提升用户体验。In a possible implementation manner, the device may receive an instruction input by the user, and adaptively adjust the sensitivity of the decision condition based on the instruction. Exemplarily, the instruction may be, for example, a specific decision condition sensitivity specified by the user, or may be an instruction such as turning off or canceling the voice information validity recognition. In this embodiment of the present application, the sensitivity of the above judgment condition can be adaptively adjusted according to the user's preference, so as to better meet the user's needs and improve the user experience.
一种可能的实施方式中,上述判决条件的灵敏度的调整可以是由另一设备或装置(例如可以是上述设备对应的服务器等)基于上述一种或多种影响因素调整好之后发送给上述设备的,上述设备接收到调整后的判决条件后,可以直接基于调整后的判决条件来判决上述第一语音信息的有效性。In a possible implementation manner, the adjustment of the sensitivity of the above-mentioned judgment condition may be sent to the above-mentioned equipment after being adjusted by another device or device (for example, it may be a server corresponding to the above-mentioned equipment, etc.) based on the above-mentioned one or more influencing factors. Yes, after receiving the adjusted judgment condition, the above-mentioned device may directly judge the validity of the above-mentioned first voice information based on the adjusted judgment condition.
参见图10,图10所示为本申请提供的一种语音信息处理方法,该方法包括但不限于如下步骤:Referring to FIG. 10, FIG. 10 shows a voice information processing method provided by the present application, and the method includes but is not limited to the following steps:
S1001、获取第一语音信息。S1001. Acquire first voice information.
该步骤的具体实现可以参见上述图2中的步骤S201中的描述,此处不再赘述。For the specific implementation of this step, reference may be made to the description in step S201 in FIG. 2 above, which will not be repeated here.
S1002、在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于该第一语音信息产生时所在的环境情况调整得到。S1002. In the case where it is determined that the first voice information is a valid voice control instruction based on a judgment condition, the operation indicated by the first voice information is executed, wherein the judgment condition is based on the environmental condition where the first voice information is generated get adjusted.
在具体实施例中,设备获取到上述第一语音信息之后,可以基于该第一语音信息产生时所在的环境情况适应性地调整判断该第一语音信息是否为有效语音指令的判决条件。具体的,基于第一语音信息产生时所在的环境情况调整判决条件的具体实现可以参见上述图4中对应的描述,此处不再赘述。In a specific embodiment, after acquiring the above-mentioned first voice information, the device can adaptively adjust the judgment condition for judging whether the first voice information is a valid voice command based on the environment in which the first voice information is generated. Specifically, for the specific implementation of adjusting the decision condition based on the environmental situation where the first voice information is generated, reference may be made to the corresponding description in FIG. 4 above, which will not be repeated here.
调整完成之后,设备采用调整后的该判决条件来判断该第一语音信息是否有效。在该第一语音信息有效的情况下,设备开始对该第一语音信息进行语义理解,具体的,设备中的处理器可以调用存储器中的自然语言理解模型来执行对该第一语音信息的语义理解,以获得该第一语音信息具体的含义。设备理解了该第一语音信息的含义后,基于该含义执行对应的操作,以为用户提供需要的服务。该第一语音信息的含义对于设备来说即为执行该对应操作的控制指令。After the adjustment is completed, the device uses the adjusted judgment condition to determine whether the first voice information is valid. When the first voice information is valid, the device starts to perform semantic understanding on the first voice information. Specifically, the processor in the device can call the natural language understanding model in the memory to execute the semantic understanding of the first voice information. understand to obtain the specific meaning of the first voice information. After understanding the meaning of the first voice information, the device performs a corresponding operation based on the meaning to provide the user with the desired service. The meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.
一种可能的是时候方式中,设备可以接收用户输入的指定的判决条件的灵敏度,然后,基于该灵敏度适应性地调整判断该第一语音信息是否为有效语音指令的判决条件,使得在使用调整后的判断条件判断语音信息是否有效时能够达到用户指定的判断灵敏度。设备基于用户指定的灵敏度调整完该判决条件后,采用调整后的该判决条件来判断该第一语音信息是否有效。并在该第一语音信息有效的情况下,设备开始对该第一语音信息进行语义理解获取该第一语音信息的含义,基于该含义执行对应的操作,以为用户提供需要的服务。该第一语音信息的含义对于设备来说即为执行该对应操作的控制指令。In a possible time mode, the device can receive the sensitivity of the specified judgment condition input by the user, and then adaptively adjust the judgment condition for judging whether the first voice information is a valid voice command based on the sensitivity, so that when using the adjustment The judgment sensitivity specified by the user can be achieved when the latter judgment condition judges whether the voice information is valid. After adjusting the judgment condition based on the sensitivity specified by the user, the device uses the adjusted judgment condition to judge whether the first voice information is valid. When the first voice information is valid, the device starts to perform semantic understanding on the first voice information to obtain the meaning of the first voice information, and performs corresponding operations based on the meaning to provide the user with the desired service. The meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.
一种可能的实施方式中,上述在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作的具体实现,可以参见上述图2中的步骤S203中的描述,此处不再赘述。In a possible implementation manner, the specific implementation of executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control command based on the judgment condition can be referred to the steps in FIG. 2 above. The description in S203 is not repeated here.
可选的,上述第一语音信息产生时所在的环境情况包括如下的一项或多项:截止至该设备获取到该第一语音信息的第二预设时长内的说话人数,该第一语音信息产生时预设范围内的人数,该第一语音信息的置信度,或该第一语音信息的信噪比。Optionally, the environment in which the above-mentioned first voice information is generated includes one or more of the following: the number of speakers within the second preset time period when the device obtains the first voice information, the first voice The number of people within a preset range when the information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.
由于在一段时间内说话人的数量越多,和/或语音信息产生时周围的人数越多,那么设备接收到的语音信息是闲聊即为无效语音的概率就越大,另外,语音信息的置信度和/或信噪比越高,表明设备可以正确识别出语音信息的语句的概率大,也会影响语音信息有效性的识别,因此,基于该几项中的一项或多项适应性地调整判决语音信息有效性的判决条件,能够更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Since the number of speakers in a period of time is greater, and/or the number of people around when the voice information is generated, the greater the probability that the voice information received by the device is idle chat is invalid voice. In addition, the confidence of the voice information The higher the degree and/or the signal-to-noise ratio, the higher the probability that the device can correctly recognize the sentences of the speech information, and the recognition of the validity of the speech information will also be affected. Adjusting the judgment conditions for judging the validity of the voice information can better judge the validity of the voice information, improve the accuracy of effective judgment, and reduce the false trigger rate of invalid signals.
在具体实施例中,在上述环境情况指示该第一语音信息有效的概率大于无效的概率的情况下,上述判决条件的灵敏度被调高;在该环境情况指示该第一语音信息有效的概率小于无效的概率的情况下,该判决条件的灵敏度被调低。具体的实现可以参见上述图4中对应的描述,此处不再赘述。In a specific embodiment, when the above-mentioned environmental conditions indicate that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the above-mentioned judgment condition is increased; and that the environmental conditions indicate that the probability that the first voice information is valid is less than In the case of invalid probability, the sensitivity of the decision condition is adjusted down. For specific implementation, reference may be made to the corresponding description in FIG. 4 , which is not repeated here.
由于语音信息产生的环境情况会对语音信息是否为有效的语音控制指令有较大的影响, 相同的或相似的语音信息在一个环境情况下为有效指令,但在另一个环境情况下就不一定是有效指令,因此,本申请实施例针对不同环境情况下接收到的语音信息,适应性地调整判决语音信息有效性的判决条件,能够在不同环境情况下更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Since the environmental conditions generated by the voice information have a great influence on whether the voice information is a valid voice control command, the same or similar voice information is a valid command in one environmental situation, but not necessarily in another environmental situation. is a valid instruction. Therefore, the embodiment of the present application adaptively adjusts the judgment conditions for judging the validity of the voice information for the voice information received under different environmental conditions, so that the validity of the voice information can be better judged in different environmental conditions, Improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,上述判决条件为基于该第一语音信息产生时所在的环境情况调整得到,包括:该判决条件为基于该环境情况以及设备的持续聆听时长调整得到。In a possible implementation manner, the above judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.
在具体实施例中,设备可以结合该第一语音信息产生时所在的环境情况和设备对语音信息的持续聆听时长来适应性调整上述判决条件的灵敏度。具体的,基于第一语音信息产生时所在的环境情况调整判决条件的具体实现可以参见上述图4中对应的描述,此处不再赘述。In a specific embodiment, the device can adaptively adjust the sensitivity of the above-mentioned decision condition in combination with the environmental conditions in which the first voice information is generated and the duration of the device's continuous listening to the voice information. Specifically, for the specific implementation of adjusting the decision condition based on the environmental situation where the first voice information is generated, reference may be made to the corresponding description in FIG. 4 above, which will not be repeated here.
可选的,该设备的持续聆听时长越长该判决条件的灵敏度被调得越低。基于设备对语音信息的持续聆听时长调整判决条件的具体实现可以参见上述图5中对应的描述,此处不再赘述。Optionally, the longer the continuous listening time of the device is, the lower the sensitivity of the judgment condition is adjusted. For the specific implementation of adjusting the decision condition based on the continuous listening duration of the device to the voice information, reference may be made to the corresponding description in FIG. 5 above, and details are not repeated here.
可选的,具体实现中,设备可以为上述环境情况和聆听时长各自配置一个权重,按照加权的方式来综合调整判决条件的灵敏度。例如,对于上述判断阈值的调整,假设综合该环境情况和聆听时长这两个影响因素进行调整,该两个因素对应设置的权重为w4和w5,该两个因素对应计算得到的调整后的判断阈值为a4和a5,那么,综合该两个因素确定的调整后的判断阈值为(a4*w4+a5*w5)。需要说明的是,这种加权综合的方式仅为一个示例,实际实现中也可以取多个影响因素中调整最多或最少的作为最后调整的结果等等,本方案对具体综合的计算过程不做限制。Optionally, in a specific implementation, the device may configure a weight for each of the foregoing environmental conditions and listening duration, and comprehensively adjust the sensitivity of the decision condition in a weighted manner. For example, for the adjustment of the above judgment threshold, it is assumed that the two influencing factors, the environmental situation and the listening duration, are adjusted. The thresholds are a4 and a5, then, the adjusted judgment threshold determined by combining the two factors is (a4*w4+a5*w5). It should be noted that this weighted synthesis method is only an example. In actual implementation, the most or least adjusted among multiple influencing factors can be taken as the final adjustment result, etc. This scheme does not do the calculation process of specific synthesis. limit.
由于设备持续聆听语音的时长越长,聆听到的语音信息为无效语音的概率越大,因此,本申请实施例中结合语音信息产生时的环境情况和设备的持续聆听时长来适应性地调整判决语音信息有效性的判决条件,可以进一步更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Because the longer the device continues to listen to the voice, the greater the probability that the voice information it hears is invalid voice, therefore, in the embodiment of the present application, the judgment is adaptively adjusted according to the environmental conditions when the voice information is generated and the continuous listening time of the device The judgment condition of the validity of the voice information can further judge the validity of the voice information better, improve the accuracy of the effective judgment, and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,上述判决条件为基于该环境情况以及设备的持续聆听时长调整得到,包括:该判决条件为基于该环境情况、该持续聆听时长以及历史语音信息的情况调整得到。In a possible implementation, the above judgment condition is adjusted based on the environmental condition and the continuous listening duration of the device, including: the judgment condition is adjusted based on the environmental condition, the continuous listening duration and historical voice information.
可选的,该历史语音信息的情况包括如下中的一种或多种:获取该第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;获取该第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;获取到该第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;该第一语音信息与最近一次获取到的有效语音信息的语义的第一关联度;该第一语音信息与最近一次获取到的无效语音信息的语义的第二关联度;第一语音信息与设备最近一次获取到的有效语音信息的第三关联度;截止至获取到该第一语音信息时设备与用户语音对话的状态;该第一语音信息与历史有效语音信息的声学特征的第一相似度;该第一语音信息与历史无效语音信息的声学特征的第二相似度。Optionally, the situation of the historical voice information includes one or more of the following: the first time interval between when the first voice information is acquired and the last time valid voice information is acquired; when the first voice information is acquired The second time interval between the last acquisition of invalid voice information; the proportion of valid voice information and invalid voice information within the first preset time period before the first voice information is obtained; the first voice information and the latest acquisition The first degree of relevance of the semantics of the valid voice information obtained; the second degree of relevance between the first voice information and the semantics of the invalid voice information obtained last time; The third degree of correlation; the state of the device and the user's voice dialogue until the first voice information is obtained; the first similarity of the acoustic features of the first voice information and the historically valid voice information; the first voice information and the history are invalid The second similarity of the acoustic features of the speech information.
可选的,上述第一时间间隔越长上述判决条件的灵敏度被调得越低。Optionally, the longer the first time interval is, the lower the sensitivity of the decision condition is adjusted.
可选的,上述第二时间间隔越长上述判决条件的灵敏度被调得越低。Optionally, the longer the second time interval is, the lower the sensitivity of the decision condition is adjusted.
可选的,在上述第一时间间隔小于上述第二时间间隔的情况下,上述判决条件的灵敏度被调高。Optionally, in the case that the above-mentioned first time interval is smaller than the above-mentioned second time interval, the sensitivity of the above-mentioned decision condition is increased.
可选的,在上述有效语音信息的占比大于上述无效语音信息的占比的情况下,上述判决条件的灵敏度被调高;Optionally, in the case that the proportion of the above-mentioned valid voice information is greater than the proportion of the above-mentioned invalid voice information, the sensitivity of the above-mentioned judgment condition is increased;
在该有效语音信息的占比小于该无效语音信息的占比的情况下,该有效语音信息的占比呈上升趋势,该判决条件的灵敏度被调高;该有效语音信息的占比呈下降趋势,该判决条件的灵敏度被调低。In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on an upward trend, and the sensitivity of the judgment condition is increased; the proportion of the valid voice information is on a downward trend , the sensitivity of the decision condition is reduced.
可选的,在上述设备与用户语音对话的状态存在的情况下,该判决条件的灵敏度被调高。Optionally, in the case that the above-mentioned state of the device and the user's voice dialogue exists, the sensitivity of the judgment condition is adjusted to be higher.
在本实施例中,设备可以结合该第一语音信息产生时所在的环境情况、设备对语音信息的持续聆听时长和设备聆听到的历史语音信息来适应性调整上述判决条件的灵敏度。具体的,基于第一语音信息产生时所在的环境情况调整判决条件的具体实现可以参见上述图4中对应的描述,此处不再赘述;基于设备对语音信息的持续聆听时长调整判决条件的具体实现可以参见上述图5中对应的描述,此处不再赘述;基于设备聆听到的历史语音信息调整判决条件的具体实现可以参见上述图5、图6A、图6B、图7或图9中对应的描述,此处不再赘述。In this embodiment, the device can adaptively adjust the sensitivity of the above judgment condition in combination with the environment in which the first voice information is generated, the duration of the device's continuous listening to the voice information, and the historical voice information heard by the device. Specifically, for the specific implementation of adjusting the judgment conditions based on the environmental conditions where the first voice information is generated, reference may be made to the corresponding description in FIG. 4 , which will not be repeated here; The implementation can refer to the corresponding description in the above-mentioned FIG. 5 , which will not be repeated here; the specific implementation of adjusting the judgment condition based on the historical voice information heard by the device can refer to the corresponding description in the above-mentioned FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG. 9 . description, which will not be repeated here.
可选的,本实施例中结合上述环境情况、聆听时长和历史语音信息来调整判决条件的灵敏度,可以是采用上述介绍的加权平均的综合调整方法来综合调整,或者可以是取多个影响因素中调整最多或最少的作为最后调整的结果等等,本方案对具体综合的计算过程不做限制。Optionally, in this embodiment, the sensitivity of the judgment condition is adjusted in combination with the above-mentioned environmental conditions, listening duration, and historical voice information. The most or the least adjusted result is the result of the final adjustment, etc. This scheme does not limit the specific comprehensive calculation process.
基于历史语音信息也可以帮助判断当前获取的语音信息的有效性,例如若当前获取的语音信息与历史获取的有效语音信息相似度较大,那么当前获取的语音信息为有效语音指令的概率较大,反之,若当前获取的语音信息与历史获取的无效语音信息相似度较大,那么当前获取的语音信息为无效语音指令的概率较大。因此,本申请实施例中除了上述介绍的语音信息产生的环境情况和设备聆听时长,还结合历史语音信息来适应性地调整判决语音信息有效性的判决条件,也可以进一步更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the embodiment of the present application, in addition to the environmental conditions and the listening duration of the voice information described above, the historical voice information is also used to adaptively adjust the judgment conditions for judging the validity of the voice information, and the voice information can be further judged better. It can improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.
一种可能的实施方式中,上述判决条件为基于该第一语音信息产生时所在的环境情况调整得到,包括:该判决条件为基于该环境情况以及历史语音信息的情况调整得到。In a possible implementation manner, the above judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.
在本实施例中,设备可以结合该第一语音信息产生时所在的环境情况和设备聆听到的历史语音信息来适应性调整上述判决条件的灵敏度。具体的,基于第一语音信息产生时所在的环境情况调整判决条件的具体实现可以参见上述图4中对应的描述,此处不再赘述;基于设备聆听到的历史语音信息调整判决条件的具体实现可以参见上述图5、图6A、图6B、图7或图9中对应的描述,此处不再赘述。In this embodiment, the device can adaptively adjust the sensitivity of the above-mentioned judgment condition in combination with the environmental conditions where the first voice information is generated and the historical voice information heard by the device. Specifically, the specific implementation of adjusting the judgment conditions based on the environmental conditions where the first voice information is generated may refer to the corresponding description in FIG. 4 , which will not be repeated here; the specific implementation of adjusting the judgment conditions based on the historical voice information heard by the device Reference may be made to the corresponding descriptions in FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG. 9 , and details are not repeated here.
可选的,本实施例中结合上述环境情况和历史语音信息来调整判决条件的灵敏度,可以是采用上述介绍的加权平均的综合调整方法来综合调整,或者可以是取多个影响因素中调整最多或最少的作为最后调整的结果等等,本方案对具体综合的计算过程不做限制。Optionally, in this embodiment, the sensitivity of the decision condition is adjusted in combination with the above-mentioned environmental conditions and historical voice information. Or at least as the result of the final adjustment, etc., this scheme does not limit the specific comprehensive calculation process.
基于前面的描述,本申请实施例中结合语音信息产生的环境情况和历史语音信息来适应性地调整判决语音信息有效性的判决条件,也可以进一步更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Based on the foregoing description, in the embodiment of the present application, the judgment conditions for judging the validity of the voice information are adaptively adjusted in combination with the environmental conditions generated by the voice information and the historical voice information, and the validity of the voice information can be further judged better, and the effectiveness of the voice information can be improved. The accuracy of the judgment can reduce the false trigger rate of invalid signals.
一种可能的实施方式中,本申请提供另一种语音信息处理方法,该方法包括:获取第一语音信息;在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于设备的持续聆听时长调整得到。In a possible implementation manner, the present application provides another voice information processing method. The method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is obtained by adjusting based on the continuous listening duration of the device.
具体实施例中,上述获取第一语音信息的具体实现可以参见上述图2中的步骤S201中的描述,此处不再赘述。上述在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作的具体实现,可以参见上述图2中的步骤S203中的描述,此处不再赘述。上述基于设备对语音信息的持续聆听时长调整判决条件的具体实现可以参见上述图5中对应的描述,此处不再赘述。In a specific embodiment, for the specific implementation of the above-mentioned acquisition of the first voice information, reference may be made to the description in step S201 in the above-mentioned FIG. 2 , which will not be repeated here. The specific implementation of executing the operation indicated by the first voice information when it is determined based on the judgment condition that the first voice information is a valid voice control command, can refer to the description in step S203 in FIG. Repeat. The specific implementation of the above-mentioned judgment condition based on the continuous listening duration adjustment of the device to the voice information may refer to the corresponding description in FIG. 5 , which will not be repeated here.
本申请中,由于设备持续聆听语音的时长越长,聆听到的语音信息为无效语音的概率越大,因此,可以通过设备的持续聆听时长来适应性地调整判决语音信息有效性的判决条件,可以更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。In this application, since the longer the device continues to listen to the voice, the greater the probability that the voice information heard is invalid voice, therefore, the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening time of the device, The validity of the voice information can be better judged, the accuracy of effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
一种可能的实施方式中,本申请提供另一种语音信息处理方法,该方法包括:获取第一语音信息;在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作,其中,该判决条件为基于历史语音信息调整得到。In a possible implementation manner, the present application provides another voice information processing method. The method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is adjusted based on historical voice information.
具体实施例中,上述获取第一语音信息的具体实现可以参见上述图2中的步骤S201中的描述,此处不再赘述。上述在基于判决条件确定该第一语音信息为有效的语音控制指令的情况下,执行该第一语音信息指示的操作的具体实现,可以参见上述图2中的步骤S203中的描述,此处不再赘述。基于设备聆听到的历史语音信息调整判决条件的具体实现可以参见上述图5、图6A、图6B、图7或图9中对应的描述,此处不再赘述。In a specific embodiment, for the specific implementation of the above-mentioned acquisition of the first voice information, reference may be made to the description in step S201 in the above-mentioned FIG. 2 , which will not be repeated here. The specific implementation of executing the operation indicated by the first voice information when it is determined based on the judgment condition that the first voice information is a valid voice control command, can refer to the description in step S203 in FIG. Repeat. The specific implementation of adjusting the decision condition based on the historical voice information heard by the device may refer to the corresponding description in the above-mentioned FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG.
基于历史语音信息也可以帮助判断当前获取的语音信息的有效性,例如若当前获取的语音信息与历史获取的有效语音信息相似度较大,那么当前获取的语音信息为有效语音指令的概率较大,反之,若当前获取的语音信息与历史获取的无效语音信息相似度较大,那么当前获取的语音信息为无效语音指令的概率较大。因此,本申请中,通过历史语音信息来适应性地调整判决语音信息有效性的判决条件,可以更好地判断语音信息的有效性,提高有效判别的准确率,降低无效信号的误触发率。Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the present application, by adaptively adjusting the judgment conditions for judging the validity of the voice information through the historical voice information, the validity of the voice information can be better judged, the accuracy of the effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.
为了便于从整体上理解本申请提供的语音信息处理方法,示例性地,可以参见图11所示的流程框图。在图11中,首先,设备的语音交互系统被唤醒,然后,该系统开始聆听用户的语音。该系统获取到用户的语音信息后,将语音信息输入到上述的无效拒识模型识别该语音信息的有效性。若识别出该语音信息为有效,则对该语音信息进行语义理解,并基于理解的语义进行指令解析和执行。In order to facilitate an overall understanding of the voice information processing method provided by the present application, for example, reference may be made to the flowchart shown in FIG. 11 . In Figure 11, first, the voice interaction system of the device is awakened, and then the system starts to listen to the user's voice. After the system acquires the user's voice information, the voice information is input into the above-mentioned invalid recognition model to identify the validity of the voice information. If it is recognized that the voice information is valid, the voice information is semantically understood, and instructions are parsed and executed based on the understood semantics.
语义理解之后,语音交互系统会判断是否继续聆听用户的语音,若继续,则进行聆听语音的操作。若确定不再继续聆听,则执行结束聆听的操作。示例性地,判断是否持续聆听可以根据预设的聆听时长来判断,若当前没超出该预设的聆听时长的范围,则可以持续聆听,否则结束聆听。After semantic understanding, the voice interaction system will determine whether to continue listening to the user's voice, and if so, perform the operation of listening to the voice. If it is determined not to continue listening, perform the operation of ending listening. Exemplarily, judging whether to continue listening may be determined according to a preset listening duration, if the current range of the preset listening duration is not exceeded, the listening may be continued; otherwise, the listening is terminated.
若上述无效拒识模型识别出的该语音信息为无效,则该系统判断是否继续聆听用户的语音,若继续,则进行聆听语音的操作。若确定不再继续聆听,则执行结束聆听的操作。If the voice information identified by the invalid recognition model is invalid, the system determines whether to continue listening to the user's voice, and if so, performs the operation of listening to the voice. If it is determined not to continue listening, perform the operation of ending listening.
一种可能的实施方式中,上述图11所示的流程中,在判断语音信息有效之后,判断是否持续聆听用户的语音和语义理解这两个步骤也可以同时进行,或者先判断是否持续聆听用户的语音,再进行语义理解,本申请对该两个操作的先后执行顺序不做限制。In a possible implementation, in the process shown in Figure 11 above, after judging that the voice information is valid, the two steps of judging whether to continuously listen to the user's voice and semantic understanding can also be carried out simultaneously, or first determine whether to continue listening to the user. and then perform semantic understanding. The present application does not limit the sequence of execution of the two operations.
另外,上述对语音信息进行语义理解之后还可以将理解之后的语音信息的语义返回到语音信息有效性识别的过程中,例如输入到上述无效拒识模型用于上述判决条件的灵敏度的调 整。In addition, after the above-mentioned semantic understanding of the speech information, the semantics of the understood speech information can also be returned to the process of validating the speech information, for example, input into the above-mentioned invalid rejection model for the adjustment of the sensitivity of the above-mentioned judgment conditions.
另外,需要说明的是,上述介绍的本申请提供的语音信息处理方法的实施例中,主要是以无效拒识模型中的判决条件为例进行介绍,但是在实际应用中,语音信息有效性的判决条件可以不限制是该无效拒识模型中的判决条件。只要是基于上述语音信息的有效性识别的影响因素中的一项或多项来调整语音信息有效性的判决条件的方案均在本申请的保护范围之内。In addition, it should be noted that the above-mentioned embodiments of the voice information processing method provided by the present application are mainly introduced by taking the judgment conditions in the invalid recognition model as an example. The decision condition may not be limited to be the decision condition in the invalid rejection model. As long as it is based on one or more of the above-mentioned influencing factors of the validity identification of the voice information, the scheme of adjusting the judgment condition of the validity of the voice information is within the protection scope of the present application.
综上所述,本申请提供的语音信息处理方法,从一个或多个影响语音信息有效性判断的影响因素入手,实时调整设备判决获取的语音信息的有效性的判决条件的灵敏度,使得设备可以基于不同的场景,不同的用户状态灵活有效的判别语音信息的有效性,可以提高语音信息有效性识别的准确率,降低无效语音信息的误触发率,同时节省了设备因误触发浪费的计算资源等,还可以提升语音交互过程中用户的体检。To sum up, the voice information processing method provided by this application starts from one or more influencing factors that affect the validity judgment of voice information, and adjusts the sensitivity of the judgment condition of the validity of the voice information obtained by the device in real time, so that the device can Based on different scenarios, different user states can flexibly and effectively determine the validity of voice information, which can improve the accuracy of voice information validity recognition, reduce the false trigger rate of invalid voice information, and save the computing resources wasted by devices due to false triggering. It can also improve the user's physical examination during the voice interaction process.
上述主要对本申请实施例提供的数据通信处理方法进行了介绍。可以理解的是,各个设备为了实现上述对应的功能,其包含了执行各个功能相应的硬件结构和/或软件模块。结合本文中所公开的实施例描述的各示例的单元及步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用使用不同方法来实现所描述的功能,但这种实现不应认为超出本申请的范围。The above mainly introduces the data communication processing method provided by the embodiments of the present application. It can be understood that, in order to implement the above-mentioned corresponding functions, each device includes corresponding hardware structures and/or software modules for performing each function. In combination with the units and steps of each example described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
本申请实施例可以根据上述方法示例对设备进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present application, the device may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
在采用对应各个功能划分各个功能模块的情况下,图12示出了装置的一种可能的逻辑结构示意图,该装置可以是上述的设备,或者可以是该设备中的芯片,或者可以是该设备中的处理系统等。该装置1200包括获取单元1201、调整单元1202、语义理解单元1203和执行单元1204。其中:In the case where each functional module is divided according to each function, FIG. 12 shows a schematic diagram of a possible logical structure of the device, and the device may be the above-mentioned device, or may be a chip in the device, or may be the device processing system, etc. The apparatus 1200 includes an acquisition unit 1201 , an adjustment unit 1202 , a semantic understanding unit 1203 and an execution unit 1204 . in:
获取单元1201,用于获取第一语音信息。该获取单元1201可以由通信接口或收发器来实现,可以执行图2所示的步骤201中所述的操作。The obtaining unit 1201 is configured to obtain the first voice information. The obtaining unit 1201 may be implemented by a communication interface or a transceiver, and may perform the operations described in step 201 shown in FIG. 2 .
调整单元1202,用于基于该第一语音信息有效性的影响因素调整判决条件,该判决条件为该第一语音信息的有效性判断模型中的一个或多个判断条件,该有效性用于指示该第一语音信息对于获取到该第一语音信息的设备是否为有效的语音控制指令。该调整单元1202可以由处理器来实现,可以执行图2所示的步骤202中所述的操作。The adjustment unit 1202 is used to adjust the judgment condition based on the influence factor of the validity of the first voice information, the judgment condition is one or more judgment conditions in the validity judgment model of the first voice information, and the validity is used to indicate Whether the first voice information is a valid voice control instruction for the device that obtained the first voice information. The adjustment unit 1202 may be implemented by a processor, and may perform the operations described in step 202 shown in FIG. 2 .
语义理解单元1203,用于在基于调整后的该判决条件确定该第一语音信息有效的情况下,对该第一语音信息进行语义理解。该语义理解单元1203可以由处理器来实现,可以执行图2所示的步骤203中所述的语义理解操作。The semantic understanding unit 1203 is configured to perform semantic understanding on the first voice information when it is determined that the first voice information is valid based on the adjusted judgment condition. The semantic understanding unit 1203 may be implemented by a processor, and may perform the semantic understanding operation described in step 203 shown in FIG. 2 .
执行单元1204,用于执行该第一语音信息的指令。该执行单元1204可以由处理器来实现,可以执行图2所示的步骤203中所述的执行操作。The execution unit 1204 is configured to execute the instruction of the first voice information. The execution unit 1204 may be implemented by a processor, and may perform the execution operations described in step 203 shown in FIG. 2 .
一种可能的实施方式中,该调整单元1202具体用于:In a possible implementation manner, the adjustment unit 1202 is specifically used for:
在基于该影响因素分析出该第一语音信息有效的概率大于无效的概率的情况下,将该判决条件的灵敏度调高,该判决条件的灵敏度越高指示通过该判决条件确定该第一语音信息有 效的概率越高;In the case that the probability that the first voice information is valid is greater than the probability that it is invalid based on the analysis of the influencing factor, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The higher the probability of being effective;
在基于该影响因素分析出该第一语音信息有效的概率小于无效的概率的情况下,将该判决条件的灵敏度调低,该判决条件的灵敏度越低指示通过该判决条件确定该第一语音信息有效的概率越低。In the case that the probability that the first voice information is valid is less than the probability that it is invalid based on the analysis of the influencing factors, the sensitivity of the judgment condition is lowered, and the lower the sensitivity of the judgment condition, the lower the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The lower the probability of being effective.
一种可能的实施方式中,该判决条件包括该有效性判断模型中该第一语音信息有效性的预判模块的选择条件,该预判模块包括规则匹配模块和推理模块。In a possible implementation manner, the judgment condition includes a selection condition of a pre-judgment module of the validity of the first speech information in the validity judgment model, and the pre-judgment module includes a rule matching module and a reasoning module.
一种可能的实施方式中,该判决条件包括该有效性判断模型中,用于预判该第一语音信息有效性的推理模块的判断阈值。In a possible implementation manner, the judgment condition includes a judgment threshold of an inference module used to predict the validity of the first voice information in the validity judgment model.
一种可能的实施方式中,该判决条件包括该有效性判断模型中决策模块的综合判断条件;该综合判断条件为基于预判结果确定该第一语音信号是否有效的判断条件;该预判结果为该有效性判断模型中预判模块对该第一语音信息的有效性的预判结果。In a possible implementation, the judgment condition includes a comprehensive judgment condition of a decision module in the validity judgment model; the comprehensive judgment condition is a judgment condition for determining whether the first speech signal is valid based on a prejudgment result; the prejudgment result is the pre-judgment result of the validity of the first voice information by the pre-judgment module in the validity judgment model.
一种可能的实施方式中,该影响因素为如下中的一种或多种:In a possible embodiment, the influencing factor is one or more of the following:
该第一语音信息产生时所在的环境情况;The environmental situation where the first voice information is generated;
该装置1200的持续聆听时长;The continuous listening time of the device 1200;
获取该第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;The first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;
获取该第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;the second time interval between when the first voice information is acquired and the invalid voice information is acquired most recently;
获取到该第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;The proportion of valid voice information and invalid voice information within the first preset time period before the first voice information is obtained;
该第一语音信息与最近一次获取到的有效语音信息的语义的第一关联度;The first degree of relevance of the semantics of the first voice information and the most recently acquired valid voice information;
该第一语音信息与最近一次获取到的无效语音信息的语义的第二关联度;The second degree of relevance of the semantics of the first voice information and the most recently acquired invalid voice information;
第一语音信息与装置1200最近一次获取到的有效语音信息的第三关联度;the third degree of association between the first voice information and the most recent valid voice information obtained by the device 1200;
截止至获取到该第一语音信息时该装置1200与用户语音对话的状态;The state of the voice dialogue between the device 1200 and the user until the first voice information is obtained;
该第一语音信息与历史有效语音信息的声学特征的第一相似度;The first similarity between the first voice information and the acoustic features of the historically valid voice information;
该第一语音信息与历史无效语音信息的声学特征的第二相似度。The second similarity between the first voice information and the acoustic features of the historical invalid voice information.
一种可能的实施方式中,该第一语音信息产生时所在的环境情况包括如下的一项或多项:In a possible implementation manner, the environment in which the first voice information is generated includes one or more of the following:
截止至该装置1200获取到该第一语音信息的第二预设时长内的说话人数,该第一语音信息产生时预设范围内的人数,该第一语音信息的置信度,或该第一语音信息的信噪比。Until the device 1200 obtains the number of speakers within the second preset time period of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the first voice information The signal-to-noise ratio of speech information.
图12所示装置1200中各个单元的具体操作以及有益效果可以参见上述方法实施例中对应的描述,此处不再赘述。For the specific operations and beneficial effects of each unit in the apparatus 1200 shown in FIG. 12, reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.
在采用对应各个功能划分各个功能模块的情况下,图13示出了装置的一种可能的逻辑结构示意图,该装置可以是上述的设备,或者可以是该设备中的芯片,或者可以是该设备中的处理系统等。该装置1300包括获取单元1301和执行单元1302。其中:In the case where each functional module is divided according to each function, FIG. 13 shows a schematic diagram of a possible logical structure of the device, and the device may be the above-mentioned device, or may be a chip in the device, or may be the device processing system, etc. The apparatus 1300 includes an acquisition unit 1301 and an execution unit 1302 . in:
获取单元1301,用于获取第一语音信息。该获取单元1301可以由通信接口或收发器来实现,可以执行图10所示的步骤S1001中所述的操作。The obtaining unit 1301 is configured to obtain the first voice information. The obtaining unit 1301 may be implemented by a communication interface or a transceiver, and may perform the operations described in step S1001 shown in FIG. 10 .
执行单元1302,用于在基于判决条件确定所述第一语音信息为有效的语音控制指令的情况下,执行所述第一语音信息指示的操作,其中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到。该执行单元1302可以由处理器来实现,可以执行图10所示的步骤S1002中所述的操作。The executing unit 1302 is configured to execute the operation indicated by the first voice information in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is based on the first voice information The environmental conditions in which the voice information is generated are adjusted. The execution unit 1302 may be implemented by a processor, and may perform the operations described in step S1002 shown in FIG. 10 .
图13所示装置1300中各个单元的具体操作以及有益效果可以参见上述方法实施例中对应的描述,此处不再赘述。For specific operations and beneficial effects of each unit in the apparatus 1300 shown in FIG. 13 , reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.
图14所示为本申请提供的设备的一种可能的硬件结构示意图,该设备可以是上述实施例所述方法中的设备。该设备1400包括:处理器1401、存储器1402和通信接口1403。处理器1401、通信接口1403以及存储器1402可以相互连接或者通过总线1404相互连接。FIG. 14 shows a schematic diagram of a possible hardware structure of the device provided by the present application, and the device may be the device in the method described in the foregoing embodiment. The device 1400 includes: a processor 1401 , a memory 1402 and a communication interface 1403 . The processor 1401 , the communication interface 1403 , and the memory 1402 may be connected to each other or to each other through a bus 1404 .
示例性的,存储器1402用于存储设备1400的计算机程序和数据,存储器1402可以包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)或便携式只读存储器(compact disc read-only memory,CD-ROM)等。Exemplarily, the memory 1402 is used to store computer programs and data of the device 1400, and the memory 1402 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), memory Erase programmable read only memory (erasable programmable read only memory, EPROM) or portable read only memory (compact disc read-only memory, CD-ROM), etc.
在实现图14所示实施例的情况下,执行图14中的全部或部分单元的功能所需的软件或程序代码存储在存储器1402中。In the case of implementing the embodiment shown in FIG. 14 , the software or program codes required to perform the functions of all or part of the units in FIG. 14 are stored in the memory 1402 .
在实现图14实施例的情况下,如果是部分单元的功能所需的软件或程序代码存储在存储器1402中,则处理器1401除了调用存储器1402中的程序代码实现部分功能外,还可以配合其他部件(如通信接口1403)共同完成图14实施例所描述的其他功能(如接收或发送数据的功能)。In the case of implementing the embodiment of FIG. 14, if the software or program codes required for the functions of some units are stored in the memory 1402, the processor 1401 can not only call the program codes in the memory 1402 to realize some functions, but also cooperate with other The components (eg, the communication interface 1403 ) together perform other functions (eg, the function of receiving or sending data) described in the embodiment of FIG. 14 .
通信接口1403的个数可以为多个,用于支持设备1400进行通信,例如接收或发送数据或信号等。The number of the communication interfaces 1403 may be multiple, and is used to support the device 1400 to communicate, such as receiving or sending data or signals.
示例性的,处理器1401可以是中央处理器单元、通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,数字信号处理器和微处理器的组合等等。处理器1401可以用于读取上述存储器1402中存储的程序,执行如下操作:Illustratively, the processor 1401 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like. The processor 1401 can be used to read the program stored in the above-mentioned memory 1402, and perform the following operations:
获取第一语音信息;基于该第一语音信息有效性的影响因素调整判决条件,该判决条件为该第一语音信息的有效性判断模型中的一个或多个判断条件,该有效性用于指示该第一语音信息对于获取到该第一语音信息的设备1400是否为有效的语音控制指令;在基于调整后的该判决条件确定该第一语音信息有效的情况下,对该第一语音信息进行语义理解,并执行该第一语音信息的指令。Acquiring first voice information; adjusting the judgment condition based on the influencing factor of the validity of the first voice information, the judgment condition is one or more judgment conditions in the validity judgment model of the first voice information, and the validity is used to indicate Whether the first voice information is a valid voice control instruction for the device 1400 that obtained the first voice information; if it is determined that the first voice information is valid based on the adjusted judgment condition, the first voice information is checked. Semantically understands and executes the instructions of the first voice information.
一种可能的实施方式中,该基于该第一语音信息有效性的影响因素调整判决条件,包括:In a possible implementation, the adjustment of the decision condition based on the influencing factor of the validity of the first voice information includes:
在基于该影响因素分析出该第一语音信息有效的概率大于无效的概率的情况下,将该判决条件的灵敏度调高,该判决条件的灵敏度越高指示通过该判决条件确定该第一语音信息有效的概率越高;In the case that the probability that the first voice information is valid is greater than the probability that it is invalid based on the analysis of the influencing factor, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The higher the probability of being effective;
在基于该影响因素分析出该第一语音信息有效的概率小于无效的概率的情况下,将该判决条件的灵敏度调低,该判决条件的灵敏度越低指示通过该判决条件确定该第一语音信息有效的概率越低。In the case that the probability that the first voice information is valid is less than the probability that it is invalid based on the analysis of the influencing factors, the sensitivity of the judgment condition is lowered, and the lower the sensitivity of the judgment condition, the lower the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The lower the probability of being effective.
图14所示设备1400中各个单元的具体操作以及有益效果可以参见上述方法实施例中对应的描述,此处不再赘述。For specific operations and beneficial effects of each unit in the device 1400 shown in FIG. 14 , reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.
图15为本申请实施例提供的另一种语音信息处理装置的结构示意图,该装置可以是上述实施例中的设备,或者可以是该设备中的芯片,或者可以是该设备中的处理系统等,并且可以实现上述本申请提供的语音信息处理方法及其各可选的实施例。如图15所示,语音信息处理装置1500包括:处理器1501,与处理器1501耦合的接口电路1502。应理解,虽然图15 中仅示出了一个处理器和一个接口电路。语音信息处理装置1500可以包括其他数目的处理器和接口电路。FIG. 15 is a schematic structural diagram of another voice information processing apparatus provided by an embodiment of the present application. The apparatus may be the device in the above-mentioned embodiment, or may be a chip in the device, or may be a processing system in the device, etc. , and can implement the above-mentioned voice information processing method and various optional embodiments thereof provided by the present application. As shown in FIG. 15 , the voice information processing apparatus 1500 includes: a processor 1501 , and an interface circuit 1502 coupled to the processor 1501 . It should be understood that although only one processor and one interface circuit are shown in FIG. 15 . The voice information processing apparatus 1500 may include other numbers of processors and interface circuits.
其中,接口电路1502用于与装置1500的其他组件连通,例如存储器或其他处理器。处理器1501用于通过接口电路1502与其他组件进行信号交互。接口电路1502可以是处理器1501的输入/输出接口。Among them, the interface circuit 1502 is used to communicate with other components of the apparatus 1500, such as memory or other processors. The processor 1501 is used for signal interaction with other components through the interface circuit 1502 . The interface circuit 1502 may be an input/output interface of the processor 1501 .
例如,处理器1501通过接口电路1502读取与之耦合的存储器中的计算机程序或指令,并译码和执行这些计算机程序或指令。应理解,这些计算机程序或指令可包括上述方法中的各个功能程序。当相应功能程序被处理器1501译码并执行时,可以使得语音信息处理装置1500实现本申请实施例所提供的语音信息处理方法中的方案。For example, the processor 1501 reads computer programs or instructions in a memory coupled thereto through the interface circuit 1502, and decodes and executes the computer programs or instructions. It should be understood that these computer programs or instructions may include various functional programs in the above-described methods. When the corresponding function program is decoded and executed by the processor 1501, the voice information processing apparatus 1500 can be made to implement the solution in the voice information processing method provided by the embodiments of the present application.
可选的,这些功能程序存储在语音信息处理装置1500外部的存储器中。当该功能程序被处理器1501译码并执行时,内存储器中临时存放该功能程序的部分或全部内容。Optionally, these functional programs are stored in a memory outside the voice information processing apparatus 1500 . When the function program is decoded and executed by the processor 1501, part or all of the content of the function program is temporarily stored in the internal memory.
可选的,这些功能程序存储在语音信息处理装置1500内部的存储器中。当语音信息处理装置1500内部的存储器中存储有该功能程序时,语音信息处理装置1500可被设置在本申请实施例的设备中。Optionally, these functional programs are stored in the internal memory of the voice information processing apparatus 1500 . When the function program is stored in the internal memory of the voice information processing apparatus 1500, the voice information processing apparatus 1500 may be set in the device of the embodiment of the present application.
可选的,这些功能程序的部分内容存储在语音信息处理装置1500外部的存储器中,这些功能程序的其他部分内容存储在语音信息处理装置1500内部的存储器中。Optionally, part of the content of these function programs is stored in a memory outside the voice information processing apparatus 1500 , and other parts of the content of these function programs are stored in a memory inside the voice information processing apparatus 1500 .
应理解,图1,图12或图13,图14和图15任一所示的装置或设备可以互相结合,图1,图12或图13,图14和图15任一所示的装置或设备以及各可选实施例相关设计细节可互相参考,也可以参考图2或图10任一所示的语音信息处理方法以及各可选实施例相关设计细节。此处不再重复赘述。It should be understood that any of the apparatuses or devices shown in FIG. 1 , FIG. 12 or FIG. 13 , FIG. 14 and FIG. 15 may be combined with each other, and the apparatus or apparatus shown in any of The relevant design details of the device and each optional embodiment can be referred to each other, and can also be referred to the voice information processing method shown in any one of FIG. 2 or FIG. 10 and the relevant design details of each optional embodiment. It will not be repeated here.
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行以实现上述各个实施例及其可能的实施例中任意一个实施例中服务器所做的操作。Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement any one of the foregoing embodiments and possible embodiments thereof. The operation done by the server.
本申请实施例还提供一种计算机程序产品,当该计算机程序产品被计算机读取并执行时,上述各个实施例及其可能的实施例中任意一个实施例中服务器所做的操作将被执行。The embodiments of the present application also provide a computer program product, when the computer program product is read and executed by a computer, the operations performed by the server in any one of the foregoing embodiments and possible embodiments thereof will be executed.
本申请实施例还提供一种计算机程序,当该计算机程序在计算机上执行时,将会使该计算机实现上述各个实施例及其可能的实施例中任意一个实施例中服务器所做的操作。The embodiments of the present application also provide a computer program, which, when executed on a computer, enables the computer to implement the operations performed by the server in any one of the foregoing embodiments and possible embodiments.
综上所述,本申请提供一种语音信息处理方法及装置,能够在不同的智能语音交互场景中提高有效语音识别的准确率,降低无效语音的误触发率。In summary, the present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false trigger rate of invalid voices in different intelligent voice interaction scenarios.
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一图像可以被称为第二图像,并且类似地,第二图像可以被称为第一图像。第一图像和第二图像都可以是图像,并且在某些情况下,可以是单独且不同的图像。In this application, the terms "first", "second" and other words are used to distinguish the same or similar items with basically the same function and function, and it should be understood that between "first", "second" and "nth" There are no logical or timing dependencies, and no restrictions on the number and execution order. It will also be understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and, similarly, a second image may be referred to as a first image, without departing from the scope of various described examples. Both the first image and the second image may be images, and in some cases, may be separate and distinct images.
还应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成 任何限定。It should also be understood that, in each embodiment of the present application, the size of the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be used in the embodiment of the present application. Implementation constitutes any limitation.
还应理解,术语“包括”(也称“includes”、“including”、“comprises”和/或“comprising”)当在本说明书中使用时指定存在所陈述的特征、整数、步骤、操作、元素、和/或部件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元素、部件、和/或其分组。It will also be understood that the term "includes" (also referred to as "includes", "including", "comprises" and/or "comprising") when used in this specification designates the presence of stated features, integers, steps, operations, elements , and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groupings thereof.
还应理解,说明书通篇中提到的“一个实施例”、“一实施例”、“一种可能的实现方式”意味着与实施例或实现方式有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”、“一种可能的实现方式”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。It should also be understood that references throughout the specification to "one embodiment," "an embodiment," and "one possible implementation" mean that a particular feature, structure, or characteristic associated with the embodiment or implementation is included herein. in at least one embodiment of the application. Thus, appearances of "in one embodiment" or "in an embodiment" or "one possible implementation" in various places throughout this specification are not necessarily necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. scope.

Claims (30)

  1. 一种语音信息处理方法,其特征在于,所述方法包括:A voice information processing method, characterized in that the method comprises:
    获取第一语音信息;obtain the first voice information;
    在基于判决条件确定所述第一语音信息为有效的语音控制指令的情况下,执行所述第一语音信息指示的操作,其中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到。In the case where it is determined that the first voice information is a valid voice control instruction based on a judgment condition, the operation indicated by the first voice information is performed, wherein the judgment condition is based on the location where the first voice information was generated. Environmental conditions can be adjusted.
  2. 根据权利要求1所述的方法,其特征在于,所述第一语音信息产生时所在的环境情况包括如下的一项或多项:The method according to claim 1, characterized in that, the environmental conditions in which the first voice information is generated include one or more of the following:
    截止至所述设备获取到该第一语音信息的第二预设时长内的说话人数,所述第一语音信息产生时预设范围内的人数,所述第一语音信息的置信度,或所述第一语音信息的信噪比。Until the device obtains the number of speakers within the second preset duration of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the Describe the signal-to-noise ratio of the first voice information.
  3. 根据权利要求1或2所述的方法,其特征在于,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:The method according to claim 1 or 2, wherein the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including:
    所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到。The judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device.
  4. 根据权利要求3所述的方法,其特征在于,所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到,包括:The method according to claim 3, wherein the judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device, comprising:
    所述判决条件为基于所述环境情况、所述持续聆听时长以及历史语音信息的情况调整得到。The judgment condition is adjusted based on the environmental conditions, the continuous listening duration and the historical voice information.
  5. 根据权利要求1或2所述的方法,其特征在于,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:The method according to claim 1 or 2, wherein the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including:
    所述判决条件为基于所述环境情况以及历史语音信息的情况调整得到。The judgment condition is adjusted based on the environmental conditions and historical voice information.
  6. 根据权利要求4或5所述的方法,其特征在于,所述历史语音信息的情况包括如下中的一种或多种:The method according to claim 4 or 5, wherein the situation of the historical voice information includes one or more of the following:
    获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;
    获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;
    获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
    所述第一语音信息与最近一次获取到的有效语音信息的语义的第一关联度;The first semantic correlation between the first voice information and the most recently acquired valid voice information;
    所述第一语音信息与最近一次获取到的无效语音信息的语义的第二关联度;The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;
    第一语音信息与设备最近一次获取到的有效语音信息的第三关联度;the third degree of association between the first voice information and the last valid voice information obtained by the device;
    截止至获取到所述第一语音信息时设备与用户语音对话的状态;The state of the voice dialogue between the device and the user when the first voice information is obtained;
    所述第一语音信息与历史有效语音信息的声学特征的第一相似度;the first similarity between the acoustic features of the first voice information and historically valid voice information;
    所述第一语音信息与历史无效语音信息的声学特征的第二相似度。The second similarity of the acoustic features of the first voice information and the historical invalid voice information.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,The method according to any one of claims 1 to 6, wherein,
    在所述环境情况指示所述第一语音信息有效的概率大于无效的概率的情况下,所述判决 条件的灵敏度被调高;In the case that the environmental conditions indicate that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;
    在所述环境情况指示所述第一语音信息有效的概率小于无效的概率的情况下,所述判决条件的灵敏度被调低。In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.
  8. 根据权利要求3或4所述的方法,其特征在于,所述设备的持续聆听时长越长所述判决条件的灵敏度被调得越低。The method according to claim 3 or 4, characterized in that, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.
  9. 根据权利要求4至6任一项所述的方法,其特征在于,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time;
    所述第一时间间隔越长所述判决条件的灵敏度被调得越低。The longer the first time interval is, the lower the sensitivity of the decision condition is adjusted.
  10. 根据权利要求4至6任一项所述的方法,其特征在于,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a second time interval between when the first voice information is acquired and the invalid voice information is acquired last time;
    所述第二时间间隔越长所述判决条件的灵敏度被调得越低。The longer the second time interval is, the lower the sensitivity of the decision condition is adjusted.
  11. 根据权利要求4至6任一项所述的方法,其特征在于,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔,以及包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time, and including the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;
    在所述第一时间间隔小于所述第二时间间隔的情况下,所述判决条件的灵敏度被调高。In the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
  12. 根据权利要求4至6任一项所述的方法,其特征在于,所述历史语音信息的情况包括获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the acquisition of the first voice information Compare;
    在所述有效语音信息的占比大于所述无效语音信息的占比的情况下,所述判决条件的灵敏度被调高;In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;
    在所述有效语音信息的占比小于所述无效语音信息的占比的情况下,所述有效语音信息的占比呈上升趋势,所述判决条件的灵敏度被调高;所述有效语音信息的占比呈下降趋势,所述判决条件的灵敏度被调低。In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
  13. 根据权利要求4至6任一项所述的方法,其特征在于,所述历史语音信息的情况包括截止至获取到所述第一语音信息时设备与用户语音对话的状态;The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a state of a voice dialogue between the device and the user until the first voice information is obtained;
    在所述设备与用户语音对话的状态存在的情况下,所述判决条件的灵敏度被调高。In the presence of a state in which the device is in a voice dialogue with the user, the sensitivity of the decision condition is increased.
  14. 一种语音信息处理装置,其特征在于,所述装置包括:A voice information processing device, characterized in that the device comprises:
    获取单元,用于获取第一语音信息;an acquisition unit for acquiring the first voice information;
    执行单元,用于在基于判决条件确定所述第一语音信息为有效的语音控制指令的情况下,执行所述第一语音信息指示的操作,其中,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到。an execution unit, configured to execute the operation indicated by the first voice information when it is determined based on a judgment condition that the first voice information is a valid voice control instruction, wherein the judgment condition is based on the first voice The environmental conditions in which the information is generated are adjusted.
  15. 根据权利要求14所述的装置,其特征在于,所述第一语音信息产生时所在的环境情况包括如下的一项或多项:The device according to claim 14, wherein the environmental conditions in which the first voice information is generated include one or more of the following:
    截止至所述设备获取到该第一语音信息的第二预设时长内的说话人数,所述第一语音信息产生时预设范围内的人数,所述第一语音信息的置信度,或所述第一语音信息的信噪比。Until the device obtains the number of speakers within the second preset duration of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the Describe the signal-to-noise ratio of the first voice information.
  16. 根据权利要求14或15所述的装置,其特征在于,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:The device according to claim 14 or 15, wherein the judgment condition is adjusted and obtained based on the environmental conditions in which the first voice information is generated, including:
    所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到。The judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device.
  17. 根据权利要求16所述的装置,其特征在于,所述判决条件为基于所述环境情况以及设备的持续聆听时长调整得到,包括:The apparatus according to claim 16, wherein the judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device, comprising:
    所述判决条件为基于所述环境情况、所述持续聆听时长以及历史语音信息的情况调整得到。The judgment condition is adjusted based on the environmental conditions, the continuous listening duration and the historical voice information.
  18. 根据权利要求14或15所述的装置,其特征在于,所述判决条件为基于所述第一语音信息产生时所在的环境情况调整得到,包括:The device according to claim 14 or 15, wherein the judgment condition is adjusted and obtained based on the environmental conditions in which the first voice information is generated, including:
    所述判决条件为基于所述环境情况以及历史语音信息的情况调整得到。The judgment condition is adjusted based on the environmental conditions and historical voice information.
  19. 根据权利要求17或18所述的装置,其特征在于,所述历史语音信息的情况包括如下中的一种或多种:The device according to claim 17 or 18, wherein the situation of the historical voice information includes one or more of the following:
    获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;
    获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;
    获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;
    所述第一语音信息与最近一次获取到的有效语音信息的语义的第一关联度;The first semantic correlation between the first voice information and the most recently acquired valid voice information;
    所述第一语音信息与最近一次获取到的无效语音信息的语义的第二关联度;The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;
    第一语音信息与设备最近一次获取到的有效语音信息的第三关联度;the third degree of association between the first voice information and the last valid voice information obtained by the device;
    截止至获取到所述第一语音信息时设备与用户语音对话的状态;The state of the voice dialogue between the device and the user when the first voice information is obtained;
    所述第一语音信息与历史有效语音信息的声学特征的第一相似度;the first similarity between the acoustic features of the first voice information and historically valid voice information;
    所述第一语音信息与历史无效语音信息的声学特征的第二相似度。The second similarity of the acoustic features of the first voice information and the historical invalid voice information.
  20. 根据权利要求14至19任一项所述的装置,其特征在于,The device according to any one of claims 14 to 19, characterized in that:
    在所述环境情况指示所述第一语音信息有效的概率大于无效的概率的情况下,所述判决条件的灵敏度被调高;In the case that the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;
    在所述环境情况指示所述第一语音信息有效的概率小于无效的概率的情况下,所述判决条件的灵敏度被调低。In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.
  21. 根据权利要求16或17所述的装置,其特征在于,所述设备的持续聆听时长越长所述判决条件的灵敏度被调得越低。The apparatus according to claim 16 or 17, characterized in that, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.
  22. 根据权利要求17至19任一项所述的装置,其特征在于,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔;The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time;
    所述第一时间间隔越长所述判决条件的灵敏度被调得越低。The longer the first time interval is, the lower the sensitivity of the decision condition is adjusted.
  23. 根据权利要求17至19任一项所述的装置,其特征在于,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time;
    所述第二时间间隔越长所述判决条件的灵敏度被调得越低。The longer the second time interval is, the lower the sensitivity of the decision condition is adjusted.
  24. 根据权利要求17至19任一项所述的装置,其特征在于,所述历史语音信息的情况包括获取所述第一语音信息时与最近一次获取到有效语音信息之间的第一时间间隔,以及包括获取所述第一语音信息时与最近一次获取到无效语音信息之间的第二时间间隔;The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time, and including the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;
    在所述第一时间间隔小于所述第二时间间隔的情况下,所述判决条件的灵敏度被调高。In the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
  25. 根据权利要求17至19任一项所述的装置,其特征在于,所述历史语音信息的情况包括获取到所述第一语音信息前第一预设时长内有效语音信息和无效语音信息的占比;The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is acquired Compare;
    在所述有效语音信息的占比大于所述无效语音信息的占比的情况下,所述判决条件的灵敏度被调高;In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;
    在所述有效语音信息的占比小于所述无效语音信息的占比的情况下,所述有效语音信息的占比呈上升趋势,所述判决条件的灵敏度被调高;所述有效语音信息的占比呈下降趋势,所述判决条件的灵敏度被调低。In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
  26. 根据权利要求17至19任一项所述的装置,其特征在于,所述历史语音信息的情况包括截止至获取到所述第一语音信息时设备与用户语音对话的状态;The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a state of a voice dialogue between the device and the user until the first voice information is obtained;
    在所述设备与用户语音对话的状态存在的情况下,所述判决条件的灵敏度被调高。In the presence of a state in which the device is in a voice dialogue with the user, the sensitivity of the decision condition is increased.
  27. 一种设备,其特征在于,所述设备包括处理器和存储器,其中,所述存储器用于存储计算机程序,所述处理器用于执行所述存储器中存储的计算机程序,使得所述设备执行如权利要求1至13任一项所述的方法。A device, characterized in that the device includes a processor and a memory, wherein the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes as claimed in the claim The method of any one of claims 1 to 13.
  28. 一种芯片系统,其特征在于,所述芯片系统应用于电子装置;芯片系统包括接口电路和处理器;接口电路和处理器通过线路互联;接口电路用于从电子装置的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行该计算机指令时,芯片系统执行如权利要求1至13任一项所述的方法。A chip system, characterized in that the chip system is applied to an electronic device; the chip system includes an interface circuit and a processor; the interface circuit and the processor are interconnected by lines; the interface circuit is used for receiving signals from a memory of the electronic device and sending signals to the electronic device. The processor sends a signal, and the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the chip system executes the method as claimed in any one of claims 1 to 13 .
  29. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现权利要求1至13任意一项所述的方法。A computer-readable storage medium, characterized in that, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method of any one of claims 1 to 13.
  30. 一种计算机程序产品,其特征在于,所述计算机程序产品被处理器执行时,权利要求1至13任意一项所述的方法将被执行。A computer program product, characterized in that, when the computer program product is executed by a processor, the method according to any one of claims 1 to 13 will be executed.
PCT/CN2021/088522 2021-04-20 2021-04-20 Speech information processing method, and device WO2022222045A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180001492.4A CN113330513A (en) 2021-04-20 2021-04-20 Voice information processing method and device
PCT/CN2021/088522 WO2022222045A1 (en) 2021-04-20 2021-04-20 Speech information processing method, and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/088522 WO2022222045A1 (en) 2021-04-20 2021-04-20 Speech information processing method, and device

Publications (1)

Publication Number Publication Date
WO2022222045A1 true WO2022222045A1 (en) 2022-10-27

Family

ID=77427019

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088522 WO2022222045A1 (en) 2021-04-20 2021-04-20 Speech information processing method, and device

Country Status (2)

Country Link
CN (1) CN113330513A (en)
WO (1) WO2022222045A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376513B (en) * 2022-10-19 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116483960B (en) * 2023-03-30 2024-01-02 阿波罗智联(北京)科技有限公司 Dialogue identification method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578468A (en) * 2012-08-01 2014-02-12 联想(北京)有限公司 Method for adjusting confidence coefficient threshold of voice recognition and electronic device
WO2014114049A1 (en) * 2013-01-24 2014-07-31 华为终端有限公司 Voice recognition method and device
US20170178627A1 (en) * 2015-12-22 2017-06-22 Intel Corporation Environmental noise detection for dialog systems
CN109326289A (en) * 2018-11-30 2019-02-12 深圳创维数字技术有限公司 Exempt to wake up voice interactive method, device, equipment and storage medium
CN110718223A (en) * 2019-10-28 2020-01-21 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for voice interaction control
CN110782891A (en) * 2019-10-10 2020-02-11 珠海格力电器股份有限公司 Audio processing method and device, computing equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044243B (en) * 2009-10-15 2012-08-29 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
KR101698369B1 (en) * 2015-11-24 2017-01-20 주식회사 인텔로이드 Method and apparatus for information providing using user speech signal
CN107622770B (en) * 2017-09-30 2021-03-16 百度在线网络技术(北京)有限公司 Voice wake-up method and device
CN108320742B (en) * 2018-01-31 2021-09-14 广东美的制冷设备有限公司 Voice interaction method, intelligent device and storage medium
CN110148405B (en) * 2019-04-10 2021-07-13 北京梧桐车联科技有限责任公司 Voice instruction processing method and device, electronic equipment and storage medium
CN110211605A (en) * 2019-05-24 2019-09-06 珠海多士科技有限公司 Smart machine speech sensitivity adjusting method, device, equipment and storage medium
CN110556107A (en) * 2019-08-23 2019-12-10 宁波奥克斯电气股份有限公司 control method and system capable of automatically adjusting voice recognition sensitivity, air conditioner and readable storage medium
CN111580773B (en) * 2020-04-15 2023-11-14 北京小米松果电子有限公司 Information processing method, device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578468A (en) * 2012-08-01 2014-02-12 联想(北京)有限公司 Method for adjusting confidence coefficient threshold of voice recognition and electronic device
WO2014114049A1 (en) * 2013-01-24 2014-07-31 华为终端有限公司 Voice recognition method and device
US20170178627A1 (en) * 2015-12-22 2017-06-22 Intel Corporation Environmental noise detection for dialog systems
CN109326289A (en) * 2018-11-30 2019-02-12 深圳创维数字技术有限公司 Exempt to wake up voice interactive method, device, equipment and storage medium
CN110782891A (en) * 2019-10-10 2020-02-11 珠海格力电器股份有限公司 Audio processing method and device, computing equipment and storage medium
CN110718223A (en) * 2019-10-28 2020-01-21 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for voice interaction control

Also Published As

Publication number Publication date
CN113330513A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
US11657804B2 (en) Wake word detection modeling
CN105009204B (en) Speech recognition power management
CN111508474B (en) Voice interruption method, electronic equipment and storage device
CN111880856B (en) Voice wakeup method and device, electronic equipment and storage medium
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
WO2022222045A1 (en) Speech information processing method, and device
KR102193629B1 (en) Selective adaptation and utilization of noise reduction technology in call phrase detection
CN112292724A (en) Dynamic and/or context-specific hotwords for invoking automated assistants
US10565862B2 (en) Methods and systems for ambient system control
US10540973B2 (en) Electronic device for performing operation corresponding to voice input
WO2019242414A1 (en) Voice processing method and apparatus, storage medium, and electronic device
KR20220088926A (en) Use of Automated Assistant Function Modifications for On-Device Machine Learning Model Training
JP2023531398A (en) Hotword threshold auto-tuning
US20220068267A1 (en) Method and apparatus for recognizing speech, electronic device and storage medium
KR20230104712A (en) Hotword recognition adaptation based on personalized negatives
CN110853669A (en) Audio identification method, device and equipment
CN111862943A (en) Speech recognition method and apparatus, electronic device, and storage medium
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN108922523B (en) Position prompting method and device, storage medium and electronic equipment
CN116705033A (en) System on chip for wireless intelligent audio equipment and wireless processing method
CN116343765A (en) Method and system for automatic context binding domain specific speech recognition
CN114495981A (en) Method, device, equipment, storage medium and product for judging voice endpoint
CN112885341A (en) Voice wake-up method and device, electronic equipment and storage medium
CN111028830A (en) Local hot word bank updating method, device and equipment
US20240062756A1 (en) Systems, methods, and devices for staged wakeup word detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21937290

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21937290

Country of ref document: EP

Kind code of ref document: A1