CN113330513A

CN113330513A - Voice information processing method and device

Info

Publication number: CN113330513A
Application number: CN202180001492.4A
Authority: CN
Inventors: 杨世辉; 聂为然
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-08-31
Also published as: WO2022222045A1

Abstract

The embodiment of the application discloses a voice information processing method and equipment, wherein the method comprises the following steps: acquiring first voice information; and executing the operation indicated by the first voice information under the condition of determining that the first voice information is valid voice control instructions based on a judgment condition, wherein the judgment condition is adjusted based on the environmental condition in which the first voice information is generated. According to the method and the device, the accuracy of effective voice recognition can be improved in different intelligent voice interaction scenes, and the false triggering rate of invalid voice is reduced.

Description

Voice information processing method and device

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing speech information.

Background

In the intelligent voice interaction scenario, the intelligent device has two common modes of listening to the user's voice, namely a continuous listening mode and a full-time wake-up-free mode, which may also be referred to as a full-time listening mode. In the continuous listening or full-time listening state, the smart device needs to distinguish whether the user content is an effective instruction for the user content, that is, to distinguish the human-machine conversation content and the human-human conversation content.

Specifically, in a listening state, the voice information acquired by the device includes chatting data, and in order to avoid the smart device being falsely triggered by the chatting content, a rule matching module or an inference module (such as a neural network inference module) is often used to determine whether the acquired voice information is an effective voice control instruction. However, since the validity of the same voice information or the same semantic voice information may be different in different usage environments and scenarios, for example, a certain sentence belongs to a valid voice control instruction in the current scenario, but only chatty information in another scenario belongs to invalid information. The existing voice information effective judgment scheme cannot adapt to the voice information effective identification under different use environments and scenes, so that the identification accuracy is low, and the false triggering of invalid voice is easily caused.

In summary, how to improve the accuracy of effective speech recognition in different intelligent speech interaction scenarios and reduce the false triggering rate of invalid speech is a technical problem that needs to be solved urgently by those skilled in the art.

Disclosure of Invention

The application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false triggering rate of invalid voice in different intelligent voice interaction scenes.

In a first aspect, the present application provides a method for processing voice information, including:

acquiring first voice information; and executing the operation indicated by the first voice information under the condition of determining that the first voice information is valid voice control instructions based on a judgment condition, wherein the judgment condition is adjusted based on the environmental condition in which the first voice information is generated.

Because the environmental condition generated by the voice information has a great influence on whether the voice information is an effective voice control instruction or not, the same or similar voice information is an effective instruction under one environmental condition but is not necessarily an effective instruction under another environmental condition, the method and the device adaptively adjust the judgment condition for judging the effectiveness of the voice information aiming at the voice information received under different environmental conditions, can better judge the effectiveness of the voice information under different environmental conditions, improve the accuracy of effective judgment and reduce the false triggering rate of invalid signals.

In one possible embodiment, the environmental condition in which the first speech information is generated includes one or more of the following: the number of speakers in a second preset duration when the first voice information is obtained by the equipment, the number of speakers in a preset range when the first voice information is generated, the confidence coefficient of the first voice information or the signal-to-noise ratio of the first voice information.

Because the more the number of speakers in a period of time and/or the more the number of surrounding people when voice information is generated, the greater the probability that the voice information received by the device is chatty, i.e. invalid voice, and in addition, the higher the confidence level and/or the signal-to-noise ratio of the voice information, the greater the probability that the device can correctly recognize the statement of the voice information, and the recognition of the validity of the voice information can also be influenced, the judgment condition for judging the validity of the voice information can be adjusted adaptively based on one or more items of the plurality of items, so that the validity of the voice information can be judged better, the accuracy of valid judgment can be improved, and the false triggering rate of invalid signals can be reduced.

In a possible implementation manner, the decision condition is adjusted based on an environmental condition in which the first speech information is generated, and includes: the decision condition is adjusted based on the environmental condition and the continuous listening time of the device.

Because the longer the duration that the equipment continuously listens to the voice, the greater the probability that the listened voice information is invalid voice, the environment condition when the voice information is generated and the duration that the equipment continuously listens are combined in the application to adaptively adjust the judgment condition for judging the validity of the voice information, so that the validity of the voice information can be further judged better, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.

In a possible embodiment, the decision condition is adjusted based on the environmental condition and the duration of continuous listening of the device, and includes: the decision condition is adjusted based on the environment condition, the continuous listening duration and the condition of the historical voice information.

The validity of the currently acquired voice information can also be helped to be judged based on the historical voice information, for example, if the similarity between the currently acquired voice information and the historically acquired valid voice information is large, the probability that the currently acquired voice information is a valid voice instruction is large, and conversely, if the similarity between the currently acquired voice information and the historically acquired invalid voice information is large, the probability that the currently acquired voice information is an invalid voice instruction is large. Therefore, in the present application, in addition to the above-described environment condition of voice information generation and the listening duration of the device, the decision condition for deciding the validity of the voice information is adaptively adjusted in combination with the historical voice information, so that the validity of the voice information can be further better decided, the accuracy of valid decision is improved, and the false triggering rate of the invalid signal is reduced.

In a possible implementation manner, the decision condition is adjusted based on an environmental condition in which the first speech information is generated, and includes: the judgment condition is obtained by adjusting based on the environment condition and the condition of the historical voice information.

Based on the foregoing description, in the present application, the determination condition for determining the validity of the voice information is adaptively adjusted in combination with the environmental condition generated by the voice information and the historical voice information, so that the validity of the voice information can be further determined better, the accuracy of valid determination is improved, and the false triggering rate of the invalid signal is reduced.

In one possible embodiment, the condition of the historical speech information includes one or more of the following:

a first time interval between when the first voice information is acquired and when the effective voice information is acquired last time;

a second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time;

acquiring the ratio of effective voice information to ineffective voice information within a first preset time before the first voice information;

a first association degree of the first voice information and the semantics of the effective voice information acquired last time;

a second degree of association between the first voice information and the semantics of invalid voice information acquired last time;

a third degree of association between the first voice information and effective voice information which is obtained by the device last time;

ending the state of the voice conversation between the equipment and the user when the first voice information is acquired;

a first similarity of acoustic features of the first voice information and historical valid voice information;

a second similarity of the acoustic features of the first speech information and the historical inactive speech information.

In the application, the historical voice information which can be used for helping to judge the validity of the currently acquired voice information comprises one or more items, and the judgment condition for judging the validity of the voice information is adaptively adjusted based on the one or more items, so that the validity of the voice information can be judged better, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.

In a possible implementation, in a case where the environmental condition indicates that the probability that the first speech information is valid is greater than the probability that the first speech information is invalid, the sensitivity of the decision condition is adjusted to be high;

in the event that the environmental condition indicates that the probability that the first speech information is valid is less than the probability of being invalid, the sensitivity of the decision condition is adjusted downward.

In the embodiment of the application, if the effective probability of the received voice information is higher, the validity judgment threshold can be reduced, that is, the sensitivity of the judgment condition is improved, and if the effective probability is lower, the validity judgment threshold can be improved, that is, the sensitivity of the judgment condition is reduced, so that the validity of the received voice information under different environmental conditions can be flexibly identified, and the identification accuracy is improved, instead of using the fixed judgment condition to judge the validity of the voice information under each scene one by one.

In a possible embodiment, the sensitivity of the decision condition is adjusted to be lower the longer the duration of continuous listening of the device.

Because the longer the duration that the device continuously listens to the voice is, the higher the probability that the listened voice information is invalid voice is, the threshold of validity judgment can be improved, namely the sensitivity of judgment conditions is reduced, and therefore whether the voice information is valid or not can be identified more accurately.

In one possible embodiment, the condition of the historical voice information comprises a first time interval between when the first voice information is acquired and when valid voice information is acquired last time; the sensitivity of the decision condition is adjusted to be lower the longer the first time interval.

Because the longer the interval between the time of acquiring the current voice signal and the time of acquiring the valid voice information the last time is, the higher the probability that the acquired current voice signal is an invalid voice instruction is, the threshold of validity judgment can be improved, that is, the sensitivity of judgment conditions is reduced in the application, so that whether the voice information is valid or not can be identified more accurately.

In one possible implementation, the condition of the historical voice information comprises a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time; the sensitivity of the decision condition is adjusted to be lower the longer the second time interval.

Because the longer the interval between the time of acquiring the current voice signal and the time of acquiring the invalid voice information the last time is, the higher the probability that the acquired current voice signal is an invalid voice instruction is, the threshold of validity judgment can be improved, that is, the sensitivity of judgment conditions can be reduced in the application, so that whether the voice information is valid or not can be identified more accurately.

In a possible implementation manner, the condition of the historical voice information comprises a first time interval between the time when the first voice information is acquired and the last time when the valid voice information is acquired, and comprises a second time interval between the time when the first voice information is acquired and the last time when the invalid voice information is acquired; in case the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted higher.

In the present application, the first time interval is smaller than the second time interval, which indicates that the time interval between the acquired first voice information and the latest acquired historical valid voice information is not long, and therefore, the probability that the first voice information is a valid voice instruction is relatively large, and therefore, the decision threshold of validity can be reduced, that is, the sensitivity of the decision condition is improved, so that whether the voice information is valid or not can be identified more accurately.

In a possible implementation manner, the condition of the historical voice information includes a ratio of valid voice information to invalid voice information within a first preset time period before the first voice information is acquired;

in the case that the proportion of the valid voice information is larger than the proportion of the invalid voice information, the sensitivity of the decision condition is increased;

under the condition that the occupation ratio of the effective voice information is smaller than that of the ineffective voice information, the occupation ratio of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is increased; the ratio of the effective voice information is in a descending trend, and the sensitivity of the judgment condition is adjusted to be low.

In the present application, within the first preset duration, the ratio of the valid voice information is larger, and then the probability that the currently acquired first voice information is a valid instruction is larger, so that the validity judgment threshold can be reduced, and the sensitivity of the judgment condition can be increased; in addition, if the ratio of the valid voice information is smaller than that of the invalid voice information, but the ratio of the valid voice information is in an ascending trend, which indicates that more and more valid voice information exists, the probability that the first voice signal is a valid instruction is higher, so that the validity judgment threshold can be reduced, the sensitivity of the judgment condition can be increased, and whether the voice information is valid or not can be identified more accurately.

In a possible implementation manner, the condition of the historical voice information includes a state of a voice conversation between the device and the user when the first voice information is acquired; in the case where a state in which the apparatus has a voice conversation with the user exists, the sensitivity of the decision condition is adjusted high.

The state of the voice conversation between the equipment and the user refers to the state of the voice communication conversation between the equipment and the user, the equipment can track through a conversation state tracking function, and if the state exists currently, the first voice information is indicated to be a valid voice instruction, so that the judgment threshold of validity can be reduced, the sensitivity of judgment conditions is improved, and whether the voice information is valid or not can be identified more accurately.

In one possible implementation, the device may receive a sensitivity of a specified decision condition, adjust the decision condition based on the sensitivity, and then use the adjusted decision condition to determine whether the first voice information is valid.

In the application, the designated sensitivity is the sensitivity input by the user, and the equipment can adjust the sensitivity of the judgment condition more flexibly based on the requirement of the user, so that the requirement of the user can be better met.

In one possible embodiment, the present application provides another method for processing speech information, including: acquiring first voice information; and executing the operation indicated by the first voice information under the condition of a voice control instruction which determines that the first voice information is valid based on a judgment condition, wherein the judgment condition is adjusted based on the continuous listening time of the equipment.

In the application, the longer the duration that the device continuously listens to the voice is, the higher the probability that the listened voice information is invalid voice is, so that the judgment condition for judging the validity of the voice information can be adaptively adjusted through the duration of the device continuously listens, the validity of the voice information can be better judged, the accuracy of valid judgment is improved, and the false triggering rate of invalid signals is reduced.

In one possible embodiment, the present application provides another method for processing speech information, including: acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is adjusted based on historical voice information.

The validity of the currently acquired voice information can also be helped to be judged based on the historical voice information, for example, if the similarity between the currently acquired voice information and the historically acquired valid voice information is large, the probability that the currently acquired voice information is a valid voice instruction is large, and conversely, if the similarity between the currently acquired voice information and the historically acquired invalid voice information is large, the probability that the currently acquired voice information is an invalid voice instruction is large. Therefore, in the application, the judgment condition for judging the validity of the voice information is adaptively adjusted through the historical voice information, so that the validity of the voice information can be better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.

In a second aspect, the present application provides a speech information processing apparatus, the apparatus comprising:

an acquisition unit configured to acquire first voice information;

and the execution unit is used for executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting based on the environmental condition when the first voice information is generated.

In a third aspect, the present application provides an apparatus, which may include a processor and a memory, for implementing the voice information processing method described in the first aspect above. The memory is coupled to the processor, and the processor may implement the method according to the first aspect or any of the possible implementations of the first aspect when executing the computer program stored in the memory. The device may also include a communication interface for the device to communicate with other devices, which may be, for example, a transceiver, circuit, bus, module, or other type of communication interface.

In one possible implementation, the apparatus may include:

a memory for storing a computer program;

the processor is used for acquiring first voice information; and executing the operation indicated by the first voice information under the condition of determining that the first voice information is valid voice control instructions based on a judgment condition, wherein the judgment condition is adjusted based on the environmental condition in which the first voice information is generated.

It should be noted that, in the present application, the computer program in the memory may be stored in advance, or may be downloaded from the internet and stored when the device is used. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in an electrical, mechanical or other form, and is used for information interaction between the devices, units or modules.

In a fourth aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device; the chip system comprises an interface circuit and a processor; the interface circuit and the processor are interconnected through a line; the interface circuit is used for receiving signals from a memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; when executed by a processor, the computer instructions cause the system on chip to perform a method as described in the above first aspect and any one of its possible implementations.

In a fifth aspect, the present application provides a computer-readable storage medium storing a computer program for execution by a processor to implement the method of the first aspect or any one of the possible implementation manners of the first aspect.

In a sixth aspect, the present application provides a computer program product, which when executed by a processor, performs the method according to the first aspect or any one of the possible implementations of the first aspect.

The solutions provided in the second aspect to the sixth aspect are used for implementing or matching with the method correspondingly provided in the first aspect, so that the same or corresponding beneficial effects as those achieved by the method corresponding to the first aspect may be achieved, and details are not repeated here.

Drawings

Fig. 1 is a schematic diagram of a system architecture to which the voice information processing method provided in the present application is applied;

fig. 2 is a schematic flow chart of a voice information processing method provided in the present application;

FIG. 3 is a schematic structural diagram of an invalid rejection model provided in the present application;

FIGS. 4 and 5 are sensitivity diagrams illustrating adjustment of decision conditions based on influencing factors according to the present application;

fig. 6A and 6B are sensitivity diagrams illustrating adjustment of decision conditions based on influencing factors according to the present application;

fig. 6C and fig. 6D are schematic diagrams illustrating the ratio change of the voice information in the present application;

FIG. 7 is a diagram illustrating sensitivity of adjusting decision conditions based on influencing factors according to the present application;

fig. 8A and 8B are schematic diagrams illustrating the determination of the speech information association degree in the present application;

FIG. 9 is a diagram illustrating sensitivity of adjusting decision conditions based on influencing factors according to the present application;

FIG. 10 is a flow chart illustrating another method for processing speech information provided by the present application;

fig. 11 is a schematic flow chart illustrating speech information validity recognition provided by the present application;

fig. 12 is a schematic diagram of a logical structure of an apparatus according to an embodiment of the present application;

fig. 13 is a schematic diagram of a logic structure of another apparatus according to an embodiment of the present application;

fig. 14 is a schematic hardware structure diagram of an apparatus provided in an embodiment of the present application;

fig. 15 is a schematic hardware structure diagram of another apparatus according to an embodiment of the present disclosure.

Detailed Description

For the sake of understanding, the technical terms related to the embodiments of the present application will be described first.

1. Automatic Speech Recognition (ASR) generally refers to a technology that takes speech as a research object, and allows a machine to automatically recognize and understand human dictated speech through speech signal processing and pattern recognition, and allows the machine to convert speech signals into corresponding texts or commands through a recognition and understanding process.

The construction process of the speech recognition system integrally comprises two parts: training and identifying. Training is usually completed off-line, and signal processing and knowledge mining are performed on a large amount of pre-collected voice and language databases to obtain an acoustic model (the acoustic model is a knowledge representation of differences of acoustics, phonetics, environmental variables, speaker gender, accent and the like) and a language model (the language model is a knowledge representation of a group of word sequences) required by a voice recognition system. The recognition process is usually completed on line, and the real-time voice of the user is automatically recognized. The identification process can be generally divided into two major modules, namely a front-end module and a back-end module: the front-end module is mainly used for carrying out end point detection (removing redundant mute and non-speaking sound), noise reduction, feature extraction and the like; the back-end module is used for carrying out statistical pattern recognition (also called decoding) on the feature vector of the user speaking by utilizing the trained acoustic model and language model to obtain the contained text information. In addition, the back-end module also has a self-adaptive feedback module which can self-learn the voice of the user, thereby carrying out necessary correction on the acoustic model and the voice model and further improving the accuracy of recognition.

2. Voiceprint Recognition (VR)

Voiceprint recognition is one of biometric identification technologies, also called speaker identification, and is a technology for distinguishing the identity of a speaker through voice. Voiceprint recognition techniques fall into two categories, namely speaker recognition and speaker verification. Different tasks and applications may use different voiceprint recognition techniques, such as recognition techniques may be required to narrow criminal investigation, and validation techniques may be required for banking transactions.

3. Speech synthesis

Speech synthesis, also known as Text To Speech (TTS) technology, is a technology for converting text information generated by a computer or input from the outside into intelligible and fluent spoken speech, and is equivalent to mounting an artificial mouth on a machine and allowing the machine to speak like a human.

4. Task-based dialog system

Task-based dialog can be understood as a sequential decision-making process, in which a machine needs to update and maintain the internal dialog state by understanding the user statements and then select the next optimal action (such as confirming the requirement, inquiring the limiting conditions, providing the result, etc.) according to the current dialog state, thereby completing the task.

The task-based dialog system currently used in the industry is a system adopting a modular structure, and generally comprises four key modules:

natural Language Understanding (NLU): and identifying and analyzing the text input of the user to obtain computer-understandable semantic labels such as a slot value, an intention and the like.

Dialog State Tracking (DST): from the dialog history, a current dialog state is maintained, which is a cumulative semantic representation of the entire dialog history, typically slot-value pairs (slot-value pairs).

Dialog Policy (DP): and outputting the next system action according to the current conversation state. The general dialog state tracking module and the dialog policy module are collectively referred to as a Dialog Manager (DM) module.

Natural Language Generation (NLG): and converting the system action into natural language output.

The modularized system structure has strong interpretability and easy landing, and is adopted by most of practical task-based dialog systems in the industry.

5. Computer Vision (CV)

Computer vision, also called machine vision (machine vision), is a science for researching how to make a machine "look" and its main task is to process a captured picture or video to obtain information of a corresponding scene.

6. Invalid rejection model

The invalid rejection model is used for judging the validity of the voice information of the user acquired by the equipment. The validity may be a voice control instruction indicating whether the voice information is valid for the device that acquired the voice information. The voice information may be text information converted from a voice signal received by the device, or the like.

The device may receive much of the user's voice information during listening, but some of the voice information is simply that of chatting between users, which is invalid information for the device. The voice information that the user really interacts with the device is the information that is valid for the device, and the valid information is the voice control instruction of the user.

In the present application, the invalid rejection model may include a prejudging module and a decision module for validity of the voice information. The prejudgment module comprises a rule matching module and an inference module and is used for making preliminary judgment on the effectiveness of the voice information. Wherein:

the rule matching module may match the input voice information by a preset rule, for example, a preset sentence, and the input voice information is valid if the preset sentence has a sentence matching the input voice information, and is invalid if the preset sentence does not have a sentence matching the input voice information.

The inference module may be a deep learning prediction model obtained through large-scale data training by using a neural network or a traditional machine learning (e.g., a supervised learning model such as a Support Vector Machine (SVM)). The equipment inputs the acquired voice information into the reasoning module to predict the effective probability of the voice information, or directly outputs the effective result.

The decision module can make a final judgment decision on the processing result of at least one of the rule matching module and the reasoning module through the comprehensive judgment condition to determine whether the voice information is effective or not, so that the accuracy of the voice information effectiveness judgment can be greatly improved. The comprehensive judgment condition will be described later, and will not be described in detail here.

The above-mentioned invalid rejection model may also be referred to as a validity determination model, etc., and the following description will take the invalid rejection model as an example, and the name of the model for determining the validity of the voice information acquired by the device does not limit the present application.

In order to better understand a voice information processing method provided by the embodiment of the present application, a system architecture to which the voice information processing method is applicable is exemplarily described below.

Referring to fig. 1, fig. 1 is a diagram illustrating a system architecture used by a voice information processing method provided in the present application. The system architecture may include an audio manager 110, a video manager 120, a memory 130, and a processor 140, which may be connected by a bus 150.

The audio manager 110 may include a speaker and a microphone array. A speaker is a transducer device that converts an electrical signal into an acoustic signal for outputting the sound of a device. The microphone is an energy conversion device for converting a sound signal into an electric signal, and is used for collecting sound information such as human voice.

The video manager 120 may include an array of cameras. The camera is capable of converting optical image signals into electrical signals for storage or transmission.

The memory 130 is used to store computer programs and data. The memory 130 may be, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable read-only memory (CD-ROM), or the like.

In the present application, the memory 130 may store computer programs or codes of models such as an automatic speech recognition model, a voiceprint recognition model, a computer vision model, an invalid recognition model, a natural language understanding model, a dialogue management model, and a speech synthesis model.

The processor 140 may be a central processing unit, general purpose processor, digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, transistor logic device, hardware component, or any combination thereof. A processor may also be a combination of computing functions, e.g., a combination of one or more microprocessors, a digital signal processor and a microprocessor, or the like. The processor 140 may be configured to read the computer program and data stored in the memory 130, and execute the voice information processing method provided by the embodiment of the present application.

The present application does not limit the type of the BUS 150, and for example, the BUS 150 may be a desktop data BUS (D-BUS), and the D-BUS is an inter-process communication (IPC) mechanism optimized for a desktop environment, and is used for inter-process communication or communication between a process and a kernel. Alternatively, the bus 150 may be a data bus (data bus, DB), an Address Bus (AB), a Control Bus (CB), and the like.

The system architecture shown in fig. 1 may be a system architecture of a terminal device or a device such as a server, for example. The terminal device may include, but is not limited to, any smart operating system-based device that can interact with a user through input devices such as a keyboard, a virtual keyboard, a touchpad, a touch screen, and a voice-activated device, such as a smartphone, a tablet, a handheld computer, a wearable electronic device, or a vehicle-mounted device (e.g., a vehicle-mounted computer, etc.), and so on. The server may be an edge server or a cloud server, and the server may be a virtual server or may be an entity server, and the like, which is not limited in this application.

The system architecture shown in fig. 1 is only an example, and does not limit the system architecture applicable to the embodiments of the present application.

A speech information processing method provided by the embodiment of the present application is described below, and the method may be applied to the system architecture shown in fig. 1, that is, the method may be executed by the terminal device or the server, or may be executed by a processing device such as a chip or a processor in the terminal device or the server, and an execution subject of the method is collectively referred to as a device in the following description. Optionally, if the main execution body of the method is a server or a chip or a processor in the server, the terminal device may receive the voice information first, and then the terminal device sends the received voice information to the server for processing. The voice information sent by the terminal device to the server may be original information received by the terminal device, or may be voice information preprocessed by the terminal device.

Referring to fig. 2, a method for processing voice information provided in the embodiments of the present application may include, but is not limited to, the following steps:

s201, acquiring first voice information.

In particular embodiments, the device may receive a voice signal of a user through a microphone. Then, the device can recognize the speech signal through an automatic speech recognition ASR model to obtain speech information corresponding to the speech signal, wherein the speech information may include text information and the like.

In particular, the voice interaction function between the device and the user may be woken up by receiving a wake-up signal of the user, for example, receiving a specific wake-up word of the user. After being woken up, the device can detect and receive the voice signal of the user through the microphone, and the process of detecting and receiving the voice signal of the user can be called a listening process of the device. To reduce the repetitive operations that must wake up the device each time a voice control command is issued, two main listening modes currently exist: continuous listening and full-time listening.

Wherein, the continuous listening mode refers to: after the device is awakened or the voice command is successfully operated, the device does not need to be awakened again within a period of time (such as 30s), and can listen and perform voice interaction with the user to execute the voice control command of the user during the period of time.

The full-time listening mode includes: the device is only required to be awakened once after being started until the device is closed, and the device can listen and perform voice interaction with the user to execute the voice control instruction of the user.

The first voice information may be voice information corresponding to any one of the voice signals received by the device in the listening stage.

S202, adjusting judgment conditions based on the influence factors of the first voice information validity, wherein the judgment conditions are one or more judgment conditions in an invalid rejection model for judging the first voice information validity.

To facilitate understanding of the above-described invalid rejection model, reference may be made to FIG. 3. Fig. 3 is a schematic processing flow diagram illustrating the invalid rejection model. First, the invalidation rejection module receives voice information, for example, receives the first voice information, and selects a pre-judging module for judging the validity of the voice information based on the voice information and a preset selection condition, that is, selects at least one of the inference module and the rule matching module to pre-judge the validity of the voice information.

The selection condition may be a condition set based on an influence factor of the validity of the voice information. Illustratively, the selection condition may be, for example: under the condition that the listening duration of the equipment is greater than a first threshold value, a rule matching module is selected to judge the validity of the voice information; under the condition that the listening duration of the equipment is less than a second threshold value, the inference module is selected to judge the validity of the voice information; and under the condition that the listening time of the equipment is between the second threshold and the first threshold, the rule matching module and the reasoning module can be simultaneously selected to judge the effectiveness of the voice information. It should be noted that the influence factor of the validity of the voice information is not limited to the listening duration of the device, and will be described in detail below, and will not be described in detail here.

If only the reasoning module is selected to prejudge the effectiveness of the voice information, the equipment inputs the acquired voice information into the reasoning module, and an output result is obtained through calculation. For example, the output result may be a probability that the input speech information is effective, and then the probability is compared with a preset judgment threshold to obtain a pre-judgment result. Specifically, if the probability is greater than the judgment threshold, the pre-judgment result is that the input voice information is valid, and if the probability is less than the judgment threshold, the pre-judgment result is that the input voice information is invalid. For example, if the judgment threshold is 70%, it is specified that the voice information is valid as long as the valid probability of the voice information is greater than 70%, and if the valid probability of the voice information predicted by the inference module is 80% and greater than 70%, the voice information is valid information. If the effective probability of the voice information predicted by the inference module is 50% and less than 70%, the voice information is invalid information.

It should be noted that the result output by the inference module is not limited to the valid probability of the voice information, but may also be in other data forms, for example, the result may be in a form of a score, and the score exceeds a judgment threshold value to indicate that the voice information is valid, and the like.

If only the rule matching module is selected to pre-judge the validity of the voice information, the equipment inputs the acquired voice information into the rule matching module, and the rule matching module compares the input voice information with information in a preset rule base to obtain a pre-judging result. If the information in the preset rule base is matched with the input voice information, the prejudgment result is that the input voice information is effective. Otherwise, if the information in the preset rule base is not matched with the input voice information, the predetermined result is that the input voice information is invalid.

Under the condition that the effectiveness of the voice information is pre-judged by only selecting the reasoning module or the rule matching module, after the pre-judgment result of the voice information effectiveness is obtained, the pre-judgment result can be input into the decision module, and the decision module judges whether the pre-judgment result is reasonable or not through the comprehensive judgment condition, so that the final indication of whether the voice information is effective or not is output. For example, the comprehensive judgment condition is: if the number of the characters included in the valid voice information is not less than 3, if the number of the characters of the input voice information is less than 3 and the prejudgment result output by the reasoning module or the rule matching module is valid, the prejudgment result is unreasonable, and the decision module determines that the voice information is invalid and outputs final indication information indicating that the voice information is invalid; on the contrary, if the number of the characters of the input voice information is not less than 3, the prejudgment result output by the reasoning module or the rule matching module is valid, and the decision module finally determines that the voice information is valid and outputs the indication information indicating that the voice information is valid.

It should be noted that the comprehensive judgment condition is not limited to the above example, and may be other types of conditions, and in one possible embodiment, the comprehensive judgment condition may be a voting mechanism, that is, if the number of votes for which the voice information is valid is large, the voice information is determined to be valid, and if the number of votes for which the voice information is invalid is large, the voice information is determined to be invalid.

Or, in a possible implementation manner, in the case that only the inference module or the rule matching module is selected to pre-judge the validity of the voice information, the comprehensive judgment is not required, and the result output by the inference module or the rule matching module is output as the final result of the invalid rejection model.

If the inference module and the rule matching module are selected to pre-judge the validity of the voice information at the same time, the obtained voice information is respectively input into the inference module and the rule matching module, the two modules pre-judge the validity of the voice information according to own processes (see the above description, which is not repeated here), respectively obtain respective validity pre-judging results, then the two pre-judging results are input into the decision module, and the two validity pre-judging results are finally judged based on the comprehensive judgment conditions in the decision module to output the final result of the invalid rejection model.

For example, the comprehensive judgment condition may be: the number of the characters included in the valid voice information is not less than 3, and then the decision module checks the rationality of the two pre-judgment results based on the comprehensive judgment condition, and the specific checking process refers to the foregoing description and is not repeated herein.

For example, in one possible implementation, the comprehensive judgment condition may be a voting mechanism, that is, if the number of votes for which the voice information is valid is large, the voice information is determined to be valid, and if the number of votes for which the voice information is invalid is large, the voice information is determined to be invalid. If the two validity pre-judgment results for the voice information are both valid, the final judgment result of the voice information is also valid. If the two validity pre-judging results are both invalid, the final judging result of the voice information is also invalid. If one of the two validity pre-judging results is valid and the other is invalid, further judgment can be carried out, for example, the judgment is carried out according to the priority, and if the priority of the inference module is higher than that of the rule matching module, the pre-judging result of the inference module is used as the final result to be output. And if the priority of the rule matching module is higher than that of the reasoning module, outputting the prejudgment result of the rule matching module as a final result.

It should be noted that the comprehensive judgment condition is only an example, and a main purpose of the comprehensive judgment condition is to be used for accurately judging the validity of the acquired voice information according to the pre-judgment result of the comprehensive reasoning module and/or the rule matching module.

Based on the above description of fig. 3, the decision condition in S202 may include one or more of the selection condition in the invalid rejection model, the decision threshold of the result output by the decision inference module, and the comprehensive decision condition. In other words, in the present application, in order to improve the accuracy of valid speech recognition in different scenarios and reduce the false triggering rate of invalid speech, the above-mentioned decision condition can be flexibly adjusted based on one or more influence factors that can influence the validity judgment of the input speech information in different scenarios of speech interaction, so that the validity recognition of the speech information is more flexible and better conforms to the current context and scenario.

In a possible implementation manner, the adjusting the decision condition based on the influencing factor of the first voice information validity may be:

under the condition that the probability that the first voice information is effective is larger than the probability that the first voice information is ineffective by analyzing based on one or more voice information effectiveness influence factors, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition is, the higher the probability that the first voice information is effective is determined by the judgment condition; and in the case that the probability that the first voice information is valid is smaller than the probability that the first voice information is invalid based on one or more voice information validity influencing factors, adjusting the sensitivity of the judgment condition to be lower, wherein the lower the sensitivity of the judgment condition indicates that the probability that the first voice information is valid is determined to be lower through the judgment condition. For the sensitivity of the decision condition and the specific adjustment process, reference is made to the following description, which is not detailed here.

Optionally, the influencing factor capable of influencing the validity recognition of the input speech information may include one or more of the following:

the environmental condition of the equipment when the voice information is generated, the continuous listening duration of the equipment, a first time interval between the time when the equipment acquires the voice information and the last time when the equipment acquires the valid voice information, a second time interval between the time when the equipment acquires the voice information and the last time when the equipment acquires the invalid voice information, the proportion of the valid voice information and the invalid voice information in a first preset time before the equipment acquires the voice information, a first correlation degree of the voice information and the semantics of the valid voice information last acquired by the equipment, a second correlation degree of the voice information and the semantics of the invalid voice information last acquired by the equipment, a third correlation degree of the first voice information and the valid voice information last acquired by the equipment, and the first similarity degree of the voice characteristics of the voice information and the historical valid voice information when the current voice information is acquired by the equipment and a user voice dialogue state when the current voice information is acquired, and a second similarity of the acoustic features of the speech information and the historical unvoiced speech information.

In a possible implementation manner, after the device acquires the first voice information, the device may adjust the selection condition in the invalid rejection model based on a first factor, where the first factor may include one or more of the above influencing factors. The specific adjustment process will be described later, and will not be described in detail here.

In a possible implementation manner, after the device acquires the first voice information, the device may adjust a judgment threshold of an output result of the decision inference module in the invalid rejection model based on a second factor, where the second factor may include one or more of the above influencing factors. The second factor and the influencing factor included in the first factor may be different, or may be partially the same, or may be completely the same, which is determined according to the actual situation, and this is not limited by the present solution. The specific adjustment process will be described later, and will not be described in detail here.

In a possible implementation manner, after the device acquires the first voice information, the device may adjust a comprehensive judgment condition of a decision module in the invalid rejection model based on a third factor, where the third factor may include one or more of the above influencing factors. The third factor may be different from, or may be partially the same as, or may be completely the same as the influencing factor included in the first factor and the second factor, which is determined according to the actual situation, and this is not limited by the present scheme. The specific adjustment process will be described later, and will not be described in detail here.

In a specific implementation, the selection condition, the judgment threshold and the comprehensive judgment condition may be adjusted together, or one or two of the selection condition, the judgment threshold and the comprehensive judgment condition may be selected for adjustment, which may be specifically selected according to an actual requirement, and the scheme is not limited thereto.

S203, under the condition that the first voice information is determined to be effective based on the adjusted judgment condition, carrying out semantic understanding on the first voice information, and executing the instruction of the first voice information.

In a specific embodiment, after acquiring the first voice information, the device adjusts a decision condition in the invalid rejection model based on the influence factor, and then identifies validity of the first voice information based on the adjusted invalid rejection model.

In a possible implementation manner, if the device adjusts the selection condition in the invalid rejection model, the device may select one or more models of the rule matching module and the inference module to pre-judge the validity of the first voice message based on the adjusted selection condition.

In a possible implementation manner, if the device adjusts the judgment threshold of the inference module, and the device selects the prejudgment module for judging the validity of the first voice information to include the inference module, after the inference module outputs the data indicating the validity of the first voice information, the device may judge whether the first voice information is valid based on the data indicating the validity of the first voice information and the adjusted judgment threshold.

In a possible implementation manner, if the device adjusts the comprehensive judgment condition of the decision module in the invalid rejection model, after obtaining the pre-judgment result of the rule matching module and/or the inference module, a comprehensive judgment may be performed on the pre-judgment result of the rule matching module and/or the inference module based on the adjusted comprehensive judgment condition, so as to determine the validity of the first voice information.

The specific process of the validity recognition of the first speech information may be referred to the description of fig. 3, and is not described herein again.

In the case that the first speech information is valid, the device starts to semantically understand the first speech information, and specifically, the processor in the device may call a natural language understanding model in the memory to perform semantic understanding on the first speech information to obtain a specific meaning of the first speech information. After the device understands the meaning of the first voice information, corresponding operation is executed based on the meaning so as to provide the needed service for the user. The meaning of the first voice message is that for the equipment, the control instruction for executing the corresponding operation is obtained.

The following describes the adjustment process of the decision condition in the first speech information validity recognition, respectively, from different influencing factors of the speech information validity. It should be noted that the decision condition may include one or more of the selection condition, the judgment threshold value, and the comprehensive judgment condition in the above-mentioned invalid rejection model, and the adjustment process described below may be applied to the adjustment of one or more of the selection condition, the judgment threshold value, and the comprehensive judgment condition.

Before describing the adjustment process, first, the related concepts involved in the adjustment process are described:

sensitivity of decision conditions: the sensitivity refers to the loose and strict degree of the judgment condition, the more strict the judgment condition is, the lower the sensitivity is, and the looser the judgment condition is, the higher the sensitivity is.

Illustratively, for the selection condition of the above-mentioned selection prejudgment model, generally, the inference module is to predict the possibility that the voice information is effective, and belongs to fuzzy matching, and the rule matching module is to prejudge in a mode matching type, that is, whether or not, relatively speaking, it is stricter. Therefore, when the pre-judging model is selected, if the probability that the voice information acquired by the equipment is effective is high, the inference module or the rule matching module can be selected for pre-judging, or if the accuracy of effective recognition of the voice information is improved, the inference module can be selected for pre-judging. If the probability that the voice information acquired by the equipment is valid is small, in order to effectively avoid false triggering of invalid information, a rule matching module can be selected for prejudgment.

For example, assume that the selection conditions are: the listening time of the equipment is less than 10 seconds, the inference module is selected for prejudging, the listening time of the equipment is more than 20 seconds, the rule matching module is selected for prejudging, and the inference module and the rule matching module are simultaneously selected for prejudging when the listening time of the equipment is between 10 seconds and 20 seconds. If it is desired to filter invalid information better and reduce false triggering, the device may adjust the selection condition in a more severe direction, that is, adjust the sensitivity of the selection condition, for example, may adjust the selection condition to: the listening time of the equipment is less than 5 seconds, the inference module is selected for prejudging, the listening time of the equipment is more than 10 seconds, the rule matching module is selected for prejudging, and the inference module and the rule matching module are simultaneously selected for prejudging when the listening time of the equipment is between 5 seconds and 10 seconds. Conversely, if it is desired to better recognize the valid speech information, the device may adjust the selection condition in a looser direction, i.e. increase the sensitivity of the selection condition, for example, the selection condition may be adjusted to: the listening time of the equipment is less than 15 seconds, the inference module is selected for prejudging, the listening time of the equipment is more than 25 seconds, the rule matching module is selected for prejudging, and the inference module and the rule matching module are simultaneously selected for prejudging when the listening time of the equipment is between 15 seconds and 25 seconds.

Illustratively, for the judgment threshold of the inference module, the standard judgment threshold is 70%, that is, the inference module predicts that the probability that the voice information is valid is greater than 70%, and then determines that the voice information is valid. However, when the judgment threshold is adjusted to 80%, that is, the judgment condition is adjusted in a strict direction, in this case, the inference module may predict that the probability that the voice information is valid needs to be greater than 80% to judge that the voice information is valid, and thus it is seen that the sensitivity of the judgment condition is reduced. If the judgment threshold is adjusted to 60%, the judgment condition is adjusted in a loose direction, and in this case, the inference module can judge that the voice information is valid as long as the effective probability of the voice information is more than 60%, so that the sensitivity of the judgment condition is improved.

For example, for the above comprehensive judgment condition, it is assumed that the comprehensive judgment condition is: the number of the characters included in the effective voice information is not less than 3, and then, if the comprehensive judgment condition is adjusted to be: the effective voice information comprises no less than 5 characters, so that the requirement on the voice information is improved and severer, and the sensitivity of the comprehensive judgment condition is reduced. If the comprehensive judgment condition is adjusted to be that the number of the characters included in the effective voice information is not less than 2, the requirement for the voice information is reduced, the requirement is looser, and therefore the sensitivity of the comprehensive judgment condition is improved.

Negative correlation adjustment sensitivity: when the value corresponding to the influence factor is increased, the sensitivity is adjusted to be lower, and the more the increase is, the lower the sensitivity is adjusted to be; when the value corresponding to the influencing factor decreases, the sensitivity is adjusted to be higher, and the more the decrease, the higher the sensitivity is adjusted.

Positive correlation adjustment sensitivity: when the value corresponding to the influence factor is increased, the sensitivity is adjusted to be higher, and the more the increase is, the higher the sensitivity is adjusted to be; when the value corresponding to the influencing factor decreases, the sensitivity is adjusted to be lower, and the more the decrease, the lower the sensitivity is adjusted.

It should be noted that, the sensitivity can be adjusted up or down according to the present application, and the specific adjustment amount can be set according to the actual situation, which is not limited in the present application. In addition, the adjustment of the sensitivity of the decision condition is in a range, for example, for the adjustment of the decision threshold, the maximum is 100%, the minimum is 0, and the like, and the adjustment range of the sensitivity of the decision condition is determined according to the actual situation, which is not limited by the present solution.

First, the adjustment process of the decision condition is introduced based on the influence of the environmental situation in which the first speech information is generated. Illustratively, the environmental conditions under which the first speech information is generated include one or more of: the number of speakers (hereinafter, simply referred to as the number of speakers) in a second preset time period until the device acquires the first voice message, the number of speakers (hereinafter, simply referred to as the number of surrounding speakers) in a preset range when the first voice message is generated, the confidence level of the first voice message, the signal-to-noise ratio of the first voice message, and the like. The speaker number specifically refers to the number of different voiceprints included in the first voice message, and since the voiceprints of each person are different, the speaker number of the first voice message can be represented by the number of the voiceprints.

Referring to fig. 4, fig. 4 illustrates how the above-mentioned decision condition is adjusted based on the environmental influence factors, by taking the above-listed environmental influence factors as examples.

In the process of acquiring the first voice information, the equipment can acquire the number of people around the first voice information and the number of speakers. Specifically, the device can call the computer vision model in the memory to drive the camera to shoot pictures or videos of the surrounding environment, then the number of people around and the number of speakers can be known by analyzing the shot pictures and videos, and the number of speakers can be obtained by analyzing the number of people whose mouths are moving in the videos within the second preset time. The surrounding people include the number of speakers. The second preset time period may be, for example, 5 seconds, 10 seconds, 1 minute, or the like, which is not limited in this application.

Or, the device may identify the voiceprint features in the speech signal received by the device within the second preset duration by calling the voiceprint identification model in the memory, and the number of the identified different voiceprint features is the number of speakers. Optionally, the voiceprint recognition model may be a dynamically monitored model to flexibly adapt to voiceprint recognition under different conditions.

After acquiring the number of surrounding people (assuming m people and m being a positive integer) and the number of speakers (assuming n people and n being a positive integer), the device first determines whether the number of speakers n is 0, and if the number of speakers n is 0, the device indicates that the first voice information does not include the voice information of the person, and the device does not need to adjust the corresponding judgment condition.

If the number n of speakers is not 0, it is indicated that the first speech information includes the speech information of the person, and further, it is determined whether the number m of surrounding persons is greater than 1, and if m is not greater than 1, it is determined whether m is 1.

If m is 1, it indicates that there is only one person in the surrounding environment, and the first voice message sent by the person is a voice control instruction sent by the device with a high probability, so that the sensitivity of the decision condition can be increased to better recognize the validity of the first voice message.

Or if m is 1, the first voice information acquired currently is a voice control instruction for the equipment by default, namely the first voice information is valid information. Then, the sensitivity of the decision condition may be adjusted to be the highest, or the invalid rejection model may not further perform validity determination, and directly output an indication that the first speech information is valid.

If m is not 1, there is a possibility that detection is erroneous and adjustment of the sensitivity of the determination condition cannot be performed by this information, so that adjustment is not performed.

In the case where the number of speakers n is not 0 and the number of surrounding persons m is greater than 1, the first voice information is highly likely to be the content of chatty and may be voice information that is invalid for the device, then the device may turn down the sensitivity of the decision condition based on the number of surrounding persons, and the larger the number of surrounding persons m is, the lower the sensitivity of the decision condition is. Because the more the number of people around, the higher the probability that the first voice information belongs to chatting voice, a more severe decision condition needs to be set to identify the validity of the first voice information, so as to avoid that invalid voice information triggers related service operations by mistake, and waste of equipment resources.

In addition, after the device acquires the first voice information, it may invoke an automatic voice recognition model in the memory to calculate a confidence of the first voice information, or calculate a signal-to-noise ratio of the first voice information using the vocal tract information, or calculate both the confidence and the signal-to-noise ratio, and then adjust the sensitivity of the decision condition based on the confidence and/or the signal-to-noise ratio.

Specifically, the sensitivity of the decision condition may be adjusted based on the confidence and/or the negative correlation of the signal-to-noise ratio, because the higher the confidence is, the higher the probability that the first voice information is correctly recognized is, and the higher the signal-to-noise ratio is, the better the quality of the collected first voice information is, at this time, even if the sensitivity of the decision condition is harsh, the validity of the first voice information may be better recognized, and the invalid voice of the chat may be effectively filtered.

On the contrary, if the confidence is lower, the probability that the first voice information is correctly recognized is smaller, and the signal-to-noise ratio is lower, the quality of the collected first voice information is worse, and the recognition of the voice content is possibly wrong.

For example, the device may set a confidence threshold and/or a signal-to-noise threshold for the voice information, and if the confidence of the first voice information is greater than the confidence threshold and/or the signal-to-noise ratio is greater than the signal-to-noise threshold, the higher the confidence and/or the signal-to-noise ratio, the lower the sensitivity of the decision condition. If the confidence of the first voice information is smaller than the confidence threshold and/or the signal-to-noise ratio is smaller than the signal-to-noise ratio threshold, the lower the confidence and/or the signal-to-noise ratio is, the higher the sensitivity of the decision condition is adjusted. The confidence threshold may be, for example, 50% or 60%, etc., the signal-to-noise threshold may be, for example, 50db or 60db, etc., and the confidence threshold and the signal-to-noise threshold are not limited in the present application.

For example, in one possible implementation, the device does not need to set a confidence threshold and/or a signal-to-noise threshold of the voice information, but may set a condition for adjusting the decision condition within each confidence and/or signal-to-noise range. For example, taking the decision condition as the decision threshold of the inference model as an example, if the initial decision threshold is 70%, the sensitivity may be increased in the range of the confidence level from 0 to 30%, and the decision threshold may be set to be increased to 50%; in the range of the confidence coefficient of 31% to 60%, a judgment threshold value can be set to be adjusted to 60%; in the range of the confidence coefficient of 61% to 70%, the original threshold value of 70% can be kept without adjustment; in the range of the confidence level of 71% to 100%, the sensitivity can be adjusted down, and the judgment threshold can be set to be adjusted to 80%.

It should be noted that, for the above-mentioned influencing factors of the number of speakers n, the number of surrounding speakers m, the confidence level and the signal-to-noise ratio, the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device may adjust the sensitivity of the decision condition based on any of a number of factors, among others. For example, a weight may be configured for each of the plurality of influencing factors, and the sensitivity of the decision condition may be adjusted in a weighted manner. For example, for the adjustment of the determination threshold value, if the three influence factors of the number m of the surrounding persons, the confidence level, and the signal-to-noise ratio are adjusted in an integrated manner, the weights set for the three factors are w1, w2, and w3, and the adjusted determination threshold values calculated for the three factors are a1, a2, and a3, the adjusted determination threshold value determined in an integrated manner is (a1 w1+ a2 w2+ a3 w 3). It should be noted that this weighted integration manner is only an example, and in an actual implementation, the most or least adjusted one of the multiple influencing factors may be taken as a result of the final adjustment, and the like.

Referring to fig. 5, fig. 5 exemplarily shows a diagram for adjusting the sensitivity of the decision condition based on three influencing factors, i.e., a duration of listening (hereinafter abbreviated as t1) until the apparatus acquires the first voice information, a first time interval (hereinafter abbreviated as Δ t1) between the apparatus acquiring the first voice information and the last acquisition of valid voice information, and a second time interval (hereinafter abbreviated as Δ t2) between the apparatus acquiring the first voice information and the last acquisition of invalid voice information.

Specifically, after the device acquires the first voice information, the duration t1 for the device to listen continuously until the first voice information is acquired, a first time interval Δ t1 between the first voice information and the last acquisition of valid voice information, and a second time interval Δ t2 between the first voice information and the last acquisition of invalid voice information may be acquired. Illustratively, the acquisition of t1, Δ t1, and Δ t2 may be timed and calculated by a timer.

After obtaining the t1, the apparatus may adjust the sensitivity of the above-described decision condition based on the t1 negative correlation, i.e., the greater the duration t1 of the continuous listening, the lower the sensitivity of the decision condition is adjusted. This is because when the device wakes up, it starts to enter a new round of continuous listening phase, and generally the voice information of the user acquired by the device at the previous stage in the continuous listening phase is more likely to be valid, so that the sensitivity is kept high, the voice information acquired by the device is more likely to be the conversation information between the users over time, and the sensitivity needs to be reduced for reducing the false triggering, so the device can adjust the sensitivity of the above-mentioned decision condition based on the negative correlation of the continuous listening time length.

To facilitate understanding of the sensitivity of the decision condition based on the t1 negative correlation adjustment, an illustration is made. For example, assuming that the decision condition is a decision threshold of the output result of the inference module, in the beginning stage of continuous listening, the decision threshold may be 60%, the condition is relatively loose, and the sensitivity is relatively high, but with the gradual increase of t1, the decision threshold is increased by a preset increment value, for example, by 1% every time t1 is increased by a unit interval (for example, an interval of 5 seconds), that is, with the increase of t1, the decision threshold is larger and more stringent, and the sensitivity is gradually reduced. It should be noted that this is only an example, and the present application does not limit the specific negative correlation adjustment manner.

After obtaining the first time interval Δ T1, the device can determine whether Δ T1 is greater than the first time interval threshold T1. If Δ T1 is greater than this T1, then the sensitivity of the decision condition is not adjusted. This is because, when Δ T1 is greater than T1, it can be considered that the first time interval Δ T1 includes a time length overlapping with the duration T1, and the sensitivity of the decision condition is adjusted by the T1, and it is not necessary to adjust the sensitivity of the decision condition based on Δ T1.

If Δ T1 is less than the T1, then the negative correlation adjusts the sensitivity of the decision condition. This is because, the longer the time interval between the time when the device acquires valid voice information and the time when the device acquires valid voice information, that is, the time length of T1, the greater the probability that the voice information acquired by the device is invalid voice information such as gossip, and therefore, the device can adjust the sensitivity of the decision condition in negative correlation in order to reduce false triggering.

After obtaining the second time interval Δ T2, the device can determine whether Δ T2 is greater than a second time interval threshold T2. If Δ T2 is greater than this T2, then the sensitivity of the decision condition is not adjusted. This is because, when Δ T2 is larger than T2, it can be considered that the second time interval Δ T2 includes a time length overlapping with the duration T1, and the sensitivity of the decision condition is adjusted by the T1, and it is not necessary to adjust the sensitivity of the decision condition based on Δ T2.

If Δ T2 is less than the T2, then the negative correlation adjusts the sensitivity of the decision condition. This is because, the longer the interval of time is within a period of time T2 after the device acquires the invalid speech information, the higher the probability that the speech information acquired by the device is invalid speech information such as gossip, and therefore, in order to reduce false triggering, the device can adjust the sensitivity of the decision condition in negative correlation.

In addition, for the first time interval Δ t1 and the second time interval Δ t2 acquired as described above, the apparatus may compare whether Δ t1 is smaller than Δ t2, and if so, adjust the sensitivity of the decision condition high. This is because, when the previous speech information from which the first speech information is obtained is valid speech information, the probability that the first speech information is added or modified to the previous speech information is high, that is, the probability that the first speech information is valid speech information is high, and in order to better recognize the validity of the first speech information, the device may adjust the decision condition in a loose direction, that is, increase the sensitivity.

The adjustment flow shown in fig. 5 is an implementation example of the present application, and the sensitivity of the determination condition is dynamically adjusted in real time through the characteristics of the duration of the continuous listening time and the time interval between the valid speech information and the invalid speech information, so that at different listening time periods, the thresholds of the speech information obtained by the device, even if the speech information with the same content is determined to be valid, are different, thereby better recognizing the valid speech, reducing false triggering of the invalid speech, and improving the speech interaction experience of the user.

It should be noted that for several influencing factors shown in fig. 5, the device may individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device may adjust the sensitivity of the decision condition based on any of a number of factors, among others.

Referring to fig. 6A and 6B, fig. 6A and 6B exemplarily show a diagram for adjusting the sensitivity of the determination condition based on the influence factor of the ratio of the valid voice information to the invalid voice information in the first preset time period before the device acquires the first voice information.

For example, the first preset duration may be a duration for continuous listening before the device acquires the first voice information, or the first preset duration may be any duration before the device acquires the first voice information, where the any duration may be configured in advance, and the application is not limited thereto.

The proportion of the valid voice information in the first preset time refers to the proportion of the valid voice information acquired by the equipment in all the voice information acquired by the equipment in the first preset time. Or, the ratio is the reciprocal of the number of invalid voice messages acquired from the time point when the valid voice control command is received last time to the time point when the first voice message is acquired. If the number of invalid voice messages acquired during the period is 0, the occupation ratio of the valid voice messages is 1.

The occupation ratio of the invalid voice information in the first preset time duration refers to a ratio of the invalid voice information acquired by the equipment to all voice information acquired by the equipment in the first preset time duration. Or, the ratio is the reciprocal of the number of valid voice messages acquired from the time point when the invalid voice control command is received last time to the time point when the first voice message is acquired. If the number of valid voice messages acquired during the period is 0, the ratio of the invalid voice messages is 1.

In a specific embodiment, after the device acquires the first voice information, the device acquires the occupation ratio of the valid voice information (abbreviated as f1) and the occupation ratio of the invalid voice information (abbreviated as f2) within the first preset time period, and the device may compare the sizes of f1 and f2 (see fig. 6A). If f1 is greater than f2, which indicates that more effective voice information is acquired within the first preset time period, and the user frequently performs voice interaction with the device, the sensitivity of the decision condition may be adjusted according to the positive correlation of the parameter (f1-f 2). That is, the larger the ratio of the valid speech information is, the larger the probability that the first speech information is valid is, then the higher the sensitivity of the decision condition is adjusted, so that the validity of the acquired speech information can be better recognized, and the possibility of missing recognition of the valid speech information is reduced.

In one possible implementation, the device may adjust the sensitivity of the above decision condition based on f1 and f 2. For example, the larger the ratio at f1, the higher the sensitivity tone, while the smaller the ratio at f2, the lower the sensitivity tone, and so on.

In fig. 6A, if f1 is not greater than f2, the apparatus can adjust the sensitivity of the decision condition according to the rate of change of f1 and the rate of change of f 2.

For example, a coordinate system is constructed by taking the number of times of acquiring the voice information as a horizontal axis (or taking the continuous listening time as a horizontal axis) and taking f1 as a vertical axis, and in the coordinate system, the slope of a connection line between f1 when valid voice information was acquired last time and f1 when valid voice information was acquired last time is the change rate of the f 1. For ease of understanding, see FIG. 6C. In fig. 6C, it is assumed that voice information has been received 6 times before the first voice information is acquired, and fig. 6C exemplarily shows a proportion of valid voice information each time voice information is acquired and validity determination is performed. Then, in fig. 6C, the device acquires f1 with a rate of change k-10% after acquiring the first voice information.

Similarly, for example, a coordinate system is constructed by taking the number of times of acquiring voice information as a horizontal axis (or taking the continuous listening time as a horizontal axis) and taking f2 as a vertical axis, and in the coordinate system, the slope of the connection line between f2 when invalid voice information was acquired last time and f2 when invalid voice information was acquired last time is the change rate of the f 2. For ease of understanding, see fig. 6D. In fig. 6D, it is assumed that voice information has been received 6 times before the first voice information is acquired, and fig. 6D exemplarily shows a proportion of invalid voice information each time voice information is acquired and validity determination is performed. Then, in fig. 6D, the apparatus acquires f2 with a rate of change k of 10% after acquiring the first speech information.

Based on the above description, in the case where f1 is not greater than f2, indicating that the voice interaction between the user and the device is reduced, then, in order to reduce false triggering of invalid voice, the device may adjust the sensitivity of the decision condition according to the positive correlation of the rate of change of f 1. That is, the larger the change rate of f1, the larger the probability that the first speech information is effective is, the higher the sensitivity is adjusted, and the more relaxed the decision condition is; and the smaller the change rate of f1, the smaller the probability that the first voice information is effective is, the lower the sensitivity is adjusted, and the more rigorous the judgment condition is. For example, referring to FIG. 6C above, several rates of change of f1 are exemplary given in FIG. 6C: -50%, 16.6%, 8.3%, 15% and-10%, in descending order: -50% < -15% < -10% < 8.3% < 16.6%. Assuming that the adjusted decision condition is the decision threshold of the result output by the inference module, and assuming that the decision threshold before adjustment is 70%, the 5 f1 change rates sorted from small to large correspond to the adjusted decision thresholds of 85%, 80%, 78%, 68% and 65%. It should be noted that the lower the determination threshold, the higher the sensitivity is, that is, the higher the sensitivity is, the lower the determination threshold, and the lower the sensitivity is, the higher the determination threshold.

And in the case where f1 is not greater than f2, the apparatus can adjust the sensitivity of the decision condition in accordance with the rate of change of f2 minus correlation. I.e. the smaller the rate of change of f2, the higher the probability that the first speech information is valid, the more relaxed the decision condition; the larger the rate of change of f2, the less the ratio of valid speech information is, i.e. the less the probability that the first speech information is valid, and therefore the lower the sensitivity is set, the more stringent the decision conditions are. For example, referring to FIG. 6D above, several rates of change of f2 are exemplary given in FIG. 6D: k-50%, k-16.6%, k-8.3%, k-15% and k-10%, in descending order: -16.6% < -8.3% < 10% < 15% < 50%. Assuming that the adjusted decision condition is the decision threshold of the result output by the inference module, and assuming that the decision threshold before adjustment is 70%, the 5 f2 change rates sorted from small to large correspond to adjusted decision thresholds of 65%, 68%, 78%, 80% and 85%.

Or after the device acquires the first voice information, the device acquires the occupation ratio of the valid voice information (abbreviated as f1) and the occupation ratio of the invalid voice information (abbreviated as f2) within the first preset time period, the device does not need to compare the sizes of f1 and f2, and the device can adjust the sensitivity of the decision condition according to the positive correlation of the parameter (f1-f2), the sensitivity of the decision condition according to the positive correlation of the change rate of f1 and/or the sensitivity of the decision condition according to the negative correlation of the change rate of f2 (see fig. 6B). For a specific adjustment process, reference is made to the above description of fig. 6A, and details are not repeated here.

It should be noted that, for several influencing factors shown in fig. 6A or fig. 6B, the device may individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device may adjust the sensitivity of the decision condition based on any of a number of factors, among others.

Referring to fig. 7, fig. 7 exemplarily shows a schematic diagram for adjusting the sensitivity of the decision condition based on three influencing factors, namely, a first degree of association between the first voice information and the semantics of the valid voice information last acquired by the device, a second degree of association between the first voice information and the semantics of the invalid voice information last acquired by the device, a third degree of association between the first voice information and the valid voice information last acquired by the device, and a time until the state of the voice conversation between the first voice information device and the user is acquired.

In a specific embodiment, after the device acquires the first voice information, it may acquire valid voice information acquired last time (referred to as recent historical valid voice information for short), and analyze a degree of association between the two voice information (referred to as first degree of association for short) based on the semantic meaning of the first voice information and the recent historical valid voice information obtained by parsing. Specifically, the semantic understanding of the first speech information may be performed by calling a natural language understanding model in a memory.

If the semantics of the two pieces of voice information are not associated, that is, the first association degree is zero, then the sensitivity of the decision condition is not adjusted. If the semantics of the two pieces of voice information are related, for example, the semantics of the two pieces of voice information are the same, an inheritance relationship exists (for example, the semantics of the latest historical valid voice information is "turn on the air conditioner", the semantics of the first voice information is "one point higher in temperature"), a progressive relationship exists (for example, the semantics of the latest historical valid voice information is "one point higher in temperature", and the semantics of the first voice information is "one point higher again"), or an opposite relationship exists (for example, the semantics of the latest historical valid voice information is "turn on the air conditioner", and the semantics of the first voice information is "turn off"), and the like, the device may calculate a specific first degree of correlation, and then adjust the sensitivity of the decision condition based on the calculated first degree of correlation positively.

Illustratively, if the first degree of association is greater than a certain threshold, which indicates that the probability that the first voice information is valid voice information is relatively high, the greater the first degree of association, the higher the sensitivity is adjusted; on the contrary, if the first relevance is smaller than a certain threshold, it indicates that the probability that the first voice information is valid voice information is small, and the smaller the first relevance is, the lower the sensitivity is adjusted.

For example, in one possible implementation, the device does not need to set a threshold of the first association degree, but may set a condition of correspondingly adjusting the decision condition within each range of the first association degree. For example, taking the decision condition as the decision threshold of the inference model as an example, assuming that the initial decision threshold is 70%, the sensitivity may be reduced in the range of 0 to 30% of the first degree of association, and the decision threshold may be set to be adjusted to 80%; in the range of 31% to 60% of the first relevance degree, the judgment threshold value can be set to be adjusted to 75%; in the range of 61% to 70% of the first degree of association, the original threshold value of 70% can be kept without adjustment; in the range of the first degree of correlation of 71% to 100%, the sensitivity may be adjusted high, and the determination threshold may be set to 60%.

In a possible implementation manner, when the first association degree is determined to be 100% associated, the sensitivity may be set to the highest, or the invalid rejection model may not perform further validity determination, and directly output an indication that the first voice information is valid.

In a specific embodiment, after the device acquires the first voice information, it may acquire invalid voice information acquired last time (referred to as recent historical invalid voice information for short), and analyze a degree of association (referred to as a second degree of association for short) between the two voice information based on the semantic meaning of the first voice information and the recent historical invalid voice information obtained by parsing. If the semantics of the two pieces of voice information are not associated, that is, the second association degree is zero, then the sensitivity of the decision condition is not adjusted. If the semantics of the two pieces of speech information are related, for example, the semantics of the two pieces of speech information are the same, an inheritance relationship exists (for example, the semantics of the latest historical invalid speech information is "we can go to Shenzhen on sunday", the semantics of the first speech information is "can go to Saturday"), a progressive relationship exists (for example, the semantics of the latest historical invalid speech information is "six-morning-onset is very early", the semantics of the first speech information is "i can also go to early-onset"), or an opposition relationship exists (for example, the semantics of the latest historical invalid speech information is "we go to Shenzhen bar", the semantics of the first speech information is "does not go"), and the like, the device may calculate a specific second degree of association, and then adjust the sensitivity of the decision condition based on the calculated second degree of association minus correlation.

Illustratively, if the second degree of association is greater than a certain threshold, which indicates that the probability that the first voice message is invalid voice message is greater, the greater the second degree of association, the lower the sensitivity is adjusted; on the contrary, if the second relevance is smaller than a certain threshold, it indicates that the probability that the first voice message is invalid voice message is small, and the smaller the second relevance, the higher the sensitivity is adjusted.

For example, in a possible implementation manner, the device does not need to set a threshold of the second association degree, but may set a condition of correspondingly adjusting the decision condition in each range of the second association degree. For example, taking the decision condition as the decision threshold of the inference model as an example, if the initial decision threshold is 70%, the sensitivity may be increased in the range of 0 to 30% of the second degree of association, and the decision threshold may be set to be adjusted to 60%; in the range of the second relevance degree of 31% to 60%, the judgment threshold value can be set to be adjusted to 65%; in the range of the second relevance degree of 61% to 70%, the original threshold value of 70% can be kept without adjustment; in the range of the second degree of correlation of 71% to 100%, the sensitivity may be adjusted down, and the judgment threshold may be set to be adjusted to 80%.

In a possible implementation manner, when the second degree of association is determined to be 100% associated, the sensitivity may be adjusted to the lowest, or the validity rejection model may not perform any further validity determination, and directly output an indication that the first speech information is invalid.

In a specific embodiment, the device may adjust the association degree of the decision condition based on a first association degree between the first voice information and the semantics of the valid voice information that is obtained last time by the device, or may adjust the association degree of the decision condition based on a third association degree between the first voice information and the valid voice information that is obtained last time by the device. The third degree of association refers to the degree of association between the first voice information and the content of the valid voice information which is acquired by the device last time, and the first degree of association refers to the degree of association between the semantics of the two voice information. For ease of understanding, see fig. 8A and 8B.

Referring first to fig. 8A, assuming "help me play music" is valid voice information that the device has last acquired, "i usually like to listen to the song of singer a" is the above-mentioned first voice information. In order to acquire the first relevance of the two pieces of voice information, after the semantic information of the two pieces of voice information is acquired through the natural language understanding model, the two pieces of semantic information are input into the semantic relevance reasoning model for processing. And outputting the first association degree of the two semantic information after the processing of the semantic association reasoning model. The semantic association reasoning model is a pre-trained neural network model or a machine learning model and the like.

Referring to fig. 8B, similarly, assuming "help me play music" as the effective voice information that the device has last acquired, "i usually like listening to the song of singer a" is the above-mentioned first voice information. In order to obtain the third degree of association between the two pieces of speech information, the two pieces of speech information may be structurally analyzed through a natural language understanding model, and specifically, the speech information "help me play music" is obtained through structural analysis: the field of description of the speech information is music, the intention of which is to play music. After structured analysis is carried out on the voice information of 'I usually like listening to the song of singer A': the field of the speech information description is music, and the singer is singer a. And after the structural information of the two pieces of voice information is obtained, inputting the two pieces of structural information into a relevant judgment model for processing. And outputting the third degree of correlation of the two voice messages after the processing of the correlation judgment model. The correlation determination model may be, for example, a dialogue state tracking DST model or the like.

The first relevance of the two voice messages "help me play music" and "i usually like to listen to the song of singer a" output in the above fig. 8A may be zero, i.e. the semantics are not relevant; and the third degree of association of the two pieces of voice information "help me play music" and "i usually like to listen to the song of singer a" output in the above-described fig. 8B may be 100%, that is, the two pieces of voice information are associated.

In a possible implementation manner, the third degree of association between the first voice information acquired based on the manner described in fig. 8B and the valid voice information acquired by the device last time may be explicitly 0 or 100%, that is, if the correlation determination model outputs irrelevant indication information, the third degree of association is 0, and if the correlation determination model outputs relevant indication information, the third degree of association is 100%.

In another possible embodiment, a third correlation between the first voice information obtained based on the above-mentioned manner in fig. 8B and the valid voice information obtained last time by the device may also be a specific percentage (e.g. 60% or 90%, etc.) or a similarity score, etc., and then, whether to correlate or not may be determined by comparing with a preset threshold.

After the third correlation between the first voice information and the valid voice information that is obtained by the device for the last time is obtained, the device may adjust the sensitivity of the decision condition based on the positive correlation of the third correlation. The specific positive correlation adjustment mode may refer to the sensitivity of the decision condition adjusted based on the first correlation positive correlation, which is not described herein again. In addition, when the third degree of association is zero, that is, the first voice information is not associated with the valid voice information that the device has last acquired, the sensitivity of the decision condition is not adjusted.

In a specific embodiment, after the device acquires the first voice information, a state of voice conversation between the device and the user until the first voice information is acquired may be acquired, and the state may be, for example, a state in which the device selects, queries, judges, or chats based on a voice control instruction of the user. In particular, the device may learn of this state based on dialog state tracking DST techniques. In the case where there is a state of the device in speech dialogue with the user, indicating that the user has conducted a long interactive dialogue with the device, the device may turn up the sensitivity of the decision condition according to this continued dialogue state. If the state of the voice conversation between the equipment and the user does not exist, the user does not have long-time interactive conversation with the equipment, and the equipment does not adjust the sensitivity of the judgment condition according to the factor.

It should be noted that for several influencing factors shown in fig. 7, the device may individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device may adjust the sensitivity of the decision condition based on any of a number of factors, among others.

Referring to fig. 9, fig. 9 is a diagram illustrating an exemplary adjustment of the sensitivity of the decision condition based on two influencing factors, i.e., a first similarity of the acoustic features of the first speech information and the historical valid speech information, and a second similarity of the acoustic features of the first speech information and the historical invalid speech information. Illustratively, the acoustic features include characteristics such as the intonation and/or the pace of speech.

In a specific embodiment, after acquiring the first voice information, the device extracts the acoustic features of the first voice information by calling an acoustic model stored in a memory, and then compares the extracted acoustic features with the acoustic features of the history valid voice information (which may be one or more pieces of history valid voice information), so as to acquire the similarity (referred to as a first similarity) between the acoustic features of the first voice information and the acoustic features of the history valid voice information. If the similarity between the acoustic feature of the first voice message and the acoustic feature of the historical valid voice message is zero, the device may not adjust the sensitivity of the decision condition according to the first similarity. If the similarity between the acoustic feature of the first speech information and the acoustic features of one or more pieces of historical valid speech information is not zero, the sensitivity of the decision condition may be adjusted in a positive correlation manner, that is, the sensitivity is adjusted to be higher the greater the similarity (for example, the similarity may be the maximum similarity among the obtained similarities, or the average form degree of the obtained similarities, etc.).

In a possible implementation, in a case that the similarity between the acoustic feature of the first speech information and the acoustic features of the one or more historical valid speech information is greater than a threshold (the threshold may be any value between 60% and 100%), which indicates that the acoustic feature of the first speech information is similar to the acoustic feature of the one or more historical valid speech information, the device may adjust the sensitivity of the decision condition to a preset value. For example, in the above judgment threshold, the original judgment threshold is assumed to be 70%, and the judgment threshold is adjusted to 60% as long as the similarity between the acoustic feature of the first voice message and the acoustic features of one or more pieces of history valid voice messages is greater than a certain threshold.

In a specific embodiment, after acquiring the first voice information, the device extracts the acoustic features of the first voice information by calling an acoustic model stored in a memory, and then compares the extracted acoustic features with the acoustic features of the historical ineffective voice information (which may be one or more pieces of historical ineffective voice information), so as to acquire the similarity (referred to as a second similarity) between the acoustic features of the first voice information and the acoustic features of the historical ineffective voice information. If the similarity between the acoustic feature of the first voice message and the acoustic feature of the historical invalid voice message is zero, the device may not adjust the sensitivity of the decision condition according to the second similarity. If the similarity between the acoustic feature of the first speech information and the acoustic features of one or more pieces of historical ineffective speech information is not zero, the sensitivity of the decision condition may be adjusted in a negative correlation, i.e., the sensitivity is adjusted to be lower the greater the similarity (which may be, for example, the maximum similarity among the obtained similarities, or the average form of the obtained similarities, etc.).

In a possible implementation, in a case that the similarity between the acoustic feature of the first voice information and the acoustic features of the one or more historical ineffective voice information is greater than a threshold (the threshold may be any value between 60% and 100%), which indicates that the acoustic feature of the first voice information is similar to the acoustic feature of the one or more historical ineffective voice information, the device may adjust the sensitivity of the decision condition to a preset value. For example, in the above judgment threshold, the original judgment threshold is assumed to be 70%, and the judgment threshold is adjusted to 75% as long as the similarity between the acoustic feature of the first voice message and the acoustic features of one or more pieces of historical invalid voice messages is greater than a certain threshold.

It should be noted that for several influencing factors shown in fig. 9, the device may individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device may adjust the sensitivity of the decision condition based on any of a number of factors, among others.

In one possible implementation, the device may receive an instruction input by a user, and adaptively adjust the sensitivity of the decision condition based on the instruction. Illustratively, the instruction may be, for example, a specific decision condition sensitivity specified by a user, or may be an instruction to turn off or cancel voice information validity recognition or the like. According to the embodiment of the application, the sensitivity of the judgment condition can be adaptively adjusted according to the preference of the user, so that the user requirement can be better met, and the user experience is improved.

In a possible implementation manner, the adjustment of the sensitivity of the decision condition may be sent to another device or apparatus (for example, a server corresponding to the device, etc.) after being adjusted based on the one or more influencing factors, and after receiving the adjusted decision condition, the device may directly decide the validity of the first voice information based on the adjusted decision condition.

Referring to fig. 10, fig. 10 shows a method for processing voice information provided by the present application, which includes, but is not limited to, the following steps:

s1001, first voice information is obtained.

The specific implementation of this step can be referred to the description in step S201 in fig. 2, and is not described herein again.

S1002, under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing the operation indicated by the first voice information, wherein the judgment condition is obtained by adjusting based on the environment condition of the first voice information when being generated.

In a specific embodiment, after the device acquires the first voice information, the decision condition for determining whether the first voice information is a valid voice instruction may be adaptively adjusted based on an environmental condition in which the first voice information is generated. Specifically, the specific implementation of adjusting the decision condition based on the environment condition where the first speech information is generated may refer to the corresponding description in fig. 4, and details are not repeated here.

After the adjustment is completed, the device judges whether the first voice message is valid by adopting the adjusted judgment condition. In the case that the first speech information is valid, the device starts to semantically understand the first speech information, and specifically, the processor in the device may call a natural language understanding model in the memory to perform semantic understanding on the first speech information to obtain a specific meaning of the first speech information. After the device understands the meaning of the first voice information, corresponding operation is executed based on the meaning so as to provide the needed service for the user. The meaning of the first voice message is that for the equipment, the control instruction for executing the corresponding operation is obtained.

In a possible case, the apparatus may receive a sensitivity of a designated decision condition input by the user, and then adaptively adjust the decision condition for deciding whether the first voice information is a valid voice instruction based on the sensitivity, so that the user-designated decision sensitivity can be reached when deciding whether the voice information is valid using the adjusted decision condition. After the device adjusts the judgment condition based on the sensitivity specified by the user, the device judges whether the first voice information is effective by adopting the adjusted judgment condition. And under the condition that the first voice information is effective, the equipment starts to carry out semantic understanding on the first voice information to acquire the meaning of the first voice information, and executes corresponding operation based on the meaning to provide required service for the user. The meaning of the first voice message is that for the equipment, the control instruction for executing the corresponding operation is obtained.

In a possible implementation manner, a specific implementation of the operation executed by the first voice information indication when the first voice information is determined to be the valid voice control instruction based on the decision condition may refer to the description in step S203 in fig. 2, and is not described herein again.

Optionally, the environment condition of the first voice message when generated includes one or more of the following: the number of speakers in a second preset duration when the equipment acquires the first voice information, the number of speakers in a preset range when the first voice information is generated, the confidence coefficient of the first voice information or the signal-to-noise ratio of the first voice information are stopped.

In a specific embodiment, in a case where the environmental condition indicates that the probability that the first speech information is valid is greater than the probability that the first speech information is invalid, the sensitivity of the decision condition is increased; the sensitivity of the decision condition is adjusted to be low in case the environmental condition indicates that the probability that the first speech information is valid is less than the probability of being invalid. For specific implementation, reference may be made to the corresponding description in fig. 4, which is not described herein again.

Because the environmental condition generated by the voice information has a great influence on whether the voice information is an effective voice control instruction or not, the same or similar voice information is an effective instruction under one environmental condition but is not necessarily an effective instruction under another environmental condition, the embodiment of the application adaptively adjusts the judgment condition for judging the effectiveness of the voice information according to the received voice information under different environmental conditions, can better judge the effectiveness of the voice information under different environmental conditions, improves the accuracy of effective judgment, and reduces the false triggering rate of invalid signals.

In a possible implementation manner, the above-mentioned decision condition is adjusted based on an environmental condition in which the first speech information is generated, and includes: the decision condition is adjusted based on the environmental condition and the duration of continuous listening of the device.

In a particular embodiment, the device may adapt the sensitivity of the decision condition in combination with the ambient conditions in which the first speech information is generated and the duration of the continuous listening of the speech information by the device. Specifically, the specific implementation of adjusting the decision condition based on the environment condition where the first speech information is generated may refer to the corresponding description in fig. 4, and details are not repeated here.

Optionally, the sensitivity of the decision condition is adjusted to be lower the longer the duration of continuous listening of the device. For specific implementation of the decision condition for adjusting the continuous listening duration of the voice information based on the device, reference may be made to the corresponding description in fig. 5, which is not described herein again.

Optionally, in a specific implementation, the device may configure a weight for each of the environment condition and the listening duration, and comprehensively adjust the sensitivity of the decision condition in a weighting manner. For example, for the adjustment of the above judgment threshold, assuming that the adjustment is performed by integrating two influencing factors, that is, the environmental condition and the listening duration, the weights set by the two factors are w4 and w5, and the adjusted judgment thresholds calculated by the two factors are a4 and a5, then the adjusted judgment threshold determined by integrating the two factors is (a4 w4+ a5 w 5). It should be noted that this weighted integration manner is only an example, and in an actual implementation, the most or least adjusted one of the multiple influencing factors may be taken as a result of the final adjustment, and the like.

Because the longer the duration that the equipment continuously listens to the voice, the greater the probability that the listened voice information is invalid voice, the environment condition when the voice information is generated and the duration that the equipment continuously listens are combined to adaptively adjust the judgment condition for judging the validity of the voice information in the embodiment of the application, so that the validity of the voice information can be further judged better, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.

In a possible implementation, the above-mentioned decision condition is adjusted based on the environmental condition and the duration of continuous listening of the device, and includes: the decision condition is adjusted based on the environmental condition, the duration of continuous listening, and the condition of the historical speech information.

Optionally, the condition of the historical speech information includes one or more of the following: a first time interval between when the first voice message is acquired and when the effective voice message is acquired last time; a second time interval between when the first voice message is acquired and when the invalid voice message is acquired last time; acquiring the ratio of effective voice information to ineffective voice information within a first preset time before the first voice information; a first degree of association between the first voice message and the semantics of the effective voice message obtained last time; a second degree of association between the first voice information and the semantics of the invalid voice information acquired last time; a third degree of association between the first voice information and effective voice information which is obtained by the device last time; ending the state of the voice conversation between the equipment and the user when the first voice information is acquired; a first similarity of the acoustic features of the first voice information and the historical valid voice information; a second similarity of the acoustic features of the first speech information and the historical inactive speech information.

Optionally, the sensitivity of the decision condition is adjusted to be lower as the first time interval is longer.

Optionally, the sensitivity of the decision condition is adjusted to be lower as the second time interval is longer.

Optionally, in a case where the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted to be high.

Optionally, in a case that the ratio of the valid voice information is greater than the ratio of the invalid voice information, the sensitivity of the determination condition is increased;

under the condition that the proportion of the effective voice information is smaller than that of the invalid voice information, the proportion of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is increased; the ratio of the effective voice information is in a descending trend, and the sensitivity of the judgment condition is adjusted to be low.

Optionally, in the case where the state of the above-mentioned device with the user voice conversation exists, the sensitivity of the decision condition is adjusted high.

In this embodiment, the device may adaptively adjust the sensitivity of the decision condition in conjunction with the environmental conditions in which the first voice information is generated, the duration of listening to the voice information by the device, and the historical voice information listened to by the device. Specifically, the specific implementation of adjusting the decision condition based on the environment condition where the first voice information is generated may refer to the corresponding description in fig. 4, and details are not repeated here; the specific implementation of the decision condition for adjusting the continuous listening duration of the voice information based on the device may refer to the corresponding description in fig. 5, which is not described herein again; for a specific implementation of adjusting the decision condition based on the historical speech information listened to by the device, reference may be made to the corresponding descriptions in fig. 5, fig. 6A, fig. 6B, fig. 7, or fig. 9, which are not described herein again.

Optionally, in this embodiment, the sensitivity of the decision condition is adjusted by combining the environmental condition, the listening duration, and the historical speech information, which may be comprehensively adjusted by using the above-described weighted average comprehensive adjustment method, or may be a result of final adjustment that is adjusted most or least among a plurality of influencing factors, and the present solution does not limit a specific comprehensive calculation process.

The validity of the currently acquired voice information can also be helped to be judged based on the historical voice information, for example, if the similarity between the currently acquired voice information and the historically acquired valid voice information is large, the probability that the currently acquired voice information is a valid voice instruction is large, and conversely, if the similarity between the currently acquired voice information and the historically acquired invalid voice information is large, the probability that the currently acquired voice information is an invalid voice instruction is large. Therefore, in the embodiment of the present application, in addition to the above-described environment condition of generating the voice information and the listening duration of the device, the determination condition for determining the validity of the voice information is adaptively adjusted in combination with the historical voice information, so that the validity of the voice information can be further determined better, the accuracy of valid determination is improved, and the false triggering rate of the invalid signal is reduced.

In a possible implementation manner, the above-mentioned decision condition is adjusted based on an environmental condition in which the first speech information is generated, and includes: the decision condition is adjusted based on the environmental condition and the condition of the historical speech information.

In this embodiment, the device can adaptively adjust the sensitivity of the decision condition in conjunction with the environmental conditions in which the first speech information is generated and the historical speech information heard by the device. Specifically, the specific implementation of adjusting the decision condition based on the environment condition where the first voice information is generated may refer to the corresponding description in fig. 4, and details are not repeated here; for a specific implementation of adjusting the decision condition based on the historical speech information listened to by the device, reference may be made to the corresponding descriptions in fig. 5, fig. 6A, fig. 6B, fig. 7, or fig. 9, which are not described herein again.

Optionally, in this embodiment, the sensitivity of the decision condition is adjusted by combining the environmental condition and the historical speech information, which may be comprehensively adjusted by using the above-described weighted average comprehensive adjustment method, or may be a result of final adjustment that is adjusted most or least among a plurality of influencing factors, and the present solution does not limit a specific comprehensive calculation process.

Based on the foregoing description, in the embodiment of the present application, the determination condition for determining the validity of the voice information is adaptively adjusted in combination with the environmental condition generated by the voice information and the historical voice information, so that the validity of the voice information can be further determined better, the accuracy of valid determination is improved, and the false triggering rate of the invalid signal is reduced.

In a specific embodiment, the specific implementation of obtaining the first voice information may refer to the description in step S201 in fig. 2, and details are not repeated here. For a specific implementation of the operation executed when the first voice information is determined to be the valid voice control instruction based on the determination condition, reference may be made to the description in step S203 in fig. 2, and details are not described here again. For a specific implementation of the aforementioned decision condition for adjusting the continuous listening duration based on the device to the voice information, reference may be made to the corresponding description in fig. 5, which is not described herein again.

In a specific embodiment, the specific implementation of obtaining the first voice information may refer to the description in step S201 in fig. 2, and details are not repeated here. For a specific implementation of the operation executed when the first voice information is determined to be the valid voice control instruction based on the determination condition, reference may be made to the description in step S203 in fig. 2, and details are not described here again. For a specific implementation of adjusting the decision condition based on the historical speech information listened to by the device, reference may be made to the corresponding descriptions in fig. 5, fig. 6A, fig. 6B, fig. 7, or fig. 9, which are not described herein again.

To facilitate a general understanding of the speech information processing method provided in the present application, for example, reference may be made to a flow chart shown in fig. 11. In fig. 11, first, the voice interaction system of the device wakes up, and then the system starts listening to the user's voice. After the system acquires the voice information of the user, the voice information is input into the invalid rejection model to identify the validity of the voice information. And if the voice information is identified to be effective, performing semantic understanding on the voice information, and performing instruction analysis and execution based on the understood semantics.

After semantic understanding, the voice interaction system will judge whether to continue listening to the voice of the user, and if so, the voice interaction system will perform the operation of listening to the voice. And if the listening is determined not to be continued, executing the operation of ending the listening. For example, the determination of whether to continue listening may be determined according to a preset listening duration, and if not, the listening may be continued, otherwise, the listening may be ended.

If the voice information identified by the invalid rejection model is invalid, the system judges whether to continuously listen to the voice of the user, and if so, the system carries out the operation of listening to the voice. And if the listening is determined not to be continued, executing the operation of ending the listening.

In a possible embodiment, in the flow shown in fig. 11, after the voice information is determined to be valid, the steps of determining whether to continuously listen to the user's voice and semantic understanding may also be performed at the same time, or determining whether to continuously listen to the user's voice first and then perform semantic understanding, and the present application does not limit the order of executing the two operations.

After the semantic understanding of the speech information, the semantic of the speech information after the semantic understanding may be returned to the process of recognizing the validity of the speech information, and the semantic of the speech information after the semantic understanding may be input to the invalidation rejection model for adjusting the sensitivity of the decision condition.

In addition, in the embodiment of the voice information processing method provided by the present application, the above description is mainly given by taking the decision condition in the invalid rejection model as an example, but in practical application, the decision condition of the validity of the voice information may not be limited to the decision condition in the invalid rejection model. Any scheme that adjusts the decision condition of the validity of the voice information based on one or more of the above-mentioned influencing factors of the validity recognition of the voice information is within the protection scope of the present application.

In summary, the voice information processing method provided by the application starts with one or more influencing factors influencing the judgment of the voice information validity, and adjusts the sensitivity of the judgment condition of the validity of the voice information obtained by the equipment judgment in real time, so that the equipment can flexibly and effectively judge the validity of the voice information according to different user states based on different scenes, the accuracy of voice information validity identification can be improved, the false triggering rate of invalid voice information is reduced, meanwhile, the calculation resources wasted due to false triggering of the equipment and the like are saved, and the physical examination of the user in the voice interaction process can be improved.

The foregoing mainly introduces a data communication processing method provided in the embodiment of the present application. It is understood that each device comprises corresponding hardware structures and/or software modules for executing each function in order to realize the corresponding function. The elements and steps of the various examples described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the device may be divided into the functional modules according to the method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that the division of the modules in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

In the case of dividing each functional module according to each function, fig. 12 shows a schematic diagram of a possible logical structure of an apparatus, which may be the above-mentioned device, or may be a chip in the device, or may be a processing system in the device, and the like. The apparatus 1200 includes an acquisition unit 1201, an adjustment unit 1202, a semantic understanding unit 1203, and an execution unit 1204. Wherein:

an obtaining unit 1201 is configured to obtain first voice information. The obtaining unit 1201 may be implemented by a communication interface or a transceiver, and may perform the operations described in step 201 shown in fig. 2.

An adjusting unit 1202, configured to adjust a determination condition based on an influence factor of validity of the first voice information, where the determination condition is one or more determination conditions in a validity determination model of the first voice information, and the validity is used to indicate whether the first voice information is a valid voice control instruction for a device that acquires the first voice information. The adjusting unit 1202 may be implemented by a processor, and may perform the operations described in step 202 shown in fig. 2.

A semantic understanding unit 1203 is configured to perform semantic understanding on the first speech information if it is determined that the first speech information is valid based on the adjusted decision condition. The semantic understanding unit 1203 may be implemented by a processor and may perform the semantic understanding operation described in step 203 shown in fig. 2.

An execution unit 1204 is configured to execute the instruction of the first voice message. The execution unit 1204 may be implemented by a processor, and may perform the execution operation described in step 203 shown in fig. 2.

In a possible implementation, the adjusting unit 1202 is specifically configured to:

under the condition that the probability that the first voice information is effective is analyzed to be larger than the probability that the first voice information is ineffective based on the influence factors, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition is, the higher the probability that the first voice information is effective is determined to be through the judgment condition;

and in the case that the probability that the first voice information is valid is analyzed to be smaller than the probability that the first voice information is invalid based on the influence factors, the sensitivity of the judgment condition is adjusted to be low, and the lower the sensitivity of the judgment condition is, the lower the probability that the first voice information is valid is determined through the judgment condition is.

In a possible implementation manner, the decision condition includes a selection condition of a pre-decision module of the validity of the first voice information in the validity judgment model, and the pre-decision module includes a rule matching module and an inference module.

In a possible implementation manner, the decision condition includes a decision threshold of an inference module in the validity decision model, which is used for prejudging validity of the first voice information.

In a possible implementation manner, the decision condition includes a comprehensive decision condition of a decision module in the validity decision model; the comprehensive judgment condition is a judgment condition for determining whether the first voice signal is effective or not based on a pre-judgment result; the prejudgment result is the prejudgment result of the prejudgment module in the effectiveness judgment model on the effectiveness of the first voice information.

In one possible embodiment, the influencing factor is one or more of the following:

the environment condition of the first voice information when being generated;

the duration of the continuous listening of the device 1200;

a first time interval between when the first voice message is acquired and when the effective voice message is acquired last time;

a second time interval between when the first voice message is acquired and when the invalid voice message is acquired last time;

a first degree of association between the first voice message and the semantics of the effective voice message obtained last time;

a second degree of association between the first voice information and the semantics of the invalid voice information acquired last time;

a third degree of association between the first voice message and the valid voice message that was last acquired by the apparatus 1200;

the state of the device 1200 in voice conversation with the user by the time the first voice message is acquired;

a first similarity of the acoustic features of the first voice information and the historical valid voice information;

In one possible embodiment, the environmental condition in which the first speech information is generated includes one or more of the following:

the number of speakers in the second preset duration for obtaining the first voice message, the number of speakers in the preset range when the first voice message is generated, the confidence level of the first voice message, or the signal-to-noise ratio of the first voice message is obtained by the device 1200.

For specific operations and advantages of each unit in the apparatus 1200 shown in fig. 12, reference may be made to the corresponding description in the foregoing method embodiment, and details are not described here again.

In the case of dividing each functional module according to each function, fig. 13 shows a schematic diagram of a possible logical structure of an apparatus, which may be the above-mentioned device, or may be a chip in the device, or may be a processing system in the device, and the like. The apparatus 1300 includes an obtaining unit 1301 and an executing unit 1302. Wherein:

an obtaining unit 1301 is configured to obtain the first voice information. The obtaining unit 1301 may be implemented by a communication interface or a transceiver, and may perform the operation described in step S1001 shown in fig. 10.

An executing unit 1302, configured to execute an operation indicated by the first voice information if it is determined that the first voice information is a valid voice control instruction based on a decision condition, where the decision condition is adjusted based on an environmental condition where the first voice information is generated. The execution unit 1302 may be implemented by a processor, and may execute the operation described in step S1002 shown in fig. 10.

For specific operations and beneficial effects of each unit in the apparatus 1300 shown in fig. 13, reference may be made to the corresponding description in the foregoing method embodiment, and details are not described here again.

Fig. 14 is a schematic diagram illustrating a possible hardware structure of the apparatus provided in the present application, where the apparatus may be the apparatus in the method according to the foregoing embodiment. The apparatus 1400 comprises: a processor 1401, a memory 1402, and a communication interface 1403. The processor 1401, the communication interface 1403 and the memory 1402 may be connected to each other or to each other through a bus 1404.

Illustratively, the memory 1402 is used for storing computer programs and data of the device 1400, and the memory 1402 may include, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable read-only memory (CD-ROM), and the like.

In the case of implementing the embodiment shown in fig. 14, software or program codes necessary for performing the functions of all or part of the units in fig. 14 are stored in the memory 1402.

In the case of implementing the embodiment of fig. 14, if software or program codes required for functions of partial units are stored in the memory 1402, the processor 1401 may cooperate with other components (such as the communication interface 1403) to perform other functions (such as functions of receiving or transmitting data) described in the embodiment of fig. 14, in addition to calling the program codes in the memory 1402 to implement the partial functions.

The number of the communication interfaces 1403 can be multiple, and is used for supporting the device 1400 to perform communication, such as receiving or sending data or signals.

The processor 1401 may be, for example, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, transistor logic, a hardware component, or any combination thereof. A processor may also be a combination of computing functions, e.g., a combination of one or more microprocessors, a digital signal processor and a microprocessor, or the like. The processor 1401 may be configured to read the program stored in the memory 1402, and perform the following operations:

acquiring first voice information; adjusting a judgment condition based on the influence factors of the first voice information validity, where the judgment condition is one or more judgment conditions in a validity judgment model of the first voice information, and the validity is used to indicate whether the first voice information is a valid voice control instruction for the device 1400 that acquires the first voice information; and under the condition that the first voice information is determined to be effective based on the adjusted judgment condition, performing semantic understanding on the first voice information, and executing the instruction of the first voice information.

In a possible implementation, the adjusting the decision condition based on the influence factor of the first voice information validity includes:

For specific operations and beneficial effects of each unit in the apparatus 1400 shown in fig. 14, reference may be made to corresponding descriptions in the foregoing method embodiments, and details are not described here again.

Fig. 15 is a schematic structural diagram of another speech information processing apparatus provided in this embodiment, where the apparatus may be a device in the foregoing embodiment, or may be a chip in the device, or may be a processing system in the device, and may implement the speech information processing method provided in this application and various optional embodiments thereof. As shown in fig. 15, the speech information processing apparatus 1500 includes: a processor 1501, and an interface circuit 1502 coupled to the processor 1501. It should be understood that although only one processor and one interface circuit are shown in FIG. 15. The speech information processing apparatus 1500 may include other numbers of processors and interface circuits.

The interface circuit 1502 is used, among other things, to communicate with other components of the apparatus 1500, such as a memory or other processor. The processor 1501 is used for signal interaction with other components through the interface circuit 1502. The interface circuit 1502 may be an input/output interface of the processor 1501.

For example, the processor 1501 reads computer programs or instructions in a memory coupled thereto through the interface circuit 1502, and decodes and executes the computer programs or instructions. It will be appreciated that these computer programs or instructions may comprise the respective functional procedures of the methods described above. When the corresponding functional programs are decoded and executed by the processor 1501, the speech information processing apparatus 1500 can be caused to implement the scheme in the speech information processing method provided in the embodiment of the present application.

Alternatively, these functional programs are stored in a memory external to the voice information processing apparatus 1500. When the functional program is decoded and executed by the processor 1501, part or all of the content of the functional program is temporarily stored in the internal memory.

Alternatively, these functional programs are stored in a memory inside the voice information processing apparatus 1500. When the memory inside the speech information processing apparatus 1500 stores the functional program, the speech information processing apparatus 1500 may be provided in the device of the embodiment of the present application.

Alternatively, part of the contents of these functional programs are stored in a memory outside the speech information processing apparatus 1500, and the other part of the contents of these functional programs are stored in a memory inside the speech information processing apparatus 1500.

It should be understood that the apparatuses or devices shown in fig. 1, fig. 12 or fig. 13, fig. 14 and fig. 15 may be combined with each other, and the apparatuses or devices shown in fig. 1, fig. 12 or fig. 13, fig. 14 and fig. 15 and the related design details of the various alternative embodiments may be referred to each other, and the speech information processing method shown in any one of fig. 2 or fig. 10 and the related design details of the various alternative embodiments may also be referred to. And will not be repeated here.

The embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to implement the operations performed by the server in any one of the above embodiments and possible embodiments thereof.

The embodiments of the present application further provide a computer program product, when the computer program product is read and executed by a computer, the operations performed by the server in any of the above embodiments and possible embodiments thereof are executed.

Embodiments of the present application further provide a computer program, which when executed on a computer, will enable the computer to implement the operations performed by the server in any one of the above embodiments and possible embodiments.

In summary, the present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false triggering rate of invalid voice in different intelligent voice interaction scenarios.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image, without departing from the scope of the various described examples. Both the first image and the second image may be images, and in some cases, may be separate and distinct images.

It should also be understood that, in the embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It will be further understood that the terms "comprises," "comprising," "includes," and/or "including," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be appreciated that reference throughout this specification to "one embodiment," "an embodiment," "one possible implementation" means that a particular feature, structure, or characteristic described in connection with the embodiment or implementation is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" or "one possible implementation" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for processing speech information, the method comprising:

acquiring first voice information;

and executing the operation indicated by the first voice information under the condition that the voice control instruction which is valid is determined based on the judgment condition, wherein the judgment condition is adjusted based on the environment condition in which the first voice information is generated.

2. The method of claim 1, wherein the environmental conditions in which the first speech information is generated comprise one or more of:

the number of speakers in a second preset duration when the first voice information is obtained by the equipment, the number of speakers in a preset range when the first voice information is generated, the confidence coefficient of the first voice information or the signal-to-noise ratio of the first voice information.

3. The method according to claim 1 or 2, wherein the decision condition is adjusted based on an environmental condition in which the first speech information is generated, and comprises:

the decision condition is adjusted based on the environmental condition and the continuous listening time of the device.

4. The method of claim 3, wherein the decision condition is adjusted based on the environmental conditions and the duration of listening of the device, and comprises:

the decision condition is adjusted based on the environment condition, the continuous listening duration and the condition of the historical voice information.

5. The method according to claim 1 or 2, wherein the decision condition is adjusted based on an environmental condition in which the first speech information is generated, and comprises:

the judgment condition is obtained by adjusting based on the environment condition and the condition of the historical voice information.

6. The method of claim 4 or 5, wherein the condition of the historical speech information comprises one or more of the following:

7. The method according to any one of claims 1 to 6,

in the case that the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that the first voice information is invalid, the sensitivity of the decision condition is increased;

8. Method according to claim 3 or 4, characterized in that the sensitivity of the decision condition is adjusted lower the longer the duration of continuous listening of the device.

9. The method according to any one of claims 4 to 6, wherein the condition of the historical speech information comprises a first time interval between when the first speech information is acquired and when valid speech information was most recently acquired;

the sensitivity of the decision condition is adjusted to be lower the longer the first time interval.

10. The method according to any one of claims 4 to 6, wherein the condition of the historical speech information comprises a second time interval between when the first speech information is acquired and when invalid speech information was acquired last time;

the sensitivity of the decision condition is adjusted to be lower the longer the second time interval.

11. The method according to any one of claims 4 to 6, wherein the condition of the historical voice information comprises a first time interval between when the first voice information is acquired and when valid voice information is acquired last time, and comprises a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time;

in case the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted higher.

12. The method according to any one of claims 4 to 6, wherein the condition of the historical voice information comprises a ratio of valid voice information to invalid voice information within a first preset time period before the first voice information is acquired;

13. The method according to any one of claims 4 to 6, wherein the condition of the historical voice information comprises a state of voice conversation between the device and the user until the first voice information is acquired;

in the case where a state in which the apparatus has a voice conversation with the user exists, the sensitivity of the decision condition is adjusted high.

14. A speech information processing apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire first voice information;

15. The apparatus of claim 14, wherein the environmental conditions in which the first speech information is generated comprise one or more of:

16. The apparatus according to claim 14 or 15, wherein the decision condition is adjusted based on an environment condition in which the first speech information is generated, and comprises:

17. The apparatus of claim 16, wherein the decision condition is adjusted based on the environmental conditions and the continuous listening time of the device, and comprises:

18. The apparatus according to claim 14 or 15, wherein the decision condition is adjusted based on an environment condition in which the first speech information is generated, and comprises:

19. The apparatus of claim 17 or 18, wherein the condition of the historical speech information comprises one or more of:

20. The apparatus of any one of claims 14 to 19,

21. The apparatus according to claim 16 or 17, wherein the sensitivity of the decision condition is adjusted to be lower the longer the duration of continuous listening of the device.

22. The apparatus according to any of claims 17 to 19, wherein the condition of the historical speech information comprises a first time interval between when the first speech information was acquired and when valid speech information was most recently acquired;

23. The apparatus according to any one of claims 17 to 19, wherein the condition of the historical speech information comprises a second time interval between when the first speech information is acquired and when invalid speech information was acquired last time;

24. The apparatus according to any one of claims 17 to 19, wherein the condition of the historical voice information comprises a first time interval between when the first voice information is acquired and when valid voice information is acquired last time, and comprises a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time;

25. The apparatus according to any one of claims 17 to 19, wherein the condition of the historical speech information includes a ratio of valid speech information to invalid speech information within a first preset time period before the first speech information is acquired;

26. The apparatus according to any one of claims 17 to 19, wherein the condition of the historical voice information includes a state of a voice conversation between the device and the user until the first voice information is acquired;

27. An apparatus, comprising a processor and a memory, wherein the memory is configured to store a computer program and the processor is configured to execute the computer program stored in the memory such that the apparatus performs the method of any of claims 1 to 13.

28. A chip system is applied to an electronic device; the chip system comprises an interface circuit and a processor; the interface circuit and the processor are interconnected through a line; the interface circuit is used for receiving signals from a memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause a system-on-chip to perform the method of any of claims 1 to 13.

29. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1 to 13.

30. A computer program product, characterized in that when the computer program product is executed by a processor, the method of any of claims 1 to 13 is to be performed.