WO2022222045A1

WO2022222045A1 - Speech information processing method, and device

Info

Publication number: WO2022222045A1
Application number: PCT/CN2021/088522
Authority: WO
Inventors: 杨世辉; 聂为然
Original assignee: 华为技术有限公司
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2022-10-27
Also published as: CN113330513A

Abstract

A speech information processing method and apparatus, and a device, a chip system, a computer-readable storage medium and a computer program product. The method comprises: S1001, acquiring first speech information; and S1002, when it is determined, on the basis of a decision condition, that the first speech information is an effective speech control instruction, executing an operation indicated by the first speech information, wherein the decision condition is obtained by means of an adjustment performed on the basis of the situation of an environment of when the first speech information is generated. By means of the method, the accuracy rate of effective speech recognition can be increased in different intelligent speech interaction scenarios, thereby reducing the false triggering rate of ineffective speech.

Description

Voice information processing method and device

technical field

The present application relates to the technical field of speech processing, and in particular to methods and devices for processing speech information.

Background technique

In intelligent voice interaction scenarios, smart devices have two commonly used modes for listening to user voices, namely continuous listening mode and full-time wake-up-free mode. Full-time wake-up-free mode can also be called full-time listening mode. In the continuous listening or full-time listening state, the smart device needs to distinguish whether the user content is a valid instruction for it, that is, it needs to distinguish the content of the dialogue between man and machine, and the content of dialogue between man and man.

Specifically, in the listening state, the voice information collected by the device includes chat data. In order to prevent the smart device from being mistakenly triggered by the chat content, the rule matching module is often used, or the inference module (such as a neural network inference module) is used for judgment. Whether the voice information is a valid voice control command. However, due to the different usage environments and scenarios, the validity of the same voice information or voice information with the same semantics may be different. For example, a sentence is a valid voice control command in the current scenario, but in another scenario It's just chatting information, which is invalid information. However, the existing voice information valid determination solutions cannot adapt to the valid voice information recognition under different usage environments and scenarios, which easily leads to low recognition accuracy and false triggering of invalid voices.

To sum up, how to improve the accuracy of valid speech recognition and reduce the false trigger rate of invalid speech in different intelligent speech interaction scenarios is a technical problem that those skilled in the art need to solve urgently.

SUMMARY OF THE INVENTION

The present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false trigger rate of invalid voices in different intelligent voice interaction scenarios.

In a first aspect, the present application provides a voice information processing method, the method comprising:

Obtain first voice information; in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, execute the operation indicated by the first voice information, wherein the judgment condition is when the first voice information is generated based on the first voice information. The environmental conditions in which it is located can be adjusted.

Since the environmental conditions generated by the voice information have a great influence on whether the voice information is a valid voice control command, the same or similar voice information is a valid command in one environmental situation, but not necessarily in another environmental situation. It is a valid instruction. Therefore, this application adaptively adjusts the judgment conditions for judging the validity of the voice information for the voice information received under different environmental conditions, which can better judge the validity of the voice information in different environmental conditions, and improve the effectiveness of the voice information. The accuracy of the judgment can reduce the false trigger rate of invalid signals.

In a possible implementation manner, the environmental conditions in which the first voice information is generated include one or more of the following: until the device obtains the first voice information, speaking within a second preset time period The number of people, the number of people within a preset range when the first voice information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.

Since the number of speakers in a period of time is greater, and/or the number of people around when the voice information is generated, the greater the probability that the voice information received by the device is idle chat is invalid voice. In addition, the confidence of the voice information The higher the degree and/or the signal-to-noise ratio, the higher the probability that the device can correctly recognize the sentences of the speech information, and the recognition of the validity of the speech information will also be affected. Adjusting the judgment conditions for judging the validity of the voice information can better judge the validity of the voice information, improve the accuracy of effective judgment, and reduce the false trigger rate of invalid signals.

In a possible implementation manner, the judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.

Since the longer the device continues to listen to the voice, the greater the probability that the voice information it hears is invalid voice. Therefore, in this application, the judgment voice information is adaptively adjusted according to the environmental conditions when the voice information is generated and the continuous listening time of the device. The validity judgment condition can further judge the validity of the speech information better, improve the accuracy of valid judgment, and reduce the false trigger rate of invalid signals.

In a possible implementation manner, the judgment condition is adjusted based on the environmental conditions and the continuous listening duration of the device, including: the judgment condition is based on the environmental conditions, the continuous listening duration and historical voice information. The situation is adjusted.

Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in this application, in addition to the environmental conditions and the listening time of the device for the generation of the voice information described above, the historical voice information is also used to adaptively adjust the judgment conditions for judging the validity of the voice information, which can further better judge the validity of the voice information. improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.

In a possible implementation manner, the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.

Based on the foregoing description, in this application, the judgment conditions for judging the validity of the voice information are adaptively adjusted in combination with the environmental conditions generated by the voice information and historical voice information, and the validity of the voice information can be further judged better and the judgment of the effective judgment can be improved. Accuracy, reduce the false trigger rate of invalid signals.

In a possible implementation manner, the situation of the historical voice information includes one or more of the following:

the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;

the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;

Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;

The first semantic correlation between the first voice information and the most recently acquired valid voice information;

The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;

the third degree of association between the first voice information and the last valid voice information obtained by the device;

The state of the voice dialogue between the device and the user when the first voice information is obtained;

the first similarity between the acoustic features of the first voice information and historically valid voice information;

The second similarity of the acoustic features of the first voice information and the historical invalid voice information.

In this application, the historical voice information that can be used to help judge the validity of the currently acquired voice information includes one or more of the above, and the decision to judge the validity of the voice information is adaptively adjusted based on the one or more items. All conditions can better judge the validity of speech information, improve the accuracy of effective discrimination, and reduce the false trigger rate of invalid signals.

In a possible implementation manner, when the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;

In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.

In the embodiment of the present application, if the received voice information has a high probability of being valid, the threshold for validity judgment can be lowered, that is, the sensitivity of the judgment condition can be improved, and if the probability of being valid is small, the threshold for validity judgment can be raised. That is to reduce the sensitivity of the judgment conditions, so that the voice information received under different environmental conditions can be flexibly recognized and its effectiveness can be improved, and the accuracy of the recognition can be improved, instead of using fixed judgment conditions across the board to judge the voice information in each scene. effectiveness.

In a possible implementation manner, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.

Since the longer the device continues to listen to the voice, the higher the probability of the voice information being heard is invalid voice. Therefore, in this application, the threshold of validity judgment can be increased, that is, the sensitivity of the judgment condition can be reduced, so that voice can be more accurately recognized. whether the information is valid.

In a possible implementation manner, the situation of the historical voice information includes a first time interval between when the first voice information is acquired and when valid voice information is acquired most recently; the longer the first time interval, the The sensitivity of the decision condition is adjusted lower.

Because the longer the interval between the time when the current voice signal is obtained and the last time valid voice information is obtained, the higher the probability that the obtained current voice signal is an invalid voice command is. Therefore, in this application, the validity judgment can be improved. The threshold is to reduce the sensitivity of the decision condition, so that whether the voice information is valid or not can be more accurately identified.

In a possible implementation manner, the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired most recently; the longer the second time interval, the The sensitivity of the decision condition is adjusted lower.

Because the longer the interval between the time when the current voice signal is acquired and the latest acquisition of invalid voice information is, the higher the probability that the acquired current voice signal is an invalid voice command is. Therefore, in this application, the validity judgment can be improved. The threshold is to reduce the sensitivity of the decision condition, so that whether the voice information is valid or not can be more accurately identified.

In a possible implementation manner, the situation of the historical voice information includes the first time interval between when the first voice information is acquired and the last time valid voice information is acquired, and includes the time when the first voice information is acquired. The second time interval between the latest acquisition of invalid voice information; in the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.

In the present application, the above-mentioned first time interval is less than the second time interval, indicating that the time interval between the obtained first voice information and the most recent acquisition of historical valid voice information is not long, therefore, the first voice information is valid voice The probability of the instruction is relatively large, therefore, the judgment threshold of validity can be lowered, that is, the sensitivity of the judgment condition can be improved, so that whether the voice information is valid can be more accurately identified.

In a possible implementation manner, the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;

In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;

In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.

In the present application, within the above-mentioned first preset time period, valid voice information accounts for a relatively large proportion, so the probability that the currently obtained first voice information is a valid command is relatively high. Therefore, the judgment threshold of validity can be lowered and the judgment conditions can be increased. In addition, if the proportion of valid voice information is smaller than the proportion of invalid voice information, but the proportion of valid voice information is on the rise, indicating that there are more and more valid voice information, then the probability that the first voice signal is a valid command Therefore, the judgment threshold of validity can be lowered, and the sensitivity of the judgment condition can be increased, so that whether the voice information is valid can be more accurately identified.

In a possible implementation manner, the situation of the historical voice information includes the state of the device and the user's voice dialogue until the first voice information is obtained; in the case that the state of the device and the user's voice dialogue exists, The sensitivity of the decision condition is adjusted up.

The state of the device and the user's voice dialogue refers to the state in which the device and the user are in a voice conversation. The device can be tracked through the dialogue state tracking function. If this state currently exists, it indicates that the above-mentioned first voice information is likely to be a valid voice command. Therefore, the judgment threshold of validity can be lowered, and the sensitivity of the judgment condition can be increased, so that whether the speech information is valid can be more accurately identified.

In a possible implementation manner, the device may receive the sensitivity of the specified judgment condition, adjust the judgment condition based on the sensitivity, and then use the adjusted judgment condition to judge whether the above-mentioned first voice information is valid.

In this application, the above-specified sensitivity is the sensitivity input by the user, and the device can more flexibly adjust the sensitivity of the decision condition based on the user's needs, so as to better meet the user's needs.

In a possible implementation manner, the present application provides another voice information processing method. The method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is obtained by adjusting based on the continuous listening duration of the device.

In this application, since the longer the device continues to listen to the voice, the greater the probability that the voice information heard is invalid voice, therefore, the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening time of the device, The validity of the voice information can be better judged, the accuracy of effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.

In a possible implementation manner, the present application provides another voice information processing method. The method includes: acquiring first voice information; and in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, executing The operation indicated by the first voice information, wherein the judgment condition is adjusted based on historical voice information.

Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the present application, by adaptively adjusting the judgment conditions for judging the validity of the voice information through the historical voice information, the validity of the voice information can be better judged, the accuracy of the effective judgment can be improved, and the false trigger rate of invalid signals can be reduced.

In a second aspect, the present application provides a voice information processing device, the device comprising:

an acquisition unit for acquiring the first voice information;

an execution unit, configured to execute the operation indicated by the first voice information when it is determined based on a judgment condition that the first voice information is a valid voice control instruction, wherein the judgment condition is based on the first voice The environmental conditions in which the information is generated are adjusted.

In a possible implementation manner, the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.

In a third aspect, the present application provides a device, which may include a processor and a memory, for implementing the voice information processing method described in the first aspect above. The memory is coupled to the processor, and when the processor executes the computer program stored in the memory, the method described in the first aspect or any possible implementation manner of the first aspect can be implemented. The device may also include a communication interface for the device to communicate with other devices, and the communication interface may, by way of example, be a transceiver, circuit, bus, module, or other type of communication interface.

In one possible implementation, the device may include:

memory for storing computer programs;

a processor, configured to obtain the first voice information; in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, execute the operation indicated by the first voice information, wherein the judgment condition is based on the first voice information The environmental conditions in which the voice information is generated are adjusted and obtained.

It should be noted that the computer program in the memory in this application can be pre-stored or downloaded from the Internet when the device is used and stored, and this application does not specifically limit the source of the computer program in the memory. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules.

In a fourth aspect, embodiments of the present application provide a chip system, which is applied to an electronic device; the chip system includes an interface circuit and a processor; the interface circuit and the processor are interconnected by lines; the interface circuit is used to receive data from a memory of the electronic device A signal is sent to the processor, where the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the system-on-a-chip executes the method described in the first aspect and any possible implementation manner thereof.

In a fifth aspect, the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the first aspect or any possible possibility of the first aspect. implement the method described.

In a sixth aspect, the present application provides a computer program product. When the computer program product is executed by a processor, the method described in the first aspect or any possible implementation manner of the first aspect will be executed.

The solutions provided in the second to sixth aspects above are used to implement or cooperate with the methods provided in the first aspect, so they can achieve the same or corresponding beneficial effects as the corresponding methods in the first aspect, which are not repeated here. Repeat.

Description of drawings

1 shows a schematic diagram of a system architecture to which the voice information processing method provided by the present application is applicable;

2 shows a schematic flowchart of a voice information processing method provided by the present application;

Fig. 3 shows the structural representation of a kind of invalid refusal model provided by this application;

FIG. 4 and FIG. 5 are schematic diagrams showing the sensitivity of adjusting decision conditions based on influencing factors provided by the present application;

6A and FIG. 6B are schematic diagrams showing the sensitivity of adjusting decision conditions based on influencing factors provided by the present application;

6C and 6D are schematic diagrams showing the change of the proportion of voice information in this application;

FIG. 7 shows a schematic diagram of the sensitivity of adjusting decision conditions based on influencing factors provided by the present application;

FIG. 8A and FIG. 8B are schematic diagrams of judging the correlation degree of voice information in the present application;

FIG. 9 is a schematic diagram showing the sensitivity of adjusting the decision condition based on the influencing factors provided by the present application;

Figure 10 shows a schematic flowchart of another voice information processing method provided by the present application;

11 shows a schematic flowchart of the validity recognition of voice information provided by the application;

FIG. 12 is a schematic diagram of a logical structure of an apparatus provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of a logical structure of another apparatus provided by an embodiment of the present application;

14 is a schematic diagram of a hardware structure of a device provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of a hardware structure of another apparatus provided by an embodiment of the present application.

Detailed ways

For ease of understanding, the following first introduces the technical terms involved in the embodiments of the present application.

1. Automatic speech recognition (ASR) generally refers to taking speech as the research object, and allowing machines to automatically recognize and understand human spoken speech through speech signal processing and pattern recognition. The technique of transforming into corresponding text or commands.

The overall process of building a speech recognition system includes two parts: training and recognition. The training is usually done offline. Signal processing and knowledge mining are performed on the pre-collected massive speech and language databases to obtain the acoustic model required by the speech recognition system (acoustic model is the variable of acoustics, phonetics, environment, speaker gender, etc.). , accent, etc.) and language model (a language model is a knowledge representation for a set of word sequences.). The recognition process is usually completed online, and the real-time voice of the user is automatically recognized. The recognition process can usually be divided into two modules: front-end and back-end: the main function of the front-end module is to perform endpoint detection (removing redundant silence and non-speaking sounds), noise reduction, feature extraction, etc.; the function of the back-end module is to use training A good acoustic model and language model perform statistical pattern recognition (also known as decoding) on the feature vector of the user's speech, and obtain the text information contained in it. In addition, there is an adaptive feedback module in the back-end module, which can self-learn the user's voice, so as to make necessary corrections to the acoustic model and the voice model, and further improve the accuracy of recognition.

2. Voiceprint recognition (VR)

Voiceprint recognition is a type of biometric technology, also known as speaker recognition, which is a technology that identifies the speaker's identity through sound. There are two types of voiceprint recognition technologies, namely speaker recognition and speaker confirmation. Different tasks and applications will use different voiceprint recognition technologies. For example, identification technology may be required when narrowing the scope of criminal investigations, while confirmation technology may be required for banking transactions.

3. Speech synthesis

Speech synthesis, also known as text to speech (TTS) technology, is a technology that converts text information generated by a computer or input from external sources into understandable and fluent spoken language output. Artificial mouth, let the machine speak like a human.

4. Task-based dialogue system

Task-based dialogue can be understood as a sequential decision-making process. During the dialogue process, the machine needs to update and maintain the internal dialogue state by understanding user sentences, and then select the next optimal action according to the current dialogue state (such as confirming requirements, querying restrictions) conditions, provide results, etc.) to complete the task.

The task-based dialogue system commonly used in the industry is a system with a modular structure, which generally includes four key modules:

Natural language understanding (NLU): Identify and parse the user's text input to obtain computer-understandable semantic labels such as slot values and intents.

Dialogue state tracking (DST): Maintains the current dialogue state according to the dialogue history. The dialogue state is the cumulative semantic representation of the entire dialogue history, generally slot-value pairs.

Dialogue policy (DP): output the next system action according to the current dialogue state. The general dialogue state tracking module and the dialogue strategy module are collectively referred to as the dialogue manager (DM) module.

Natural language generation (NLG): Convert system actions into natural language output.

This modular system structure is highly interpretable and easy to implement. Most practical task-based dialogue systems in the industry use this structure.

5. Computer vision (CV)

Computer vision, also known as machine vision, is a science that studies how to make machines "see". Its main task is to obtain information about the corresponding scene by processing the collected pictures or videos.

6. Invalid rejection model

The invalid rejection model is used to judge the validity of the user's voice information obtained by the device. The validity can be used to indicate whether the voice information is a valid voice control instruction for the device that obtains the voice information. The voice information may be text information or the like obtained by converting the voice signal received by the device.

During the listening process, the device may receive a lot of voice information from the user, but some voice information is just the voice information of chatting between users, which is invalid information for the device. The voice information that the user actually interacts with the device is the information effective for the device, and the effective information is the user's voice control instructions.

In this application, the invalid recognition model may include a pre-judgment module and a decision-making module for the validity of voice information. The pre-judgment module includes a rule matching module and a reasoning module, and is used to make a preliminary judgment on the validity of the speech information. in:

The rule matching module can match the input voice information through preset rules, such as preset sentences, etc. If there is a sentence matching the input voice information in the preset sentences, then the input voice information is valid, If the preset sentence does not have a sentence matching the input voice information, the input voice information is invalid.

The inference module can be a deep learning prediction model trained on large-scale data using neural networks or traditional machine learning (such as a supervised learning model such as a support vector machine (SVM)). By inputting the acquired voice information into the reasoning module, the device can predict the probability that the voice information is valid, or directly output the result of whether it is valid or not.

The decision-making module can make a final judgment decision on the processing result of at least one of the rule matching module and the reasoning module by synthesizing the judgment conditions, and determine whether the voice information is valid, which can greatly improve the accuracy of the validity judgment of the voice information. The comprehensive judgment condition will be introduced later, and will not be described in detail here.

It should be noted that the above invalid recognition model can also be called a validity judgment model, etc. The following takes the invalid recognition model as an example to introduce, the name of the model used to judge the validity of the voice information obtained by the device does not constitute a model. LIMITATIONS ON THIS APPLICATION.

In order to better understand the voice information processing method provided by the embodiments of the present application, the following exemplarily introduces the system architecture to which the voice information processing method is applicable.

Referring to FIG. 1, FIG. 1 exemplarily shows a system architecture diagram used by the voice information processing method provided by the present application. The system architecture may include an audio manager 110 , a video manager 120 , a memory 130 and a processor 140 , which may be connected by a bus 150 .

Audio manager 110 may include a speaker and microphone array. A loudspeaker is a transducer that converts electrical signals into sound signals, and is used to output the sound of the device. A microphone is an energy conversion device that converts a sound signal into an electrical signal, and is used to collect human voice and other sound information.

Video manager 120 may include an array of cameras. Cameras can convert optical image signals into electrical signals for storage or transmission.

The memory 130 is used to store computer programs and data. The memory 130 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM) or Portable read-only memory (compact disc read-only memory, CD-ROM), etc.

In this application, the memory 130 may store computer programs or codes for models such as automatic speech recognition model, voiceprint recognition model, computer vision model, invalid recognition model, natural language understanding model, dialogue management model, and speech synthesis model.

The processor 140 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like. The processor 140 may be configured to read the computer program and data stored in the above-mentioned memory 130, and execute the voice information processing method provided by the embodiment of the present application.

This application does not limit the type of the bus 150. For example, the bus 150 may be a desktop data bus (desktop bus, D-BUS), and D-BUS is an inter-process communication (IPC) optimized for desktop environments. ) mechanism for inter-process communication or process-kernel communication. Alternatively, the bus 150 may be a data bus (DB), an address bus (AB), a control bus (CB), and the like.

Exemplarily, the system architecture shown in FIG. 1 may be a system architecture of a terminal device or a server or other device. The terminal device may include, but is not limited to, any device based on an intelligent operating system, which can perform human-computer interaction with the user through input devices such as keyboards, virtual keyboards, touchpads, touchscreens, and voice-activated devices, such as smart phones and tablet computers. , handheld computers, wearable electronic devices or in-vehicle devices (such as in-vehicle computers, etc.) and so on. The server may be an edge server or a cloud server, the server may be a virtual server or a physical server, etc., which is not limited in this application.

The system architecture shown in FIG. 1 above is only an example, and does not constitute a limitation on the system architecture applicable to the embodiments of the present application.

The following describes a voice information processing method provided by an embodiment of the present application. The method can be applied to the system architecture shown in FIG. 1 above, that is, the method is executed by the above-mentioned terminal device or server or other devices, or can be executed by The terminal device or a processing device such as a chip or a processor in the server executes the method, and the execution body of the method is collectively referred to as a device in the following description. Optionally, if the execution body of the method is a server or a chip or processor in the server, the terminal device may first receive the voice information, and then the terminal device sends the received voice information to the server for processing. The voice information sent by the terminal device to the server may be original information received by the terminal device, or may be voice information preprocessed by the terminal device.

Referring to FIG. 2 , a voice information processing method provided by an embodiment of the application may include, but is not limited to, the following steps:

S201. Acquire first voice information.

In a specific embodiment, the device may receive the user's voice signal through a microphone. Then, the device can recognize the voice signal through the automatic voice recognition ASR model to obtain voice information corresponding to the voice signal, and the voice information can include text information and the like.

Specifically, the voice interaction function between the device and the user can be woken up by receiving a wake-up signal from the user, for example, receiving a specific wake-up word from the user. After being woken up, the device can detect and receive the user's voice signal through the microphone, and the process of detecting and receiving the user's voice signal can be referred to as a listening process of the device. In order to reduce the repetitive operation of having to wake up the device every time a voice control command is issued, there are currently two main listening methods: continuous listening and full-time listening.

Among them, the continuous listening method refers to: after the device is awakened or the voice command operation is successful, within a period of time (such as 30s), the device does not need to be awakened again, and can continue to listen during this period of time, and perform voice interaction with the user, execute The user's voice control commands.

The full-time listening mode means that the device only needs to be woken up once after it is started, and until the device is turned off, you can listen all the time, interact with the user by voice, and execute the user's voice control instructions.

The above-mentioned first voice information may be voice information corresponding to any voice signal received by the device in the listening stage.

S202. Adjust a judgment condition based on an influencing factor of the validity of the first voice information, where the judgment condition is one or more judgment conditions in an invalid recognition model for judging the validity of the first voice information.

In order to facilitate the understanding of the above invalid recognition model, please refer to FIG. 3 . FIG. 3 exemplarily shows a process flow diagram of the invalid identification rejection model. First, the invalid refusal model receives voice information, for example, receives the above-mentioned first voice information, and selects a pre-judgment module for judging the validity of the voice information based on the voice information and preset selection conditions, that is, selects the above-mentioned reasoning module and rule. At least one of the matching modules predicts the validity of the speech information.

The selection condition may be a condition set based on factors affecting the validity of the voice information. Exemplarily, for example, the selection condition may be: when the listening time of the device is greater than the first threshold, select the rule matching module to judge the validity of the voice information; when the listening time of the device is less than the second threshold, select the reasoning The module judges the validity of the voice information; and when the listening time of the device is between the second threshold and the first threshold, the rule matching module and the reasoning module can be selected at the same time to judge the validity of the voice information. It should be noted that the influencing factor of the validity of the voice information is not limited to the listening time of the device, which will be introduced in detail below, and will not be described in detail here.

If only the reasoning module is selected to predict the validity of the speech information, the device inputs the acquired speech information into the reasoning module, and obtains the output result after calculation. Exemplarily, the output result may be the probability of predicting the validity of the input voice information, and then comparing the probability with a preset judgment threshold to obtain a prejudgment result. Specifically, if the probability is greater than the judgment threshold, the pre-judgment result is that the input voice information is valid, and if the probability is less than the judgment threshold, the pre-judgment result is that the input voice information is invalid. For example, assuming that the judgment threshold is 70%, it is stipulated that as long as the effective probability of the voice information is greater than 70%, then the voice information can be determined to be valid. If the effective probability of the voice information predicted by the reasoning module is 80%, greater than 70%, then , the voice information is valid information. If the effective probability of the voice information predicted by the reasoning module is 50% and less than 70%, then the voice information is invalid information.

It should be noted that the result output by the above-mentioned reasoning module is not limited to the valid probability of voice information, but can also be in other data forms, such as the form of scoring. The score exceeds the judgment threshold, indicating that the voice information is valid, etc. This does not limit.

If only the rule matching module is selected to predict the validity of the voice information, the device inputs the acquired voice information into the rule matching module, and the rule matching module compares the input voice information with the information in the preset rule base get the prediction result. If the information in the preset rule base matches the input voice information, the pre-judgment result is that the input voice information is valid. On the contrary, if the information in the preset rule base does not match the input voice information, the pre-judgment result is that the input voice information is invalid.

In the above case where only the reasoning module or the rule matching module is selected to predict the validity of the speech information, after obtaining the pre-judgment result of the validity of the speech information, the pre-judgment result can be input into the decision-making module, and the decision-making module can synthesize the validity of the speech information. The judgment condition judges whether the prejudgment result is reasonable, so as to output a final indication of whether the voice information is valid. For example, the comprehensive judgment condition is: the valid voice information includes not less than 3 characters, then, if the input voice information contains less than 3 characters, and the pre-judgment result output by the inference module or the rule matching module is the voice If the information is valid, the pre-judgment result is unreasonable, and then the decision-making module determines that the voice information is invalid, and outputs the final indication information indicating that the voice information is invalid; otherwise, if the input voice information has no less than 3 characters, the reasoning module Or the pre-judgment result output by the rule matching module is reasonable, and the decision module finally determines that the voice information is valid, and outputs indication information indicating that the voice information is valid.

It should be noted that the above comprehensive judgment condition is not limited to the above examples, and may also be other forms of conditions. In a possible implementation, the comprehensive judgment condition may be a voting mechanism. It is determined that the voice information is valid, and the number of votes for which the voice information is invalid is large, and the voice information is determined to be invalid.

Or, in a possible implementation, in the case where only the inference module or the rule matching module is selected to predict the validity of the speech information, it is not necessary to make a comprehensive judgment, but the result output by the inference module or the rule matching module is used. Output as the final result of the invalid rejection model.

If the inference module and the rule matching module are selected at the same time to predict the validity of the speech information, then the above-obtained speech information is input into the inference module and the rule matching module respectively, and the two modules follow their own processes (see the above description). (not repeated here) pre-judging the validity of the voice information, respectively obtaining the respective validity pre-judgment results, then, inputting the two pre-judgment results into the decision-making module, based on the comprehensive judgment conditions in the decision-making module to The two validity prediction results are finalized to output the final result of the invalid rejection model.

Exemplarily, the comprehensive judgment condition may be: the valid voice information includes no less than 3 characters, and then, the decision-making module checks the rationality of the above-mentioned two pre-judgment results based on the comprehensive judgment condition, and the specific inspection process refers to the previous description, which will not be repeated here.

Exemplarily, in a possible implementation, the comprehensive judgment condition may be a voting mechanism, that is, if the voice information has a large number of valid votes, it is determined that the voice information is valid, and if the voice information has a large number of invalid votes, it is determined that the voice information is valid. Information is invalid. If the above two pre-judgment results of the validity of the voice information are valid, the final judgment result of the voice information is also valid. If the two validity prediction results are invalid, the final judgment result of the voice information is also invalid. If one of the two validity prediction results is valid and the other is invalid, then further judgment can be made, for example, according to the priority. If the priority of the reasoning module is higher than that of the rule matching module, then the prediction result of the reasoning module is used. output as the final result. If the priority of the rule matching module is higher than that of the inference module, the pre-judgment result of the rule matching module is used as the final result output.

It should be noted that the above comprehensive judgment condition is only an example, and its main purpose is to more accurately synthesize the pre-judgment results of the reasoning module and/or the rule matching module to judge the validity of the acquired voice information. In the specific embodiment, the comprehensive judgment condition may also be other conditions that can achieve the purpose, which is not limited in this solution.

Based on the above description of FIG. 3 , the above judgment conditions in S202 may include one or more of the selection conditions in the invalid recognition model, the judgment threshold of the output result of the judgment inference module, and the comprehensive judgment conditions. That is, in this application, in order to improve the accuracy of valid speech recognition and reduce the false trigger rate of invalid speech in different scenarios, in different speech interaction scenarios, based on one or more factors that can affect the input speech information. The influencing factors of validity judgment The above judgment conditions are adjusted flexibly, so that the validity recognition of speech information is more flexible and more suitable for the context and scene at that time.

In a possible implementation manner, the above-mentioned adjustment of the decision condition based on the influencing factor of the validity of the first voice information may be:

In the case that the probability that the first voice information is valid is greater than the probability that it is invalid based on one or more voice information validity influencing factors, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the The judgment condition determines that the probability that the first voice information is valid is higher; in the case that the probability of the first voice information being valid is less than the probability of being invalid based on one or more factors that influence the validity of the voice information, the judgment condition is determined. When the sensitivity is adjusted down, the lower the sensitivity of the decision condition, the lower the probability of determining that the first voice information is valid through the decision condition. For the sensitivity of the decision condition and the specific adjustment process, please refer to the following introduction, which will not be described in detail here.

Optionally, the above-mentioned influencing factors that can affect the validity recognition of the input speech information may include one or more of the following:

The environment in which the voice information is generated, the continuous listening time of the device, the first time interval between when the device obtains voice information and the last time it obtains valid voice information, and the time between when the device obtains voice information and the last time it obtains invalid voice information. The second time interval between , the proportion of valid voice information and invalid voice information in the first preset time period before the device obtains the voice information, the first degree of semantic relevance between the voice information and the last valid voice information obtained by the device , the second degree of association between the voice information and the semantics of the invalid voice information obtained by the device last time, the third degree of association between the first voice information and the valid voice information obtained by the device last time, until the current voice information is obtained when the device The state of the voice dialogue with the user, the first similarity between the acoustic features of the voice information and the historically valid voice information, and the second similarity between the voice information and the acoustic features of the historically invalid voice information.

In a possible implementation manner, after acquiring the above-mentioned first voice information, the device can adjust the selection conditions in the above-mentioned invalid recognition model based on a first factor, and the first factor can include one or more of the above-mentioned influencing factors. . The specific adjustment process will be introduced later, and will not be described in detail here.

In a possible implementation, after the device obtains the above-mentioned first voice information, it can adjust the judgment threshold of the output result of the decision inference module in the above-mentioned invalid rejection model based on a second factor, and the second factor can include the above-mentioned influencing factors. one or more of. The second factor and the influencing factors included in the above-mentioned first factor may be different, or may be partially the same, or may be completely the same, which is specifically determined according to the actual situation, which is not limited in this solution. The specific adjustment process will be introduced later, and will not be described in detail here.

In a possible implementation manner, after the device obtains the above-mentioned first voice information, it can adjust the comprehensive judgment condition of the decision-making module in the above-mentioned invalid rejection model based on a third factor, and the third factor can include one of the above-mentioned influencing factors. or more. The third factor may be different from the influencing factors included in the above-mentioned first factor and the above-mentioned second factor, or may be partially the same, or may be completely the same, which is specifically determined according to the actual situation, which is not limited in this solution. The specific adjustment process will be introduced later, and will not be described in detail here.

In specific implementation, the above selection conditions, judgment thresholds and comprehensive judgment conditions can be adjusted together, or one or both of the selection conditions, judgment thresholds and comprehensive judgment conditions can be adjusted, and the specific selection can be based on actual needs. , this program does not limit this.

S203. Under the condition that the first voice information is determined to be valid based on the adjusted judgment condition, perform semantic understanding on the first voice information, and execute an instruction of the first voice information.

In a specific embodiment, after the device acquires the above-mentioned first voice information, after adjusting the judgment conditions in the invalid recognition model based on the above-mentioned influencing factors, the device identifies the validity of the first voice information based on the adjusted invalid recognition model. sex.

In a possible implementation, if the device adjusts the selection conditions in the above invalid denial model, then the device can select one or more models in the above rule matching module and inference module to pre-judgment based on the adjusted selection conditions. the validity of the first voice information.

In a possible implementation, if the device adjusts the judgment threshold of the above-mentioned reasoning module, and the device selects the pre-judgment module for judging the validity of the first voice information to include the reasoning module, then the output of the reasoning module indicates that the first voice information is valid. After obtaining the valid data, the device can judge whether the first voice information is valid based on the data indicating the validity of the first voice information and the adjusted judgment threshold.

In a possible implementation, if the device adjusts the comprehensive judgment conditions of the decision-making module in the above invalid denial model, then, after obtaining the pre-judgment results of the above-mentioned rule matching module and/or inference module, it can be based on the adjusted result. The comprehensive judgment condition performs a comprehensive judgment on the prediction result of the rule matching module and/or the reasoning module, so as to determine the validity of the above-mentioned first voice information.

For the specific process of the validity identification of the above-mentioned first voice information, reference may be made to the description about the above-mentioned FIG. 3 , which will not be repeated here.

In the case that the above-mentioned first voice information is valid, the device starts to perform semantic understanding of the first voice information. Specifically, the processor in the device can call the natural language understanding model in the memory to execute the semantic understanding of the first voice information. understand to obtain the specific meaning of the first voice information. After understanding the meaning of the first voice information, the device performs a corresponding operation based on the meaning to provide the user with the desired service. The meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.

The following describes the adjustment process of the judgment condition in the above-mentioned first voice information validity recognition from different influencing factors of the voice information validity. It should be noted that the judgment condition may include one or more of the selection conditions, judgment thresholds and comprehensive judgment conditions in the above invalid rejection model, and the adjustment process described below can be applied to the selection conditions, judgment thresholds and comprehensive judgment conditions. Adjustment of one or more of the judgment conditions.

Before introducing the adjustment process, first introduce the relevant concepts involved in the adjustment process:

Sensitivity of the judgment condition: The sensitivity refers to the degree of relaxation and strictness of the judgment condition. The stricter the judgment condition, the lower the sensitivity, and the looser the judgment condition, the higher the sensitivity.

Exemplarily, for the above-mentioned selection conditions for selecting a prediction model, generally, since the inference module predicts the validity of the speech information, it belongs to fuzzy matching, while the rule matching module is a pattern-matching prediction, so yes, no. No, it is relatively strict. Therefore, when selecting a pre-judgment model, if the voice information obtained by the device has a high probability of being valid, then an inference module or a rule matching module can be selected for pre-judgment, or at this time, if you want to improve the accuracy of the effective recognition of the voice information , you can choose the reasoning module to predict. If the probability that the voice information obtained by the device is valid is small, in order to effectively avoid false triggering of invalid information, a rule matching module can be selected to prejudge.

For example, suppose the selection conditions are: the listening time of the device is less than 10 seconds, the reasoning module is selected to predict, the listening time of the device is greater than 20 seconds, the rule matching module is selected to predict, the listening time of the device is between 10 seconds and 20 seconds Then, the inference module and the rule matching module are selected at the same time to predict. If you want to better filter invalid information and reduce false triggering, the device can adjust the selection conditions to a more severe direction, that is, lower the sensitivity of the selection conditions. For example, you can adjust the selection conditions to: the listening time of the device is less than 5 seconds, select the reasoning module to predict, if the listening time of the device is greater than 10 seconds, select the rule matching module to predict, if the listening time of the device is between 5 seconds and 10 seconds, select the reasoning module and the rule matching module to predict at the same time . Conversely, if you want to better recognize valid voice information, the device can adjust the selection condition to a looser direction, that is, increase the sensitivity of the selection condition. If the listening time of the device is greater than 25 seconds, the rule matching module is selected for prediction. If the listening time of the device is between 15 seconds and 25 seconds, the inference module and the rule matching module are selected for prediction.

Exemplarily, for the judgment threshold of the above reasoning module, assuming that the standard judgment threshold is 70%, that is, the probability that the reasoning module predicts that the voice information is valid is greater than 70%, it is determined that the voice information is valid. However, when the judgment threshold is adjusted to 80%, that is, the judgment condition is adjusted in a strict direction. In this case, the probability that the reasoning module predicts that the voice information is valid needs to be greater than 80% before it can be judged to be valid. It can be seen that the judgment The sensitivity of the condition is reduced. However, if the judgment threshold is adjusted to 60%, that is, the judgment condition is adjusted in a relaxed direction. In this case, the inference module predicts that the voice information is valid only if the probability is greater than 60% before it can be judged to be valid. It can be seen that the judgment condition Sensitivity is improved.

Exemplarily, for the above-mentioned comprehensive judgment condition, it is assumed that the comprehensive judgment condition is: the characters included in the valid speech information are not less than 3, then, if the comprehensive judgment condition is adjusted as: the characters included in the valid speech information are not less than 3. 5, it can be seen that the requirements for voice information are increased and more stringent, so the sensitivity of the comprehensive judgment condition is reduced. If the comprehensive judgment condition is adjusted to include no less than 2 characters in valid voice information, it can be seen that the requirements for voice information are reduced and more relaxed, so that the sensitivity of the comprehensive judgment condition is improved.

Negative correlation adjustment sensitivity: it means that when the value corresponding to the influencing factor increases, the sensitivity is adjusted lower, and the more the increase, the lower the sensitivity adjustment; and when the value corresponding to the influencing factor decreases, the sensitivity is adjusted higher, and the more the decrease is , the higher the sensitivity is.

Positive correlation adjustment sensitivity: it means that when the value corresponding to the influencing factor increases, the sensitivity is adjusted higher, and the more the increase is, the higher the sensitivity is adjusted; and when the value corresponding to the influencing factor decreases, the sensitivity is adjusted lower, and the more the decrease is , the lower the sensitivity.

It should be noted that the specific adjustment amount of the sensitivity adjustment described in this application can be set according to the actual situation, which is not limited in this application. In addition, the adjustment of the sensitivity of the above judgment condition has a range. For example, for the adjustment of the above judgment threshold, the maximum is 100%, the minimum is 0, etc. The adjustment range of the sensitivity of the judgment condition is determined according to the actual situation. This does not limit.

First, the adjustment process of the above judgment condition is introduced based on the influence factor of the environmental situation in which the above-mentioned first voice information is generated. Exemplarily, the environmental conditions in which the first voice information is generated include one or more of the following: the number of speakers within the second preset time period until the device acquires the first voice information (hereinafter referred to as the number of speakers), The number of people within a preset range when the first voice information is generated (hereinafter referred to as the number of people around), the confidence level of the first voice information, and the signal-to-noise ratio of the first voice information, and so on. The number of speakers specifically refers to the number of different voiceprints included in the first voice information, because each person has different voiceprints, therefore, the number of voiceprints can be used to represent the speaking of the first voice information number of people.

Referring to FIG. 4, FIG. 4 takes the above listed several environmental influence factors as examples to describe how to adjust the above judgment conditions based on the environmental influence factors.

During the process of acquiring the above-mentioned first voice information, the device may acquire the number of people around and the number of speakers of the first voice information. Specifically, the device can drive the camera to shoot pictures or videos of the surrounding environment by calling the computer vision model in the memory, and then analyze the captured pictures and videos to know the number of people around and the number of speakers. The number of speakers can be obtained by analyzing the above Find out which people's mouths are moving in the video within the second preset duration. The surrounding number includes the number of speakers. The second preset duration may be, for example, 5 seconds, 10 seconds, or 1 minute, etc., which is not limited in this application.

Alternatively, the device can identify the voiceprint features in the voice signal received by the device within the second preset duration by calling the voiceprint recognition model in the memory, and the number of different voiceprint features identified is the number of speakers. Optionally, the voiceprint recognition model may be a dynamic monitoring model to flexibly adapt to voiceprint recognition in different situations.

After the above device obtains the number of people around (assuming m people, m is a positive integer) and the number of speakers (assuming n people, n is a positive integer), it first determines whether the number of speakers n is 0, if it is 0, it means the above If the first voice information does not include human voice information, it is not necessary to adjust the corresponding judgment conditions.

If the number of speakers n is not 0, it indicates that the first voice information includes human voice information. Further, it is judged whether the number of people around m is greater than 1. If m is not greater than 1, it can be judged whether m is 1.

If m is 1, it means that there is only one person in the surrounding environment, and the first voice information sent by him is very likely to be a voice control command sent to the device. Then, the sensitivity of the judgment condition can be adjusted to be better. The validity of the first voice information is recognized.

Or, if m is 1, by default, the currently acquired first voice information is a voice control instruction for the device, which is valid information. Then, the sensitivity of the judgment condition can be adjusted to the highest, or the invalid rejection model does not further perform validity judgment, and directly outputs an indication that the first voice information is valid.

If m is not 1, there may be an error in detection, and the sensitivity of the decision condition cannot be adjusted based on this information, so it is not adjusted.

When the number of speakers n is not 0 and the number of people around m is greater than 1, the first voice information is likely to be the content of small talk, which may be invalid voice information for the device. The sensitivity of the decision condition is lowered by the size of the size, and the larger the number of people around m, the lower the sensitivity of the decision condition. Because the more people around, the higher the probability that the first voice information belongs to chatting voice. Therefore, stricter judgment conditions need to be set to identify the validity of the first voice information, so as to prevent invalid voice information from falsely triggering related The service operation wastes the resources of the device.

In addition, after acquiring the first voice information, the device can call the automatic speech recognition model in the memory to calculate the confidence of the first voice information, or use the channel information to calculate the signal-to-noise ratio of the first voice information, or the confidence Both the degree and the signal-to-noise ratio are calculated, and then the sensitivity of the decision condition is adjusted based on this confidence degree and/or the signal-to-noise ratio.

Specifically, the sensitivity of the decision condition can be adjusted based on the confidence and/or the negative correlation of the SNR, because the higher the confidence, the higher the probability that the first voice information is correctly recognized, and the higher the SNR. High, indicating that the quality of the collected first voice information is better. At this time, even if the sensitivity of the judgment condition is harsh, the validity of the first voice information can be better recognized, and the invalid voice of chatting can be effectively filtered.

On the contrary, if the confidence level is lower, it indicates that the probability of the first voice information being correctly recognized is smaller, and the signal-to-noise ratio is lower, indicating that the quality of the collected first voice information is worse, and the recognition of the voice content may be wrong. , in order to improve the robustness of the voice interaction of the device, the sensitivity of the decision condition can be appropriately increased, and the decision condition can be relaxed, so that the validity of the first voice information can be better recognized.

Exemplarily, the device may set a confidence threshold and/or a signal-to-noise ratio threshold for voice information. If the confidence of the first voice information is greater than the confidence threshold and/or the signal-to-noise ratio is greater than the signal-to-noise ratio threshold, then the confidence The higher the degree and/or the signal-to-noise ratio, the lower the sensitivity of the decision condition. If the confidence of the first speech information is smaller than the confidence threshold and/or the SNR is smaller than the SNR threshold, then the lower the confidence and/or the SNR, the higher the sensitivity of the decision condition is adjusted. The confidence threshold may be, for example, 50% or 60%, and the signal-to-noise ratio threshold may be, for example, 50db or 60db, and the present application does not limit the confidence threshold and the signal-to-noise ratio threshold.

Exemplarily, in a possible implementation manner, the device does not need to set the confidence threshold and/or the signal-to-noise ratio threshold of the speech information, but can set the corresponding adjustment decision condition within each confidence and/or signal-to-noise ratio range. . For example, taking the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then, within the range of the confidence level of 0 to 30%, the sensitivity can be increased, and the judgment threshold can be set to 50%; within the range of the confidence level from 31% to 60%, you can set the judgment threshold to 60%; within the range of the confidence level from 61% to 70%, you can not adjust it and keep the original 70% threshold; Within the range of 71% to 100% confidence, the sensitivity can be adjusted down, and the judgment threshold can be set to 80%.

It should be noted that, for the above-mentioned influencing factors of the number of speakers n, the number of surrounding people m, the confidence level and the signal-to-noise ratio, the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors. Exemplarily, a weight may be configured for each of the multiple influencing factors, and the sensitivity of the decision condition may be adjusted in a weighted manner. For example, for the adjustment of the above judgment threshold, it is assumed that the three influencing factors of the surrounding number m, the confidence degree and the signal-to-noise ratio are adjusted. The corresponding weights of the three factors are w1, w2 and w3. The calculated and adjusted judgment thresholds are a1, a2, and a3. Then, the adjusted judgment threshold determined by synthesizing the three factors is (a1*w1+a2*w2+a3*w3). It should be noted that this weighted synthesis method is only an example. In actual implementation, the most or least adjusted among multiple influencing factors can be taken as the final adjustment result, etc. This scheme does not do the calculation process of specific synthesis. limit.

Referring to FIG. 5, FIG. 5 exemplarily shows that based on the continuous listening time (hereinafter referred to as t1) before the device acquires the above-mentioned first voice information, the first time between the device acquiring the first voice information and the most recent acquisition of valid voice information. The time interval (hereinafter referred to as Δt1), and the second time interval (hereinafter referred to as Δt2) between the device acquiring the first voice information and the last time it obtained invalid voice information (hereinafter referred to as Δt2), these three influencing factors adjust the sensitivity of the judgment condition. Schematic.

Specifically, after the device acquires the above-mentioned first voice information, it can acquire the duration t1 that the device continues to listen until the time when the first voice information is acquired, and the first time between the acquisition of the first voice information and the most recent acquisition of valid voice information. A time interval Δt1, and a second time interval Δt2 between the acquisition of the first voice information and the latest acquisition of invalid voice information. Exemplarily, the acquisition of t1, Δt1 and Δt2 can be obtained by timing and calculation by a timer.

After obtaining the t1, the device can adjust the sensitivity of the above judgment condition based on the negative correlation of the t1, that is, the longer the duration t1 of continuous listening is, the lower the sensitivity of the judgment condition is adjusted. This is because when the device is woken up, it begins to enter a new round of continuous listening stage. Generally, the user's voice information obtained by the device in the early stage of the continuous listening stage is more likely to be effective. With the passage of time, the voice information obtained by the device is more likely to be chat information between users. To reduce false triggering, the sensitivity needs to be reduced. Therefore, the device can adjust the sensitivity of the above judgment conditions based on the negative correlation of the continuous listening time length.

In order to facilitate the understanding of the sensitivity of adjusting the decision condition based on the t1 negative correlation, an example is given. For example, assuming that the judgment condition is the judgment threshold of the output result of the above inference module, in the initial stage of continuous listening, the judgment threshold can be 60%, the condition is relatively loose, and the sensitivity is high, but with the gradual increase of t1, every time t1 increases by one Unit interval (such as 5-second interval), the judgment threshold is increased by a preset increment value, such as an increase of 1%, etc., that is, with the increase of t1, the judgment threshold is larger and larger, and the conditions are more and more harsh. Sensitivity gradually decreases. It should be noted that this is just an example, and the present application does not limit the specific negative correlation adjustment method.

After obtaining the above-mentioned first time interval Δt1, the device may determine whether the Δt1 is greater than the first time interval threshold T1. If Δt1 is greater than this T1, the sensitivity of the decision condition is not adjusted. This is because when the Δt1 is greater than the T1, it can be considered that the length of time included in the first time interval Δt1 overlaps with the above-mentioned time length t1 of continuous listening, and the sensitivity of the judgment condition can be adjusted by the above-mentioned t1, and there is no need to adjust the sensitivity of the judgment condition according to the above-mentioned t1. Δt1 to adjust the sensitivity of the decision condition.

If Δt1 is smaller than the T1, the negative correlation adjusts the sensitivity of the decision condition. This is because, within a period of time after the device obtains valid voice information, that is, the length of time T1, the longer the interval, the greater the probability that the voice information obtained by the device is invalid voice information such as chat, therefore, in order to reduce false triggers, The device can negatively correlate to adjust the sensitivity of the decision condition.

After obtaining the second time interval Δt2, the device may determine whether the Δt2 is greater than the second time interval threshold T2. If Δt2 is greater than this T2, the sensitivity of the decision condition is not adjusted. This is because when the Δt2 is greater than the T2, it can be considered that the length of time included in the second time interval Δt2 overlaps with the above-mentioned time length t1 of continuous listening, and the sensitivity of the judgment condition can be adjusted by the above-mentioned t1, and there is no need to adjust the sensitivity of the judgment condition according to the above-mentioned t1. Δt2 to adjust the sensitivity of the decision condition.

If Δt2 is smaller than the T2, the negative correlation adjusts the sensitivity of the decision condition. This is because, within a period of time after the device acquires invalid voice information, that is, the length of time T2, the longer the interval, the greater the probability that the voice information acquired by the device is invalid voice information such as chat, therefore, in order to reduce false triggers, The device can negatively correlate to adjust the sensitivity of the decision condition.

In addition, for the first time interval Δt1 and the second time interval Δt2 obtained above, the device can compare whether Δt1 is smaller than Δt2, and if so, increase the sensitivity of the decision condition. This is because, when the previous voice information obtained from the first voice information is valid voice information, it is more likely that the first voice information is an addition or modification of the previous voice information, that is, the first voice information In order to better identify the validity of the first voice information, the device may adjust the judgment condition to a relaxed direction, that is, increase the sensitivity.

The adjustment process shown in FIG. 5 above is an example of implementation of the present application. The sensitivity of the judgment condition is dynamically adjusted in real time through the characteristics of the length of the continuous listening time and the time interval between valid voice information and invalid voice information, so that at different listening times At this stage, even if the voice information obtained by the device has the same content, there are differences in the threshold for being judged to be valid, so that the valid voice can be better recognized, the false trigger of invalid voice can be reduced, and the user's voice interaction experience can be improved.

It should be noted that, for the several influencing factors shown in FIG. 5 , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.

Referring to FIGS. 6A and 6B , FIGS. 6A and 6B exemplarily illustrate the adjustment of the decision condition based on the influence factor of the ratio of valid voice information and invalid voice information in the first preset time period before the device obtains the above-mentioned first voice information. Schematic diagram of the sensitivity of .

Exemplarily, the first preset duration may be the duration of continuous listening before the device obtains the first voice information, or the first preset duration may be any duration before the device obtains the first voice information, The arbitrary duration may be pre-configured, which is not limited in this application.

The above-mentioned proportion of valid voice information within the first preset duration refers to the proportion of valid voice information acquired by the device to all voice information acquired by the device within the first preset duration. Or, the ratio is the reciprocal of the number of invalid voice information acquired between the time when a valid voice control instruction was last received and the time when the first voice information was acquired. If the number of invalid voice information acquired during the period is 0, the proportion of the valid voice information is 1.

The proportion of invalid voice information within the first preset duration refers to the proportion of invalid voice information acquired by the device to all voice information acquired by the device within the first preset duration. Or, the ratio is the reciprocal of the number of valid voice information acquired between the time when an invalid voice control instruction is received last time and the time when the first voice information is acquired. If the number of valid voice information acquired during the period is 0, then the proportion of invalid voice information is 1.

In a specific embodiment, after the device obtains the first voice information, the device obtains the proportion of valid voice information (referred to as f1) and the proportion of invalid voice information (referred to as f2) within the first preset duration, and the device may Compare the sizes of the f1 and f2 (see Figure 6A). If f1 is greater than f2, it indicates that more valid voice information is obtained within the above-mentioned first preset duration, and the user frequently interacts with the device in voice, then the above judgment condition can be adjusted according to the positive correlation of the parameter (f1-f2). sensitivity. That is, the larger the proportion of valid voice information, the greater the probability that the first voice information is valid, then the higher the sensitivity of the judgment condition is adjusted, so that the validity of the acquired voice information can be better recognized, and the effective Possibility of missing recognition of voice information.

In a possible implementation manner, the device may adjust the sensitivity of the above decision conditions based on f1 and f2. For example, the larger the proportion of f1, the higher the sensitivity adjustment, and the smaller the proportion of f2, the lower the sensitivity adjustment, and so on.

In FIG. 6A, if f1 is not greater than f2, the device can adjust the sensitivity of the decision condition according to the change rate of f1 and the change rate of f2.

Exemplarily, take the number of times the voice information is acquired as the horizontal axis (or take the time of continuous listening as the horizontal axis), and take f1 as the vertical axis to construct a coordinate system, in this coordinate system, when valid voice information is acquired last time. The slope of the line connecting the f1 with the f1 when the last valid voice information was acquired last time is the change rate of the f1. For ease of understanding, reference may be made to Figure 6C. In FIG. 6C , it is assumed that the voice information has been received 6 times before the above-mentioned first voice information is obtained, and FIG. 6C exemplarily shows the proportion of valid voice information after each time the voice information is obtained and the validity is judged . Then, in FIG. 6C , after the device acquires the first voice information, the acquired change rate of f1 is k=-10%.

In the same way, exemplarily, take the number of times the voice information is acquired as the horizontal axis (or take the continuous listening time as the horizontal axis), and take f2 as the vertical axis to construct a coordinate system, in this coordinate system, the most recent acquisition is invalid. The slope of the line connecting f2 at the time of voice information and f2 when invalid voice information was acquired last time is the rate of change of f2. For ease of understanding, reference may be made to Figure 6D. In FIG. 6D , it is assumed that the voice information has been received 6 times before the above-mentioned first voice information is obtained, and FIG. 6D exemplarily shows the proportion of invalid voice information after each time the voice information is obtained and the validity is judged . Then, in FIG. 6D , after the device acquires the first voice information, the acquired change rate of f2 is k=10%.

Based on the above description, when f1 is not greater than f2, it indicates that the voice interaction between the user and the device is reduced. Then, in order to reduce false triggering of invalid voices, the device can adjust the sensitivity of the decision condition according to the positive correlation of the rate of change of f1. That is, the larger the change rate of f1, the greater the probability that the first voice information is valid, the higher the sensitivity, the looser the judgment condition; and the smaller the change rate of f1, the lower the probability that the first voice information is valid. , the lower the sensitivity is adjusted, the harsher the judgment condition. For example, referring to FIG. 6C above, several rates of change of f1 are exemplarily given in FIG. 6C: k=-50%, k=16.6%, k=8.3%, k=-15% and k=-10%, The order from small to large is: -50%<-15%<-10%<8.3%<16.6%. Assuming that the judgment condition for adjustment is the judgment threshold of the output result of the above inference module, and assuming that the judgment threshold before adjustment is 70%, then the adjusted judgment threshold corresponding to the rate of change of the five f1 sorted from small to large is 85% , 80%, 78%, 68% and 65%. It should be noted that the lower the judgment threshold, the higher the sensitivity, that is, increasing the sensitivity here means lowering the judgment threshold, and lowering the sensitivity means increasing the judgment threshold.

In the case where f1 is not greater than f2, the device can adjust the sensitivity of the decision condition according to the negative correlation of the rate of change of f2. That is, the smaller the rate of change of f2, the higher the proportion of valid voice information, that is, the greater the probability that the first voice information is valid. Therefore, the higher the sensitivity, the looser the judgment condition; and the rate of change of f2 The larger the value, the smaller the proportion of valid voice information, that is, the lower the probability that the first voice information is valid. Therefore, the lower the sensitivity is, the stricter the judgment condition is. For example, referring to Figure 6D above, several f2 rates of change are exemplified in Figure 6D: k=50%, k=-16.6%, k=-8.3%, k=15% and k=10%, which The order from small to large is: -16.6%<-8.3%<10%<15%<50%. Assuming that the judgment condition for adjustment is the judgment threshold of the output result of the above inference module, and assuming that the judgment threshold before adjustment is 70%, then the adjusted judgment threshold corresponding to the rate of change of f2 sorted from small to large is 65% , 68%, 78%, 80% and 85%.

Or, after the device obtains the first voice information, the device obtains the proportion of valid voice information (referred to as f1) and the ratio of invalid voice information (referred to as f2) within the first preset duration, and the device does not need to compare f1 and f2. The sensitivity of the above judgment condition can also be adjusted according to the positive correlation of this parameter (f1-f2), the sensitivity of the judgment condition can be adjusted according to the positive correlation of the rate of change of f1, and/or the sensitivity of the judgment condition can be adjusted according to the negative correlation of the rate of change of f2 ( See Figure 6B). For the specific adjustment process, refer to the above description of FIG. 6A , which will not be repeated here.

It should be noted that, for several influencing factors shown in FIG. 6A or FIG. 6B , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.

Referring to FIG. 7 , FIG. 7 exemplarily shows the semantics of the first voice information and the invalid voice information acquired by the device based on the first correlation degree between the first voice information and the semantics of the valid voice information acquired by the device last time. The second correlation degree of the first voice information, the third degree of correlation between the first voice information and the last valid voice information obtained by the device, and the three influencing factors of the state of the voice dialogue between the device and the user until the first voice information is obtained. Schematic diagram of the sensitivity of .

In a specific embodiment, after acquiring the above-mentioned first voice information, the device can acquire the most recently acquired valid voice information (referred to as the most recent historical valid voice information), based on the first voice information obtained by analysis and the recent historical valid voice information. The semantic analysis of the voice information analyzes the degree of association of the two voice information (referred to as the first degree of association for short). Specifically, semantic understanding of the first speech information may be performed by invoking a natural language understanding model in the memory.

If the semantics of the two speech information are not related, that is, the first degree of correlation is zero, then the sensitivity of the decision condition is not adjusted. If the semantics of the two voice information is related, for example, the two voice information have the same semantics and there is an inheritance relationship (for example, the semantics of the recent historical valid voice information is "turn on the air conditioner", the semantics of the first voice information is "the temperature is higher." ”), there is a progressive relationship (for example, the semantics of the recent historically valid voice information is “a little higher”, and the semantics of the first voice information is “a little higher”), or there is an opposite relationship (for example, the semantics of the recent historically valid voice information is "Turn on the air conditioner", the semantics of the first voice information is "off"), etc., the device can calculate the specific first correlation degree, and then adjust the sensitivity of the decision condition based on the positive correlation of the calculated first correlation degree.

Exemplarily, if the first correlation degree is greater than a certain threshold, it indicates that the probability that the first voice information is valid voice information is high, and the greater the first correlation degree is, the higher the sensitivity is; If the correlation degree is smaller than a certain threshold, it indicates that the probability that the first voice information is valid voice information is small, and the lower the first correlation degree is, the lower the sensitivity is adjusted.

Exemplarily, in a possible implementation manner, the device does not need to set the threshold of the first correlation degree, but can set the corresponding adjustment decision conditions within each range of the first correlation degree. For example, taking the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then, in the range of the first correlation degree from 0 to 30%, the sensitivity can be lowered, and the judgment threshold can be set Adjusted to 80%; in the range of the first correlation degree from 31% to 60%, you can set the judgment threshold to 75%; in the range of the first correlation degree from 61% to 70%, you can not adjust it and keep the original 70% of the threshold; in the range of the first correlation degree from 71% to 100%, the sensitivity can be increased, and the judgment threshold can be set to 60%.

In a possible implementation, when it is judged that the first degree of relevance is 100% relevant, the sensitivity can be adjusted to the highest, or the invalid rejection model does not conduct further validity judgment, and directly outputs the first voice information. valid instructions.

In a specific embodiment, after obtaining the above-mentioned first voice information, the device can obtain the invalid voice information obtained the last time (referred to as the recent invalid voice information for short), based on the first voice information obtained by analysis and the recent history invalid voice information The semantic analysis of the voice information analyzes the degree of association between the two voice information (referred to as the second degree of association for short). If the semantics of the two speech information are not related, that is, the second degree of correlation is zero, then the sensitivity of the decision condition is not adjusted. If the semantic association of the two voice information, for example, the semantics of the two voice information are the same, and there is an inheritance relationship (for example, the semantics of the recent invalid voice information is "We can go to Shenzhen on Sunday", the semantics of the first voice information is " I can go on Saturday”), there is a progressive relationship (for example, the semantics of the recent invalid voice information is “get up early at six in the morning”, and the semantics of the first voice information is “I can get up earlier”) or there is an antagonistic relationship ( For example, the semantics of the recent invalid voice information is "Let's go to Shenzhen", and the semantics of the first voice information is "don't go"), etc., the device can calculate the specific second correlation degree, and then based on the calculated second correlation The degree of negative correlation adjusts the sensitivity of the decision condition.

Exemplarily, if the second correlation degree is greater than a certain threshold, it indicates that the probability that the first voice information is invalid voice information is high, and the greater the second correlation degree is, the lower the sensitivity is; If the correlation degree is smaller than a certain threshold, it indicates that the probability that the first voice information is invalid voice information is small, and the smaller the second correlation degree is, the higher the sensitivity is.

Exemplarily, in a possible implementation manner, the device does not need to set the threshold of the second correlation degree, but can set the corresponding adjustment decision conditions within each range of the second correlation degree. For example, taking the judgment condition as the judgment threshold of the above inference model as an example, assuming that the initial judgment threshold is 70%, then in the range of the second correlation degree from 0 to 30%, the sensitivity can be increased, and the judgment threshold can be set Adjust to 60%; in the range of the second correlation degree from 31% to 60%, you can set the judgment threshold to 65%; in the range of the second correlation degree from 61% to 70%, you can not adjust it and keep the original 70% of the threshold; in the range of the second correlation degree from 71% to 100%, the sensitivity can be lowered, and the judgment threshold can be set to 80%.

In a possible implementation, when it is judged that the second degree of relevance is 100% relevant, the sensitivity can be adjusted to the lowest level, or the invalid rejection model does not conduct further validity judgment, and directly outputs the first voice information. Invalid instruction.

In a specific embodiment, in addition to adjusting the degree of association of the judgment condition based on the first degree of association between the first voice information and the semantics of the most recent valid voice information obtained by the device, the device may also The third correlation degree of the valid voice information obtained once is used to adjust the correlation degree of the judgment condition. The third degree of association refers to the degree of association between the first voice information and the content of the valid voice information obtained by the device last time, and the above-mentioned first degree of association refers to the association between the semantics of the two voice information Spend. To facilitate understanding of the first degree of association and the third degree of association, reference may be made to FIG. 8A and FIG. 8B .

Referring first to FIG. 8A , it is assumed that "play music for me" is the latest valid voice information acquired by the device, and "I usually like to listen to singer A's songs" is the above-mentioned first voice information. In order to obtain the first correlation degree of the two pieces of speech information, after obtaining the semantic information of the two pieces of speech information through the natural language understanding model, the two pieces of semantic information are input into the semantic correlation inference model for processing. After being processed by the semantic correlation inference model, the first correlation degree of the two semantic information is output. The semantic association inference model is a pre-trained neural network model or a machine learning model or the like.

Referring to Fig. 8B, similarly, assume that "play music for me" is the latest valid voice information acquired by the device, and "I usually like to listen to singer A's songs" is the first voice information. In order to obtain the third degree of correlation between the two pieces of speech information, the two pieces of speech information can be structurally parsed through a natural language understanding model. Specifically, after structural analysis of the piece of speech information "help me play music", it is known that : The field described by this voice message is music, and the intent is to play music. After structural analysis of the voice information "I usually like to listen to singer A's songs", we know that the field described by the voice information is music, and the singer is singer A. After the structured information of the two voice information is obtained, the two structured information is input into the relevant judgment model for processing. After being processed by the correlation judgment model, the third correlation degree of the two voice information is output. The relevant judgment model may be, for example, a dialogue state tracking DST model or the like.

The first correlation degree of the two voice information "help me play music" and "I usually like to listen to singer A's songs" output in the above-mentioned FIG. 8A may be zero, that is, the semantics are not related; while the output in the above-mentioned FIG. 8B The third degree of relevance of the two voice information "help me play music" and "I usually like to listen to singer A's songs" may be 100%, that is, the two voice information are related.

In a possible implementation manner, the third degree of correlation between the first voice information obtained based on the method described in FIG. 8B and the last valid voice information obtained by the device may be a clear 0 or 100%, that is, if the above When the correlation judgment model outputs irrelevant indication information, the third correlation degree is 0, and when the above correlation judgment model outputs relevant indication information, the third correlation degree is 100%.

In another possible implementation manner, the third degree of association between the first voice information obtained in the manner described in FIG. 8B and the last valid voice information obtained by the device may also be a specific percentage (for example, 60% Or 90%, etc.) or similarity score, etc., and then, it can be determined whether it is related by comparing with a preset threshold.

After obtaining the third degree of correlation between the above-mentioned first voice information and the last valid voice information obtained by the device, the device can positively correlate to adjust the sensitivity of the decision condition based on the third degree of correlation. For a specific positive correlation adjustment method, reference may be made to the above-mentioned sensitivity of the positive correlation adjustment decision condition based on the first correlation degree, which will not be repeated here. In addition, when the third degree of correlation is zero, that is, when the first voice information is not related to the valid voice information acquired by the device last time, the sensitivity of the decision condition is not adjusted.

In a specific embodiment, after the device obtains the above-mentioned first voice information, it can obtain the status of the voice dialogue between the device and the user until the first voice information is obtained. The state of judgment or small talk, etc. Specifically, the device may learn the state based on the dialog state tracking DST technology. If there is a state in which the device has a voice dialogue with the user, it indicates that the user and the device have conducted a long interactive dialogue. Then, the device can increase the sensitivity of the decision condition according to the continuous dialogue state. If there is no state in which the device has a voice dialogue with the user, the user does not have a long interactive dialogue with the device, and the device may not adjust the sensitivity of the decision condition according to this factor.

It should be noted that, for the several influencing factors shown in FIG. 7 , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.

Referring to FIG. 9, FIG. 9 exemplarily shows the first similarity based on the acoustic features of the first voice information and the historically valid voice information, and the second similarity based on the acoustic features of the first voice information and the historical invalid voice information. A schematic diagram of the sensitivity of each influencing factor to adjust the decision condition. Exemplarily, the acoustic features include features such as intonation and/or speed of speech.

In a specific embodiment, after acquiring the above-mentioned first voice information, the device extracts the acoustic features of the first voice information by invoking the acoustic model stored in the memory, and then compares the extracted acoustic features with historical valid voice information (may be is to compare the acoustic features of one or more historically valid voice information), and obtain the similarity (referred to as the first similarity for short) between the acoustic features of the first voice information and the acoustic features of the historically valid voice information. If the similarity between the acoustic feature of the first voice information and the acoustic feature of the historically valid voice information is zero, the device may not adjust the sensitivity of the decision condition according to the first similarity. If the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically valid voice information is not zero, then the sensitivity of the decision condition, that is, the similarity, can be adjusted in a positive correlation (exemplarily, the similarity It can be the largest similarity among the obtained similarities, or the greater the average formality of the obtained similarities, etc.), the higher the sensitivity is adjusted.

In a possible implementation manner, the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically valid voice information is greater than a certain threshold (for example, the threshold may be between 60% and 100%). In the case of any value), it indicates that the acoustic features of the first voice information are similar to the acoustic features of one or more historically valid voice information, then the device can increase the sensitivity of the decision condition to a preset value. For example, taking the above judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic feature of the first voice information and the acoustic characteristics of one or more historically valid voice information is greater than a certain threshold, the judgment threshold will be equal to Adjust to 60%.

In a specific embodiment, after acquiring the above-mentioned first voice information, the device extracts the acoustic features of the first voice information by invoking the acoustic model stored in the memory, and then compares the extracted acoustic features with the historical invalid voice information (may be is to compare the acoustic features of one or more historical invalid voice information), and obtain the similarity (referred to as the second similarity) between the acoustic features of the first voice information and the acoustic features of the historical invalid voice information. If the similarity between the acoustic feature of the first voice information and the acoustic feature of the historical invalid voice information is zero, then the device may not adjust the sensitivity of the decision condition according to the second similarity. If the similarity between the acoustic features of the first voice information and the acoustic features of one or more historical invalid voice information is not zero, then the sensitivity of the decision condition, that is, the similarity, may be adjusted in a negative correlation (exemplarily, the similarity It can be the largest similarity among the obtained similarities, or the greater the average formality of the obtained similarities, etc.), the lower the sensitivity is adjusted.

In a possible implementation manner, the similarity between the acoustic features of the first voice information and the acoustic features of one or more historical invalid voice information is greater than a certain threshold (for example, the threshold may be between 60% and 100%). Any value), it indicates that the acoustic features of the first voice information are similar to the acoustic features of one or more historical invalid voice information, then the device can lower the sensitivity of the decision condition to a preset value. For example, taking the above judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic features of the first voice information and the acoustic features of one or more historically invalid voice information is greater than a certain threshold, the judgment thresholds are all Adjust to 75%.

It should be noted that, for the several influencing factors shown in FIG. 9 , the device can individually adjust the sensitivity of the decision condition based on any one of them. Alternatively, the device can comprehensively adjust the sensitivity of the decision condition based on any of the multiple influencing factors.

In a possible implementation manner, the device may receive an instruction input by the user, and adaptively adjust the sensitivity of the decision condition based on the instruction. Exemplarily, the instruction may be, for example, a specific decision condition sensitivity specified by the user, or may be an instruction such as turning off or canceling the voice information validity recognition. In this embodiment of the present application, the sensitivity of the above judgment condition can be adaptively adjusted according to the user's preference, so as to better meet the user's needs and improve the user experience.

In a possible implementation manner, the adjustment of the sensitivity of the above-mentioned judgment condition may be sent to the above-mentioned equipment after being adjusted by another device or device (for example, it may be a server corresponding to the above-mentioned equipment, etc.) based on the above-mentioned one or more influencing factors. Yes, after receiving the adjusted judgment condition, the above-mentioned device may directly judge the validity of the above-mentioned first voice information based on the adjusted judgment condition.

Referring to FIG. 10, FIG. 10 shows a voice information processing method provided by the present application, and the method includes but is not limited to the following steps:

S1001. Acquire first voice information.

For the specific implementation of this step, reference may be made to the description in step S201 in FIG. 2 above, which will not be repeated here.

S1002. In the case where it is determined that the first voice information is a valid voice control instruction based on a judgment condition, the operation indicated by the first voice information is executed, wherein the judgment condition is based on the environmental condition where the first voice information is generated get adjusted.

In a specific embodiment, after acquiring the above-mentioned first voice information, the device can adaptively adjust the judgment condition for judging whether the first voice information is a valid voice command based on the environment in which the first voice information is generated. Specifically, for the specific implementation of adjusting the decision condition based on the environmental situation where the first voice information is generated, reference may be made to the corresponding description in FIG. 4 above, which will not be repeated here.

After the adjustment is completed, the device uses the adjusted judgment condition to determine whether the first voice information is valid. When the first voice information is valid, the device starts to perform semantic understanding on the first voice information. Specifically, the processor in the device can call the natural language understanding model in the memory to execute the semantic understanding of the first voice information. understand to obtain the specific meaning of the first voice information. After understanding the meaning of the first voice information, the device performs a corresponding operation based on the meaning to provide the user with the desired service. The meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.

In a possible time mode, the device can receive the sensitivity of the specified judgment condition input by the user, and then adaptively adjust the judgment condition for judging whether the first voice information is a valid voice command based on the sensitivity, so that when using the adjustment The judgment sensitivity specified by the user can be achieved when the latter judgment condition judges whether the voice information is valid. After adjusting the judgment condition based on the sensitivity specified by the user, the device uses the adjusted judgment condition to judge whether the first voice information is valid. When the first voice information is valid, the device starts to perform semantic understanding on the first voice information to obtain the meaning of the first voice information, and performs corresponding operations based on the meaning to provide the user with the desired service. The meaning of the first voice information is, for the device, a control instruction for executing the corresponding operation.

In a possible implementation manner, the specific implementation of executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control command based on the judgment condition can be referred to the steps in FIG. 2 above. The description in S203 is not repeated here.

Optionally, the environment in which the above-mentioned first voice information is generated includes one or more of the following: the number of speakers within the second preset time period when the device obtains the first voice information, the first voice The number of people within a preset range when the information is generated, the confidence level of the first voice information, or the signal-to-noise ratio of the first voice information.

In a specific embodiment, when the above-mentioned environmental conditions indicate that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the above-mentioned judgment condition is increased; and that the environmental conditions indicate that the probability that the first voice information is valid is less than In the case of invalid probability, the sensitivity of the decision condition is adjusted down. For specific implementation, reference may be made to the corresponding description in FIG. 4 , which is not repeated here.

Since the environmental conditions generated by the voice information have a great influence on whether the voice information is a valid voice control command, the same or similar voice information is a valid command in one environmental situation, but not necessarily in another environmental situation. is a valid instruction. Therefore, the embodiment of the present application adaptively adjusts the judgment conditions for judging the validity of the voice information for the voice information received under different environmental conditions, so that the validity of the voice information can be better judged in different environmental conditions, Improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.

In a possible implementation manner, the above judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and the continuous listening duration of the device.

In a specific embodiment, the device can adaptively adjust the sensitivity of the above-mentioned decision condition in combination with the environmental conditions in which the first voice information is generated and the duration of the device's continuous listening to the voice information. Specifically, for the specific implementation of adjusting the decision condition based on the environmental situation where the first voice information is generated, reference may be made to the corresponding description in FIG. 4 above, which will not be repeated here.

Optionally, the longer the continuous listening time of the device is, the lower the sensitivity of the judgment condition is adjusted. For the specific implementation of adjusting the decision condition based on the continuous listening duration of the device to the voice information, reference may be made to the corresponding description in FIG. 5 above, and details are not repeated here.

Optionally, in a specific implementation, the device may configure a weight for each of the foregoing environmental conditions and listening duration, and comprehensively adjust the sensitivity of the decision condition in a weighted manner. For example, for the adjustment of the above judgment threshold, it is assumed that the two influencing factors, the environmental situation and the listening duration, are adjusted. The thresholds are a4 and a5, then, the adjusted judgment threshold determined by combining the two factors is (a4*w4+a5*w5). It should be noted that this weighted synthesis method is only an example. In actual implementation, the most or least adjusted among multiple influencing factors can be taken as the final adjustment result, etc. This scheme does not do the calculation process of specific synthesis. limit.

Because the longer the device continues to listen to the voice, the greater the probability that the voice information it hears is invalid voice, therefore, in the embodiment of the present application, the judgment is adaptively adjusted according to the environmental conditions when the voice information is generated and the continuous listening time of the device The judgment condition of the validity of the voice information can further judge the validity of the voice information better, improve the accuracy of the effective judgment, and reduce the false trigger rate of invalid signals.

In a possible implementation, the above judgment condition is adjusted based on the environmental condition and the continuous listening duration of the device, including: the judgment condition is adjusted based on the environmental condition, the continuous listening duration and historical voice information.

Optionally, the situation of the historical voice information includes one or more of the following: the first time interval between when the first voice information is acquired and the last time valid voice information is acquired; when the first voice information is acquired The second time interval between the last acquisition of invalid voice information; the proportion of valid voice information and invalid voice information within the first preset time period before the first voice information is obtained; the first voice information and the latest acquisition The first degree of relevance of the semantics of the valid voice information obtained; the second degree of relevance between the first voice information and the semantics of the invalid voice information obtained last time; The third degree of correlation; the state of the device and the user's voice dialogue until the first voice information is obtained; the first similarity of the acoustic features of the first voice information and the historically valid voice information; the first voice information and the history are invalid The second similarity of the acoustic features of the speech information.

Optionally, the longer the first time interval is, the lower the sensitivity of the decision condition is adjusted.

Optionally, the longer the second time interval is, the lower the sensitivity of the decision condition is adjusted.

Optionally, in the case that the above-mentioned first time interval is smaller than the above-mentioned second time interval, the sensitivity of the above-mentioned decision condition is increased.

Optionally, in the case that the proportion of the above-mentioned valid voice information is greater than the proportion of the above-mentioned invalid voice information, the sensitivity of the above-mentioned judgment condition is increased;

In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on an upward trend, and the sensitivity of the judgment condition is increased; the proportion of the valid voice information is on a downward trend , the sensitivity of the decision condition is reduced.

Optionally, in the case that the above-mentioned state of the device and the user's voice dialogue exists, the sensitivity of the judgment condition is adjusted to be higher.

In this embodiment, the device can adaptively adjust the sensitivity of the above judgment condition in combination with the environment in which the first voice information is generated, the duration of the device's continuous listening to the voice information, and the historical voice information heard by the device. Specifically, for the specific implementation of adjusting the judgment conditions based on the environmental conditions where the first voice information is generated, reference may be made to the corresponding description in FIG. 4 , which will not be repeated here; The implementation can refer to the corresponding description in the above-mentioned FIG. 5 , which will not be repeated here; the specific implementation of adjusting the judgment condition based on the historical voice information heard by the device can refer to the corresponding description in the above-mentioned FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG. 9 . description, which will not be repeated here.

Optionally, in this embodiment, the sensitivity of the judgment condition is adjusted in combination with the above-mentioned environmental conditions, listening duration, and historical voice information. The most or the least adjusted result is the result of the final adjustment, etc. This scheme does not limit the specific comprehensive calculation process.

Based on the historical voice information, it can also help to judge the validity of the currently acquired voice information. For example, if the currently acquired voice information is highly similar to the historically acquired valid voice information, the probability that the currently acquired voice information is a valid voice command is high. On the contrary, if the currently acquired voice information has a high similarity with the invalid voice information acquired in the past, then the probability that the currently acquired voice information is an invalid voice command is high. Therefore, in the embodiment of the present application, in addition to the environmental conditions and the listening duration of the voice information described above, the historical voice information is also used to adaptively adjust the judgment conditions for judging the validity of the voice information, and the voice information can be further judged better. It can improve the accuracy of effective discrimination and reduce the false trigger rate of invalid signals.

In a possible implementation manner, the above judgment condition is adjusted and obtained based on the environmental condition where the first voice information is generated, including: the judgment condition is adjusted and obtained based on the environmental condition and historical voice information.

In this embodiment, the device can adaptively adjust the sensitivity of the above-mentioned judgment condition in combination with the environmental conditions where the first voice information is generated and the historical voice information heard by the device. Specifically, the specific implementation of adjusting the judgment conditions based on the environmental conditions where the first voice information is generated may refer to the corresponding description in FIG. 4 , which will not be repeated here; the specific implementation of adjusting the judgment conditions based on the historical voice information heard by the device Reference may be made to the corresponding descriptions in FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG. 9 , and details are not repeated here.

Optionally, in this embodiment, the sensitivity of the decision condition is adjusted in combination with the above-mentioned environmental conditions and historical voice information. Or at least as the result of the final adjustment, etc., this scheme does not limit the specific comprehensive calculation process.

Based on the foregoing description, in the embodiment of the present application, the judgment conditions for judging the validity of the voice information are adaptively adjusted in combination with the environmental conditions generated by the voice information and the historical voice information, and the validity of the voice information can be further judged better, and the effectiveness of the voice information can be improved. The accuracy of the judgment can reduce the false trigger rate of invalid signals.

In a specific embodiment, for the specific implementation of the above-mentioned acquisition of the first voice information, reference may be made to the description in step S201 in the above-mentioned FIG. 2 , which will not be repeated here. The specific implementation of executing the operation indicated by the first voice information when it is determined based on the judgment condition that the first voice information is a valid voice control command, can refer to the description in step S203 in FIG. Repeat. The specific implementation of the above-mentioned judgment condition based on the continuous listening duration adjustment of the device to the voice information may refer to the corresponding description in FIG. 5 , which will not be repeated here.

In a specific embodiment, for the specific implementation of the above-mentioned acquisition of the first voice information, reference may be made to the description in step S201 in the above-mentioned FIG. 2 , which will not be repeated here. The specific implementation of executing the operation indicated by the first voice information when it is determined based on the judgment condition that the first voice information is a valid voice control command, can refer to the description in step S203 in FIG. Repeat. The specific implementation of adjusting the decision condition based on the historical voice information heard by the device may refer to the corresponding description in the above-mentioned FIG. 5 , FIG. 6A , FIG. 6B , FIG. 7 or FIG.

In order to facilitate an overall understanding of the voice information processing method provided by the present application, for example, reference may be made to the flowchart shown in FIG. 11 . In Figure 11, first, the voice interaction system of the device is awakened, and then the system starts to listen to the user's voice. After the system acquires the user's voice information, the voice information is input into the above-mentioned invalid recognition model to identify the validity of the voice information. If it is recognized that the voice information is valid, the voice information is semantically understood, and instructions are parsed and executed based on the understood semantics.

After semantic understanding, the voice interaction system will determine whether to continue listening to the user's voice, and if so, perform the operation of listening to the voice. If it is determined not to continue listening, perform the operation of ending listening. Exemplarily, judging whether to continue listening may be determined according to a preset listening duration, if the current range of the preset listening duration is not exceeded, the listening may be continued; otherwise, the listening is terminated.

If the voice information identified by the invalid recognition model is invalid, the system determines whether to continue listening to the user's voice, and if so, performs the operation of listening to the voice. If it is determined not to continue listening, perform the operation of ending listening.

In a possible implementation, in the process shown in Figure 11 above, after judging that the voice information is valid, the two steps of judging whether to continuously listen to the user's voice and semantic understanding can also be carried out simultaneously, or first determine whether to continue listening to the user. and then perform semantic understanding. The present application does not limit the sequence of execution of the two operations.

In addition, after the above-mentioned semantic understanding of the speech information, the semantics of the understood speech information can also be returned to the process of validating the speech information, for example, input into the above-mentioned invalid rejection model for the adjustment of the sensitivity of the above-mentioned judgment conditions.

In addition, it should be noted that the above-mentioned embodiments of the voice information processing method provided by the present application are mainly introduced by taking the judgment conditions in the invalid recognition model as an example. The decision condition may not be limited to be the decision condition in the invalid rejection model. As long as it is based on one or more of the above-mentioned influencing factors of the validity identification of the voice information, the scheme of adjusting the judgment condition of the validity of the voice information is within the protection scope of the present application.

To sum up, the voice information processing method provided by this application starts from one or more influencing factors that affect the validity judgment of voice information, and adjusts the sensitivity of the judgment condition of the validity of the voice information obtained by the device in real time, so that the device can Based on different scenarios, different user states can flexibly and effectively determine the validity of voice information, which can improve the accuracy of voice information validity recognition, reduce the false trigger rate of invalid voice information, and save the computing resources wasted by devices due to false triggering. It can also improve the user's physical examination during the voice interaction process.

The above mainly introduces the data communication processing method provided by the embodiments of the present application. It can be understood that, in order to implement the above-mentioned corresponding functions, each device includes corresponding hardware structures and/or software modules for performing each function. In combination with the units and steps of each example described in the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In this embodiment of the present application, the device may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.

In the case where each functional module is divided according to each function, FIG. 12 shows a schematic diagram of a possible logical structure of the device, and the device may be the above-mentioned device, or may be a chip in the device, or may be the device processing system, etc. The apparatus 1200 includes an acquisition unit 1201 , an adjustment unit 1202 , a semantic understanding unit 1203 and an execution unit 1204 . in:

The obtaining unit 1201 is configured to obtain the first voice information. The obtaining unit 1201 may be implemented by a communication interface or a transceiver, and may perform the operations described in step 201 shown in FIG. 2 .

The adjustment unit 1202 is used to adjust the judgment condition based on the influence factor of the validity of the first voice information, the judgment condition is one or more judgment conditions in the validity judgment model of the first voice information, and the validity is used to indicate Whether the first voice information is a valid voice control instruction for the device that obtained the first voice information. The adjustment unit 1202 may be implemented by a processor, and may perform the operations described in step 202 shown in FIG. 2 .

The semantic understanding unit 1203 is configured to perform semantic understanding on the first voice information when it is determined that the first voice information is valid based on the adjusted judgment condition. The semantic understanding unit 1203 may be implemented by a processor, and may perform the semantic understanding operation described in step 203 shown in FIG. 2 .

The execution unit 1204 is configured to execute the instruction of the first voice information. The execution unit 1204 may be implemented by a processor, and may perform the execution operations described in step 203 shown in FIG. 2 .

In a possible implementation manner, the adjustment unit 1202 is specifically used for:

In the case that the probability that the first voice information is valid is greater than the probability that it is invalid based on the analysis of the influencing factor, the sensitivity of the judgment condition is increased, and the higher the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The higher the probability of being effective;

In the case that the probability that the first voice information is valid is less than the probability that it is invalid based on the analysis of the influencing factors, the sensitivity of the judgment condition is lowered, and the lower the sensitivity of the judgment condition, the lower the sensitivity of the judgment condition indicates that the first voice information is determined by the judgment condition. The lower the probability of being effective.

In a possible implementation manner, the judgment condition includes a selection condition of a pre-judgment module of the validity of the first speech information in the validity judgment model, and the pre-judgment module includes a rule matching module and a reasoning module.

In a possible implementation manner, the judgment condition includes a judgment threshold of an inference module used to predict the validity of the first voice information in the validity judgment model.

In a possible implementation, the judgment condition includes a comprehensive judgment condition of a decision module in the validity judgment model; the comprehensive judgment condition is a judgment condition for determining whether the first speech signal is valid based on a prejudgment result; the prejudgment result is the pre-judgment result of the validity of the first voice information by the pre-judgment module in the validity judgment model.

In a possible embodiment, the influencing factor is one or more of the following:

The environmental situation where the first voice information is generated;

The continuous listening time of the device 1200;

the second time interval between when the first voice information is acquired and the invalid voice information is acquired most recently;

The proportion of valid voice information and invalid voice information within the first preset time period before the first voice information is obtained;

The first degree of relevance of the semantics of the first voice information and the most recently acquired valid voice information;

The second degree of relevance of the semantics of the first voice information and the most recently acquired invalid voice information;

the third degree of association between the first voice information and the most recent valid voice information obtained by the device 1200;

The state of the voice dialogue between the device 1200 and the user until the first voice information is obtained;

The first similarity between the first voice information and the acoustic features of the historically valid voice information;

The second similarity between the first voice information and the acoustic features of the historical invalid voice information.

In a possible implementation manner, the environment in which the first voice information is generated includes one or more of the following:

Until the device 1200 obtains the number of speakers within the second preset time period of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the first voice information The signal-to-noise ratio of speech information.

For the specific operations and beneficial effects of each unit in the apparatus 1200 shown in FIG. 12, reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.

In the case where each functional module is divided according to each function, FIG. 13 shows a schematic diagram of a possible logical structure of the device, and the device may be the above-mentioned device, or may be a chip in the device, or may be the device processing system, etc. The apparatus 1300 includes an acquisition unit 1301 and an execution unit 1302 . in:

The obtaining unit 1301 is configured to obtain the first voice information. The obtaining unit 1301 may be implemented by a communication interface or a transceiver, and may perform the operations described in step S1001 shown in FIG. 10 .

The executing unit 1302 is configured to execute the operation indicated by the first voice information in the case that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is based on the first voice information The environmental conditions in which the voice information is generated are adjusted. The execution unit 1302 may be implemented by a processor, and may perform the operations described in step S1002 shown in FIG. 10 .

For specific operations and beneficial effects of each unit in the apparatus 1300 shown in FIG. 13 , reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.

FIG. 14 shows a schematic diagram of a possible hardware structure of the device provided by the present application, and the device may be the device in the method described in the foregoing embodiment. The device 1400 includes: a processor 1401 , a memory 1402 and a communication interface 1403 . The processor 1401 , the communication interface 1403 , and the memory 1402 may be connected to each other or to each other through a bus 1404 .

Exemplarily, the memory 1402 is used to store computer programs and data of the device 1400, and the memory 1402 may include, but is not limited to, random access memory (RAM), read-only memory (ROM), memory Erase programmable read only memory (erasable programmable read only memory, EPROM) or portable read only memory (compact disc read-only memory, CD-ROM), etc.

In the case of implementing the embodiment shown in FIG. 14 , the software or program codes required to perform the functions of all or part of the units in FIG. 14 are stored in the memory 1402 .

In the case of implementing the embodiment of FIG. 14, if the software or program codes required for the functions of some units are stored in the memory 1402, the processor 1401 can not only call the program codes in the memory 1402 to realize some functions, but also cooperate with other The components (eg, the communication interface 1403 ) together perform other functions (eg, the function of receiving or sending data) described in the embodiment of FIG. 14 .

The number of the communication interfaces 1403 may be multiple, and is used to support the device 1400 to communicate, such as receiving or sending data or signals.

Illustratively, the processor 1401 may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs computing functions, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and the like. The processor 1401 can be used to read the program stored in the above-mentioned memory 1402, and perform the following operations:

Acquiring first voice information; adjusting the judgment condition based on the influencing factor of the validity of the first voice information, the judgment condition is one or more judgment conditions in the validity judgment model of the first voice information, and the validity is used to indicate Whether the first voice information is a valid voice control instruction for the device 1400 that obtained the first voice information; if it is determined that the first voice information is valid based on the adjusted judgment condition, the first voice information is checked. Semantically understands and executes the instructions of the first voice information.

In a possible implementation, the adjustment of the decision condition based on the influencing factor of the validity of the first voice information includes:

For specific operations and beneficial effects of each unit in the device 1400 shown in FIG. 14 , reference may be made to the corresponding descriptions in the foregoing method embodiments, and details are not repeated here.

FIG. 15 is a schematic structural diagram of another voice information processing apparatus provided by an embodiment of the present application. The apparatus may be the device in the above-mentioned embodiment, or may be a chip in the device, or may be a processing system in the device, etc. , and can implement the above-mentioned voice information processing method and various optional embodiments thereof provided by the present application. As shown in FIG. 15 , the voice information processing apparatus 1500 includes: a processor 1501 , and an interface circuit 1502 coupled to the processor 1501 . It should be understood that although only one processor and one interface circuit are shown in FIG. 15 . The voice information processing apparatus 1500 may include other numbers of processors and interface circuits.

Among them, the interface circuit 1502 is used to communicate with other components of the apparatus 1500, such as memory or other processors. The processor 1501 is used for signal interaction with other components through the interface circuit 1502 . The interface circuit 1502 may be an input/output interface of the processor 1501 .

For example, the processor 1501 reads computer programs or instructions in a memory coupled thereto through the interface circuit 1502, and decodes and executes the computer programs or instructions. It should be understood that these computer programs or instructions may include various functional programs in the above-described methods. When the corresponding function program is decoded and executed by the processor 1501, the voice information processing apparatus 1500 can be made to implement the solution in the voice information processing method provided by the embodiments of the present application.

Optionally, these functional programs are stored in a memory outside the voice information processing apparatus 1500 . When the function program is decoded and executed by the processor 1501, part or all of the content of the function program is temporarily stored in the internal memory.

Optionally, these functional programs are stored in the internal memory of the voice information processing apparatus 1500 . When the function program is stored in the internal memory of the voice information processing apparatus 1500, the voice information processing apparatus 1500 may be set in the device of the embodiment of the present application.

Optionally, part of the content of these function programs is stored in a memory outside the voice information processing apparatus 1500 , and other parts of the content of these function programs are stored in a memory inside the voice information processing apparatus 1500 .

It should be understood that any of the apparatuses or devices shown in FIG. 1 , FIG. 12 or FIG. 13 , FIG. 14 and FIG. 15 may be combined with each other, and the apparatus or apparatus shown in any of The relevant design details of the device and each optional embodiment can be referred to each other, and can also be referred to the voice information processing method shown in any one of FIG. 2 or FIG. 10 and the relevant design details of each optional embodiment. It will not be repeated here.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement any one of the foregoing embodiments and possible embodiments thereof. The operation done by the server.

The embodiments of the present application also provide a computer program product, when the computer program product is read and executed by a computer, the operations performed by the server in any one of the foregoing embodiments and possible embodiments thereof will be executed.

The embodiments of the present application also provide a computer program, which, when executed on a computer, enables the computer to implement the operations performed by the server in any one of the foregoing embodiments and possible embodiments.

In summary, the present application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false trigger rate of invalid voices in different intelligent voice interaction scenarios.

In this application, the terms "first", "second" and other words are used to distinguish the same or similar items with basically the same function and function, and it should be understood that between "first", "second" and "nth" There are no logical or timing dependencies, and no restrictions on the number and execution order. It will also be understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and, similarly, a second image may be referred to as a first image, without departing from the scope of various described examples. Both the first image and the second image may be images, and in some cases, may be separate and distinct images.

It should also be understood that, in each embodiment of the present application, the size of the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be used in the embodiment of the present application. Implementation constitutes any limitation.

It will also be understood that the term "includes" (also referred to as "includes", "including", "comprises" and/or "comprising") when used in this specification designates the presence of stated features, integers, steps, operations, elements , and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groupings thereof.

It should also be understood that references throughout the specification to "one embodiment," "an embodiment," and "one possible implementation" mean that a particular feature, structure, or characteristic associated with the embodiment or implementation is included herein. in at least one embodiment of the application. Thus, appearances of "in one embodiment" or "in an embodiment" or "one possible implementation" in various places throughout this specification are not necessarily necessarily referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application. scope.

Claims

A voice information processing method, characterized in that the method comprises:

obtain the first voice information;

In the case where it is determined that the first voice information is a valid voice control instruction based on a judgment condition, the operation indicated by the first voice information is performed, wherein the judgment condition is based on the location where the first voice information was generated. Environmental conditions can be adjusted.
The method according to claim 1, characterized in that, the environmental conditions in which the first voice information is generated include one or more of the following:

Until the device obtains the number of speakers within the second preset duration of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the Describe the signal-to-noise ratio of the first voice information.
The method according to claim 1 or 2, wherein the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including:

The judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device.
The method according to claim 3, wherein the judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device, comprising:

The judgment condition is adjusted based on the environmental conditions, the continuous listening duration and the historical voice information.
The method according to claim 1 or 2, wherein the judgment condition is adjusted and obtained based on an environmental condition where the first voice information is generated, including:

The judgment condition is adjusted based on the environmental conditions and historical voice information.
The method according to claim 4 or 5, wherein the situation of the historical voice information includes one or more of the following:

the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;

the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;

Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;

The first semantic correlation between the first voice information and the most recently acquired valid voice information;

The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;

the third degree of association between the first voice information and the last valid voice information obtained by the device;

The state of the voice dialogue between the device and the user when the first voice information is obtained;

the first similarity between the acoustic features of the first voice information and historically valid voice information;

The second similarity of the acoustic features of the first voice information and the historical invalid voice information.
The method according to any one of claims 1 to 6, wherein,

In the case that the environmental conditions indicate that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;

In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.
The method according to claim 3 or 4, characterized in that, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.
The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time;

The longer the first time interval is, the lower the sensitivity of the decision condition is adjusted.
The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a second time interval between when the first voice information is acquired and the invalid voice information is acquired last time;

The longer the second time interval is, the lower the sensitivity of the decision condition is adjusted.
The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time, and including the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;

In the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the acquisition of the first voice information Compare;

In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;

In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
The method according to any one of claims 4 to 6, wherein the situation of the historical voice information includes a state of a voice dialogue between the device and the user until the first voice information is obtained;

In the presence of a state in which the device is in a voice dialogue with the user, the sensitivity of the decision condition is increased.
A voice information processing device, characterized in that the device comprises:

an acquisition unit for acquiring the first voice information;

an execution unit, configured to execute the operation indicated by the first voice information when it is determined based on a judgment condition that the first voice information is a valid voice control instruction, wherein the judgment condition is based on the first voice The environmental conditions in which the information is generated are adjusted.
The device according to claim 14, wherein the environmental conditions in which the first voice information is generated include one or more of the following:

Until the device obtains the number of speakers within the second preset duration of the first voice information, the number of people within the preset range when the first voice information is generated, the confidence level of the first voice information, or the Describe the signal-to-noise ratio of the first voice information.
The device according to claim 14 or 15, wherein the judgment condition is adjusted and obtained based on the environmental conditions in which the first voice information is generated, including:

The judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device.
The apparatus according to claim 16, wherein the judgment condition is adjusted and obtained based on the environmental conditions and the continuous listening duration of the device, comprising:

The judgment condition is adjusted based on the environmental conditions, the continuous listening duration and the historical voice information.
The device according to claim 14 or 15, wherein the judgment condition is adjusted and obtained based on the environmental conditions in which the first voice information is generated, including:

The judgment condition is adjusted based on the environmental conditions and historical voice information.
The device according to claim 17 or 18, wherein the situation of the historical voice information includes one or more of the following:

the first time interval between when the first voice information is obtained and the last time when valid voice information is obtained;

the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;

Obtaining the ratio of valid voice information and invalid voice information within a first preset time period before the first voice information is obtained;

The first semantic correlation between the first voice information and the most recently acquired valid voice information;

The second degree of relevance of the semantics of the first voice information and the invalid voice information obtained last time;

the third degree of association between the first voice information and the last valid voice information obtained by the device;

The state of the voice dialogue between the device and the user when the first voice information is obtained;

the first similarity between the acoustic features of the first voice information and historically valid voice information;

The second similarity of the acoustic features of the first voice information and the historical invalid voice information.
The device according to any one of claims 14 to 19, characterized in that:

In the case that the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;

In the case where the environmental conditions indicate that the probability that the first voice information is valid is smaller than the probability that it is invalid, the sensitivity of the decision condition is lowered.
The apparatus according to claim 16 or 17, characterized in that, the longer the continuous listening time of the device is, the lower the sensitivity of the decision condition is adjusted.
The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time;

The longer the first time interval is, the lower the sensitivity of the decision condition is adjusted.
The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time;

The longer the second time interval is, the lower the sensitivity of the decision condition is adjusted.
The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a first time interval between when the first voice information is acquired and valid voice information is acquired last time, and including the second time interval between when the first voice information is obtained and when the invalid voice information is obtained most recently;

In the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is increased.
The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes the proportion of valid voice information and invalid voice information within a first preset time period before the first voice information is acquired Compare;

In the case that the proportion of the valid voice information is greater than the proportion of the invalid voice information, the sensitivity of the judgment condition is increased;

In the case where the proportion of the valid voice information is smaller than the proportion of the invalid voice information, the proportion of the valid voice information is on the rise, and the sensitivity of the judgment condition is increased; The proportion is on a downward trend, and the sensitivity of the decision condition is lowered.
The device according to any one of claims 17 to 19, wherein the situation of the historical voice information includes a state of a voice dialogue between the device and the user until the first voice information is obtained;

In the presence of a state in which the device is in a voice dialogue with the user, the sensitivity of the decision condition is increased.
A device, characterized in that the device includes a processor and a memory, wherein the memory is used to store a computer program, and the processor is used to execute the computer program stored in the memory, so that the device executes as claimed in the claim The method of any one of claims 1 to 13.
A chip system, characterized in that the chip system is applied to an electronic device; the chip system includes an interface circuit and a processor; the interface circuit and the processor are interconnected by lines; the interface circuit is used for receiving signals from a memory of the electronic device and sending signals to the electronic device. The processor sends a signal, and the signal includes computer instructions stored in the memory; when the processor executes the computer instructions, the chip system executes the method as claimed in any one of claims 1 to 13 .
A computer-readable storage medium, characterized in that, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the method of any one of claims 1 to 13.
A computer program product, characterized in that, when the computer program product is executed by a processor, the method according to any one of claims 1 to 13 will be executed.