CN113330513B - Voice information processing method and equipment - Google Patents

Voice information processing method and equipment Download PDF

Info

Publication number
CN113330513B
CN113330513B CN202180001492.4A CN202180001492A CN113330513B CN 113330513 B CN113330513 B CN 113330513B CN 202180001492 A CN202180001492 A CN 202180001492A CN 113330513 B CN113330513 B CN 113330513B
Authority
CN
China
Prior art keywords
voice information
condition
information
voice
acquired
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202180001492.4A
Other languages
Chinese (zh)
Other versions
CN113330513A (en
Inventor
杨世辉
聂为然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN113330513A publication Critical patent/CN113330513A/en
Application granted granted Critical
Publication of CN113330513B publication Critical patent/CN113330513B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a voice information processing method and equipment, wherein the method comprises the following steps: acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is adjusted based on the environmental condition of the first voice information when the first voice information is generated. The application can improve the accuracy of effective voice recognition in different intelligent voice interaction scenes and reduce the false triggering rate of ineffective voice.

Description

Voice information processing method and equipment
Technical Field
The application relates to the technical field of voice processing, in particular to a voice information processing method and equipment.
Background
In the smart voice interaction scenario, there are two common modes for listening to the user's voice, namely a continuous listening mode and a full-time wake-free mode, which may also be referred to as a full-time listening mode. In continuous or full-time listening, the smart device needs to distinguish whether the user content is an instruction for it to be valid, i.e. needs to distinguish between the person-to-machine conversation content and the person-to-person conversation content.
Specifically, in the listening state, the voice information collected by the device includes boring data, and in order to avoid the intelligent device from being triggered by the boring content, a rule matching module is often utilized, or an inference module (such as a neural network inference module) is utilized to determine whether the obtained voice information is a valid voice control instruction. However, since the validity of the same voice information or the voice information of the same semantic may be different in different usage environments and scenes, for example, a certain sentence belongs to a valid voice control instruction in the current scene, but is only boring information in another scene, and belongs to invalid information. The existing voice information effective judgment scheme cannot adapt to the voice information effective recognition under different use environments and scenes, and is easy to cause the situation that the recognition accuracy is low and invalid voice is triggered by mistake.
In summary, how to improve the accuracy of effective voice recognition in different intelligent voice interaction scenarios and reduce the false triggering rate of ineffective voice is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The application provides a voice information processing method and equipment, which can improve the accuracy of effective voice recognition and reduce the false triggering rate of invalid voice in different intelligent voice interaction scenes.
In a first aspect, the present application provides a voice information processing method, including:
Acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is adjusted based on the environmental condition of the first voice information when the first voice information is generated.
Because the environment condition generated by the voice information has a larger influence on whether the voice information is an effective voice control instruction, the same or similar voice information is an effective instruction under one environment condition, but not necessarily an effective instruction under another environment condition, the application adaptively adjusts the judgment condition for judging the effectiveness of the voice information according to the voice information received under different environment conditions, can better judge the effectiveness of the voice information under different environment conditions, improves the accuracy of effective judgment, and reduces the false triggering rate of invalid signals.
In a possible implementation manner, the environmental condition of the first voice information when generated includes one or more of the following: and stopping the number of the speakers in the second preset duration when the equipment acquires the first voice information, wherein the number of the speakers in the preset range is generated when the first voice information is generated, and the confidence degree of the first voice information or the signal-to-noise ratio of the first voice information is obtained.
Because the more the number of speakers in a period of time and/or the more the number of surrounding people is when the voice information is generated, the more the probability that the voice information received by the equipment is boring and is invalid voice is high, in addition, the higher the confidence and/or the signal to noise ratio of the voice information is, the probability that the equipment can correctly recognize sentences of the voice information is high, and the recognition of the validity of the voice information is also influenced, so that the validity of the voice information can be better judged based on one or more of the judgment conditions for adaptively adjusting the validity of the voice information, the accuracy of the valid judgment is improved, and the false triggering rate of invalid signals is reduced.
In a possible implementation manner, the decision condition is obtained by adjusting based on the environmental condition of the first voice information when the first voice information is generated, and the method includes: the decision condition is obtained by adjusting the continuous listening duration of the equipment based on the environmental condition.
The longer the equipment continuously listens to the voice, the larger the probability that the listened voice information is invalid voice, so the application combines the environment condition of the voice information generation and the continuous listening time of the equipment to adaptively adjust the judgment condition for judging the validity of the voice information, thereby further judging the validity of the voice information better, improving the accuracy of effective judgment and reducing the false triggering rate of invalid signals.
In a possible implementation manner, the decision condition is adjusted based on the environmental condition and the continuous listening duration of the device, and includes: the decision condition is adjusted based on the environmental condition, the duration of continuous listening and the condition of historical voice information.
The method can also help judge the validity of the currently acquired voice information based on the historical voice information, for example, if the similarity of the currently acquired voice information and the historical acquired valid voice information is larger, the probability that the currently acquired voice information is a valid voice instruction is larger, otherwise, if the similarity of the currently acquired voice information and the historical acquired invalid voice information is larger, the probability that the currently acquired voice information is an invalid voice instruction is larger. Therefore, in addition to the environmental conditions and the equipment listening time length generated by the voice information, the application also combines the historical voice information to adaptively adjust the judging condition for judging the validity of the voice information, can further and better judge the validity of the voice information, improves the accuracy of effective judgment, and reduces the false triggering rate of invalid signals.
In a possible implementation manner, the decision condition is obtained by adjusting based on the environmental condition of the first voice information when the first voice information is generated, and the method includes: the judging condition is obtained by adjusting based on the environment condition and the condition of the historical voice information.
Based on the above description, the application combines the environment condition generated by the voice information and the historical voice information to adaptively adjust the judgment condition for judging the validity of the voice information, can further and better judge the validity of the voice information, improves the accuracy of effective judgment, and reduces the false triggering rate of invalid signals.
In a possible implementation manner, the condition of the historical voice information includes one or more of the following:
a first time interval between when the first voice information is acquired and when the effective voice information is acquired last time;
A second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time;
Acquiring the duty ratio of effective voice information and ineffective voice information in a first preset time before the first voice information;
a first association degree of the semantics of the first voice information and the last acquired effective voice information;
A second association degree of the semantics of the first voice information and the invalid voice information acquired last time;
A third association degree between the first voice information and the effective voice information acquired by the equipment last time;
Stopping the state of the voice dialogue between the equipment and the user when the first voice information is acquired;
A first similarity of acoustic features of the first speech information and the historically effective speech information;
The first speech information has a second similarity to an acoustic feature of the historical invalid speech information.
In the application, the historical voice information which can be used for helping to judge the validity of the currently acquired voice information comprises one or more items, and the judgment condition for judging the validity of the voice information is adaptively adjusted based on the one or more items, so that the validity of the voice information can be better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In a possible implementation, in case the environmental situation indicates that the probability that the first speech information is valid is greater than the probability that it is not valid, the sensitivity of the decision condition is adjusted to be high;
in the case where the environmental condition indicates that the probability that the first speech information is valid is less than the probability that it is invalid, the sensitivity of the decision condition is turned down.
In the embodiment of the application, if the effective probability of the received voice information is larger, the effective judgment threshold can be reduced, namely the sensitivity of the judgment condition is improved, if the effective probability is smaller, the effective judgment threshold can be improved, namely the sensitivity of the judgment condition is reduced, so that the received voice information under different environment conditions can be flexibly identified, the identification accuracy is improved, and the voice information under each scene is not judged by using the fixed judgment condition in a cut.
In a possible embodiment, the sensitivity of the decision condition is adjusted to be lower the longer the continuous listening period of the apparatus.
Since the longer the device continuously listens to the voice, the greater the probability that the listened voice information is invalid voice, the threshold of validity judgment, that is, the sensitivity of judgment conditions can be reduced, so that whether the voice information is valid or not can be recognized more accurately.
In a possible implementation manner, the condition of the historical voice information includes a first time interval between when the first voice information is acquired and when the valid voice information is acquired last time; the longer the first time interval the lower the sensitivity of the decision condition is adjusted.
Because the longer the interval between the time of acquiring the current voice signal and the last time of acquiring the valid voice information, the larger the probability that the acquired current voice signal is an invalid voice instruction, the application can improve the threshold of validity judgment, namely reduce the sensitivity of judgment conditions, so as to more accurately identify whether the voice information is valid or not.
In a possible implementation manner, the case of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time; the longer the second time interval the lower the sensitivity of the decision condition is adjusted.
Because the longer the interval between the time of acquiring the current voice signal and the time of acquiring the invalid voice information last time, the larger the probability that the acquired current voice signal is an invalid voice instruction, the application can improve the threshold of validity judgment, namely reduce the sensitivity of judgment conditions, so as to more accurately identify whether the voice information is valid or not.
In a possible implementation manner, the case of the historical voice information includes a first time interval between when the first voice information is acquired and when the valid voice information is acquired last time, and includes a second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time; in case the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted to be high.
In the application, the first time interval is smaller than the second time interval, which indicates that the time interval between the acquired first voice information and the last acquired historical effective voice information is not long, so that the probability that the first voice information is an effective voice instruction is relatively large, and therefore, the effectiveness judgment threshold can be reduced, namely the sensitivity of the judgment condition can be improved, and whether the voice information is effective can be more accurately identified.
In a possible implementation manner, the condition of the historical voice information includes the ratio of effective voice information to ineffective voice information in a first preset duration before the first voice information is acquired;
in the case that the duty ratio of the effective voice information is larger than the duty ratio of the ineffective voice information, the sensitivity of the judgment condition is increased;
Under the condition that the duty ratio of the effective voice information is smaller than that of the ineffective voice information, the duty ratio of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is adjusted to be high; the duty ratio of the effective voice information is in a descending trend, and the sensitivity of the judgment condition is reduced.
In the application, in the first preset time period, the effective voice information occupies a larger area, so that the probability that the first voice information is an effective instruction is larger, thereby reducing the judgment threshold of the effectiveness and raising the sensitivity of the judgment condition; in addition, if the duty ratio of the effective voice information is smaller than that of the ineffective voice information, but the duty ratio of the effective voice information is in an ascending trend, which indicates that the effective voice information is more and more, the probability that the first voice signal is an effective instruction is larger, so that the judgment threshold of the effectiveness can be reduced, the sensitivity of the judgment condition can be increased, and whether the voice information is effective can be more accurately identified.
In a possible implementation manner, the condition of the historical voice information includes a state of voice dialogue between the device and the user when the first voice information is acquired; in the presence of a state of the device in speech dialogue with the user, the sensitivity of the decision condition is increased.
The state of the voice conversation between the device and the user refers to the state of the voice conversation between the device and the user, the device can track through the conversation state tracking function, if the state exists currently, the first voice information is indicated to be a valid voice command, therefore, the judgment threshold of the validity can be reduced, the sensitivity of the judgment condition can be increased, and whether the voice information is valid or not can be accurately identified.
In a possible embodiment, the device may receive a sensitivity of a specified decision condition, adjust the decision condition based on the sensitivity, and then use the adjusted decision condition to determine whether the first speech information is valid.
In the application, the appointed sensitivity is the sensitivity input by the user, and the device can more flexibly adjust the sensitivity of the judgment condition based on the requirement of the user, thereby better meeting the requirement of the user.
In one possible embodiment, the present application provides another voice information processing method, which includes: acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a decision condition, wherein the decision condition is adjusted based on the continuous listening duration of the equipment.
In the application, as the duration of continuous listening to the voice of the equipment is longer, the probability that the listened voice information is invalid voice is larger, the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening duration of the equipment, the validity of the voice information can be better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In one possible embodiment, the present application provides another voice information processing method, which includes: acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting based on the historical voice information.
The method can also help judge the validity of the currently acquired voice information based on the historical voice information, for example, if the similarity of the currently acquired voice information and the historical acquired valid voice information is larger, the probability that the currently acquired voice information is a valid voice instruction is larger, otherwise, if the similarity of the currently acquired voice information and the historical acquired invalid voice information is larger, the probability that the currently acquired voice information is an invalid voice instruction is larger. Therefore, in the application, the judgment condition for judging the validity of the voice information is adaptively adjusted through the historical voice information, so that the validity of the voice information can be better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In a second aspect, the present application provides a voice information processing apparatus, the apparatus comprising:
The acquisition unit is used for acquiring the first voice information;
and the execution unit is used for executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting the environment condition of the first voice information when the first voice information is generated.
In a possible implementation manner, the environmental condition of the first voice information when generated includes one or more of the following: and stopping the number of the speakers in the second preset duration when the equipment acquires the first voice information, wherein the number of the speakers in the preset range is generated when the first voice information is generated, and the confidence degree of the first voice information or the signal-to-noise ratio of the first voice information is obtained.
In a possible implementation manner, the decision condition is obtained by adjusting based on the environmental condition of the first voice information when the first voice information is generated, and the method includes: the decision condition is obtained by adjusting the continuous listening duration of the equipment based on the environmental condition.
In a possible implementation manner, the decision condition is adjusted based on the environmental condition and the continuous listening duration of the device, and includes: the decision condition is adjusted based on the environmental condition, the duration of continuous listening and the condition of historical voice information.
In a possible implementation manner, the decision condition is obtained by adjusting based on the environmental condition of the first voice information when the first voice information is generated, and the method includes: the judging condition is obtained by adjusting based on the environment condition and the condition of the historical voice information.
In a possible implementation manner, the condition of the historical voice information includes one or more of the following:
a first time interval between when the first voice information is acquired and when the effective voice information is acquired last time;
A second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time;
Acquiring the duty ratio of effective voice information and ineffective voice information in a first preset time before the first voice information;
a first association degree of the semantics of the first voice information and the last acquired effective voice information;
A second association degree of the semantics of the first voice information and the invalid voice information acquired last time;
A third association degree between the first voice information and the effective voice information acquired by the equipment last time;
Stopping the state of the voice dialogue between the equipment and the user when the first voice information is acquired;
A first similarity of acoustic features of the first speech information and the historically effective speech information;
The first speech information has a second similarity to an acoustic feature of the historical invalid speech information.
In a possible implementation, in case the environmental situation indicates that the probability that the first speech information is valid is greater than the probability that it is not valid, the sensitivity of the decision condition is adjusted to be high;
in the case where the environmental condition indicates that the probability that the first speech information is valid is less than the probability that it is invalid, the sensitivity of the decision condition is turned down.
In a possible embodiment, the sensitivity of the decision condition is adjusted to be lower the longer the continuous listening period of the apparatus.
In a possible implementation manner, the condition of the historical voice information includes a first time interval between when the first voice information is acquired and when the valid voice information is acquired last time; the longer the first time interval the lower the sensitivity of the decision condition is adjusted.
In a possible implementation manner, the case of the historical voice information includes a second time interval between when the first voice information is acquired and when invalid voice information is acquired last time; the longer the second time interval the lower the sensitivity of the decision condition is adjusted.
In a possible implementation manner, the case of the historical voice information includes a first time interval between when the first voice information is acquired and when the valid voice information is acquired last time, and includes a second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time; in case the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted to be high.
In a possible implementation manner, the condition of the historical voice information includes the ratio of effective voice information to ineffective voice information in a first preset duration before the first voice information is acquired;
in the case that the duty ratio of the effective voice information is larger than the duty ratio of the ineffective voice information, the sensitivity of the judgment condition is increased;
Under the condition that the duty ratio of the effective voice information is smaller than that of the ineffective voice information, the duty ratio of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is adjusted to be high; the duty ratio of the effective voice information is in a descending trend, and the sensitivity of the judgment condition is reduced.
In a possible implementation manner, the condition of the historical voice information includes a state of voice dialogue between the device and the user when the first voice information is acquired; in the presence of a state of the device in speech dialogue with the user, the sensitivity of the decision condition is increased.
In a third aspect, the present application provides an apparatus, which may include a processor and a memory, for implementing the voice information processing method described in the first aspect. The memory is coupled to a processor, and the processor executes a computer program stored in the memory, to implement the method according to the first aspect or any one of the possible implementation manners of the first aspect. The device may also include a communication interface for the device to communicate with other devices, which may be, for example, a transceiver, circuit, bus, module, or other type of communication interface.
In one possible implementation, the apparatus may include:
a memory for storing a computer program;
The processor is used for acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is adjusted based on the environmental condition of the first voice information when the first voice information is generated.
The computer program in the memory of the present application may be stored in advance or may be downloaded from the internet and then stored when the device is used, and the source of the computer program in the memory is not particularly limited. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units, or modules, which may be in electrical, mechanical, or other form for the exchange of information between the devices, units, or modules.
In a fourth aspect, an embodiment of the present application provides a chip system, where the chip system is applied to an electronic device; the chip system comprises an interface circuit and a processor; the interface circuit and the processor are interconnected through a circuit; the interface circuit is used for receiving signals from a memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; when executed by a processor, the system-on-chip performs the method as described in the first aspect and any one of its possible implementations.
In a fifth aspect, the present application provides a computer readable storage medium storing a computer program for execution by a processor to implement the method of the first aspect or any one of the possible implementation manners of the first aspect.
In a sixth aspect, the present application is a computer program product, which when executed by a processor, performs the method according to the first aspect or any one of the possible implementation manners of the first aspect.
The solutions provided in the second aspect to the sixth aspect are used to implement or cooperate to implement the methods correspondingly provided in the first aspect, so that the same or corresponding beneficial effects as those of the corresponding methods in the first aspect can be achieved, and no further description is given here.
Drawings
FIG. 1 is a schematic diagram of a system architecture to which the voice information processing method of the present application is applicable;
FIG. 2 is a schematic flow chart of a voice information processing method according to the present application;
FIG. 3 is a schematic structural diagram of an invalid rejection model according to the present application;
Fig. 4 and fig. 5 are schematic diagrams of sensitivity of adjusting a decision condition based on an influence factor according to the present application;
fig. 6A and fig. 6B are schematic diagrams of sensitivity of adjusting a decision condition based on an influence factor according to the present application;
FIGS. 6C and 6D are diagrams showing the change of the voice information duty ratio according to the present application;
FIG. 7 is a schematic diagram showing the sensitivity of the present application for adjusting the decision conditions based on influencing factors;
Fig. 8A and 8B are schematic diagrams illustrating the voice information association judgment in the present application;
FIG. 9 is a schematic diagram showing the sensitivity of the present application for adjusting the decision conditions based on influencing factors;
FIG. 10 is a flowchart illustrating another voice information processing method according to the present application;
FIG. 11 is a schematic flow chart of voice information validity recognition according to the present application;
FIG. 12 is a schematic diagram of a logic structure of an apparatus according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a logic structure of another device according to an embodiment of the present application;
fig. 14 is a schematic hardware structure of an apparatus according to an embodiment of the present application;
fig. 15 is a schematic hardware structure of another device according to an embodiment of the present application.
Detailed Description
For easy understanding, technical terms related to the embodiments of the present application will be first described below.
1. Automatic speech recognition (automatic speech recognition, ASR) generally refers to a technology that takes speech as a research object, allows a machine to automatically recognize and understand human spoken speech through speech signal processing and pattern recognition, and allows the machine to convert speech signals into corresponding text or commands through a recognition and understanding process.
The speech recognition system construction process generally includes two major parts: training and recognition. Training is typically done off-line, with signal processing and knowledge mining of a pre-collected mass speech, linguistic database, to obtain the acoustic model (acoustic model is a knowledge representation of differences in acoustics, phonetics, environmental variables, speaker gender, accents, etc.) and the language model (language model is a knowledge representation of a set of word sequences) required by the speech recognition system. The recognition process is usually completed online, and the real-time voice of the user is automatically recognized. The identification process can be generally divided into a front-end module and a back-end module: the front-end module mainly has the functions of endpoint detection (removing excessive silence and non-speaking sound), noise reduction, feature extraction and the like; the back-end module is used for carrying out statistical pattern recognition (also called decoding) on the characteristic vector of the user speaking by utilizing the trained acoustic model and language model to obtain the text information contained in the characteristic vector. In addition, the back-end module is also provided with a self-adaptive feedback module, so that the self-learning can be carried out on the voice of the user, necessary correction is carried out on the acoustic model and the voice model, and the recognition accuracy is further improved.
2. Voiceprint recognition (voiceprint recognition VR)
Voiceprint recognition is a biological recognition technology, also called speaker recognition, and is a technology for distinguishing the identity of a speaker through sound. Voiceprint recognition techniques are of two types, speaker recognition and speaker verification. Different tasks and applications may use different voiceprint recognition techniques, such as recognition techniques when narrowing criminal investigation, and confirmation techniques when transacting banks.
3. Speech synthesis
The speech synthesis, also called Text To Speech (TTS) technology, is a technology for converting text information generated by a computer or input from outside into intelligible and fluent spoken language output, which is equivalent to installing an artificial mouth on a machine and enabling the machine to speak like a robot.
4. Task type dialogue system
Task-based dialog may be understood as a sequential decision process in which a machine needs to update the dialog state within the maintenance by understanding the user's statements, and then select the next optimal action (e.g., confirm demand, ask for constraints, provide results, etc.) based on the current dialog state, thereby completing the task.
The task type dialogue system commonly used in the industry is a system adopting a modularized structure, and generally comprises four key modules:
Natural language understanding (natural language understanding, NLU): and identifying and analyzing the text input of the user to obtain semantic tags which can be understood by computers such as slot values, intentions and the like.
Dialog state tracking (dialog STATE TRACKING, DST): from the dialog history, a current dialog state is maintained, which is a cumulative semantic representation of the entire dialog history, typically a slot value pair (slot-value pairs).
Dialog strategy (dialogue policy, DP): and outputting the next system action according to the current dialogue state. The general dialog state tracking module and the dialog policy module are collectively referred to as a dialog management (dialogue manager, DM) module.
Natural language generation (natural language generation, NLG): the system actions are converted to natural language output.
The modularized system structure has strong interpretability and is easy to fall to the ground, and most practical task type dialogue systems in the industry adopt the structure.
5. Computer Vision (CV)
Computer vision, also known as machine vision (machine vision), is a science of how to "see" a machine, and its main task is to process the acquired pictures or videos to obtain information of the corresponding scene.
6. Invalid rejection model
The invalid rejection model is used for judging the validity of the voice information of the user acquired by the equipment. The validity may be used to indicate whether the voice information is a valid voice control instruction for the device that acquired the voice information. The voice information may be text information converted from a voice signal received by the device, or the like.
The device may receive much of the user's voice information during the listening process, but some voice information is merely boring between users, which is not effective information for the device. The voice information of the user actually interacting with the device is effective information for the device, and the effective information is the voice control instruction of the user.
In the application, the invalid rejection model can comprise a pre-judging module and a decision module for voice information validity. The pre-judging module comprises a rule matching module and an reasoning module and is used for primarily judging the effectiveness of the voice information. Wherein:
The rule matching module may match the input voice information by a preset rule, for example, a preset sentence, etc., if the preset sentence has a sentence matching the input voice information, the input voice information is valid, and if the preset sentence has no sentence matching the input voice information, the input voice information is invalid.
The inference module may be a deep learning predictive model trained on large-scale data using neural networks or conventional machine learning (e.g., supervised learning models such as support vector machines (support vector machine, SVM)). The device inputs the acquired voice information into the reasoning module, so that the probability of the voice information being effective can be predicted, or whether the result is effective can be directly output.
The decision module can make a final decision for the processing result of at least one module in the rule matching module and the reasoning module through comprehensive judgment conditions, so as to determine whether the voice information is effective or not, and greatly improve the accuracy of voice information effectiveness judgment. The comprehensive judgment conditions will be described later, and will not be described in detail here.
The above-described invalid rejection model may also be referred to as a validity judgment model or the like, and the description below is given by taking the invalid rejection model as an example, and the name of the model for judging the validity of the voice information acquired by the apparatus does not constitute a limitation of the present application.
In order to better understand a voice information processing method provided by the embodiment of the present application, a system architecture to which the voice information processing method is applicable is exemplarily described below.
Referring to fig. 1, fig. 1 schematically shows a system architecture diagram used in a voice information processing method according to the present application. The system architecture may include an audio manager 110, a video manager 120, a memory 130, and a processor 140, which may be connected by a bus 150.
The audio manager 110 may include a speaker and microphone array. A speaker is a transducer device that converts an electrical signal into a sound signal for outputting the sound of a device. The microphone is an energy conversion device for converting a sound signal into an electrical signal, and is used for collecting sound information such as human voice.
The video manager 120 may include an array of cameras. The camera is capable of converting an optical image signal into an electrical signal for storage or transmission.
Memory 130 is used to store computer programs and data. The memory 130 may be, but is not limited to, a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM), etc.
In the present application, the memory 130 may store therein computer programs or codes of models such as an automatic speech recognition model, a voiceprint recognition model, a computer vision model, an invalid rejection model, a natural language understanding model, a dialogue management model, and a speech synthesis model.
The processor 140 may be a central processor unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs a computational function, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so forth. The processor 140 may be configured to read the computer programs and data stored in the memory 130, and execute the voice information processing method provided by the embodiment of the present application.
The application is not limited to the type of BUS 150, and by way of example, BUS 150 may be a desktop data BUS (D-BUS), which is an inter-process communication (IPC) mechanism optimized for desktop environments for inter-process communication or process-to-kernel communication. Or the bus 150 may be a Data Bus (DB), an Address Bus (AB), a Control Bus (CB), or the like.
The system architecture shown in fig. 1 may be a system architecture of a terminal device or a server, for example. The terminal device may include, but is not limited to, any intelligent operating system-based device that can interact with a user through input devices such as keyboards, virtual keyboards, touch pads, touch screens, voice-activated devices, etc., such as smartphones, tablets, handheld computers, wearable electronic or vehicle-mounted devices (e.g., vehicle-mounted computers, etc.), and the like. The server may be an edge server or a cloud server, the server may be a virtual server or may be an entity server, etc., as the application is not limited in this regard.
The system architecture shown in fig. 1 is merely an example, and does not limit the system architecture to which the embodiments of the present application are applicable.
The method for processing voice information provided by the embodiment of the present application is applicable to the system architecture shown in fig. 1, that is, the method is performed by a device such as a terminal device or a server, or may be performed by a processing device such as a chip or a processor in the terminal device or the server, where the main body of performing the method is collectively referred to as a device in the following description. Alternatively, if the execution subject of the method is a server or a chip or a processor in the server, the terminal device may first receive the voice information, and then the terminal device sends the received voice information to the server for processing. The voice information sent by the terminal device to the server may be original information received by the terminal device, or may be voice information preprocessed by the terminal device.
Referring to fig. 2, a method for processing voice information provided in the application embodiment may include, but is not limited to, the following steps:
S201, acquiring first voice information.
In particular embodiments, the device may receive a voice signal of a user through a microphone. The device may then recognize the speech signal through an automatic speech recognition ASR model to obtain speech information corresponding to the speech signal, which may include text information, etc.
In particular, the voice interaction function between the device and the user may be awakened by receiving a wake-up signal of the user, e.g. by receiving a specific wake-up word of the user. After being awakened, the device may detect and receive the user's voice signal through the microphone, which may be referred to as a listening process of the device. To reduce the repetitive operation of having to wake up the device before each voice control command is issued, two main listening modes exist at present: continuous listening and full time listening.
The continuous listening mode refers to: after the equipment is awakened or the voice instruction operation is successful, the equipment is not required to be awakened again for a period of time (such as 30 s), and the equipment can always listen in the period of time and interact with the user in voice, so that the voice control instruction of the user is executed.
The full-time listening mode refers to: after the equipment is started, the equipment is only required to be awakened once until the equipment is closed, and the equipment can always listen to the equipment, interact with the user through voice, and execute voice control instructions of the user.
The first voice information may be voice information corresponding to any one of the voice signals received by the device in the listening stage.
S202, adjusting judgment conditions based on influence factors of the validity of the first voice information, wherein the judgment conditions are one or more judgment conditions in an invalid refusal model for judging the validity of the first voice information.
To facilitate an understanding of the invalid rejection model described above, reference may be made to fig. 3. Fig. 3 exemplarily shows a process flow diagram of the invalid rejection model. Firstly, the invalid rejection model receives the voice information, for example, the first voice information, and selects a pre-judging module for judging the validity of the voice information based on the voice information and a preset selecting condition, namely, selects at least one module of the reasoning module and the rule matching module to pre-judge the validity of the voice information.
The selection condition may be a condition set based on an influence factor of the validity of the voice information. Illustratively, the selection condition may be, for example: under the condition that the listening time length of the equipment is larger than a first threshold value, the selection rule matching module judges the validity of the voice information; under the condition that the listening duration of the equipment is smaller than a second threshold value, the selection reasoning module judges the validity of the voice information; and in the case that the listening time of the device is between the second threshold and the first threshold, the rule matching module and the reasoning module can be selected simultaneously to judge the validity of the voice information. The influence factor of the validity of the voice information is not limited to the listening time of the device, and will be described in detail below, which is not described in detail herein.
If only the reasoning module is selected to pre-judge the validity of the voice information, the equipment inputs the acquired voice information into the reasoning module, and an output result is obtained through calculation. For example, the output result may be a probability of predicting that the input voice information is valid, and then comparing the probability with a preset judgment threshold to obtain a pre-judgment result. Specifically, if the probability is greater than the judgment threshold, the pre-judgment result is that the input voice information is valid, and if the probability is less than the judgment threshold, the pre-judgment result is that the input voice information is invalid. For example, assuming that the judgment threshold is 70%, it is specified that the voice information is valid as long as the valid probability of the voice information is greater than 70%, and if the valid probability of the voice information predicted by the inference module is 80% or greater than 70%, the voice information is valid information. If the effective probability of the voice information predicted by the reasoning module is 50% and less than 70%, the voice information is invalid information.
It should be noted that, the result output by the above reasoning module is not limited to the effective probability of the voice information, but may be other data forms, for example, may be a scoring form, and if the score exceeds the judgment threshold value, the voice information is effective, which is not limited in the present application.
If only the rule matching module is selected to pre-judge the validity of the voice information, the equipment inputs the acquired voice information into the rule matching module, and the rule matching module compares the input voice information with information in a preset rule base to obtain a pre-judging result. If the information in the preset rule base is matched with the input voice information, the preset judgment result is that the input voice information is effective. Otherwise, if the information in the preset rule base is not matched with the input voice information, the preset judgment result is that the input voice information is invalid.
Under the condition that only the reasoning module or the rule matching module is selected to pre-judge the validity of the voice information, after the pre-judging result of the validity of the voice information is obtained, the pre-judging result can be input into the decision module, and the decision module judges whether the pre-judging result is reasonable or not through comprehensive judging conditions, so that a final indication of whether the voice information is valid or not is output. For example, the comprehensive judgment conditions are: if the effective voice information comprises no less than 3 characters, if the characters of the input voice information are less than 3 characters, and the pre-judging result output by the reasoning module or the rule matching module is that the voice information is effective, the pre-judging result is unreasonable, and then the decision module determines that the voice information is ineffective and outputs final indication information for indicating that the voice information is ineffective; otherwise, if the number of the characters of the input voice information is not less than 3, it is reasonable if the pre-judging result output by the reasoning module or the rule matching module is valid, and the decision module finally determines that the voice information is valid and outputs indication information for indicating that the voice information is valid.
It should be noted that the above-mentioned comprehensive judgment condition is not limited to the above-mentioned example, but may be other types of conditions, and in one possible implementation, the comprehensive judgment condition may be a voting mechanism, that is, a number of votes for which the voice information is valid is determined to be valid, and a number of votes for which the voice information is invalid is determined to be invalid.
Or in a possible implementation manner, in the case that only the reasoning module or the rule matching module is selected to pre-judge the validity of the voice information, comprehensive judgment is not needed, and the result output by the reasoning module or the rule matching module is output as the final result of the invalid rejection model.
If the inference module and the rule matching module are selected to pre-judge the validity of the voice information, the obtained voice information is respectively input into the inference module and the rule matching module, the two modules pre-judge the validity of the voice information according to the own flow (see the description above and not described here in detail) respectively to obtain respective validity pre-judging results, then the two pre-judging results are input into the decision module, and final judgment is performed on the two validity pre-judging results based on the comprehensive judging conditions in the decision module so as to output the final result of the invalid rejection model.
Illustratively, the comprehensive judgment condition may be: the effective speech information includes no less than 3 characters, and then the decision module checks the rationality of the two pre-determined results based on the comprehensive determination condition, and the specific checking process is referred to in the foregoing description and will not be repeated here.
For example, in one possible implementation, the comprehensive judgment condition may be a voting mechanism, that is, if the number of votes for which the voice information is valid is large, the voice information is determined to be valid, and if the number of votes for which the voice information is invalid is large, the voice information is determined to be invalid. If the two validity pre-judging results of the voice information are valid, the final judging result of the voice information is valid. If both validity pre-judgment results are invalid, the final judgment result of the voice information is invalid. If one of the two validity pre-judging results is valid and the other is invalid, further judgment can be performed, for example, according to the priority, if the priority of the reasoning module is higher than that of the rule matching module, the pre-judging result of the reasoning module is output as a final result. If the priority of the rule matching module is higher than that of the reasoning module, the pre-judging result of the rule matching module is output as a final result.
It should be noted that the above comprehensive judgment condition is only an example, and the main purpose is to judge the validity of the obtained voice information according to the pre-judgment result of the relatively accurate comprehensive reasoning module and/or rule matching module, and in a specific embodiment, the comprehensive judgment condition may be other conditions capable of achieving the purpose, which is not limited by the present scheme.
Based on the description of fig. 3, the decision conditions described in S202 may include one or more of a selection condition in the invalid rejection model, a decision threshold of a decision inference module output result, and a comprehensive decision condition. In order to improve the accuracy of effective voice recognition and reduce the false triggering rate of ineffective voice under different scenes, the judgment conditions can be flexibly adjusted based on one or more influencing factors which can influence the effectiveness judgment of the input voice information under different voice interaction scenes, so that the effectiveness recognition of the voice information is more flexible and more suitable for the current context and scene.
In a possible implementation manner, the adjusting the decision condition based on the influencing factor of the validity of the first voice information may be:
In the case that the probability of the validity of the first voice information is greater than the probability of invalidity based on one or more voice information validity influencing factors, the sensitivity of the judgment condition is increased, wherein the higher the sensitivity of the judgment condition is, the higher the probability of the validity of the first voice information is determined through the judgment condition; in the case where the probability of the first voice information being valid is less than the probability of being invalid based on the one or more voice information validity influencing factors, the sensitivity of the decision condition is reduced, the lower the sensitivity of the decision condition is indicative of the lower the probability of the first voice information being valid being determined by the decision condition. For the sensitivity of the decision conditions and for the specific adjustment procedure, reference is made to the following description, which is not described in detail here.
Optionally, the above influencing factors that can influence the validity recognition of the input voice information may include one or more of the following:
The method comprises the steps of continuously listening to a condition of a device when voice information is generated, obtaining a first time interval between the voice information obtaining time and the last effective voice information obtaining time by the device, obtaining a second time interval between the voice information obtaining time and the last ineffective voice information obtaining time by the device, obtaining a first ratio of the effective voice information and the ineffective voice information in a first preset time before the voice information obtaining time by the device, obtaining a first association degree of the semantics of the effective voice information and the last effective voice information obtained by the device, obtaining a second association degree of the semantics of the effective voice information and the last effective voice information obtained by the device, obtaining a third association degree of the effective voice information and the last effective voice information obtained by the device, obtaining a state of voice dialogue between the device and a user when the current voice information is obtained by the device, obtaining a first similarity of the voice information and acoustic characteristics of the historical effective voice information, and a second similarity of the voice information and acoustic characteristics of the historical ineffective voice information.
In a possible implementation manner, after the device acquires the first voice information, the selection condition in the ineffective rejection model may be adjusted based on a first factor, where the first factor may include one or more of the influencing factors. The specific adjustment procedure will be described later and will not be described in detail here.
In a possible implementation manner, after the device acquires the first voice information, a judgment threshold of the output result of the judgment reasoning module in the ineffective rejection model may be adjusted based on a second factor, where the second factor may include one or more of the influencing factors. The second factor and the influencing factor included in the first factor may be different, or may be partially the same, or may be completely the same, and specifically, the method is not limited thereto, and is determined according to actual situations. The specific adjustment procedure will be described later and will not be described in detail here.
In a possible implementation manner, after the device acquires the first voice information, the comprehensive judgment condition of the decision module in the ineffective rejection model may be adjusted based on a third factor, where the third factor may include one or more of the influencing factors. The third factor may be different from the influencing factors included in the first factor and the second factor, or may be partially the same, or may be completely the same, and specifically, the third factor is determined according to the actual situation, which is not limited in this aspect. The specific adjustment procedure will be described later and will not be described in detail here.
In a specific implementation, the selection condition, the judgment threshold value and the comprehensive judgment condition may be adjusted together, or one or two of the selection condition, the judgment threshold value and the comprehensive judgment condition may also be selected for adjustment, and the specific selection may be selected according to actual requirements, which is not limited in this scheme.
S203, under the condition that the first voice information is determined to be effective based on the adjusted judging condition, carrying out semantic understanding on the first voice information, and executing an instruction of the first voice information.
In a specific embodiment, after the device acquires the first voice information, after adjusting the decision condition in the invalid rejection model based on the influence factor, the device identifies the validity of the first voice information based on the adjusted invalid rejection model.
In a possible implementation manner, if the device adjusts the selection condition in the ineffective rejection model, the device may pre-determine the validity of the first voice information by selecting one or more models in the rule matching module and the inference module based on the adjusted selection condition.
In a possible implementation manner, if the device adjusts the judgment threshold of the inference module, and the pre-judgment module for selectively judging the validity of the first voice information includes the inference module, after the inference module outputs the data indicating the validity of the first voice information, the device may judge whether the first voice information is valid based on the data indicating the validity of the first voice information and the adjusted judgment threshold.
In a possible implementation manner, if the device adjusts the comprehensive judgment condition of the decision module in the invalid rejection model, after obtaining the pre-judgment result of the rule matching module and/or the inference module, a comprehensive judgment may be performed on the pre-judgment result of the rule matching module and/or the inference module based on the adjusted comprehensive judgment condition, so as to determine the validity of the first voice information.
The specific process of the validity identification of the first voice information may be referred to the description of fig. 3, and will not be repeated here.
In the case that the first voice information is valid, the device starts to perform semantic understanding on the first voice information, and specifically, the processor in the device may call the natural language understanding model in the memory to perform semantic understanding on the first voice information, so as to obtain a specific meaning of the first voice information. After the device understands the meaning of the first voice information, a corresponding operation is performed based on the meaning so as to provide the user with the required service. The meaning of the first voice information is for the device to execute the control instruction of the corresponding operation.
The following describes the adjustment process of the decision condition in the above-mentioned first voice information validity recognition from different influencing factors of the voice information validity, respectively. It should be noted that the decision condition may include one or more of the selection condition, the decision threshold, and the comprehensive decision condition in the ineffective rejection model, and the adjustment process described below may be applied to adjustment of one or more of the selection condition, the decision threshold, and the comprehensive decision condition.
Before introducing the adjustment process, first, the related concepts involved in the adjustment process are described:
sensitivity of the decision conditions: the sensitivity refers to the degree of relaxation and severity of the decision condition, the lower the sensitivity, the more relaxed the decision condition, the higher the sensitivity.
For example, for the selection conditions of the above-mentioned selection pre-judgment model, generally, because the inference module is a possibility of predicting that the voice information is valid, it belongs to fuzzy matching, while the rule matching module is a pre-judgment of pattern matching, that is, whether or not, and relatively, is strict. Therefore, when selecting the pre-judging model, if the probability that the voice information acquired by the equipment is valid is high, the inference module or the rule matching module can be selected for pre-judging, or if the accuracy of the effective recognition of the voice information is to be improved, the inference module can be selected for pre-judging. If the probability of the voice information acquired by the equipment is smaller, a rule matching module can be selected for pre-judging in order to effectively avoid false triggering of the invalid information.
For example, assume that the selection conditions are: the method comprises the steps that the listening time of equipment is less than 10 seconds, an inference module is selected for pre-judging, the listening time of the equipment is longer than 20 seconds, a rule matching module is selected for pre-judging, and the inference module and the rule matching module are simultaneously selected for pre-judging when the listening time of the equipment is between 10 seconds and 20 seconds. If it is desired to better filter invalid information and reduce false triggers, the device may adjust the selection conditions in a more severe direction, i.e. to reduce the sensitivity of the selection conditions, e.g. the selection conditions may be adjusted to: the method comprises the steps that the listening time of equipment is less than 5 seconds, an inference module is selected for pre-judging, the listening time of the equipment is longer than 10 seconds, a rule matching module is selected for pre-judging, and the listening time of the equipment is between 5 seconds and 10 seconds, and the inference module and the rule matching module are simultaneously selected for pre-judging. Conversely, if the device wants to better recognize the valid voice information, the device can adjust the selection condition in a more relaxed direction, that is, adjust the sensitivity of the selection condition, for example, adjust the selection condition to: the equipment is listened to for a period of time less than 15 seconds, the inference module is selected for pre-judging, the equipment is listened to for a period of time greater than 25 seconds, the rule matching module is selected for pre-judging, and the inference module and the rule matching module are simultaneously selected for pre-judging when the equipment is listened to for a period of time between 15 seconds and 25 seconds.
For the judgment threshold of the inference module, it is determined that the voice information is valid, assuming that the standard judgment threshold is 70%, that is, the probability that the inference module predicts that the voice information is valid is greater than 70%. However, when the decision threshold is adjusted to 80%, i.e., the decision condition is adjusted in a severe direction, in which case the probability that the inference module predicts that the speech information is valid needs to be greater than 80% to determine that it is valid, whereby the sensitivity of the decision condition is reduced. In this case, the inference module predicts that the speech information is valid only if the probability that it is valid is greater than 60%, and thus the sensitivity of the decision condition can be seen to be improved.
For the above-described comprehensive judgment conditions, it is assumed that the comprehensive judgment conditions are: if the valid voice information includes no less than 3 characters, then if the comprehensive judgment condition is adjusted to be: the effective voice information includes no less than 5 characters, and it can be seen that the requirements for the voice information are increased and more severe, so that the sensitivity of the comprehensive judgment condition is reduced. If the comprehensive judgment condition is adjusted to be that the effective voice information comprises no less than 2 characters, the requirement on the voice information is reduced and relaxed, so that the sensitivity of the comprehensive judgment condition is improved.
Negative correlation adjusts sensitivity: when the value corresponding to the influence factor is increased, the sensitivity is reduced, and the sensitivity is reduced as the value is increased; when the value corresponding to the influence factor is reduced, the sensitivity is increased, and the more the reduction is, the higher the sensitivity is.
Positive correlation adjusts sensitivity: when the value corresponding to the influence factor is increased, the sensitivity is increased, and the sensitivity is increased more; when the value corresponding to the influence factor decreases, the sensitivity decreases, and the more the decrease, the lower the sensitivity is.
It should be noted that, the adjusting up sensitivity or the adjusting down sensitivity according to the present application may be specifically set according to the actual situation, which is not limited by the present application. Further, the adjustment of the sensitivity of the above-mentioned judgment condition is limited, for example, the adjustment of the above-mentioned judgment threshold is 100% at the maximum, 0 at the minimum, or the like, and the adjustment range of the sensitivity of the judgment condition is determined according to the actual situation, which is not limited in this aspect.
First, the process of adjusting the decision condition is described based on the influence factor of the environmental condition in which the first voice information is generated. Illustratively, the environmental conditions in which the first voice information is generated include one or more of the following: stopping the process until the equipment acquires the number of speakers (hereinafter referred to as the number of speakers) in the second preset duration of the first voice information, the number of speakers (hereinafter referred to as the number of surrounding people) in the preset range when the first voice information is generated, the confidence of the first voice information, the signal-to-noise ratio of the first voice information and the like. The number of speakers specifically refers to the number of different voiceprints included in the first voice information, and since the voiceprints of each person are different, the number of speakers of the first voice information can be represented by the number of voiceprints.
Referring to fig. 4, fig. 4 illustrates how the above-described decision conditions are adjusted based on the environmental impact factors, taking as an example several environmental impact factors listed above.
In the process of acquiring the first voice information, the device can acquire the number of surrounding people and the number of speaking people of the first voice information. Specifically, the device may call the computer vision model in the memory to drive the camera to shoot a picture or a video of the surrounding environment, then analyze the shot picture and video to obtain the number of surrounding people and the number of speaking people, and obtain the number of speaking people by analyzing which people in the video within the second preset time period have their mouths moving. The surrounding population includes a speaker population. The second preset time period may be, for example, 5 seconds, 10 seconds, 1 minute, or the like, which is not limited by the present application.
Or the device can identify the voiceprint characteristics in the voice signal received by the device in the second preset time period by calling a voiceprint identification model in the memory, and the number of the identified different voiceprint characteristics is the number of the talkers. Alternatively, the voiceprint recognition model may be a dynamically monitored model to flexibly adapt to voiceprint recognition under different conditions.
After the device acquires the number of surrounding persons (assumed to be m persons, m is a positive integer) and the number of speaking persons (assumed to be n persons, n is a positive integer), firstly judging whether the number of speaking persons n is 0, if 0, indicating that the voice information of the person is not included in the first voice information, and not needing to adjust corresponding judgment conditions.
If the number n of speakers is not 0, indicating that the first voice information includes voice information of a person, further, judging whether the number m of surrounding persons is greater than 1, if m is not greater than 1, judging whether m is 1.
If m is 1, it indicates that there is only one person in the surrounding environment, and the first voice information sent by the person is a voice control instruction sent by the device with a high probability, so that the sensitivity of the judgment condition can be increased, so that the validity of the first voice information can be better identified.
Or if m is 1, defaulting that the first voice information acquired currently is a voice control instruction to the equipment, namely the effective information. Then, the sensitivity of the decision condition may be tuned to the highest, or the invalid rejection model does not further perform validity judgment, and an indication that the first voice information is valid is directly output.
If m is not 1, there is a possibility that the detection is wrong, and the sensitivity of the judgment condition cannot be adjusted by this information, and therefore, the judgment condition is not adjusted.
In the case where the number of speakers n is not 0 and the number of surrounding persons m is greater than 1, the first voice information is a boring content with a high probability, and is voice information that may be invalid for the device, then the device may lower the sensitivity of the decision condition based on the size of the surrounding persons, and the larger the number of surrounding persons m, the lower the sensitivity of the decision condition is adjusted. Because the more surrounding people, the greater the probability that the first voice information belongs to boring voice, a more severe judgment condition needs to be set to identify the validity of the first voice information, so that the invalid voice information can not trigger related service operation by mistake, and the resources of the equipment are wasted.
In addition, after the device acquires the first voice information, an automatic voice recognition model in a memory can be called to calculate the confidence coefficient of the first voice information, or calculate the signal-to-noise ratio of the first voice information by using the sound channel information, or calculate both the confidence coefficient and the signal-to-noise ratio, and then adjust the sensitivity of the decision condition based on the confidence coefficient and/or the signal-to-noise ratio.
Specifically, the sensitivity of the decision condition can be adjusted based on the confidence and/or the signal-to-noise ratio negative correlation, because the higher the confidence is, the larger the probability that the first voice information is correctly identified is, the higher the signal-to-noise ratio is, the better the quality of the collected first voice information is, at this time, the validity of the first voice information can be better identified even if the sensitivity of the decision condition is harsh, and the boring invalid voice can be effectively filtered.
Conversely, if the confidence is lower, the probability that the first voice information is correctly identified is smaller, the signal to noise ratio is lower, the quality of the collected first voice information is poorer, and the identification of voice content is possibly wrong, in order to improve the robustness of the voice interaction of the device, the sensitivity of the judgment condition can be properly improved, and the judgment condition is relaxed, so that the effectiveness of the first voice information can be better identified.
For example, the device may set a confidence threshold and/or a signal-to-noise threshold for the speech information, and if the confidence of the first speech information is greater than the confidence threshold and/or the signal-to-noise ratio is greater than the signal-to-noise threshold, the higher the confidence and/or the signal-to-noise ratio, the lower the sensitivity of the decision condition is adjusted. If the confidence level of the first voice information is smaller than the confidence level threshold and/or the signal-to-noise ratio is smaller than the signal-to-noise ratio threshold, the lower the confidence level and/or the signal-to-noise ratio is, the higher the sensitivity of the judgment condition is adjusted. The confidence threshold may be, for example, 50% or 60%, and the signal-to-noise ratio threshold may be, for example, 50db or 60db, and the confidence threshold and the signal-to-noise ratio threshold are not limited by the present application.
For example, in one possible implementation, the device need not set a confidence threshold and/or a signal-to-noise threshold for the voice information, but may set a condition for correspondingly adjusting the decision condition within each confidence and/or signal-to-noise range. For example, taking the judgment condition as the judgment threshold of the above-mentioned inference model as an example, assuming that the initial judgment threshold is 70%, the sensitivity can be increased in the range of 0 to 30% of confidence, and the judgment threshold can be set to be adjusted to 50%; in the range of 31% to 60% confidence, a judgment threshold can be set to adjust to 60%; in the range of 61% to 70% confidence, the original 70% threshold may be maintained without adjustment; in the range of 71% to 100% confidence, the sensitivity may be lowered and the judgment threshold may be set to 80%.
It should be noted that, for the above-mentioned factors of the number of words n, the number of surrounding words m, the confidence level and the signal-to-noise ratio, the device may individually adjust the sensitivity of the decision condition based on any one of them. Or the device may adjust the sensitivity of the decision condition comprehensively based on any number of influencing factors therein. For example, a weight may be configured for each of the plurality of influencing factors, and the sensitivity of the decision condition may be adjusted in a weighted manner. For example, for the adjustment of the above-mentioned judgment threshold, it is assumed that the adjustment is performed by integrating three influencing factors, i.e., the number m of surrounding people, the confidence coefficient, and the signal-to-noise ratio, the weights set by the three factors are w1, w2, and w3, and the adjusted judgment thresholds calculated by the three factors are a1, a2, and a3, and then the adjusted judgment threshold determined by integrating the three factors is (a1+a2+w2+a3). It should be noted that this weighted synthesis method is only an example, and in practical implementation, the most or least adjustment among a plurality of influencing factors may be taken as the final adjustment result, and the present solution does not limit the specific synthesis calculation process.
Referring to fig. 5, fig. 5 schematically shows the sensitivity of the adjustment decision condition based on three influencing factors, namely, a duration of listening (hereinafter abbreviated as t 1) before the device acquires the first voice information, a first time interval (hereinafter abbreviated as Δt 1) between the device acquiring the first voice information and the last valid voice information, and a second time interval (hereinafter abbreviated as Δt 2) between the device acquiring the first voice information and the last invalid voice information.
Specifically, after the device acquires the first voice information, the device may acquire a duration t1 for which the device continuously listens when acquiring the first voice information, acquire a first time interval Δt1 between the first voice information and the last valid voice information, and acquire a second time interval Δt2 between the first voice information and the last invalid voice information. The t1, Δt1 and Δt2 acquisitions can be obtained by timer counting and calculation, for example.
After obtaining this t1, the device may adjust the sensitivity of the above-mentioned decision condition based on this t1 negative correlation, i.e., the greater the duration of continuous listening t1, the lower the sensitivity of the decision condition is adjusted. This is because when the apparatus is awakened, it starts to enter a new continuous listening phase, and in general, the possibility that the voice information of the user acquired by the apparatus is valid in the early stage of the continuous listening phase is high, so to maintain high sensitivity, the voice information acquired by the apparatus is more likely to be talking information between users over time, and to reduce false triggering, sensitivity needs to be reduced, so that the apparatus can adjust the sensitivity of the above-mentioned decision condition based on the negative correlation of the continuous listening time length.
To facilitate an understanding of the sensitivity of the adjustment of the decision condition based on the t1 negative correlation, an example is illustrated. For example, assuming that the decision condition is a decision threshold of the result output by the inference module, the decision threshold may be 60% in the continuous listening start stage, the condition is relatively loose and the sensitivity is relatively high, but with the gradual increase of t1, each time t1 increases by a unit interval (for example, an interval of 5 seconds), the decision threshold is increased by a preset increment value, for example, by 1%, or the like, that is, with the increase of t1, the decision threshold is larger, the condition is more severe, and the sensitivity is gradually reduced. It should be noted that this is only an example, and the present application is not limited to a specific negative correlation adjustment method.
After obtaining the first time interval Δt1, the device may determine whether the Δt1 is greater than the first time interval threshold T1. If Deltat 1 is greater than T1, the sensitivity of the decision condition is not adjusted. This is because, when Δt1 is larger than T1, it is considered that the first time interval Δt1 includes a time length overlapping with the above-mentioned continuous listening time length T1, and the sensitivity of the decision condition may be adjusted by T1, without adjusting the sensitivity of the decision condition according to Δt1.
If Deltat 1 is less than T1, then the negative correlation adjusts the sensitivity of the decision condition. This is because the longer the device is spaced after acquiring valid voice information, i.e., within the T1 time period, the greater the probability that the voice information acquired by the device is invalid voice information such as talk, and therefore, in order to reduce false triggering, the device may negatively correlate with the sensitivity of the adjustment decision condition.
After obtaining the above second time interval Δt2, the device may determine whether the Δt2 is greater than the second time interval threshold T2. If Deltat 2 is greater than T2, the sensitivity of the decision condition is not adjusted. This is because, when Δt2 is larger than T2, it is considered that the second time interval Δt2 includes a time length overlapping with the above-described continuous listening time length T1, and the sensitivity of the decision condition may be adjusted by T1, without adjusting the sensitivity of the decision condition according to Δt2.
If Deltat 2 is less than T2, then the negative correlation adjusts the sensitivity of the decision condition. This is because the longer the device is in the period of time after acquiring the invalid voice information, i.e., the T2 period of time, the greater the probability that the voice information acquired by the device is the invalid voice information such as talk, and therefore, in order to reduce false triggering, the device can negatively correlate with the sensitivity of the adjustment decision condition.
In addition, for the first time interval Δt1 and the second time interval Δt2 acquired above, the device may compare whether Δt1 is smaller than Δt2, and if so, adjust the sensitivity of the decision condition higher. This is because, when the previous voice information of the first voice information is obtained as valid voice information, the first voice information is more likely to be added or modified, i.e., the first voice information is more likely to be valid voice information, and then the device may adjust the decision condition in a relaxed direction, i.e., increase the sensitivity, in order to better recognize the validity of the first voice information.
The adjustment flow shown in fig. 5 is an implementation example of the present application, and the sensitivity of the decision condition is dynamically adjusted in real time by the features of the duration of the listening time and the time interval between the valid voice information and the invalid voice information, so that the voice information acquired by the device even though the voice information with the same content is decided to be valid has a difference between the thresholds in different listening time periods, thereby better recognizing the valid voice, reducing false triggering of the invalid voice, and improving the voice interaction experience of the user.
It should be noted that, for several influencing factors shown in fig. 5, the device may individually adjust the sensitivity of the decision condition based on any one of them. Or the device may adjust the sensitivity of the decision condition comprehensively based on any number of influencing factors therein.
Referring to fig. 6A and 6B, fig. 6A and 6B schematically show the sensitivity of adjusting the decision condition based on the influence factor of the duty ratio of the valid voice information and the invalid voice information in the first preset time period before the device acquires the first voice information.
The first preset duration may be a duration that the device continuously listens to before the device obtains the first voice information, or the first preset duration may be any duration that the device obtains the first voice information, and the any duration may be preconfigured, which is not limited in this application.
The ratio of the effective voice information in the first preset duration refers to the proportion of the effective voice information acquired by the equipment to all the voice information acquired by the equipment in the first preset duration. Or the duty ratio is the reciprocal of the number of invalid voice information acquired between the last time of receiving the valid voice control instruction and the time of acquiring the first voice information. If the number of invalid voice information acquired during the period is 0, the duty ratio of the valid voice information is 1.
The duty ratio of the invalid voice information in the first preset duration refers to the proportion of the invalid voice information acquired by the equipment to all the voice information acquired by the equipment in the first preset duration. Or the duty ratio is the time point when the invalid voice control instruction is received last time, and the reciprocal of the number of valid voice information acquired between the acquisition of the first voice information. If the number of valid voice information acquired during the period is 0, the duty ratio of the invalid voice information is 1.
In a specific embodiment, after the device acquires the first voice information, the device acquires a duty ratio of the valid voice information (abbreviated as f 1) and a duty ratio of the invalid voice information (abbreviated as f 2) within the first preset duration, and the device may compare the sizes of the f1 and the f2 (see fig. 6A). If f1 is greater than f2, indicating that more effective voice information is obtained in the first preset duration, and the user frequently performs voice interaction with the device, the sensitivity of the decision condition can be adjusted according to the positive correlation of the parameter (f 1-f 2). That is, the larger the duty ratio of the effective voice information is, the larger the probability that the first voice information is effective is, and the higher the sensitivity of the judgment condition is adjusted, so that the effectiveness of the acquired voice information can be better identified, and the possibility that the effective voice information is not identified is reduced.
In a possible embodiment, the device may adjust the sensitivity of the above-mentioned decision conditions based on f1 and f 2. For example, the larger the f1 duty cycle, the higher the sensitivity tone, and the smaller the f2 duty cycle, the lower the sensitivity tone, and so on.
In fig. 6A, if f1 is not greater than f2, the device may adjust the sensitivity of the decision condition according to the rate of change of f1 and the rate of change of f 2.
For example, a coordinate system is constructed with the frequency of acquiring the voice information as the horizontal axis (or the time of continuous listening as the horizontal axis), and with f1 as the vertical axis, in the coordinate system, the slope of the line between f1 when the effective voice information is acquired last time and f1 when the effective voice information is acquired last time is the change rate of f 1. For ease of understanding, see fig. 6C. In fig. 6C, assuming that 6 times of voice information have been accepted before the above-described first voice information is acquired, the duty ratio of valid voice information after each time voice information is acquired and validity judgment is made is exemplarily shown in fig. 6C. Then, in fig. 6C, after the device acquires the first voice information, the change rate of f1 acquired is k= -10%.
Similarly, a coordinate system is illustratively constructed with the frequency of acquiring the voice information as the horizontal axis (or with the time of continuous listening as the horizontal axis), and with f2 as the vertical axis, in which the slope of the line between f2 when the invalid voice information is acquired last time and f2 when the invalid voice information is acquired last time is the change rate of f 2. For ease of understanding, see fig. 6D. In fig. 6D, assuming that 6 times of voice information have been accepted before the above-described first voice information is acquired, the duty ratio of invalid voice information after each time voice information is acquired and validity judgment is made is exemplarily shown in fig. 6D. Then, in fig. 6D, after the device acquires the first voice information, the change rate of f2 acquired is k=10%.
Based on the above description, in the case where f1 is not greater than f2, indicating that the voice interaction between the user and the device is reduced, then, in order to reduce false triggering of invalid voice, the device may adjust the sensitivity of the decision condition according to the positive correlation of the rate of change of f 1. That is, the larger the change rate of f1 is, the larger the effective probability of the first voice information is, the higher the sensitivity is adjusted, and the loose the judgment condition is; and the smaller the change rate of f1 is, the smaller the effective probability of the first voice information is, the lower the sensitivity is adjusted, and the more severe the judgment condition is. For example, referring to FIG. 6C above, several rates of change of f1 are shown in FIG. 6C by way of example: k= -50%, k=16.6%, k=8.3%, k= -15% and k= -10%, ordered from small to large: -50% < -15% < -10% < 8.3% < 16.6%. Assuming that the adjusted judgment condition is the judgment threshold of the output result of the reasoning module, and assuming that the judgment threshold before adjustment is 70%, the adjusted judgment thresholds corresponding to the change rates of f1 of the 5 orders from small to large are 85%, 80%, 78%, 68% and 65%. The lower the judgment threshold is, the higher the sensitivity is, i.e., the higher the sensitivity is, the lower the judgment threshold is, and the lower the sensitivity is, the higher the judgment threshold is.
And in the case where f1 is not greater than f2, the device can adjust the sensitivity of the decision condition according to the negative correlation of the rate of change of f 2. I.e. the smaller the rate of change of f2, the more the duty cycle of the effective speech information is increasing, i.e. the greater the probability that the first speech information is effective, and therefore the higher the sensitivity is, the more relaxed the decision condition is; and the larger the rate of change of f2, the smaller the duty ratio of the effective voice information is, i.e. the probability that the first voice information is effective, and therefore, the lower the sensitivity is, the more severe the decision condition is. For example, referring to fig. 6D above, several rates of change of f2 are shown in fig. 6D by way of example: k=50%, k= -16.6%, k= -8.3%, k=15% and k=10%, ordered from small to large as: -16.6% < -8.3% < 10% < 15% < 50%. Assuming that the adjusted judgment condition is the judgment threshold of the output result of the reasoning module, and assuming that the judgment threshold before adjustment is 70%, the adjusted judgment thresholds corresponding to the change rates of f2 of the 5 orders from small to large are 65%, 68%, 78%, 80% and 85%.
Or after the device acquires the first voice information, the device acquires the duty ratio of the effective voice information (abbreviated as f 1) and the duty ratio of the ineffective voice information (abbreviated as f 2) in the first preset duration, the device does not need to compare the sizes of f1 and f2, and can also adjust the sensitivity of the judgment condition according to the positive correlation of the parameter (f 1-f 2), adjust the sensitivity of the judgment condition according to the positive correlation of the change rate of f1 and/or adjust the sensitivity of the judgment condition according to the negative correlation of the change rate of f2 (see fig. 6B). The specific adjustment process is described above with reference to fig. 6A, and is not described herein.
It should be noted that, for several influencing factors shown in fig. 6A or fig. 6B, the device may individually adjust the sensitivity of the decision condition based on any one of them. Or the device may adjust the sensitivity of the decision condition comprehensively based on any number of influencing factors therein.
Referring to fig. 7, fig. 7 illustrates a schematic diagram of the sensitivity of the three influencing factors for adjusting the decision condition based on the first association degree of the first voice information with the semantics of the valid voice information acquired by the device last time, the second association degree of the first voice information with the semantics of the invalid voice information acquired by the device last time, the third association degree of the first voice information with the valid voice information acquired by the device last time, and the state of the voice dialogue between the first voice information device and the user is acquired by the last time.
In a specific embodiment, after the device acquires the first voice information, the device may acquire valid voice information acquired last time (referred to as latest historical valid voice information for short), and analyze the association degree of the first voice information and the latest historical valid voice information based on the semantic analysis of the two voice information (referred to as first association degree for short). Specifically, the first speech information may be semantically understood by invoking a natural language understanding model in memory.
If the semantics of the two voice messages are not associated, i.e. the first association degree is zero, the sensitivity of the decision condition is not adjusted. If the semantics of the two voice messages are associated, for example, the semantics of the two voice messages are the same, there is an inheritance relationship (for example, the semantics of the last history effective voice message is "open air conditioner", the semantics of the first voice message is "a little higher temperature"), there is a progressive relationship (for example, the semantics of the last history effective voice message is "a little higher temperature", the semantics of the first voice message is "a little higher temperature"), or there is a opposition relationship (for example, the semantics of the last history effective voice message is "open air conditioner", the semantics of the first voice message is "closed"), etc., the device may calculate a specific first association degree, and then adjust the sensitivity of the decision condition based on the calculated first association degree positive correlation.
For example, if the first correlation degree is greater than a certain threshold value, indicating that the probability that the first voice information is valid voice information is greater, the greater the first correlation degree, the higher the sensitivity is adjusted; conversely, if the first correlation degree is smaller than a certain threshold value, the probability that the first voice information is effective voice information is smaller, and the sensitivity is adjusted to be lower as the first correlation degree is smaller.
For example, in one possible implementation manner, the device does not need to set a threshold value of the first association degree, but may set a situation that the decision condition is correspondingly adjusted within each range of the first association degree. For example, taking the judgment condition as the judgment threshold of the above-mentioned inference model as an example, assuming that the initial judgment threshold is 70%, the sensitivity may be lowered and the judgment threshold may be set to 80% in the range of 0 to 30% of the first degree of association; in the range of 31% to 60% of the first degree of association, a judgment threshold value may be set to 75%; in the range of 61% to 70% of the first association degree, the original threshold value of 70% can be maintained without adjustment; in the range of 71% to 100% of the first degree of correlation, the sensitivity may be heightened, and the judgment threshold may be set to 60%.
In a possible implementation manner, when the first association degree is determined to be 100% in association, the sensitivity may be tuned to the highest, or the validity determination is not further performed by the invalid rejection model, and an indication that the first voice information is valid is directly output.
In a specific embodiment, after the device acquires the first voice information, the device may acquire the last acquired invalid voice information (referred to as the last historical invalid voice information for short), and analyze the association degree of the first voice information and the last historical invalid voice information based on the semantic analysis of the two voice information (referred to as the second association degree for short). If the semantics of the two voice messages are not associated, i.e. the second association degree is zero, the sensitivity of the decision condition is not adjusted. If the semantics of the two voice messages are associated, for example, the semantics of the two voice messages are the same, inherited relationships exist (for example, the semantics of the latest historical invalid voice message is "we can go deep on sunday", the semantics of the first voice message is "can go six days"), progressive relationships exist (for example, the semantics of the latest historical invalid voice message is "early in getting up at six morning", the semantics of the first voice message is "I can get up still earlier"), or opposite relationships exist (for example, the semantics of the latest historical invalid voice message is "we go deep bar", the semantics of the first voice message is "no"), etc., the device can calculate a specific second association degree, and then adjust the sensitivity of the decision condition based on the calculated second association degree negative correlation.
For example, if the second association degree is greater than a certain threshold value, indicating that the probability that the first voice information is invalid voice information is greater, the greater the second association degree, the lower the sensitivity is adjusted; conversely, if the second association degree is smaller than a certain threshold value, the probability that the first voice information is invalid voice information is smaller, and the sensitivity is adjusted to be higher as the second association degree is smaller.
For example, in one possible implementation manner, the device does not need to set a threshold value of the second association degree, but may set a situation that the decision condition is correspondingly adjusted within each range of the second association degree. For example, taking the judgment condition as the judgment threshold of the above-mentioned inference model as an example, assuming that the initial judgment threshold is 70%, the sensitivity may be increased and the judgment threshold may be set to 60% in the range of 0 to 30% of the second association degree; in the range of 31% to 60% of the second degree of association, a judgment threshold value may be set to 65%; in the range of 61% to 70% of the second association degree, the original threshold value of 70% can be maintained without adjustment; in the range of 71% to 100% of the second degree of correlation, the sensitivity may be lowered, and the judgment threshold may be set to 80%.
In a possible implementation manner, when the second association degree is determined to be 100% in association, the sensitivity may be adjusted to the lowest, or the validity judgment is not further performed by the invalid rejection model, and an indication that the first voice information is invalid is directly output.
In a specific embodiment, the device may adjust the association degree of the decision condition based on the third association degree of the first voice information and the valid voice information acquired by the device last time, in addition to the first association degree of the semantics of the first voice information and the valid voice information acquired by the device last time. The third degree of association refers to the degree of association between the first voice information and the content of the valid voice information acquired by the device last time, and the first degree of association refers to the degree of association between the semantics of the two voice information. For easy understanding of the first degree of association and the third degree of association, refer to fig. 8A and 8B.
Referring first to fig. 8A, assume that "help me play music" is the last valid voice information acquired by the device, and "i me like listening to singer a's song" is the first voice information described above. In order to acquire the first association degree of the two pieces of voice information, after semantic information of the two pieces of voice information is acquired through a natural language understanding model, the two pieces of semantic information are input into a semantic association reasoning model for processing. And outputting the first association degree of the two semantic information through the semantic association reasoning model processing. The semantic association reasoning model is a pre-trained neural network model or a machine learning model and the like.
Referring to fig. 8B, similarly, assume that "help me play music" is the last valid voice information acquired by the device, and "i me like listening to singer a's song" is the first voice information described above. In order to obtain the third association degree of the two pieces of voice information, the two pieces of voice information can be structurally analyzed through a natural language understanding model, and specifically, after the voice information of 'help me play music' is structurally analyzed, the following steps are known: the field of description of this speech information is music, which is intended to play music. The voice information of 'i like listening to singer a' is obtained after structural analysis: the field of description of this speech information is music, and the singer is singer a. After the structured information of the two voice messages is obtained, the two structured information is input into a relevant judgment model for processing. And outputting a third association degree of the two voice messages through the processing of the relevant judgment model. The correlation determination model may be, for example, a dialog state tracking DST model or the like.
The first association degree of the two voice information of "help me play music" and "me like listen to singer a's song" output in fig. 8A above may be zero, that is, the semantics are not associated; the third association degree of the two voice information outputted in fig. 8B, i.e., the "help me play music" and the "i like listening to the song of singer a in general" may be 100%, i.e., the two voice information are associated.
In one possible embodiment, the third degree of association between the first voice information acquired in the manner described in fig. 8B and the valid voice information acquired by the device last time may be explicitly 0 or 100%, that is, the third degree of association is 0 if the correlation determination model outputs the irrelevant instruction information, and the third degree of association is 100% if the correlation determination model outputs the relevant instruction information.
In another possible implementation manner, the third association degree between the first voice information acquired in the manner described in fig. 8B and the valid voice information acquired by the device last time may be a specific percentage (for example, 60% or 90%, etc.) or a similarity score, etc., and then it may be determined whether to associate by comparing with a preset threshold value.
After obtaining the third association degree between the first voice information and the effective voice information acquired by the device last time, the device may adjust the sensitivity of the decision condition based on the third association degree positive correlation. For a specific positive correlation adjustment manner, reference may be made to the sensitivity of the decision condition based on the first correlation positive correlation adjustment, which is not described herein. In addition, when the third association degree is zero, that is, the first voice information is not related to the valid voice information acquired by the device last time, the sensitivity of the decision condition is not adjusted.
In a specific embodiment, after the device acquires the first voice information, the device may acquire a state of stopping acquiring a voice dialogue between the first voice information device and the user, where the state may be, for example, a state that the device selects, inquires, judges or is boring based on a voice control instruction of the user, and so on. In particular, the device may learn of the state based on dialog state tracking DST techniques. In the presence of a state of the device in which it is in speech communication with the user, indicating that a long interactive session is being performed between the user and the device, the device may then increase the sensitivity of the decision condition in accordance with this continued session state. If there is no state of the device to speak to the user, the user does not have a long interactive session with the device, and the device may not adjust the sensitivity of the decision condition based on this factor.
It should be noted that, for several influencing factors shown in fig. 7, the device may individually adjust the sensitivity of the decision condition based on any one of them. Or the device may adjust the sensitivity of the decision condition comprehensively based on any number of influencing factors therein.
Referring to fig. 9, fig. 9 illustrates a schematic diagram of sensitivity of adjusting a decision condition based on two influencing factors, namely, a first similarity of acoustic features of the first voice information and the history valid voice information and a second similarity of acoustic features of the first voice information and the history invalid voice information. Illustratively, the acoustic features include features of speech, such as speech pitch and/or speech speed.
In a specific embodiment, after the device acquires the first voice information, the acoustic features of the first voice information are extracted by calling an acoustic model stored in a memory, and then the extracted acoustic features are compared with acoustic features of the historical effective voice information (which may be one or more of the historical effective voice information), so as to acquire a similarity (simply referred to as a first similarity) between the acoustic features of the first voice information and the acoustic features of the historical effective voice information. If the similarity between the acoustic features of the first voice information and the acoustic features of the historical effective voice information is zero, the device may not adjust the sensitivity of the decision condition according to the first similarity. If the acoustic feature of the first speech information is not zero in similarity to the acoustic feature of one or more historically valid speech information, then the sensitivity of the adjustment decision condition, i.e., the similarity (which may be, for example, the largest of the obtained similarities, or the average form of the obtained similarities, etc.), may be positively correlated, the higher the sensitivity is adjusted.
In a possible implementation, in case the similarity of the acoustic feature of the first speech information to the acoustic feature of the one or more historically valid speech information is greater than a certain threshold (which may be, for example, any value between 60% and 100%), which indicates that the acoustic feature of the first speech information is similar to the acoustic feature of the one or more historically valid speech information, the device may then adjust the sensitivity of the decision condition to a preset value. For example, taking the above-mentioned judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic features of the first voice information and the acoustic features of one or more pieces of history effective voice information is greater than a certain threshold, the judgment threshold is adjusted to 60%.
In a specific embodiment, after the device acquires the first voice information, the acoustic features of the first voice information are extracted by calling an acoustic model stored in a memory, and then the extracted acoustic features are compared with acoustic features of the historical ineffective voice information (which may be one or more pieces of historical ineffective voice information), so as to acquire a similarity (abbreviated as a second similarity) between the acoustic features of the first voice information and the acoustic features of the historical ineffective voice information. If the similarity between the acoustic features of the first voice information and the acoustic features of the historical invalid voice information is zero, the device may not adjust the sensitivity of the decision condition according to the second similarity. If the acoustic feature of the first speech information is not zero in similarity to the acoustic feature of one or more of the historical invalid speech information, then the sensitivity of the adjustment decision condition may be adjusted in negative correlation, i.e., the greater the similarity (which may be, for example, the largest of the obtained similarities, or the average form of the obtained similarities, etc.), the lower the sensitivity.
In a possible implementation, in case the similarity of the acoustic feature of the first speech information to the acoustic feature of the one or more history of inactive speech information is greater than a certain threshold (which may be, for example, any value between 60% and 100%), which indicates that the acoustic feature of the first speech information is similar to the acoustic feature of the one or more history of inactive speech information, the device may then adjust the sensitivity of the decision condition to a preset value. For example, taking the above-mentioned judgment threshold as an example, assuming that the original judgment threshold is 70%, as long as the similarity between the acoustic features of the first voice information and the acoustic features of one or more pieces of history ineffective voice information is greater than a certain threshold, the judgment threshold is adjusted to 75%.
It should be noted that, for several influencing factors shown in fig. 9, the device may individually adjust the sensitivity of the decision condition based on any one of them. Or the device may adjust the sensitivity of the decision condition comprehensively based on any number of influencing factors therein.
In one possible implementation, the device may receive a user-entered instruction based on which to adaptively adjust the sensitivity of the decision condition. The instruction may be, for example, a specific decision condition sensitivity specified by the user, or may be an instruction to turn off or cancel voice information validity recognition, or the like. The embodiment of the application can adaptively adjust the sensitivity of the judgment conditions according to the preference of the user, thereby better meeting the user requirements and improving the user experience.
In one possible implementation manner, the adjustment of the sensitivity of the decision condition may be that the sensitivity is adjusted by another device or apparatus (for example, may be a server corresponding to the device) based on the one or more influencing factors and then sent to the device, and after the device receives the adjusted decision condition, the device may directly decide the validity of the first voice information based on the adjusted decision condition.
Referring to fig. 10, fig. 10 shows a voice information processing method provided by the present application, which includes, but is not limited to, the following steps:
S1001, acquiring first voice information.
The specific implementation of this step may be referred to the description in step S201 in fig. 2, and will not be repeated here.
S1002, executing the operation of the first voice information instruction under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting based on the environmental condition of the first voice information when the first voice information is generated.
In a specific embodiment, after the device acquires the first voice information, a decision condition for determining whether the first voice information is a valid voice command may be adaptively adjusted based on an environmental condition in which the first voice information is generated. Specifically, the specific implementation of adjusting the decision condition based on the environmental condition where the first voice information is generated may be referred to the corresponding description in fig. 4, which is not repeated herein.
After the adjustment is completed, the device uses the adjusted decision condition to determine whether the first voice information is valid. In the case that the first voice information is valid, the device starts semantic understanding of the first voice information, and specifically, the processor in the device may invoke the natural language understanding model in the memory to perform semantic understanding of the first voice information to obtain a specific meaning of the first voice information. After the device understands the meaning of the first voice information, a corresponding operation is performed based on the meaning so as to provide the user with the required service. The meaning of the first voice information is for the device to execute the control instruction of the corresponding operation.
One possibility is that in a time-wise manner, the device may receive a sensitivity of a specified decision condition entered by the user, and then adaptively adjust the decision condition for determining whether the first voice information is a valid voice command based on the sensitivity, so that the user-specified decision sensitivity can be achieved when determining whether the voice information is valid using the adjusted decision condition. After the device adjusts the judgment condition based on the sensitivity specified by the user, the adjusted judgment condition is adopted to judge whether the first voice information is effective or not. And under the condition that the first voice information is effective, the device starts to perform semantic understanding on the first voice information to acquire the meaning of the first voice information, and performs corresponding operation based on the meaning so as to provide the needed service for the user. The meaning of the first voice information is for the device to execute the control instruction of the corresponding operation.
In a possible implementation manner, in the case where the first voice information is determined to be a valid voice control instruction based on the decision condition, the specific implementation of the operation of performing the indication of the first voice information may be referred to the description in step S203 in fig. 2, which is not repeated herein.
Optionally, the environmental condition of the first voice information when generated includes one or more of the following: stopping the number of the talkers in the second preset duration when the equipment acquires the first voice information, and generating the first voice information by the number of the talkers in the preset range, the confidence degree of the first voice information or the signal-to-noise ratio of the first voice information.
Because the more the number of speakers in a period of time and/or the more the number of surrounding people is when the voice information is generated, the more the probability that the voice information received by the equipment is boring and is invalid voice is high, in addition, the higher the confidence and/or the signal to noise ratio of the voice information is, the probability that the equipment can correctly recognize sentences of the voice information is high, and the recognition of the validity of the voice information is also influenced, so that the validity of the voice information can be better judged based on one or more of the judgment conditions for adaptively adjusting the validity of the voice information, the accuracy of the valid judgment is improved, and the false triggering rate of invalid signals is reduced.
In a specific embodiment, in a case where the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is not valid, the sensitivity of the decision condition is increased; in the case where the environmental condition indicates that the probability that the first speech information is valid is less than the probability that it is invalid, the sensitivity of the decision condition is lowered. Specific implementations may be referred to the corresponding descriptions in fig. 4, and are not repeated here.
Because the environment condition generated by the voice information has a larger influence on whether the voice information is an effective voice control instruction, the same or similar voice information is an effective instruction under one environment condition, but not necessarily an effective instruction under another environment condition, the embodiment of the application adaptively adjusts the judgment condition for judging the effectiveness of the voice information according to the voice information received under different environment conditions, can better judge the effectiveness of the voice information under different environment conditions, improves the accuracy of effective judgment, and reduces the false triggering rate of invalid signals.
In one possible implementation manner, the determining condition is obtained based on an adjustment of an environmental condition of the first voice information when the first voice information is generated, and includes: the decision condition is adjusted based on the environmental condition and the continuous listening time of the device.
In a specific embodiment, the device may adaptively adjust the sensitivity of the above-mentioned decision condition according to the environmental condition in which the first voice information is generated and the duration of continuous listening of the voice information by the device. Specifically, the specific implementation of adjusting the decision condition based on the environmental condition where the first voice information is generated may be referred to the corresponding description in fig. 4, which is not repeated herein.
Optionally, the sensitivity of the decision condition is adjusted to be lower the longer the continuous listening period of the device. Specific implementation of the decision condition based on the continuous listening duration of the device to the voice information may be referred to the corresponding description in fig. 5, and will not be repeated here.
Alternatively, in a specific implementation, the device may configure a weight for each of the above environmental conditions and the listening period, and comprehensively adjust the sensitivity of the decision condition in a weighted manner. For example, for the adjustment of the above-mentioned judgment threshold, it is assumed that two influencing factors, i.e., the environmental situation and the listening period, are integrated and adjusted, weights corresponding to the two factors are set to be w4 and w5, and the two factors correspond to the calculated adjusted judgment thresholds to be a4 and a5, and then the adjusted judgment threshold determined by integrating the two factors is (a4×w4+a5×w5). It should be noted that this weighted synthesis method is only an example, and in practical implementation, the most or least adjustment among a plurality of influencing factors may be taken as the final adjustment result, and the present solution does not limit the specific synthesis calculation process.
Because the longer the duration that the equipment continuously listens to the voice is, the larger the probability that the listened voice information is invalid voice is, the judgment condition for judging the validity of the voice information is adaptively adjusted by combining the environment condition when the voice information is generated and the continuous listening duration of the equipment, the validity of the voice information can be further and better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In one possible implementation manner, the above decision condition is adjusted based on the environmental condition and the continuous listening duration of the device, and includes: the decision condition is adjusted based on the environmental condition, the duration of the continuous listening time, and the condition of the historical speech information.
Optionally, the case of the historical voice information includes one or more of the following: a first time interval between when the first voice information is acquired and when the effective voice information is acquired last time; a second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time; acquiring the duty ratio of effective voice information and ineffective voice information in a first preset time before the first voice information; a first association degree of the first voice information and the last acquired semantic meaning of the effective voice information; a second degree of association of the first voice information with the semantics of the invalid voice information acquired last time; a third association degree between the first voice information and the effective voice information acquired by the equipment last time; stopping the state of the voice dialogue between the equipment and the user when the first voice information is acquired; a first similarity of acoustic features of the first speech information and the historically effective speech information; the first speech information has a second similarity to the acoustic characteristics of the historical invalid speech information.
Optionally, the sensitivity of the decision condition is adjusted to be lower as the first time interval is longer.
Optionally, the sensitivity of the decision condition is adjusted to be lower as the second time interval is longer.
Optionally, in the case that the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted to be high.
Optionally, in the case that the duty ratio of the valid voice information is greater than the duty ratio of the invalid voice information, the sensitivity of the decision condition is adjusted to be high;
Under the condition that the duty ratio of the effective voice information is smaller than that of the ineffective voice information, the duty ratio of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is adjusted to be high; the duty ratio of the effective voice information is in a decreasing trend, and the sensitivity of the decision condition is adjusted down.
Alternatively, in the presence of the state of the device in speech dialogue with the user, the sensitivity of the decision condition is adjusted to be high.
In this embodiment, the device may adaptively adjust the sensitivity of the above-mentioned decision condition in combination with the environmental condition in which the first speech information is generated, the duration of continuous listening of the speech information by the device, and the historical speech information heard by the device. Specifically, the specific implementation of adjusting the decision condition based on the environmental condition where the first voice information is generated may be referred to the corresponding description in fig. 4, which is not repeated herein; specific implementation of adjusting the decision condition based on the continuous listening duration of the device to the voice information can be referred to the corresponding description in fig. 5, and will not be repeated here; specific implementation of adjusting the decision condition based on the historical voice information heard by the device may be referred to the corresponding description in fig. 5, 6A, 6B, 7 or 9, and will not be described herein.
Optionally, in this embodiment, the sensitivity of the decision condition is adjusted by combining the above environmental situation, the listening duration and the historical voice information, which may be adjusted comprehensively by using the above-described weighted average comprehensive adjustment method, or may be the result of the last adjustment that is the most or least adjusted among a plurality of influencing factors, which is not limited by the specific comprehensive calculation process.
The method can also help judge the validity of the currently acquired voice information based on the historical voice information, for example, if the similarity of the currently acquired voice information and the historical acquired valid voice information is larger, the probability that the currently acquired voice information is a valid voice instruction is larger, otherwise, if the similarity of the currently acquired voice information and the historical acquired invalid voice information is larger, the probability that the currently acquired voice information is an invalid voice instruction is larger. Therefore, in the embodiment of the application, besides the environmental condition and the equipment listening time generated by the introduced voice information, the historical voice information is combined to adaptively adjust the judging condition for judging the validity of the voice information, so that the validity of the voice information can be further and better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In one possible implementation manner, the determining condition is obtained based on an adjustment of an environmental condition of the first voice information when the first voice information is generated, and includes: the decision condition is adjusted based on the environmental condition and the condition of the historical voice information.
In this embodiment, the device may adaptively adjust the sensitivity of the above-mentioned decision condition in combination with the environmental condition in which the first voice information is generated and the historical voice information that the device listens to. Specifically, the specific implementation of adjusting the decision condition based on the environmental condition where the first voice information is generated may be referred to the corresponding description in fig. 4, which is not repeated herein; specific implementation of adjusting the decision condition based on the historical voice information heard by the device may be referred to the corresponding description in fig. 5, 6A, 6B, 7 or 9, and will not be described herein.
Optionally, in this embodiment, the sensitivity of the decision condition is adjusted by combining the environmental situation and the historical voice information, which may be adjusted comprehensively by using the above-described weighted average comprehensive adjustment method, or may be the result of taking the most or least adjustment of multiple influencing factors as the last adjustment, etc., where the specific comprehensive calculation process is not limited by this scheme.
Based on the foregoing description, the embodiment of the application combines the environmental condition generated by the voice information and the historical voice information to adaptively adjust the judgment condition for judging the validity of the voice information, so that the validity of the voice information can be further and better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In one possible embodiment, the present application provides another voice information processing method, which includes: acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a decision condition, wherein the decision condition is adjusted based on the continuous listening duration of the equipment.
In a specific embodiment, the specific implementation of the above-mentioned obtaining the first voice information may be referred to the description in step S201 in fig. 2, which is not repeated here. In the case that the first voice information is determined to be a valid voice control instruction based on the decision condition, the specific implementation of the operation of performing the indication of the first voice information may be referred to the description in step S203 in fig. 2, and will not be repeated here. The specific implementation of the above decision condition for adjusting the duration of continuous listening of the voice information based on the device may be referred to the corresponding description in fig. 5, and will not be repeated here.
In the application, as the duration of continuous listening to the voice of the equipment is longer, the probability that the listened voice information is invalid voice is larger, the judgment condition for judging the validity of the voice information can be adaptively adjusted through the continuous listening duration of the equipment, the validity of the voice information can be better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In one possible embodiment, the present application provides another voice information processing method, which includes: acquiring first voice information; and executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting based on the historical voice information.
In a specific embodiment, the specific implementation of the above-mentioned obtaining the first voice information may be referred to the description in step S201 in fig. 2, which is not repeated here. In the case that the first voice information is determined to be a valid voice control instruction based on the decision condition, the specific implementation of the operation of performing the indication of the first voice information may be referred to the description in step S203 in fig. 2, and will not be repeated here. Specific implementation of adjusting the decision condition based on the historical voice information heard by the device may be referred to the corresponding description in fig. 5, 6A, 6B, 7 or 9, and will not be described herein.
The method can also help judge the validity of the currently acquired voice information based on the historical voice information, for example, if the similarity of the currently acquired voice information and the historical acquired valid voice information is larger, the probability that the currently acquired voice information is a valid voice instruction is larger, otherwise, if the similarity of the currently acquired voice information and the historical acquired invalid voice information is larger, the probability that the currently acquired voice information is an invalid voice instruction is larger. Therefore, in the application, the judgment condition for judging the validity of the voice information is adaptively adjusted through the historical voice information, so that the validity of the voice information can be better judged, the accuracy of effective judgment is improved, and the false triggering rate of invalid signals is reduced.
In order to facilitate the overall understanding of the voice information processing method provided by the present application, reference is made to a flowchart shown in fig. 11, for example. In fig. 11, first, the voice interaction system of the device is awakened, and then the system begins to listen to the user's voice. After the system acquires the voice information of the user, the voice information is input into the invalid rejection model to identify the validity of the voice information. And if the voice information is identified to be effective, carrying out semantic understanding on the voice information, and carrying out instruction analysis and execution based on the understood semantics.
After semantic understanding, the voice interaction system judges whether to continue to listen to the voice of the user, and if so, the voice interaction system performs the operation of listening to the voice. If it is determined that the listening is not continued, an operation of ending the listening is performed. For example, the determination of whether to continue listening may be determined according to a preset listening period, if the current listening period is not out of the preset listening period, then listening may be continued, otherwise, listening is ended.
If the voice information identified by the invalid rejection model is invalid, the system judges whether to continue to listen to the voice of the user, and if so, the system performs the operation of listening to the voice. If it is determined that the listening is not continued, an operation of ending the listening is performed.
In a possible implementation manner, in the above-mentioned flow shown in fig. 11, after determining that the voice information is valid, the two steps of determining whether to continuously listen to the voice of the user and semantic understanding may be performed simultaneously, or whether to continuously listen to the voice of the user first and then perform semantic understanding.
In addition, after the semantic understanding of the speech information, the semantic meaning of the speech information after the understanding may be returned to the process of recognizing the validity of the speech information, for example, the speech information may be input to the invalid rejection model for the adjustment of the sensitivity of the decision condition.
In addition, in the embodiment of the voice information processing method provided by the present application, the description is mainly given by taking the judgment condition in the invalid rejection model as an example, but in practical application, the judgment condition of the validity of the voice information may not be limited to the judgment condition in the invalid rejection model. Any scheme for adjusting the judgment condition of the validity of the voice information based on one or more of the influencing factors of the validity recognition of the voice information is within the protection scope of the present application.
In summary, the voice information processing method provided by the application starts from one or more influencing factors influencing the voice information validity judgment, adjusts the sensitivity of the judgment condition for judging the validity of the acquired voice information by the equipment in real time, so that the equipment can flexibly and effectively judge the validity of the voice information based on different scenes and different user states, can improve the accuracy of voice information validity identification, can reduce the false triggering rate of invalid voice information, can save the calculation resources wasted by false triggering of the equipment, and can also improve the physical examination of the user in the voice interaction process.
The foregoing mainly describes a data communication processing method provided by the embodiment of the present application. It will be appreciated that each device, in order to implement the corresponding functions described above, includes corresponding hardware structures and/or software modules that perform each function. The elements and steps of the examples described in connection with the embodiments disclosed herein may be embodied in hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application can divide the functional modules of the device according to the method example, for example, each functional module can be divided corresponding to each function, and two or more functions can be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, the division of the modules in the embodiment of the present application is illustrative, and is merely a logic function division, and there may be another division manner in actual implementation.
Fig. 12 shows a schematic diagram of a possible logic structure of an apparatus, which may be the above-described device, or may be a chip in the device, or may be a processing system in the device, or the like, in the case where respective functional blocks are divided with corresponding respective functions. The apparatus 1200 includes an acquisition unit 1201, an adjustment unit 1202, a semantic understanding unit 1203, and an execution unit 1204. Wherein:
An acquisition unit 1201 is configured to acquire first voice information. The acquisition unit 1201 may be implemented by a communication interface or transceiver, and may perform the operations described in step 201 shown in fig. 2.
An adjusting unit 1202, configured to adjust a decision condition based on an influencing factor of the validity of the first voice information, where the decision condition is one or more decision conditions in a validity decision model of the first voice information, and the validity is a voice control instruction that indicates whether the first voice information is valid for a device that obtains the first voice information. The adjustment unit 1202 may be implemented by a processor and may perform the operations described in step 202 shown in fig. 2.
The semantic understanding unit 1203 is configured to perform semantic understanding on the first voice information when it is determined that the first voice information is valid based on the adjusted decision condition. The semantic understanding unit 1203 may be implemented by a processor and may perform the semantic understanding operations described in step 203 shown in fig. 2.
An execution unit 1204, configured to execute an instruction of the first voice information. The execution unit 1204 may be implemented by a processor, and may perform the execution operations described in step 203 shown in fig. 2.
In a possible embodiment, the adjusting unit 1202 is specifically configured to:
In the case that the probability of the first voice information being effective is greater than the probability of the first voice information being ineffective based on the influence factor, the sensitivity of the judgment condition is increased, wherein the higher the sensitivity of the judgment condition is, the higher the probability of the first voice information being effective is determined through the judgment condition;
In a case where the probability of the first voice information being effective is analyzed to be smaller than the probability of being ineffective based on the influencing factors, the sensitivity of the decision condition is lowered, the lower the sensitivity of the decision condition is, the lower the probability of the first voice information being effective is determined by the decision condition is.
In a possible implementation manner, the judging condition includes a selecting condition of a pre-judging module of the validity of the first voice information in the validity judging model, and the pre-judging module includes a rule matching module and an reasoning module.
In a possible implementation manner, the judging condition includes a judging threshold value of an inference module for pre-judging the validity of the first voice information in the validity judging model.
In a possible implementation manner, the judging condition includes a comprehensive judging condition of a decision module in the validity judging model; the comprehensive judging condition is a judging condition for determining whether the first voice signal is effective or not based on a pre-judging result; the pre-judging result is a pre-judging result of the pre-judging module in the validity judging model on the validity of the first voice information.
In one possible embodiment, the influencing factor is one or more of the following:
the environmental condition of the first voice information when generated;
a duration of listening of the apparatus 1200;
A first time interval between when the first voice information is acquired and when the effective voice information is acquired last time;
a second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time;
acquiring the duty ratio of effective voice information and ineffective voice information in a first preset time before the first voice information;
A first association degree of the first voice information and the last acquired semantic meaning of the effective voice information;
a second degree of association of the first voice information with the semantics of the invalid voice information acquired last time;
A third degree of association between the first voice message and the valid voice message last acquired by the apparatus 1200;
the device 1200 dialogues with the user's voice until the first voice information is obtained;
a first similarity of acoustic features of the first speech information and the historically effective speech information;
the first speech information has a second similarity to the acoustic characteristics of the historical invalid speech information.
In one possible implementation, the environmental condition in which the first voice information is generated includes one or more of the following:
The number of speakers in the second predetermined duration until the device 1200 obtains the first voice message, the number of speakers in the predetermined range when the first voice message is generated, the confidence level of the first voice message, or the signal-to-noise ratio of the first voice message.
The specific operation and beneficial effects of each unit in the apparatus 1200 shown in fig. 12 may be referred to the corresponding description in the above method embodiments, and will not be repeated here.
Fig. 13 shows a schematic diagram of a possible logic structure of an apparatus, which may be the above-described device, or may be a chip in the device, or may be a processing system in the device, or the like, in the case where respective functional blocks are divided with corresponding respective functions. The apparatus 1300 includes an acquisition unit 1301 and an execution unit 1302. Wherein:
an acquiring unit 1301 is configured to acquire first voice information. The acquisition unit 1301 may be implemented by a communication interface or a transceiver, and may perform the operation described in step S1001 shown in fig. 10.
The execution unit 1302 is configured to execute the operation indicated by the first voice information when the first voice information is determined to be a valid voice control instruction based on a decision condition, where the decision condition is adjusted based on an environmental condition in which the first voice information is generated. The execution unit 1302 may be implemented by a processor, and may execute the operations described in step S1002 shown in fig. 10.
The specific operation and beneficial effects of each unit in the apparatus 1300 shown in fig. 13 may be referred to the corresponding description in the above method embodiments, and will not be repeated here.
Fig. 14 is a schematic diagram of a possible hardware structure of the apparatus provided by the present application, where the apparatus may be an apparatus in the method described in the foregoing embodiment. The apparatus 1400 includes: a processor 1401, a memory 1402 and a communication interface 1403. The processor 1401, the communication interface 1403, and the memory 1402 may be connected to each other or connected to each other through a bus 1404.
By way of example, memory 1402 is used to store computer programs and data for device 1400 and memory 1402 may include, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM), etc.
In the case of implementing the embodiment shown in fig. 14, software or program code required to perform the functions of all or part of the units in fig. 14 is stored in the memory 1402.
In the case of implementing the embodiment of fig. 14, if software or program codes required for the functions of part of the units are stored in the memory 1402, the processor 1401 may implement part of the functions in addition to calling the program codes in the memory 1402, and may cooperate with other components (such as the communication interface 1403) to perform other functions (such as the function of receiving or transmitting data) described in the embodiment of fig. 14.
The number of communication interfaces 1403 may be multiple to enable the device 1400 to communicate, e.g., receive or transmit data or signals, etc.
By way of example, the processor 1401 may be a central processor unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. A processor may also be a combination that performs a computational function, such as a combination comprising one or more microprocessors, a combination of a digital signal processor and a microprocessor, and so forth. The processor 1401 may be configured to read the program stored in the memory 1402, and perform the following operations:
Acquiring first voice information; adjusting a judgment condition based on an influence factor of the validity of the first voice information, wherein the judgment condition is one or more judgment conditions in a validity judgment model of the first voice information, and the validity is used for indicating whether the first voice information is a valid voice control instruction for the device 1400 which acquires the first voice information; and under the condition that the first voice information is determined to be effective based on the adjusted judging condition, carrying out semantic understanding on the first voice information and executing the instruction of the first voice information.
In a possible implementation manner, the adjusting the decision condition based on the influencing factor of the validity of the first voice information includes:
In the case that the probability of the first voice information being effective is greater than the probability of the first voice information being ineffective based on the influence factor, the sensitivity of the judgment condition is increased, wherein the higher the sensitivity of the judgment condition is, the higher the probability of the first voice information being effective is determined through the judgment condition;
In a case where the probability of the first voice information being effective is analyzed to be smaller than the probability of being ineffective based on the influencing factors, the sensitivity of the decision condition is lowered, the lower the sensitivity of the decision condition is, the lower the probability of the first voice information being effective is determined by the decision condition is.
The specific operation and beneficial effects of each unit in the apparatus 1400 shown in fig. 14 may be referred to the corresponding description in the above method embodiments, and will not be repeated here.
Fig. 15 is a schematic structural diagram of another voice information processing apparatus according to an embodiment of the present application, where the apparatus may be a device in the foregoing embodiment, or may be a chip in the device, or may be a processing system in the device, etc., and may implement the voice information processing method and various alternative embodiments of the foregoing voice information processing method according to the present application. As shown in fig. 15, the voice information processing apparatus 1500 includes: a processor 1501, interface circuitry 1502 coupled to the processor 1501. It should be appreciated that although only one processor and one interface circuit are shown in fig. 15. The speech information processing apparatus 1500 can include other numbers of processors and interface circuits.
Wherein interface circuit 1502 is used to communicate with other components of apparatus 1500, such as a memory or other processor. The processor 1501 is used to interact with other components through the interface circuit 1502. The interface circuit 1502 may be an input/output interface of the processor 1501.
For example, the processor 1501 reads computer programs or instructions in a memory coupled thereto through the interface circuit 1502 and decodes and executes the computer programs or instructions. It should be understood that these computer programs or instructions may include the various functional programs in the methods described above. When the corresponding functional program is decoded and executed by the processor 1501, the voice information processing apparatus 1500 can be caused to implement the scheme in the voice information processing method provided by the embodiment of the present application.
Alternatively, these functional programs are stored in a memory external to the speech information processing apparatus 1500. When the function program is decoded and executed by the processor 1501, a part or the whole of the function program is temporarily stored in the internal memory.
Alternatively, these functional programs are stored in a memory inside the voice information processing apparatus 1500. When the functional program is stored in the memory inside the voice information processing apparatus 1500, the voice information processing apparatus 1500 may be provided in the device of the embodiment of the present application.
Alternatively, part of the contents of these functional programs are stored in a memory external to the speech information processing apparatus 1500, and the other part of the contents of these functional programs are stored in a memory internal to the speech information processing apparatus 1500.
It should be understood that the apparatus or device shown in any of fig. 1, 12 or 13, 14 and 15 may be combined with each other, and that the apparatus or device shown in any of fig. 1, 12 or 13, 14 and 15 and the design details related to each alternative embodiment may be referred to each other, and also refer to the voice information processing method shown in any of fig. 2 or 10 and the design details related to each alternative embodiment. The description is not repeated here.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program that is executed by a processor to implement operations performed by a server in any of the above-described embodiments and possible embodiments thereof.
Embodiments of the present application also provide a computer program product, which when read and executed by a computer, performs the operations performed by the server in any of the above embodiments and possible embodiments thereof.
The embodiments of the present application also provide a computer program which, when executed on a computer, causes the computer to implement the operations performed by the server in any of the above embodiments and possible embodiments thereof.
In summary, the application provides a voice information processing method and device, which can improve the accuracy of effective voice recognition and reduce the false triggering rate of ineffective voice in different intelligent voice interaction scenes.
The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image, without departing from the scope of the various described examples. The first image and the second image may both be images, and in some cases may be separate and distinct images.
It should also be understood that, in the embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiments of the present application.
It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should be further appreciated that reference throughout this specification to "one embodiment," "an embodiment," "one possible implementation" means that a particular feature, structure, or characteristic described in connection with the embodiment or implementation is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment," "one possible implementation" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims (24)

1. A method for processing voice information, the method comprising:
Acquiring first voice information;
Executing the operation indicated by the first voice information under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting the condition of the environment in which the first voice information is generated and the condition of the historical voice information;
The condition of the historical voice information comprises one or more of the following:
a first time interval between when the first voice information is acquired and when the effective voice information is acquired last time;
A second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time;
Acquiring the duty ratio of effective voice information and ineffective voice information in a first preset time before the first voice information;
the first association degree between the first voice information and the last acquired effective voice information;
A second association degree of the semantics of the first voice information and the invalid voice information acquired last time;
And ending the state that the equipment and the user have voice dialogue when the first voice information is acquired.
2. The method of claim 1, wherein the environmental conditions in which the first voice information is generated include one or more of:
and stopping the number of the speakers in the second preset duration of the acquired first voice information, wherein the number of the speakers in the preset range is generated when the first voice information is generated, and the confidence degree of the first voice information or the signal-to-noise ratio of the first voice information is obtained.
3. The method of claim 1, wherein the decision condition is adjusted based on an environmental condition in which the first voice information is generated and a condition of the historical voice information, comprising: the judging condition is obtained by adjusting the environment condition, the continuous listening time length and the condition of the historical voice information.
4. The method of claim 1, wherein the step of determining the position of the substrate comprises,
In the case where the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;
in the case where the environmental condition indicates that the probability that the first speech information is valid is less than the probability that it is invalid, the sensitivity of the decision condition is turned down.
5. The method of claim 3 wherein the sensitivity of the decision condition is adjusted to be lower the longer the duration of continuous listening.
6. The method according to any one of claims 1 to 5, wherein the condition of the historical speech information includes a first time interval between when the first speech information was acquired and when valid speech information was last acquired;
The longer the first time interval the lower the sensitivity of the decision condition is adjusted.
7. The method of any one of claims 1 to 5, wherein the history of speech information comprises a second time interval between when the first speech information was acquired and when invalid speech information was last acquired;
the longer the second time interval the lower the sensitivity of the decision condition is adjusted.
8. The method of any one of claims 1 to 5, wherein the history of speech information comprises a first time interval between when the first speech information was acquired and when valid speech information was last acquired, and a second time interval between when the first speech information was acquired and when invalid speech information was last acquired;
in case the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted to be high.
9. The method of any one of claims 1 to 5, wherein the history of speech information includes a ratio of valid speech information to invalid speech information within a first predetermined time period before the first speech information is obtained;
in the case that the duty ratio of the effective voice information is larger than the duty ratio of the ineffective voice information, the sensitivity of the judgment condition is increased;
Under the condition that the duty ratio of the effective voice information is smaller than that of the ineffective voice information, the duty ratio of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is adjusted to be high; the duty ratio of the effective voice information is in a descending trend, and the sensitivity of the judgment condition is reduced.
10. The method of any one of claims 1 to 5, wherein the condition of the historical voice information includes a state of a device voice conversation with a user until the first voice information is obtained;
In the presence of a state of the device in speech dialogue with the user, the sensitivity of the decision condition is increased.
11. A speech information processing apparatus, characterized in that the apparatus comprises:
The acquisition unit is used for acquiring the first voice information;
The execution unit is used for executing the operation of the first voice information indication under the condition that the first voice information is determined to be a valid voice control instruction based on a judgment condition, wherein the judgment condition is obtained by adjusting the condition of the environment where the first voice information is generated and the condition of the historical voice information;
The condition of the historical voice information comprises one or more of the following:
a first time interval between when the first voice information is acquired and when the effective voice information is acquired last time;
A second time interval between when the first voice information is acquired and when the invalid voice information is acquired last time;
Acquiring the duty ratio of effective voice information and ineffective voice information in a first preset time before the first voice information;
the first association degree between the first voice information and the last acquired effective voice information;
A second association degree of the semantics of the first voice information and the invalid voice information acquired last time;
And ending the state that the equipment and the user have voice dialogue when the first voice information is acquired.
12. The apparatus of claim 11, wherein the environmental conditions in which the first voice information is generated include one or more of:
and stopping the number of the speakers in the second preset duration of the acquired first voice information, wherein the number of the speakers in the preset range is generated when the first voice information is generated, and the confidence degree of the first voice information or the signal-to-noise ratio of the first voice information is obtained.
13. The apparatus of claim 11, wherein the decision condition is based on an environmental condition in which the first voice information is generated and a condition adjustment of the historical voice information, comprising: the judging condition is obtained by adjusting the environment condition, the continuous listening time length and the condition of the historical voice information.
14. The apparatus of claim 11, wherein the device comprises a plurality of sensors,
In the case where the environmental condition indicates that the probability that the first voice information is valid is greater than the probability that it is invalid, the sensitivity of the decision condition is increased;
in the case where the environmental condition indicates that the probability that the first speech information is valid is less than the probability that it is invalid, the sensitivity of the decision condition is turned down.
15. The apparatus of claim 13 wherein the sensitivity of the decision condition is adjusted to be lower the longer the duration of continuous listening.
16. The apparatus according to any one of claims 11 to 15, wherein the condition of the historical speech information comprises a first time interval between when the first speech information was acquired and when valid speech information was last acquired;
The longer the first time interval the lower the sensitivity of the decision condition is adjusted.
17. The apparatus according to any one of claims 11 to 15, wherein the condition of the historical speech information includes a second time interval between when the first speech information was acquired and when invalid speech information was last acquired;
the longer the second time interval the lower the sensitivity of the decision condition is adjusted.
18. The apparatus according to any one of claims 11 to 15, wherein the condition of the historical speech information includes a first time interval between when the first speech information was acquired and when valid speech information was last acquired, and includes a second time interval between when the first speech information was acquired and when invalid speech information was last acquired;
in case the first time interval is smaller than the second time interval, the sensitivity of the decision condition is adjusted to be high.
19. The apparatus according to any one of claims 11 to 15, wherein the history of speech information includes a ratio of valid speech information to invalid speech information within a first predetermined period of time before the first speech information is acquired;
in the case that the duty ratio of the effective voice information is larger than the duty ratio of the ineffective voice information, the sensitivity of the judgment condition is increased;
Under the condition that the duty ratio of the effective voice information is smaller than that of the ineffective voice information, the duty ratio of the effective voice information is in an ascending trend, and the sensitivity of the judgment condition is adjusted to be high; the duty ratio of the effective voice information is in a descending trend, and the sensitivity of the judgment condition is reduced.
20. The apparatus according to any one of claims 11 to 15, wherein the condition of the historical speech information includes a state of a device talking to a user's speech until the first speech information is obtained;
In the presence of a state of the device in speech dialogue with the user, the sensitivity of the decision condition is increased.
21. An electronic device comprising a processor and a memory, wherein the memory is for storing a computer program, the processor is for executing the computer program stored in the memory, such that the electronic device performs the method of any one of claims 1 to 10.
22. A chip system, wherein the chip system is applied to an electronic device; the chip system comprises an interface circuit and a processor; the interface circuit and the processor are interconnected through a circuit; the interface circuit is used for receiving signals from a memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; the system-on-chip performs the method of any one of claims 1 to 10 when the processor executes the computer instructions.
23. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1 to 10.
24. A computer program product, characterized in that the method according to any of claims 1 to 10 is to be performed when the computer program product is executed by a processor.
CN202180001492.4A 2021-04-20 2021-04-20 Voice information processing method and equipment Active CN113330513B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/088522 WO2022222045A1 (en) 2021-04-20 2021-04-20 Speech information processing method, and device

Publications (2)

Publication Number Publication Date
CN113330513A CN113330513A (en) 2021-08-31
CN113330513B true CN113330513B (en) 2024-08-27

Family

ID=77427019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180001492.4A Active CN113330513B (en) 2021-04-20 2021-04-20 Voice information processing method and equipment

Country Status (2)

Country Link
CN (1) CN113330513B (en)
WO (1) WO2022222045A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376513B (en) * 2022-10-19 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116483960B (en) * 2023-03-30 2024-01-02 阿波罗智联(北京)科技有限公司 Dialogue identification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN108320742A (en) * 2018-01-31 2018-07-24 广东美的制冷设备有限公司 Voice interactive method, smart machine and storage medium
CN110148405A (en) * 2019-04-10 2019-08-20 北京梧桐车联科技有限责任公司 Phonetic order processing method and processing device, electronic equipment and storage medium
CN110556107A (en) * 2019-08-23 2019-12-10 宁波奥克斯电气股份有限公司 control method and system capable of automatically adjusting voice recognition sensitivity, air conditioner and readable storage medium
CN111580773A (en) * 2020-04-15 2020-08-25 北京小米松果电子有限公司 Information processing method, device and storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044243B (en) * 2009-10-15 2012-08-29 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
CN103578468B (en) * 2012-08-01 2017-06-27 联想(北京)有限公司 The method of adjustment and electronic equipment of a kind of confidence coefficient threshold of voice recognition
CN103065631B (en) * 2013-01-24 2015-07-29 华为终端有限公司 A kind of method of speech recognition, device
KR101698369B1 (en) * 2015-11-24 2017-01-20 주식회사 인텔로이드 Method and apparatus for information providing using user speech signal
US9818404B2 (en) * 2015-12-22 2017-11-14 Intel Corporation Environmental noise detection for dialog systems
CN109326289B (en) * 2018-11-30 2021-10-22 深圳创维数字技术有限公司 Wake-up-free voice interaction method, device, equipment and storage medium
CN110211605A (en) * 2019-05-24 2019-09-06 珠海多士科技有限公司 Smart machine speech sensitivity adjusting method, device, equipment and storage medium
CN110782891B (en) * 2019-10-10 2022-02-18 珠海格力电器股份有限公司 Audio processing method and device, computing equipment and storage medium
CN110718223B (en) * 2019-10-28 2021-02-12 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for voice interaction control

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622770A (en) * 2017-09-30 2018-01-23 百度在线网络技术(北京)有限公司 voice awakening method and device
CN108320742A (en) * 2018-01-31 2018-07-24 广东美的制冷设备有限公司 Voice interactive method, smart machine and storage medium
CN110148405A (en) * 2019-04-10 2019-08-20 北京梧桐车联科技有限责任公司 Phonetic order processing method and processing device, electronic equipment and storage medium
CN110556107A (en) * 2019-08-23 2019-12-10 宁波奥克斯电气股份有限公司 control method and system capable of automatically adjusting voice recognition sensitivity, air conditioner and readable storage medium
CN111580773A (en) * 2020-04-15 2020-08-25 北京小米松果电子有限公司 Information processing method, device and storage medium

Also Published As

Publication number Publication date
WO2022222045A1 (en) 2022-10-27
CN113330513A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
KR102535338B1 (en) Speaker diarization using speaker embedding(s) and trained generative model
CN110428810B (en) Voice wake-up recognition method and device and electronic equipment
CN111508474B (en) Voice interruption method, electronic equipment and storage device
CN105009204B (en) Speech recognition power management
CN112368769B (en) End-to-end stream keyword detection
CN111880856B (en) Voice wakeup method and device, electronic equipment and storage medium
CN110444193A (en) The recognition methods of voice keyword and device
KR20190123362A (en) Method and Apparatus for Analyzing Voice Dialogue Using Artificial Intelligence
CN110534099A (en) Voice wakes up processing method, device, storage medium and electronic equipment
CN113330513B (en) Voice information processing method and equipment
CN110047481A (en) Method for voice recognition and device
US11361764B1 (en) Device naming-indicator generation
JP2021089438A (en) Selective adaptation and utilization of noise reduction technique in invocation phrase detection
CN112102850A (en) Processing method, device and medium for emotion recognition and electronic equipment
JP7516571B2 (en) Hotword threshold auto-tuning
CN109686368B (en) Voice wake-up response processing method and device, electronic equipment and storage medium
CN111710337A (en) Voice data processing method and device, computer readable medium and electronic equipment
CN111192590A (en) Voice wake-up method, device, equipment and storage medium
CN110853669B (en) Audio identification method, device and equipment
CN111145748A (en) Audio recognition confidence determining method, device, equipment and storage medium
KR20190001435A (en) Electronic device for performing operation corresponding to voice input
CN112669818B (en) Voice wake-up method and device, readable storage medium and electronic equipment
CN111862943A (en) Speech recognition method and apparatus, electronic device, and storage medium
CN116705033A (en) System on chip for wireless intelligent audio equipment and wireless processing method
CN114495981A (en) Method, device, equipment, storage medium and product for judging voice endpoint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant