CN117116250A

CN117116250A - Voice interaction refusing method and device

Info

Publication number: CN117116250A
Application number: CN202211737105.6A
Authority: CN
Inventors: 盛佳琦
Original assignee: Shenzhen TCL New Technology Co Ltd
Current assignee: Shenzhen TCL New Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-11-24

Abstract

The application provides a voice interaction refusing method and a device, wherein the voice interaction refusing method comprises the following steps: acquiring current audio information received by target equipment; acquiring a first preset strategy library matched with target equipment; judging whether an identification strategy matched with the current audio information exists in a first preset strategy library or not; if the identification strategy matched with the current audio information does not exist in the first preset strategy library, acquiring the current multi-mode characteristic information of the target equipment; and determining a refusing result according to the current multi-mode characteristic information. The application can improve the rejection accuracy rate, and combines a quicker strategy matching mode and a mode of collaborative reasoning of a plurality of modal characteristic information with higher accuracy, thereby being capable of considering rejection accuracy and recognition efficiency.

Description

Voice interaction refusing method and device

Technical Field

The application mainly relates to the technical field of voice interaction, in particular to a voice interaction refusing method and device.

Background

The rejection algorithm is used to determine whether a piece of speech is spoken to a voice assistant, who needs to respond. The prior rejection is still in the text stage, and is rejected through text analysis and a pre-established strategy. As speech technology evolves, more and more users begin to use speech assistants. For automobiles and home use, the voice assistant is limited by a relatively noisy environment, and the like, and the voice assistant is erroneously recognized as a command to the voice assistant by noise, background music, echo of the device, and chatting sound. For these utterances, the voice assistant either does not understand the reply "i hear does not understand you to speak again", or does some preset operations directly, such as playing video or music, etc. Such untimely reactions can lead to a very poor user experience, resulting in a loss of confidence in the voice assistant and a shut down of this functionality.

At present, rejection mostly stays in a text stage, and is rejected through text analysis and a pre-established strategy. When the voice has only 1-2 words or the semantic expression is incomplete (i.e. i want to see, play), the meaning of the voice is refused to be recognized, and no operation is performed. However, due to the complexity and diversity of the scenes of the voice dialogue system, the plain text is insufficient to display the information of the current environment and the speaker, and the rejection cannot be accurately performed.

That is, the rejection method of voice interaction in the prior art is inaccurate.

Disclosure of Invention

The application provides a voice interaction refusing method and device, and aims to solve the problem that a voice interaction refusing method in the prior art is inaccurate.

In a first aspect, the present application provides a voice interaction rejection method, where the voice interaction rejection method includes:

acquiring current audio information received by target equipment;

acquiring a first preset strategy library matched with target equipment;

judging whether an identification strategy matched with the current audio information exists in the first preset strategy library or not;

if the identification strategy matched with the current audio information does not exist in the first preset strategy library, acquiring current multi-mode characteristic information of the target equipment;

And determining a refusing result according to the current multi-mode characteristic information.

Optionally, if the identification policy matching the current audio information does not exist in the first preset policy repository, acquiring current multi-mode feature information of the target device includes:

if the first preset strategy library does not have the identification strategy matched with the current audio information, a second preset strategy library is obtained, wherein the second preset strategy library is determined according to the first preset strategy libraries of a plurality of different users;

judging whether an identification strategy matched with the current audio information exists in the second preset strategy library or not;

and if the second preset strategy library does not have the identification strategy matched with the current audio information, acquiring the current multi-mode characteristic information of the target equipment.

Optionally, before the step of obtaining the current audio information received by the target device, the method includes:

acquiring an initialization strategy library of target equipment;

acquiring feedback information of a user on an initialization strategy base according to a first preset period;

and updating the initialization strategy library of the target device according to feedback information of a user on the initialization strategy library to obtain the first preset strategy library.

acquiring a plurality of first preset strategy libraries corresponding to a plurality of different users according to a second preset period;

acquiring the number proportion of the total number of the number occupied users of each identification strategy in a plurality of first preset strategy libraries;

and updating the identification strategies with the quantity duty ratio exceeding the preset duty ratio to the second preset strategy library.

Optionally, the multi-modal feature includes a time sequence feature and a non-time sequence feature, and the determining the rejection result according to the current multi-modal feature information includes:

the time sequence characteristics are fused according to time sequence, so that time sequence fusion characteristics are obtained;

inputting the time sequence fusion characteristic into a preset vector characterization model to obtain a time sequence fusion characteristic vector;

inputting the time sequence fusion feature vector and the feature vector of each non-time sequence feature into a full connection layer to obtain a full connection feature vector;

and determining the refusal result according to the full connection feature vector.

Optionally, the determining the rejection result according to the full connection feature vector includes:

normalizing the full-connection feature vector to obtain rejection probability;

if the rejection probability is higher than a preset threshold, not responding to the current audio information; and if the rejection probability is not higher than a preset threshold, responding to the current audio information.

Optionally, the time sequence class features include audio sequence features of the speaker, text features of the speaker, body posture features of the speaker, and distance features of the speaker.

Optionally, the non-time sequence type features include a signal-to-noise ratio feature of the current audio information, a volume feature of the current audio information, a speech rate feature of the current audio information, a sound energy value feature of the current audio information, an intention feature of the current audio information, a contextual intention feature of the current audio information, a far-near field feature of the current audio information, and a device type of the target device.

Optionally, the non-time sequence type feature further includes a scene matching feature of the current audio information, and the obtaining the current multi-mode feature information of the target device includes:

acquiring display screen interface information of target equipment when the target equipment acquires current audio information;

determining the scene type of the display interface according to the interface information of the display screen;

acquiring the type of an audio scene of the current audio information;

and determining scene matching characteristics according to the matching degree of the audio scene type and the scene type of the display interface.

Optionally, the identifying policies in the first preset policy repository include a plurality of passing policies and a plurality of rejecting policies, and the determining whether the identifying policies matched with the current audio information exist in the first preset policy repository includes:

Performing voice recognition on the current audio information to obtain current text characteristics;

acquiring preset audio information and preset text characteristics in an identification strategy;

calculating a first similarity between the current audio information and preset audio information in the identification strategy;

calculating a second similarity between the current text feature and a preset text feature in the recognition strategy;

and determining an identification strategy with the first similarity larger than a first similarity threshold and the second similarity larger than a second similarity threshold as an identification strategy matched with the current audio information.

Optionally, the voice interaction refusing method further includes:

if the identification strategy matched with the current audio information exists in the first preset strategy library, acquiring the strategy type of the identification strategy matched with the current audio information;

if the strategy type is a passing strategy type, responding to the current audio information; and if the policy type is the refusal policy type, not responding to the current audio information.

In a second aspect, the present application provides a voice interaction rejecting device, including:

the first acquisition unit is used for acquiring current audio information received by the target equipment;

The second acquisition unit is used for acquiring a first preset strategy library matched with the target equipment;

the judging unit is used for judging whether an identification strategy matched with the current audio information exists in the first preset strategy library or not;

a third obtaining unit, configured to obtain current multi-mode feature information of a target device if an identification policy matched with the current audio information does not exist in the first preset policy repository;

and the determining unit is used for determining a refusing result according to the current multi-mode characteristic information.

Optionally, the third obtaining unit is configured to:

Optionally, the first obtaining unit is configured to:

Acquiring an initialization strategy library of target equipment;

Optionally, the first obtaining unit is configured to:

Optionally, the multi-modal feature includes a timing class feature and a non-timing class feature, and the determining unit is configured to:

Optionally, the determining unit is configured to:

normalizing the full-connection feature vector to obtain rejection probability;

Optionally, the non-time sequence class feature further includes a scene matching feature of the current audio information, and the first obtaining unit is configured to:

Acquiring the type of an audio scene of the current audio information;

Optionally, the identification policies in the first preset policy repository include a plurality of passing policies and a plurality of rejecting policies, and the judging unit is configured to:

Optionally, the determining unit is configured to:

In a third aspect, the present application provides an intelligent device comprising:

one or more processors;

a memory; and

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the method of rejecting voice interactions of any one of the first aspects.

In a fourth aspect, the present application provides a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the voice interaction rejection method of any of the first aspects.

The application provides a voice interaction refusing method and a device, wherein the voice interaction refusing method comprises the following steps: acquiring current audio information received by target equipment; acquiring a first preset strategy library matched with target equipment; judging whether an identification strategy matched with the current audio information exists in a first preset strategy library or not; if the identification strategy matched with the current audio information does not exist in the first preset strategy library, acquiring the current multi-mode characteristic information of the target equipment; and determining a refusing result according to the current multi-mode characteristic information. When the current audio information of the target equipment is acquired, whether a matched identification strategy exists is judged through a strategy matching mode, when the identification strategy does not exist, the multi-mode characteristic information of the target equipment is acquired, the rejection accuracy can be improved through the collaborative reasoning of the characteristic information of a plurality of modes, and meanwhile, the rejection accuracy and the identification efficiency can be considered by combining a quicker strategy matching mode and a mode of collaborative reasoning of the characteristic information of a plurality of modes with higher accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a scenario of a voice interaction rejection system according to an embodiment of the present application;

FIG. 2 is a flow chart of one embodiment of a method for rejecting voice interactions provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for rejecting a voice interaction according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an embodiment of a voice interaction rejection apparatus provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an embodiment of a smart device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

In the description of the present application, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the drawings are merely for convenience in describing the present application and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more features. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the present application, the term "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described as "exemplary" in this disclosure is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for purposes of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been described in detail so as not to obscure the description of the application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

The embodiment of the application provides a voice interaction refusing method and device, and the voice interaction refusing method and device are respectively described in detail below.

Referring to fig. 1, fig. 1 is a schematic view of a scenario of a voice interaction rejection system provided by an embodiment of the present application, where the voice interaction rejection system may include an intelligent device 100, and a voice interaction rejection apparatus is integrated in the intelligent device 100.

In the embodiment of the present application, the smart device 100 may be a general-purpose computer device or a special-purpose computer device. In a specific implementation, the smart device 100 may be a desktop, a portable computer, a network server, a palm computer (Personal Digital Assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc., and the embodiment is not limited to the type of the smart device 100.

It will be understood by those skilled in the art that the application environment shown in fig. 1 is only one application scenario of the present application, and is not limited to the application scenario of the present application, and other application environments may further include more or fewer smart devices than those shown in fig. 1, for example, only 1 smart device is shown in fig. 1, and it will be understood that the voice interaction rejection system may further include one or more other smart devices capable of processing data, which is not limited herein.

In addition, as shown in fig. 1, the voice interaction rejection system may further include a memory 200 for storing data.

It should be noted that, the schematic view of the scenario of the voice interaction rejection system shown in fig. 1 is only an example, and the voice interaction rejection system and scenario described in the embodiments of the present application are for more clearly describing the technical solution of the embodiments of the present application, and do not constitute a limitation on the technical solution provided by the embodiments of the present application, and as one of ordinary skill in the art can know, along with the evolution of the voice interaction rejection system and the appearance of a new service scenario, the technical solution provided by the embodiments of the present application is equally applicable to similar technical problems.

Firstly, in the embodiment of the application, a voice interaction refusing method is provided, which comprises the following steps: acquiring current audio information received by target equipment; acquiring a first preset strategy library matched with target equipment; judging whether an identification strategy matched with the current audio information exists in a first preset strategy library or not; if the identification strategy matched with the current audio information does not exist in the first preset strategy library, acquiring the current multi-mode characteristic information of the target equipment; and determining a refusing result according to the current multi-mode characteristic information.

As shown in fig. 2, fig. 2 is a flow chart of an embodiment of a voice interaction rejection method provided in an embodiment of the present application, where the voice interaction rejection method includes steps S201 to S205 as follows:

s201, acquiring current audio information received by the target equipment.

In the embodiment of the application, the target equipment can be an intelligent television, an integrated kitchen, an intelligent automobile and the like, and the target equipment records audio information with preset duration through a microphone or a pickup to obtain the current audio information received by the target equipment. The preset time length can be 5s, 3s and the like, and the preset time length is set according to the specific setting.

S202, acquiring a first preset strategy library matched with target equipment.

The first preset policy library may include a plurality of identification policies, the plurality of identification policies may include a plurality of rejection policies and a plurality of pass policies, and each identification policy in the first preset policy library may be set according to human experience.

In a specific embodiment, a device number of the target device is obtained, and a corresponding first preset policy bank is obtained according to the device number. Specifically, when the user uses the target device for the first time, the user registers through voice or face, and binds the device with the user through the device number. The rejection policy may include: through policy 1, reject policy 2, reject policy 3, and reject policy 4.

Refusal policy 1: if single words except the white list appear in the preset audio information, rejecting. The white list includes: row, good, pair, etc.

Refusal policy 2: if the text length of the preset audio information exceeds 20 Chinese characters, rejecting.

Refusal policy 3: if the preset audio information is nonsensical stop word, refusing to recognize. Such as: because of this, here.

Rejection policy 4: if the preset audio information is character name, refusing to identify. Such as: dad, mom, go, sister.

The pass policies may include pass policy 1 and pass policy 2.

By strategy 1: if the preset audio information contains the action verbs of I want to see, play, I want to hear, search, and the like, the preset audio information is not refused.

By strategy 2: if the preset audio information contains part of the user high-frequency speech technology, the user high-frequency speech technology is not refused. Such as: the pig is petted, and the light is transmitted to the team.

S203, judging whether an identification strategy matched with the current audio information exists in the first preset strategy library.

In a specific embodiment, calculating a first similarity of the current audio information and preset audio information in the recognition strategies, and if the first similarity of the current audio information and the preset audio information in the recognition strategies is greater than a first similarity threshold, determining that the recognition strategies matched with the current audio information exist in a first preset strategy library; if the first similarity between the current audio information and the preset audio information in the identification strategies is not greater than a first similarity threshold, determining that the identification strategies matched with the current audio information do not exist in the first preset strategy library. The first similarity threshold may be 80%, 90%, or the like, and is set according to the specific situation.

In another specific embodiment, determining whether an identification policy matching the current audio information exists in the first preset policy repository includes:

(1) And carrying out voice recognition on the current audio information to obtain the current text characteristics.

(2) And acquiring preset audio information and preset text characteristics in the identification strategy.

(3) And calculating the first similarity between the current audio information and preset audio information in the identification strategy.

(4) And calculating the second similarity between the current text feature and the preset text feature in the recognition strategy.

(5) And determining an identification strategy with the first similarity larger than a first similarity threshold and the second similarity larger than a second similarity threshold as an identification strategy matched with the current audio information.

The first similarity threshold may be 80%, 90%, or the like, and is set according to the specific situation.

Further, in order to dynamically update the user-personalized first preset policy repository, before acquiring the current audio information received by the target device, the method may include:

(1) And acquiring an initialization strategy library of the target equipment.

The initialization strategy library is a factory-set strategy library and consists of a plurality of identification strategies determined by manual experience.

(2) And acquiring feedback information of the user to the initialization strategy library according to a first preset period.

The first preset period may be 1 hour, 2 hours, and may be set according to specific conditions. The feedback information may include operational actions made by the user to shut down the target device, repeat audio information, etc., in response to the target device responding to the initialization policy repository. For example, when the target device responds to preset audio information of the user, the user sends an instruction for closing the target device, and the preset audio information is added to the refusal strategy; the target device does not respond to the preset audio information of the user, the user sends repeated preset audio information, the preset audio information is added to the refusal strategy when the target device indicates that the user emphasizes again.

(3) And updating the initialization strategy library of the target device according to feedback information of the user to the initialization strategy library to obtain a first preset strategy library.

(1) And acquiring a plurality of first preset strategy libraries corresponding to a plurality of different users according to a second preset period.

(2) And acquiring the number proportion of the total number of the number occupied users of each identification strategy in the first preset strategy libraries.

For example, one of the recognition policies is a recognition policy a, the number of occurrences of the recognition policy a in the plurality of first preset policy libraries is C, the total number of users in the plurality of first preset policy libraries is B, and the number of occurrences of the recognition policy a occupies the total number of users by a ratio of C/B. The number of the total number of the number occupied users with the identification policy A is larger than the number occupied users with the identification policy A, which indicates that the identification policy A is used by more users, and the identification policy A can be used as a general policy for all users.

(3) And updating the identification strategies with the quantity duty ratio exceeding the preset duty ratio to a second preset strategy library.

The preset duty ratio may be 80%, 90%, etc., according to a specific setting. The identification strategies with the number of the more than preset duty ratio are the identification strategies with more used users, and the identification strategies are updated to a second preset strategy library to be used as the general strategies.

S204, if the identification strategy matched with the current audio information does not exist in the first preset strategy library, the current multi-mode characteristic information of the target equipment is obtained.

Multimodal, i.e., multiple heterogeneous modality data collaborative reasoning. The multi-mode data analysis is mutually promoted with the advanced cognitive intelligent internal requirements. In the field of artificial intelligence, it is often referred to as sensing information, such as image, text, voice, etc. cooperated, which helps the artificial intelligence understand the outside world more accurately. The available methods include: two-way convolutional neural networks, and the like.

If the identification strategy matched with the current audio information does not exist in the first preset strategy library, the current multi-mode characteristic information of the target equipment is obtained. The current multi-modal feature information is feature information obtained in a number of different ways. The multi-modal feature information may include audio sequence features of each speaker obtained through an audio module such as a microphone, text features obtained through speech recognition, and image features obtained through image processing. Further, the image features include human body posture features of the speaker and distance features of the speaker obtained through image processing.

Specifically, the current audio information is converted into a single-channel audio file with preset frequency, then a window with preset length is selected, a plurality of audio windows are intercepted by jumping once every preset time interval, a short-time Fourier transform is adopted to calculate a frequency spectrum image of each audio window, and then a Mel frequency spectrum diagram is calculated through a Mel filter. And finally, calculating the acquired Mel spectrogram to form the audio sequence characteristic. And calculating the related index of the current audio information. Such as signal-to-noise ratio characteristics of the current audio information, volume characteristics of the current audio information, speech rate characteristics of the current audio information, and sound energy value characteristics of the current audio information.

Further, the current audio information is subjected to speech recognition (ASR), and after the text subjected to speech recognition is subjected to word segmentation, the word segmentation is converted into a text vector through word2vec and the like. The text vector is sent to an intent recognition model (NLU) to recognize the intent and the intent is transformed with a single hot code into an intent feature. The intent of the previous round of dialog is converted into a contextual intent feature using one-hot encoding. word2vec is a group of correlation models used to generate word vectors. These models are shallow, bi-layer neural networks that are used to train to reconstruct linguistic word text. The network is represented by words and guesses the input words in adjacent positions, and the order of the words is unimportant under the word bag model assumption in word2 vec. After training is completed, word2vec models can be used to map each word to a vector that can be used to represent word-to-word relationships, which is the hidden layer of the neural network.

Further, at present, some high-end devices are equipped with a camera, which is allowed to be turned on after a user signs a privacy agreement to detect the behavior of a person watching television, and are equipped with functions of human body posture detection and distance detection. Therefore, video information recorded simultaneously with the current audio information in the target equipment is called, the video information is identified, the human body posture characteristics of the speaker and the distance characteristics of the speaker are obtained, and the human body posture characteristics and the distance characteristics of the speaker can be used as image characteristics to be input into a model to be used as auxiliary information. When the person is facing away, has no frontal equipment, is not present in the video, there is a greater probability that it is not speaking to the voice assistant when the distance is greater.

In the embodiment of the application, if the identification strategy matched with the current audio information exists in the first preset strategy library, acquiring the strategy type of the identification strategy matched with the current audio information; if the strategy type is a passing strategy type, responding to the current audio information; if the policy type is a refusal policy type, the current audio information is not responded. Firstly, rejecting is rapidly carried out in a matching mode, so that rejecting efficiency can be improved.

Further, in order to improve the matching accuracy, if the first preset policy library does not have the identification policy matching with the current audio information, the obtaining the current multi-mode feature information of the target device may include:

(1) If the identification strategy matched with the current audio information does not exist in the first preset strategy library, a second preset strategy library is obtained, wherein the second preset strategy library is determined according to the first preset strategy libraries of a plurality of different users.

The second preset strategy library is provided with a plurality of identification strategies, and the second preset strategy library is determined according to the first preset strategy library of a plurality of different users. The first preset strategy library is a strategy library related to the target equipment, is determined according to the feedback behavior of the user, and can meet the personalized requirements of the user; the second preset strategy library is determined according to the first preset strategy libraries of a plurality of different users, and can meet the general requirements of the users.

(2) And judging whether an identification strategy matched with the current audio information exists in a second preset strategy library.

The step of determining whether the second preset policy repository has an identification policy matching the current audio information may refer to the step of determining whether the first preset policy repository has an identification policy matching the current audio information, which is not described herein.

(3) And if the second preset strategy library does not have the identification strategy matched with the current audio information, acquiring the current multi-mode characteristic information of the target equipment.

If the second preset strategy library does not have the identification strategy matched with the current audio information, the fact that the identification strategy is not matched through two rounds of matching is indicated, the current multi-mode characteristic information of the target equipment is obtained, and further identification is carried out.

S205, determining a refusal result according to the current multi-mode characteristic information.

In the embodiment of the application, the refusal result can be refusal probability or refusal. If the refusal result is refusal, not responding to the current audio information; and if the refusal result is refusal, responding to the current audio information. The rejection probability represents the probability of not responding to the current audio information.

In a specific embodiment, the multi-mode feature information is input into a preset classification model to obtain a rejection result. The preset classification model may be CNN, etc.

Further, the multi-modal features include timing class features and non-timing class features. Referring to fig. 3, determining a rejection result according to current multi-modal feature information includes S301-S304:

s301, fusing the time sequence features according to time sequence to obtain time sequence fusion features.

S302, inputting the time sequence fusion characteristic into a preset vector characterization model to obtain a time sequence fusion characteristic vector.

In the implementation of the present application, the preset vector characterization model may be a bert model, bert (Bidirectional Encoder Representations from Transformers) is a transform bi-directional encoder, aimed at pre-training the depth bi-directional representation from unlabeled text by conditional computation that is common in the left and right contexts. Thus, the pre-trained bert model can be trimmed with only one extra output layer, thereby generating the latest model for various natural language processing tasks. Of course, the preset vector characterization model may be word2vec, etc.

S303, inputting the time sequence fusion feature vector and the feature vector of each non-time sequence feature into a full connection layer to obtain a full connection feature vector.

S304, determining a refusing result according to the full connection feature vector.

Specifically, the full-connection feature vector is normalized to obtain rejection probability, if the rejection probability is higher than a preset threshold, the rejection result is rejection, and the current audio information is not responded; if the rejection probability is not higher than the preset threshold, the rejection result is that the current audio information is not rejected, and the current audio information is responded. The preset threshold value may be 80%, 90%, etc. Specifically, the full connection feature vector is input into a sigmoid layer, and a rejection result is obtained.

In the embodiment of the application, the time sequence characteristics comprise audio sequence characteristics of a speaker, text characteristics of the speaker, human body posture characteristics of the speaker and distance characteristics of the speaker. The non-temporal class features include a signal-to-noise ratio feature of the current audio information, a volume feature of the current audio information, a speech rate feature of the current audio information, a sound energy value feature of the current audio information, an intent feature of the current audio information, a contextual intent feature of the current audio information, a near-far field feature of the current audio information, and a device type of the target device.

Wherein, near field carries out the dialogue distance very closely through pressing the button on the remote controller, and the probability of false triggering and false discernment is low. The far field speaks directly into the device and the distance between the person and the device is relatively large, so that the probability of false wake-up and false recognition is obviously higher than that of near field devices. Acoustic waves are longitudinal waves, i.e. waves in which particles in a medium move in the direction of propagation. The sound wave is a vibration wave, after the sound source sounds and vibrates, the medium around the sound source vibrates, and the sound wave diffuses along with the medium around, so that the sound wave is a spherical wave. The sound field model can be divided into two types according to the distance between the sound source and the microphone array: a near field model and a far field model. The near field model regards sound waves as spherical waves, and takes the amplitude difference between signals received by the microphone array elements into consideration; the far field model regards sound waves as plane waves, ignores the amplitude differences among the received signals of the array elements, and approximately considers that the received signals are in a simple time delay relation. Obviously, the far-field model is a simplification of the actual model, and the processing difficulty is greatly simplified. The general speech enhancement method is based on far field models. The division of the near-field model and the far-field model has no absolute standard, and the distance between a sound source and a reference point at the center of the microphone array is generally considered to be far-field when the distance is far greater than the wavelength of a signal; otherwise, the near field is the case.

The device type of the target device may be a television, a refrigerator, a washing machine, an air conditioner, an integrated oven, or the like. Like an integrated kitchen is in a noisy environment and is more easily awakened by mistake, the probability of rejection is higher.

In the embodiment of the present application, the non-time sequence type feature further includes a scene feature of the current audio information, and the obtaining of the current multi-mode feature information of the target device includes: acquiring display screen interface information of target equipment when the target equipment acquires current audio information; determining scene characteristics of a display interface according to the interface information of the display screen; acquiring audio scene characteristics of current audio information; and determining scene matching characteristics according to the matching degree of the audio scene characteristics and the scene characteristics of the display interface.

Specifically, the display screen interface information may include an application program running at the front end in the display screen, a mapping relationship between the application program and a display interface scene type is pre-established, and display interface scene characteristics are determined according to the application program type. For example, the application program is a game application, and the corresponding display interface scene type is a game scene type; the application program is singing application, and the corresponding display interface scene type is singing application type. And inputting the current audio type into a pre-trained deep learning model to obtain the audio scene type of the current audio information. The audio scene type includes a game scene type, a singing scene type, and the like. Judging whether the audio scene characteristics are the same as the scene characteristics of the display interface, if so, the scene matching characteristics are matching; if the scene matching features are different, the scene matching features are not matched. For example, if an application program running on the front end in the display screen of the user is a photographic type application, the user sends out eggplant and eggplant is the current audio information, the user speaks eggplant under the application page of the photographic type application, the product definition is met, the scene matching characteristics are matched, and rejection is not performed. However, in other services, the eggplant is awakened by mistake and needs to be refused.

In order to better implement the voice interaction refusing method in the embodiment of the present application, on the basis of the voice interaction refusing method, the embodiment of the present application further provides a voice interaction refusing device, where the voice interaction refusing device is integrated in an intelligent device, as shown in fig. 4, a voice interaction refusing device 400 includes:

A first obtaining unit 401, configured to obtain current audio information received by a target device;

a second obtaining unit 402, configured to obtain a first preset policy repository matched with the target device;

a judging unit 403, configured to judge whether an identification policy matching the current audio information exists in the first preset policy repository;

a third obtaining unit 404, configured to obtain current multi-mode feature information of the target device if an identification policy matched with the current audio information does not exist in the first preset policy repository;

a determining unit 405, configured to determine a rejection result according to the current multi-modal feature information.

Optionally, the third obtaining unit 404 is configured to:

Optionally, the first obtaining unit 401 is configured to:

acquiring an initialization strategy library of target equipment;

Optionally, the first obtaining unit 401 is configured to:

Optionally, the multi-modal feature includes a timing class feature and a non-timing class feature, and the determining unit 405 is configured to:

Optionally, the determining unit 405 is configured to:

normalizing the full-connection feature vector to obtain rejection probability;

Optionally, the non-time sequence class feature further includes a scene matching feature of the current audio information, and the first obtaining unit 401 is configured to:

acquiring the type of an audio scene of the current audio information;

Optionally, the identifying policies in the first preset policy repository include a plurality of passing policies and a plurality of rejecting policies, and the determining unit 403 is configured to:

Optionally, the determining unit 405 is configured to:

The embodiment of the application also provides an intelligent device which integrates any voice interaction refusing device provided by the embodiment of the application, and the intelligent device comprises:

one or more processors;

a memory; and

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to perform the steps of the voice interaction rejection method of any of the voice interaction rejection method embodiments described above.

As shown in fig. 5, a schematic structural diagram of an intelligent device according to an embodiment of the present application is shown, specifically:

the smart device may include one or more processor cores 'processors 501, one or more computer-readable storage media's memory 502, a power supply 503, and an input unit 504, among other components. It will be appreciated by those skilled in the art that the smart device architecture shown in the figures is not limiting of the smart device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the processor 501 is a control center of the smart device, and uses various interfaces and lines to connect various parts of the entire smart device, and by running or executing software programs and/or modules stored in the memory 502, and invoking data stored in the memory 502, performs various functions of the smart device and processes the data, thereby performing overall monitoring of the smart device. Optionally, processor 501 may include one or more processing cores; the processor 501 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and preferably the processor 501 may integrate an application processor primarily handling operating systems, physical interfaces, applications, etc. with a modem processor primarily handling wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.

The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by executing the software programs and modules stored in the memory 502. The memory 502 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the smart device, etc. In addition, memory 502 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 502 may also include a memory controller to provide access to the memory 502 by the processor 501.

The smart device also includes a power supply 503 for powering the various components, and preferably the power supply 503 may be logically coupled to the processor 501 via a power management system such that functions such as charge, discharge, and power consumption management are performed by the power management system. The power supply 503 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The smart device may also include an input unit 504, which input unit 504 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to physical settings and function control.

Although not shown, the smart device may further include a display unit or the like, which is not described herein. Specifically, in this embodiment, the processor 501 in the smart device loads executable files corresponding to the processes of one or more application programs into the memory 502 according to the following instructions, and the processor 501 executes the application programs stored in the memory 502, so as to implement various functions as follows:

acquiring current audio information received by target equipment; acquiring a first preset strategy library matched with target equipment; judging whether an identification strategy matched with the current audio information exists in a first preset strategy library or not; if the identification strategy matched with the current audio information does not exist in the first preset strategy library, acquiring the current multi-mode characteristic information of the target equipment; and determining a refusing result according to the current multi-mode characteristic information.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer-readable storage medium, which may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like. The method comprises the steps of storing a computer program thereon, wherein the computer program is loaded by a processor to execute any steps in the voice interaction refusing method provided by the embodiment of the application. For example, the loading of the computer program by the processor may perform the steps of:

In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of one embodiment that are not described in detail in the foregoing embodiments may be referred to in the foregoing detailed description of other embodiments, which are not described herein again.

In the implementation, each unit or structure may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit or structure may be referred to the foregoing method embodiments and will not be repeated herein.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

The foregoing describes in detail a voice interaction rejection method and apparatus provided by the embodiments of the present application, and specific examples are applied herein to illustrate the principles and embodiments of the present application, where the foregoing examples are only for helping to understand the method and core ideas of the present application; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present application, the present description should not be construed as limiting the present application in summary.

Claims

1. The voice interaction refusing method is characterized by comprising the following steps of:

acquiring current audio information received by target equipment;

acquiring a first preset strategy library matched with target equipment;

2. The voice interaction rejection method according to claim 1, wherein if the identification policy matching the current audio information does not exist in the first preset policy repository, acquiring current multi-modal feature information of the target device includes:

3. The voice interaction rejection method according to claim 1, wherein before the current audio information received by the target device is obtained, the method comprises:

acquiring an initialization strategy library of target equipment;

4. The voice interaction rejection method according to claim 2, wherein before the current audio information received by the target device is obtained, the method comprises:

5. The voice interaction rejection method according to claim 1, wherein the multi-modal features include a time-sequential class feature and a non-time-sequential class feature, and the determining the rejection result according to the current multi-modal feature information includes:

6. The voice interactive rejection method according to claim 5, wherein the determining the rejection result according to the full connection feature vector comprises:

Normalizing the full-connection feature vector to obtain rejection probability;

7. The method of claim 5, wherein the time-series class features include audio sequence features of a speaker, text features of a speaker, human body posture features of a speaker, and distance features of a speaker.

8. The method of claim 5, wherein the non-temporal class of characteristics includes a signal-to-noise ratio characteristic of the current audio information, a volume characteristic of the current audio information, a speech rate characteristic of the current audio information, a sound energy value characteristic of the current audio information, an intent characteristic of the current audio information, a contextual intent characteristic of the current audio information, a far-near field characteristic of the current audio information, and a device type of the target device.

9. The method for rejecting voice interactions according to claim 8, wherein the non-time-series class feature further comprises a scene matching feature of current audio information, and the obtaining current multi-modal feature information of the target device comprises:

acquiring the type of an audio scene of the current audio information;

10. The voice interaction rejection method according to claim 1, wherein the recognition policies in the first preset policy repository include a plurality of pass policies and a plurality of reject policies, and the determining whether the recognition policies matching the current audio information exist in the first preset policy repository includes:

11. The voice interaction rejection method according to claim 10, further comprising:

12. A voice interactive refusing device, characterized in that the voice interactive refusing device comprises:

13. An intelligent device, the intelligent device comprising:

one or more processors;

a memory; and

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the method of rejecting voice interactions of any one of claims 1 to 11.

14. A computer-readable storage medium, having stored thereon a computer program, the computer program being loaded by a processor to perform the steps of the speech interaction rejection method of any one of claims 1 to 11.