CN113393834A - Control method and device - Google Patents

Control method and device Download PDF

Info

Publication number
CN113393834A
CN113393834A CN202010167783.8A CN202010167783A CN113393834A CN 113393834 A CN113393834 A CN 113393834A CN 202010167783 A CN202010167783 A CN 202010167783A CN 113393834 A CN113393834 A CN 113393834A
Authority
CN
China
Prior art keywords
control
voice
sound box
wake
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010167783.8A
Other languages
Chinese (zh)
Other versions
CN113393834B (en
Inventor
张平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010167783.8A priority Critical patent/CN113393834B/en
Publication of CN113393834A publication Critical patent/CN113393834A/en
Application granted granted Critical
Publication of CN113393834B publication Critical patent/CN113393834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Abstract

The embodiment of the application provides a control method and a control device. In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the initiator of the control voice is a wake-free user; and under the condition that the initiator of the control voice is a wake-up-free user, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports the situation that a user is free from awakening the intelligent sound box, and can realize voice control on the intelligent sound box based on the control voice under the situation that the user can speak the control voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.

Description

Control method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a control method and apparatus.
Background
With the continuous development of the technology, a user can control the device in a voice interaction mode in daily life. For example, a user may speak a voice command directly to the device, which the device may respond to.
In the prior art, a user often needs to use an expression mode of "voice wake-up word + voice instruction", so that the user needs to speak a wake-up word for waking up the smart speaker every time the user interacts with the device, so that the device knows that the user speaks to the device.
However, speaking the wake-up word may make the interaction process between the user and the device cumbersome, may reduce the interaction efficiency, and thus the user experience.
Disclosure of Invention
In order to improve interaction efficiency and further improve user experience, the application shows a control method and a control device.
In a first aspect, the present application shows a control method comprising:
collecting control voice for controlling the intelligent sound box;
determining whether the initiator of the control voice is a wake-free user;
and under the condition that the initiator is used for avoiding awakening the user, controlling the intelligent sound box based on the control voice.
In an optional implementation manner, the determining whether the originator of the control voice is a wake-free user includes:
recognizing the voiceprint characteristics of the control voice;
and under the condition that the voiceprint feature is the voiceprint feature of the wake-free user, determining that the initiator is the wake-free user.
In an optional implementation manner, the determining whether the originator of the control voice is a wake-free user includes:
determining the position of a wake-up-free device in communication connection with the intelligent sound box;
determining the relative direction of the wake-up-free device relative to the smart sound box according to the position;
determining the source direction of the control voice;
and under the condition that the relative direction is the same as the source direction, determining that the initiator is a person-free wake-up user.
In an optional implementation manner, the determining whether the originator of the control voice is a wake-free user includes:
determining the source direction of the control voice;
acquiring an image including the initiator in a direction of origin;
identifying facial features of the sponsor in the image;
determining that the initiator is a wake-exempt user if the facial features are of the wake-exempt user.
In an optional implementation manner, the smart sound box includes at least two voice collecting devices;
the collection is used for controlling the control pronunciation of intelligent audio amplifier includes:
respectively acquiring control voices for controlling the intelligent sound boxes based on at least voice acquisition equipment;
the determining the source direction of the control voice comprises:
determining phase information of control voices respectively acquired by at least two voice acquisition devices;
determining the source direction based on the phase information.
In an optional implementation manner, the determining whether the originator of the control voice is a wake-free user includes:
determining the relative direction of the initiator relative to the intelligent sound box;
acquiring the historical direction of the awakening-free user relative to the intelligent sound box which is determined for the last time;
and under the condition that the difference between the relative direction and the historical direction is smaller than a preset difference, determining that the initiator is a person-free wake-up user.
In an optional implementation manner, the controlling the smart speaker based on the control voice includes:
carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
determining a control intent of the control speech based at least on the control text;
determining an intention field in which the control intention is;
and under the condition that the intention field is the intention field supported by the intelligent sound box, controlling the intelligent sound box based on the control intention.
In an alternative implementation, the determining the control intention of the control speech based on at least the control text includes:
and inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.
In an alternative implementation, the means for training the intent prediction model includes:
obtaining a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;
constructing a network structure of an intention prediction model;
and training network parameters in an intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.
In an alternative implementation, the intention prediction model network structure includes at least:
the system comprises a word segmentation layer, an encoding layer, a bidirectional recurrent neural network, a polymerization layer and a full connection layer;
the word segmentation layer is used for segmenting the control text to obtain a plurality of words;
the coding layer is used for converting a plurality of vocabularies into feature vectors respectively;
the bidirectional recurrent neural network is used for respectively performing feature supplementation on the multiple vectors based on the dependency relationship between at least two adjacent feature vectors in the multiple feature vectors;
the aggregation layer is used for aggregating a plurality of feature vectors after feature supplement is completed to obtain an aggregation vector;
the fully-connected layer is to predict a control intent from the aggregated vector.
In an alternative implementation, the bidirectional recurrent neural network includes a forward long short term memory network (LSTM) network and a backward LSTM network;
the forward LSTM network comprises a plurality of LSTM models connected in sequence;
the backward LSTM network includes a plurality of LSTM models connected sequentially;
the connection order among the plurality of LSTM models included in the forward LSTM network is opposite to the connection order among the plurality of LSTM models included in the backward LSTM network.
In an alternative implementation, the determining the control intention of the control speech based on at least the control text includes:
determining a current service scene of the intelligent sound box;
determining the control intent based on the traffic scenario and the control text.
In an alternative implementation, the determining the intention field of the control intention includes:
and searching an intention field corresponding to the control intention in the corresponding relation between the control intention and the intention field.
In an optional implementation manner, the control voices are multiple and are respectively sent by multiple initiators; and the number of wake-up free users among the plurality of initiators is at least two;
based on control speech control smart speaker includes:
determining the priority of at least two wake-free users;
and controlling the intelligent sound box based on the control voice sent by the non-awakening user with high priority.
In a second aspect, the present application shows a control method applied to a smart speaker, including:
collecting control voice for controlling the intelligent sound box;
determining whether the control voice is a wake-up free control voice;
and under the condition that the control voice is the wake-up-free control voice, controlling the intelligent sound box based on the control voice.
In an optional implementation manner, the determining whether the control speech is a wake-exempt control speech includes:
carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
judging whether the control text carries a wake-up free keyword or not;
and under the condition that the control text carries the wake-up-free keyword, determining the control voice as the wake-up-free control voice.
In a third aspect, the present application shows a control method applied to a smart speaker, including:
collecting control voice for controlling the intelligent sound box;
acquiring the acquisition time of the intelligent sound box when the control voice is acquired;
and under the condition that the collection time is the wake-up-free time, controlling the intelligent sound box based on the control voice.
In a fourth aspect, the present application shows a control method applied to a smart speaker, including:
collecting control voice for controlling the intelligent sound box;
determining the position of the intelligent sound box;
and under the condition that the position is located in the wake-up-free area, controlling the intelligent sound box based on the control voice.
In a fifth aspect, the present application shows a control device applied to a smart speaker, including:
the first acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the first determining module is used for determining whether the initiator of the control voice is a wake-free user;
and the first control module is used for controlling the intelligent sound box based on the control voice under the condition that the initiator is used for avoiding awakening the user artificially.
In an optional implementation manner, the first determining module includes:
a first recognition unit configured to recognize a voiceprint feature of the control voice;
a first determining unit, configured to determine that the initiator is an aroused-free user when the voiceprint feature is a voiceprint feature of an aroused-free user.
In an optional implementation manner, the first determining module includes:
the second determination unit is used for determining the position of the wake-up-free device in communication connection with the intelligent sound box;
a third determining unit, configured to determine, according to the position, a relative direction of the wake-up-free device with respect to the smart speaker;
a fourth determining unit, configured to determine a source direction of the control voice;
a fifth determining unit, configured to determine that the initiator is an awake-exempt user when the relative direction is the same as the source direction.
In an optional implementation manner, the first determining module includes:
a fourth determining unit, configured to determine a source direction of the control voice;
the acquisition unit is used for acquiring an image which is positioned in a source direction and comprises the initiator;
a second identifying unit configured to identify a facial feature of the originator in the image;
a sixth determining unit, configured to determine that the initiator is an aroused-free user if the facial feature is a facial feature of an aroused-free user.
In an optional implementation manner, the smart sound box includes at least two voice collecting devices;
the first acquisition module is specifically configured to:
respectively acquiring control voices for controlling the intelligent sound boxes based on at least voice acquisition equipment;
the fourth determination unit includes:
the first determining subunit is used for determining phase information of control voices acquired by at least two voice acquisition devices respectively;
a second determining subunit for determining the source direction based on the phase information.
In an optional implementation manner, the first determining module includes:
a seventh determining unit, configured to determine a relative direction of the initiator with respect to the smart sound box;
the obtaining unit is used for obtaining the historical direction of the awakening-free user relative to the intelligent sound box which is determined for the last time;
an eighth determining unit, configured to determine that the initiator is an awaking-free user when a difference between the relative direction and the historical direction is smaller than a preset difference.
In an optional implementation, the first control module includes:
the third recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
a ninth determining unit configured to determine a control intention of the control voice based on at least the control text;
a tenth determination unit configured to determine an intention field in which the control intention is located;
and the first control unit is used for controlling the intelligent sound box based on the control intention under the condition that the intention field is the intention field supported by the intelligent sound box.
In an optional implementation manner, the sixth determining unit includes:
and the input subunit is used for inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.
In an optional implementation manner, the sixth determining unit further includes:
the acquisition subunit is used for acquiring a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;
a construction subunit, configured to construct a network structure of the intent prediction model;
and the third determining subunit is used for training the network parameters in the intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.
In an alternative implementation, the intention prediction model network structure includes at least:
the system comprises a word segmentation layer, an encoding layer, a bidirectional recurrent neural network, a polymerization layer and a full connection layer;
the word segmentation layer is used for segmenting the control text to obtain a plurality of words;
the coding layer is used for converting a plurality of vocabularies into feature vectors respectively;
the bidirectional recurrent neural network is used for respectively performing feature supplementation on the multiple vectors based on the dependency relationship between at least two adjacent feature vectors in the multiple feature vectors;
the aggregation layer is used for aggregating a plurality of feature vectors after feature supplement is completed to obtain an aggregation vector;
the fully-connected layer is to predict a control intent from the aggregated vector.
In an alternative implementation, the bidirectional recurrent neural network includes a forward long short term memory network (LSTM) network and a backward LSTM network;
the forward LSTM network comprises a plurality of LSTM models connected in sequence;
the backward LSTM network includes a plurality of LSTM models connected sequentially;
the connection order among the plurality of LSTM models included in the forward LSTM network is opposite to the connection order among the plurality of LSTM models included in the backward LSTM network.
In an optional implementation manner, the tenth determining unit includes:
the fourth determining subunit is configured to determine a service scene where the smart sound box is currently located;
a fifth determining subunit, configured to determine the control intention based on the service scenario and the control text.
In an optional implementation manner, the tenth determining unit is specifically configured to: and searching an intention field corresponding to the control intention in the corresponding relation between the control intention and the intention field.
In an optional implementation manner, the control voices are multiple and are respectively sent by multiple initiators; and the number of wake-up free users among the plurality of initiators is at least two;
the first control module includes:
an eleventh determining unit, configured to determine priorities of at least two wake-exempt users;
and the second control unit is used for controlling the intelligent sound box based on the control voice sent by the non-awakening user with high priority.
In a sixth aspect, the present application shows a control device applied to a smart speaker, including:
the second acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the second determination module is used for determining whether the control voice is the wake-up-free control voice;
and the second control module is used for controlling the voice to be in a condition of avoiding awakening the control voice and controlling the intelligent sound box based on the control voice.
In an optional implementation manner, the second determining module includes:
the fourth recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
the judging unit is used for judging whether the control text carries the wake-up-free keyword or not;
a twelfth determining unit, configured to determine that the control speech is the wake-up free control speech when the control text carries the wake-up free keyword.
In a seventh aspect, the present application shows a control device applied to a smart speaker, including:
the third acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the acquisition module is used for acquiring the acquisition time of the intelligent sound box when the control voice is acquired;
and the third control module is used for controlling the intelligent sound box based on the control voice under the condition that the collection time is the wake-up-free time.
In an eighth aspect, the present application shows a control device applied to a smart speaker, including:
the fourth acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the third determining module is used for determining the position of the intelligent sound box;
and the fourth control module is used for controlling the intelligent sound box based on the control voice under the condition that the position is located in the wake-up-free area.
In a ninth aspect, the present application shows a smart speaker, comprising:
a processor; and
a memory having executable code stored thereon, which when executed, causes the processor to perform the control method of the first, second, third or fourth aspect.
In a fourth aspect, the present application shows one or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform a control method as described in the first, second, third or fourth aspects.
Compared with the prior art, the embodiment of the application has the following advantages:
in the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the initiator of the control voice is a wake-free user; and under the condition that the initiator of the control voice is a wake-up-free user, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports the situation that a user is free from awakening the intelligent sound box, and can realize voice control on the intelligent sound box based on the control voice under the situation that the user can speak the control voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Drawings
Fig. 1 is a flowchart illustrating a control method according to an exemplary embodiment of the present application.
Fig. 2 is a flowchart illustrating a method for determining an awake-exempt user according to an exemplary embodiment of the present application.
Fig. 3 is a flowchart illustrating a method for determining an awake-exempt user according to an exemplary embodiment of the present application.
Fig. 4 is a schematic diagram of a scenario shown in an exemplary embodiment of the present application.
Fig. 5 is a flowchart illustrating a method for determining an awake-exempt user according to an exemplary embodiment of the present application.
Fig. 6 is a flowchart illustrating a method for determining an awake-exempt user according to an exemplary embodiment of the present application.
Fig. 7 is a flowchart illustrating a method for controlling a smart sound box according to an exemplary embodiment of the present application.
FIG. 8 is a schematic diagram illustrating a network architecture of an intent prediction model according to an exemplary embodiment of the present application.
Fig. 9 is a flowchart illustrating a method for determining control intent according to an exemplary embodiment of the present application.
Fig. 10 is a flowchart illustrating a method for controlling a smart sound box according to an exemplary embodiment of the present application.
Fig. 11 is a flowchart illustrating a control method according to an exemplary embodiment of the present application.
Fig. 12 is a flowchart illustrating a control method according to an exemplary embodiment of the present application.
Fig. 13 is a flowchart illustrating a control method according to an exemplary embodiment of the present application.
Fig. 14 is a block diagram of a control device according to an exemplary embodiment of the present application.
Fig. 15 is a block diagram of a control device according to an exemplary embodiment of the present application.
Fig. 16 is a block diagram illustrating a control apparatus according to an exemplary embodiment of the present application.
Fig. 17 is a block diagram of a control device according to an exemplary embodiment of the present application.
Fig. 18 is a schematic structural diagram of an apparatus according to an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.
Referring to fig. 1, a flow chart of a control method according to the present application is shown, the method is applied to a smart speaker, and the method may include:
in step S101, collecting a control voice for controlling the smart speaker;
in this application, when the user need control smart sound box through pronunciation, the user can say the control pronunciation that is used for controlling smart sound box for smart sound box, and smart sound box can gather the control pronunciation that is used for controlling smart sound box that the user spoken based on voice acquisition equipment.
Wherein, the voice acquisition equipment comprises a microphone and the like.
In this application, the control speech includes a control instruction for controlling the smart sound box, for example, an instruction such as "play song of zhang san" or "query the temperature of today", and the control speech may not include a wake-up word of the smart sound box.
The method and the device for controlling the intelligent sound box aim to support a user to realize control over the intelligent sound box under the condition that the user does not speak out a wakeup word but speaks out control voice, however, sometimes the user speaks out the control voice not used for controlling the intelligent sound box but normal conversation with other people, if the intelligent sound box collects the voice not spoken out by the user for controlling the intelligent sound box and executes the process of the application based on the collected voice, system resources of the intelligent sound box, such as Central Processing Unit (CPU) resources, memory resources, electric energy resources and the like, are wasted.
Therefore, in order to avoid wasting system resources of the smart speaker, in another embodiment of the present application, a voice capture area may be set in the smart speaker in advance, for example, the voice capture area includes a circular area with a specific radius centered on the location of the smart speaker, and the specific radius includes 1m, 2m, and 3m, and the present application does not limit this. The intelligent sound box can collect the voice sent from the voice collection area, and can not collect the voice sent from the outside of the voice collection area, so that under the condition that the user needs to control the intelligent sound box through the voice, the user can go to the voice collection area of the intelligent sound box, and then can speak out the control voice for controlling the intelligent sound box, the intelligent sound box can collect the control voice for controlling the intelligent sound box spoken by the user in the voice collection area based on the voice collection equipment, and then step S102 is executed.
And under the condition that the user need not control intelligent audio amplifier through pronunciation, if the user needs normally to talk with other people, the user can leave the pronunciation collection region, then normally talks with other people, because the user is located outside the pronunciation collection region, consequently, the pronunciation that the user sent when normally talking with other people just can not be gathered to intelligent audio amplifier to can avoid the system resource of extravagant intelligent audio amplifier of vain.
In step S102, it is determined whether the originator of the control voice is a wake-free user;
in this application, for the smart speaker, the users who input the control voice to the smart speaker include a wake-exempt user and a non-wake-exempt user.
The intelligent sound box supports the wake-up-free user, and voice control over the intelligent sound box based on the control voice can be achieved under the condition that the control voice is spoken without speaking the wake-up word.
The intelligent sound box does not support that the non-exempt awakening user can realize voice control on the intelligent sound box based on the control voice under the condition that the non-exempt awakening user speaks the control voice without speaking the awakening word, and the non-exempt awakening user can realize voice control on the intelligent sound box based on the control voice under the conditions that the awakening word is spoken and the control voice is spoken.
The step can be referred to the following embodiments shown in fig. 2 to 6, and will not be described in detail here.
If the initiator of the control voice is a wake-up-free user, in step S103, the smart speaker is controlled based on the control voice.
In one embodiment of the present application, controlling the smart speaker based on the control speech includes: voice information is played based on the speaker, the voice information being used in response to the control voice.
For example, the control voice input by the user is "how much the temperature is at present", the intelligent sound box can search the temperature at present according to the control voice, for example, the searched temperature is 20 ℃ to 25 ℃, and then the voice information "the temperature at present is 20 ℃ to 25 ℃ can be played based on the speaker, so that the user can know that the temperature at present is 20 ℃ to 25 ℃.
In another embodiment of the present application, the smart speaker may also perform multiple turns of dialog with the user, and thus, the smart speaker may adopt a full duplex mode, for example, the smart speaker may collect one control voice based on the microphone and play voice information in response to another control voice based on the speaker. That is, the smart speaker is receiving sound during the speaking process.
In another embodiment of the present application, in the process of performing multiple rounds of conversations between the wake-free user and the smart speaker, if one non-wake-free user speaks, the smart speaker collects the voice spoken by the non-wake-free user, and determines that the voice is spoken by the non-wake-free user, so that the smart speaker does not respond to the voice spoken by the non-wake-free user.
The specific control method for controlling the smart speaker based on the control voice may refer to the embodiments shown in fig. 7 to fig. 10, and will not be described in detail herein.
In the case that the initiator of the control voice is a non-wake-up-exempt user, it may be detected whether a wake-up word input by the initiator is received before the control voice is received, and in the case that the wake-up word input by the initiator is not received, the control voice may not be processed, for example, the control voice is discarded. Or, the initiator is prompted that the initiator is not free from waking up the user, and the initiator can control the smart sound box by controlling the voice only by speaking the wake-up word for waking up the smart sound box, so that the initiator knows that the initiator needs to speak the wake-up word for waking up the smart sound box so as to control the smart sound box by controlling the voice.
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the initiator of the control voice is a wake-free user; and under the condition that the initiator of the control voice is a wake-up-free user, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports the situation that a user is free from awakening the intelligent sound box, and can realize voice control on the intelligent sound box based on the control voice under the situation that the user can speak the control voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
In one embodiment of the present application, referring to fig. 2, step S102 includes:
in step S201, a voiceprint feature of the control speech is recognized;
the present application does not limit the recognition method for recognizing the voiceprint feature of the control speech, and any recognition method is within the scope of the present application.
In step S202, in the case where the voiceprint feature is a voiceprint feature of a wake-exempt user, it is determined that the initiator is a wake-exempt user.
In this application, the owner of intelligence audio amplifier can set up in the intelligence audio amplifier and exempt from to awaken up the user, for example, the owner of intelligence audio amplifier can set up oneself and exempt up the user for the exempting from of intelligence audio amplifier, wherein, the owner of intelligence audio amplifier can be in the input of intelligence audio amplifier and exempt from to awaken up the setting instruction, the intelligence audio amplifier is received and is exempted up the setting instruction, then use pronunciation collection equipment to gather the pronunciation of the owner of intelligence audio amplifier, then can discern the vocal print characteristic of the pronunciation of the owner of intelligence audio amplifier, and regard this vocal print characteristic as the vocal print characteristic storage of exempting up the user in the intelligence audio amplifier.
The owner of the smart sound box may also set other wake-up exempting users for the smart sound box, for example, a user trusted by the owner of the smart sound box is set as the wake-up exempting user, for example, a home of the owner of the smart sound box is set as the wake-up exempting user, and the specific reference may be made to the above setting method, which is not described in detail herein.
In the application, the voiceprint features of different users are often different, so that whether the initiator controlling the voice is the wake-up-free user or not can be accurately determined according to the voiceprint features.
In one embodiment of the present application, referring to fig. 3, step S102 includes:
in step S301, determining a location of a wake-up exempting device in communication connection with the smart speaker;
in this application, the owner of smart sound box can set up in smart sound box and exempt from awakening equipment, for example, the owner of smart sound box can set up the cell-phone of oneself and be the awakening equipment of exempting from of smart sound box, wherein, the owner of smart sound box can be in the input of smart sound box and exempt from to awaken up the setting instruction, carry the equipment sign of the cell-phone of the owner of smart sound box in the instruction, the smart sound box receives and exempts from to awaken up the setting instruction, then can follow and exempt from to awaken up the equipment sign of the cell-phone of the owner of smart sound box and extract the equipment sign of the cell-phone of the owner of smart sound box, and store the equipment sign of the cell-phone of the owner of smart sound box in the smart sound box.
The owner of the smart sound box may also set other wake-up exempting devices for the smart sound box, for example, the device of the user trusted by the owner of the smart sound box is set as the wake-up exempting device, for example, the mobile phone of the owner's family of the smart sound box is set as the wake-up exempting device, and the specific reference may be made to the above setting method, which is not described in detail herein.
Before the user needs to control the smart sound box through voice, the user can also establish communication connection between the user's equipment and the smart sound box, after the communication connection is established, the smart sound box can acquire the equipment identifier of the user's equipment through the communication connection, and judge whether the equipment identifier is the stored equipment identifier of the device which is not wakened up, and under the condition that the equipment identifier is the stored equipment identifier of the device which is not wakened up, the equipment can be determined to be the device which is not wakened up.
The hardware device in the application includes the smart sound box and the wake-up-free device, and certainly may include other devices, for example, a router in a home, and the smart sound box and the wake-up-free device are respectively in communication connection with the router, that is, two-two mutual communication connection between the smart sound box, the wake-up-free device, and the router, so that, in the smart sound box, the wake-up-free device, and the router, the position of the wake-up-free device may be determined by combining a triangulation method, and a specific determination method is not described in detail herein.
In step S302, determining a relative direction of the wake-up exempting device with respect to the smart speaker according to the position;
in the application, the position of the smart sound box can be determined by combining a triangulation method in the smart sound box, the wake-up-free device and the router, the specific determination method is not described in detail herein, and then the relative direction of the wake-up-free device relative to the smart sound box can be determined based on the position of the wake-up-free device and the position of the smart sound box.
In step S303, determining the source direction of the control voice;
in one embodiment of the present application, the smart speaker may include at least two voice collecting devices;
in this way, when the intelligent sound box collects the control voice for controlling the intelligent sound box, the control voice for controlling the intelligent sound box can be collected respectively based on at least voice collection equipment; the control voices collected by at least the voice collection equipment are sent by the same initiator, and the control contents in the control voices collected by at least the voice collection equipment are the same, however, the distance between each voice collection equipment and the initiator is different, so that the phase information of the control voices collected by each voice collection equipment is different.
In this way, when the source direction of the control voice is determined, the phase information of the control voice respectively acquired by at least two voice acquisition devices can be determined, and then the source direction of the control voice is determined based on the phase information.
For example, the collecting time of the control voice collected by at least two voice collecting devices may be determined, then the time difference between the collecting times of the control voice collected by at least two voice collecting devices may be determined, and then the source direction of the control voice may be determined according to the time difference.
Referring to fig. 4, taking 2 voice capturing devices on the smart glasses as an example, where the two voice capturing devices are a and B, respectively, and assuming that the initiator is located at the location S, the control voice is also sent from the location S.
Assuming that the collection time of the control voice collected by the voice collection device a is T1, the collection time of the control voice collected by the voice collection device B is T2, since the distance between the voice collecting device a and the position S in fig. 4 is larger than the distance between the voice collecting device B and the position S, T1 is larger than T2, the perpendicular to the line segment AS can be made along B to obtain the perpendicular BM, the point M divides the line segment AS into two segments, wherein the control voice is propagated in the space by spherical waves instead of plane waves, so that the distance from the S point to the M point is the same as the distance from the S point to the B point, thus, the length of the line segment AM is a product of the sound velocity and the time difference, which includes the time difference between the time when the control voice reaches the voice collecting apparatus a and the time when the control voice reaches the voice collecting apparatus B.
Because the distance between the voice acquisition equipment A and the voice acquisition equipment B is known, the angle of the angle A can be determined according to the distance between the voice acquisition equipment A and the voice acquisition equipment B and the length of the line segment AM, and therefore the source direction of the control voice can be determined.
In step S304, it is determined that the initiator of the control speech is a wake-free user if the relative direction is the same as the source direction.
In this application, under the general condition, the user can hand oneself equipment, perhaps, though the user does not hand oneself equipment, often the user is nearer apart from oneself equipment, so, the direction of the relative intelligent audio amplifier of user and the direction of oneself equipment for intelligent audio amplifier often the same.
Therefore, in the case that the relative direction is the same as the source direction, the control voice initiator is often indicated as an owner of the wake-up-free device or a user authorized by the owner, so that it can be determined that the control voice initiator is the wake-up-free user.
In another embodiment of the present application, referring to fig. 5, step S102 includes:
in step S401, determining the source direction of the control voice;
the step may specifically refer to the description of step S303, and is not described in detail here.
In step S402, an image including the originator in the direction of the source is acquired;
in this application, be provided with at least one image acquisition equipment on the smart loudspeaker box, for example, camera etc. can gather the image of any direction based on at least one image acquisition equipment.
In the present application, because the control voice is uttered by the initiator, the direction of the initiator of the control voice relative to the smart speaker and the direction of the source of the control voice are the same. Therefore, the captured image in the direction of the source also includes an image of the originator of the control speech.
In step S403, a facial feature of the originator of the control voice in the image is recognized;
in the present application, the facial feature of the originator in the image may be identified by any identification method, and the specific identification method is not limited in the present application, and any identification method is within the scope of the present application.
In step S404, in a case where the facial feature is a facial feature of the wake-exempt user, it is determined that the initiator is the wake-exempt user.
In this application, the owner of intelligence audio amplifier can set up in the intelligence audio amplifier and exempt from to awaken up the user, for example, the owner of intelligence audio amplifier can set up oneself and exempt up the user for the intelligence audio amplifier, wherein, the owner of intelligence audio amplifier can be in the intelligence audio amplifier input and exempt from to awaken up the setting instruction, the setting instruction is exempted up in the intelligence audio amplifier receipt, then use image acquisition equipment to shoot the facial image of the owner of intelligence audio amplifier, then extract the facial feature of this facial image, and with this facial feature as exempting up the facial feature storage of user in the intelligence audio amplifier.
The owner of the smart sound box may also set other wake-up exempting users for the smart sound box, for example, a user trusted by the owner of the smart sound box is set as the wake-up exempting user, for example, a home of the owner of the smart sound box is set as the wake-up exempting user, and the specific reference may be made to the above setting method, which is not described in detail herein.
In the application, the facial features of different users are often different, so that whether the initiator controlling the voice is the awakening-free user or not can be accurately determined according to the facial features.
In another embodiment of the present application, referring to fig. 6, step S102 includes:
in step S501, the relative direction of the initiator of the control voice with respect to the smart speaker is determined;
in the present application, because the control voice is uttered by the initiator, the direction of the initiator of the control voice relative to the smart speaker and the direction of the source of the control voice are the same.
That is, the source direction of the control voice is the same as the relative direction of the initiator of the control voice relative to the smart sound box.
Therefore, after the source direction of the control voice is determined, the relative direction of the control voice initiator relative to the intelligent sound box can be obtained.
The specific manner of determining the source direction of the control voice can be referred to the description of step S303, and is not described in detail here.
In step S502, obtaining a history direction of the wake-up exempting user determined last time relative to the smart sound box;
in the application, in the process of carrying out multiple rounds of conversations between the user and the intelligent sound box, every time the intelligent sound box determines that the user is the wake-free user according to the control voice sent by the user and determines the relative direction of the user relative to the intelligent sound box, the intelligent sound box stores the determined relative direction of the user relative to the intelligent sound box as the latest determined historical direction of the wake-free user relative to the intelligent sound box in the intelligent sound box.
Therefore, in this step, the smart speaker can obtain the latest stored historical direction of the wake-up-free user relative to the smart speaker, and use the latest stored historical direction of the wake-up-free user relative to the smart speaker as the latest determined historical direction of the wake-up-free user relative to the smart speaker.
In step S503, in a case that the difference between the relative direction and the historical direction is smaller than a preset difference, it is determined that the initiator of the control voice is a wake-free user.
In the present application, during multiple rounds of conversations between the wake-free user and the smart speaker, the wake-free user will input a control voice to the smart speaker for each round of conversations, and usually, during the multiple rounds of conversations, the wake-free user will not move remotely, i.e., during the multiple rounds of conversations, the position of the wake-free user will not change much,
therefore, if the difference between the historical direction of the wake-free user relative to the smart sound box and the relative direction of the control voice initiator is small, the control voice is often output by the wake-free user in a middle or last round of conversation in the process of carrying out multiple rounds of conversations with the smart sound box, and therefore it can be determined that the control voice initiator is the wake-free user.
In this application, the direction may be a direction angle, and the difference between the direction and the direction is a difference between the direction angles, for example, the difference between the relative direction and the historical direction is a difference between two direction angles, and the preset difference includes an angle value, for example, 30 °, 40 °, or 50 °, and the like, which is not limited in this application, and the preset difference may be obtained according to historical statistics.
In another embodiment of the present application, referring to fig. 7, step S103 includes:
in step S601, performing voice recognition on the control voice to obtain a control text corresponding to the control voice;
in this step, the control speech may be subjected to speech recognition through any speech recognition algorithm to obtain a control text corresponding to the control speech.
In step S602, a control intention of the control speech is determined based on at least the control text;
in the application, the control intention is used for embodying what the user needs to control the smart sound box to do or ask what the smart sound box does, for example, controlling the smart sound box to play a song of zhang san, asking who the graduation school of lie four is, asking the weather condition of the day, and the like.
In the present application, the control intention of the control speech may be determined based on the control text by means of an intention prediction model.
The intention prediction model can be trained in advance, and the specific training mode comprises the following steps:
11) acquiring a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;
12) constructing a network structure of an intention prediction model;
referring to fig. 8, the intention prediction model network structure at least includes:
the system comprises a word segmentation layer, an encoding layer, a bidirectional recurrent neural network, a polymerization layer and a full connection layer.
The control text may be input into a segmentation layer, which is used to segment the control text to obtain a plurality of words and to input the plurality of words into the encoding layer.
The coding layer is used for converting a plurality of words into feature vectors respectively and inputting the feature vectors into the bidirectional recurrent neural network respectively.
The bidirectional recurrent neural network is used for respectively performing feature supplementation on the plurality of feature vectors based on the dependency relationship between at least two adjacent feature vectors in the plurality of feature vectors, and inputting the plurality of feature vectors after feature supplementation into the aggregation layer.
The aggregation layer is used for aggregating a plurality of feature vectors after feature supplement is completed to obtain an aggregation vector, and inputting the aggregation vector into the full-connection layer.
The fully-connected layer is used to predict the control intent from the aggregated vector.
The bidirectional recurrent neural network comprises a forward LSTM (Long Short-Term Memory) network and a backward LSTM network, wherein the forward LSTM network comprises a plurality of sequentially connected LSTM models, the backward LSTM network comprises a plurality of sequentially connected LSTM models, and the connection sequence of the plurality of LSTM models in the forward LSTM network is opposite to that of the plurality of LSTM models in the backward LSTM network.
13) And training the network parameters in the intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.
In the process of training the intention prediction model, parameters in the intention prediction model are generally optimized according to the gradient value of the intention prediction model and the output value of the loss function of the intention prediction model until network parameters in the intention prediction model converge, however, if the gradient value of the intention prediction model disappears in the process of training the intention prediction model, the network parameters of the intention prediction model cannot be accurately optimized according to the value of the loss function only, and normal training of the intention prediction model is influenced.
Therefore, to avoid this, the recurrent neural network may include a LSTM (Long Short-Term Memory network) model, which may be used to avoid gradient vanishing. The loss function includes mean square error, etc. Using the LSTM model as a network element in the recurrent neural network, it is possible to avoid the disappearance of the gradient values of the intention prediction model in the process of training the intention prediction model.
In this way, when the control intention of the control speech is determined based on the control text by means of the intention prediction model, the control text can be input into the intention prediction model, and the control intention of the control speech output by the intention prediction model is obtained.
For example, the control text is entered into the participle layer; the word segmentation layer segments the control text to obtain a plurality of words and phrases, and inputs the plurality of words and phrases into the coding layer; the coding layer is used for converting a plurality of vocabularies into feature vectors respectively; for example, the encoding layer performs one-hot encoding on each vocabulary, obtains a feature vector of each vocabulary, and inputs the feature vectors into the bidirectional recurrent neural network.
When a plurality of feature vectors are respectively input into the bidirectional recurrent neural network, for any feature vector, the LSTM model corresponding to the feature vector in the forward LSTM network and the LSTM model corresponding to the feature vector in the backward LSTM network can be respectively input into the LSTM model corresponding to the forward LSTM network and the LSTM model corresponding to the backward LSTM network. For example, the order of the feature vector and its position in the feature vectors of the plurality of words included in the control text may be determined. Determining the LSTM models corresponding to the position sequence in a plurality of LSTM models included in the forward LSTM network, and using the LSTM models as the LSTM models corresponding to the feature vectors; and determining the LSTM model corresponding to the position sequence from a plurality of LSTM models included in the backward LSTM network, and using the LSTM model corresponding to the feature vector. The vector output by the LSTM model corresponding to the feature vector forward to the LSTM network and the vector output by the LSTM model corresponding to the feature vector backward to the LSTM network may be aggregated into one vector as the output vector corresponding to the feature vector.
The above operation is also performed for each of the other feature vectors.
Thus, the output vector corresponding to each feature vector is obtained.
The output vectors corresponding to each feature vector can be aggregated into one vector to be used as the output vector for obtaining the bidirectional recurrent neural network.
For example, assuming that the position order of the feature vector of a certain vocabulary in the control text is nth in the feature vectors of the plurality of vocabularies included in the control text, the feature vector of the vocabulary is input into the nth LSTM model included in the forward LSTM network and the feature vector of the vocabulary is input into the nth last LSTM model included in the backward LSTM network.
Therefore, the intention prediction model can use not only the forward related content of the position sequence, but also the backward related content of the position sequence, that is, the sequence relationship, the position relationship, the dependency relationship and the like among the vocabularies in the control text can be better used, so that the intention prediction model can obtain more contents, the prediction result is more perfect, and the control intention determined later can be more accurate.
In one example, referring to FIG. 9, the bidirectional recurrent neural network includes a forward LSTM network and a backward LSTM network.
The forward LSTM network includes LSTM model 1, LSTM model 2, LSTM model 3, and LSTM model 4, in order LSTM model 1 precedes LSTM model 2, LSTM model 2 precedes LSTM model 3, and LSTM model 3 precedes LSTM model 4.
The backward LSTM network includes LSTM model 5, LSTM model 6, LSTM model 7, and LSTM model 8, in order LSTM model 5 precedes LSTM model 6, LSTM model 6 precedes LSTM model 7, and LSTM model 7 precedes LSTM model 8.
Assume that word 1, word 2, word 3, and word 4 in the control text, with word 1 preceding word 2, word 2 preceding word 3, and word 3 preceding word 4.
Feature vectors of respective words may be acquired, for example, feature vector 1 of word 1, feature vector 2 of word 2, feature vector 3 of word 3, and feature vector 4 of word 4 are acquired.
Feature vector 1 may then be input into LSTM model 1, the output of LSTM model 1 with feature vector 2 into LSTM model 2, the output of LSTM model 2 with feature vector 3 into LSTM model 3, and the output of LSTM model 3 with feature vector 4 into LSTM model 4.
Further, feature vector 4 may be input into LSTM model 5, the output of LSTM model 5 and feature vector 3 may be input into LSTM model 6, the output of LSTM model 6 and feature vector 2 may be input into LSTM model 7, and the output of LSTM model 7 and feature vector 1 may be input into LSTM model 8.
In this way, the LSTM model 1 and the LSTM model 8 output one vector, and the output vectors of the LSTM model 1 and the LSTM model 8 can be aggregated into one vector as the output vector 1 corresponding to the feature vector 1.
The LSTM model 2 and the LSTM model 7 may output a vector respectively, and the vectors output by the LSTM model 2 and the LSTM model 7 may be aggregated into a vector as the output vector 2 corresponding to the feature vector 2.
The LSTM model 3 and the LSTM model 6 may output a vector respectively, and the vectors output by the LSTM model 3 and the LSTM model 6 may be aggregated into a vector as the output vector 3 corresponding to the feature vector 3.
The LSTM model 4 and the LSTM model 5 may output a vector respectively, and the vectors output by the LSTM model 4 and the LSTM model 5 may be aggregated into a vector as the output vector 4 corresponding to the feature vector 4.
Then, the output vector 1 corresponding to the feature vector 1, the output vector 2 corresponding to the feature vector 2, the output vector 3 corresponding to the feature vector 3, and the output vector 4 corresponding to the feature vector 4 may be aggregated (Concat) to obtain an aggregated vector. And inputs the aggregate vector into the fully-connected layer. The fully-connected layer predicts a control intent of the control speech based on the aggregate vector.
In one embodiment of the present application, in a possible scenario, the control intention accuracy of the control speech determined based on the control text is low, for example, suppose that Zusanzi who is a person of the public wants a type recommended by the public, such as recommending good-hearing music, recommending a good-eating restaurant, recommending a good-seeing movie, recommending a shop with a high discount degree, and the like.
Suppose that the user needs to search for "zhang san recommended good-listening music" through the smart sound box, but the control voice input by the user to the smart sound box may not be accurate, for example, the control text corresponding to the control voice input by the user to the smart sound box is "search for zhang san recommended goods", and it is not limited to the smart sound box what type of goods that needs to be searched for zhang san recommended goods. Thus, in general, the product searched by the smart speaker for the user based on the control voice of the user does not necessarily have to be the product precisely searched for "zhang san recommended good listening music", but may be other types of products recommended by zhang san, for example, a restaurant recommended by zhang san, a movie recommended by zhang san, a shop recommended by zhang san with a high discount, and the like. This results in that the smart speaker may not provide the service to the user as the user originally intended to obtain, resulting in a low accuracy of providing the service to the user by the smart speaker.
In this case, therefore, in order to improve the accuracy of the service provided by the smart speaker to the user,
when the control intention of the control voice is determined based on the control text, the current service scene of the intelligent sound box can be determined; a control intent is then determined based on the business scenario and the control text.
For example, sometimes a service is provided for a user in a service scenario, for example, the user speaks "start music player" to the smart speaker to enable the smart speaker to start the music player of the smart speaker, and then the smart speaker enters a service scenario for playing music, assuming that the user needs to search for the good-listening music recommended by zhang san through the smart speaker, even though the control voice input to the smart speaker by the user may not be accurate, for example, even though the control text corresponding to the control voice input to the smart speaker by the user is "search for a commodity recommended by zhang san", what type of commodity needs to be searched for by zhang san is not defined for the smart speaker.
However, at this time, the smart speaker is already in the service scene of playing music, and therefore, based on the service scene of playing music, the smart speaker can determine that the "search for the recommended good by zhang san" that the user needs is actually the "search for the recommended good-listening music by zhang san" that the smart speaker needs ".
Therefore, the possibility that the service provided by the intelligent sound box for the user is the service which the user originally wants to obtain is increased, and the accuracy of the service provided by the intelligent sound box for the user is improved.
In step S603, an intention field in which the control intention is located is determined;
the control fields supported by the intelligent sound box comprise a music field, a weather field, a calling field, a shopping field, a route searching field and the like. The control field supported by the smart sound box can be determined by a manufacturer of the smart sound box or an owner of the smart sound box.
In the case where the control intention of the control voice of the user is in the control field supported by the smart speaker, the user may control the smart speaker based on the control voice.
Under the condition that the control intention of the control voice of the user is not in the control field supported by the intelligent sound box, the user cannot control the intelligent sound box based on the control voice.
For example, the smart speaker does not support the medical field, and if the wake-up-free user says the control voice of "help me to do needle" to the smart speaker, the smart speaker cannot respond to the instruction of the user, that is, cannot help the user to do needle.
Therefore, after the control intention of the control voice is obtained, it is necessary to determine the intention field in which the control intention is, and then to execute step S604.
For any intention field supported by the intelligent sound box, control intentions which belong to the intention field and can control the intelligent sound box can be counted in advance, and then each control intention corresponds to an item of the intention field and is stored in a corresponding relation between the control intention and the intention field. The same is true for each of the other areas of intent supported by the smart enclosures.
Therefore, in this step, an intention area corresponding to the control intention can be searched for in the correspondence between the control intention and the intention area.
In step S604, if the intended area is an intended area supported by the smart speaker, the smart speaker is controlled based on the control intention.
In another embodiment, in the case that the area of intent is not an area of intent supported by the smart speaker, the process may end.
In the application, in one scenario, a plurality of wake-up-free users are located around the smart sound box, and if the wake-up-free users need to control the smart sound box based on the control voice at the same time, the wake-up-free users can input respective control voices to the smart sound box respectively, at this time, the smart sound box receives a plurality of control voices, the control voices are respectively sent by a plurality of initiators, and at least two wake-up-free users are located among the plurality of initiators;
in this case, referring to fig. 10, step S103 includes:
in step S701, priorities of at least two wake-up exempt users are determined;
in the present application, among a plurality of wake-up-free users of the smart speaker, an owner of the smart speaker may set priorities of the wake-up-free users in the smart speaker in advance, for example, in the smart speaker, voiceprint features of the wake-up-free users are sorted, and a higher priority of a wake-up-free user corresponding to a voiceprint feature that is closer to the front in the sorting is, a lower priority of a wake-up-free user corresponding to a generation feature that is closer to the rear in the sorting is.
In this way, for any collected control voice, in the case that it is determined that the initiator of the control voice is the wake-free user, the priority of the wake-free user may be determined, for example, the priority of the wake-free user is determined according to the positions of the voiceprint features of the control voice in the plurality of voiceprint features ordered from high to low in priority, and the same is true for each other collected control voice. Thereby obtaining a priority for each of the wake-free users.
Of course, facial features and the like may be used in addition to the voiceprint features, which are not limited in this application.
In step S701, the smart speaker is controlled based on the control voice sent by the wake-up exempt user with the high priority.
Second, the control speech uttered by other wake-free users may be discarded, etc.
Through the application, under the condition that a plurality of users of exempting from to awaken up of these intelligent audio amplifier need be based on respective control speech control intelligent audio amplifier simultaneously, can guarantee that the user of exempting up that the priority is high can pass through control speech control intelligent audio amplifier smoothly, avoid leading to the user of exempting up that the priority is high to awaken up to be unable to pass through control speech control intelligent audio amplifier smoothly because other interference of exempting up the user of exempting up leads to the user of exempting up of priority.
Referring to fig. 11, a flow chart of a control method according to the present application is shown, the method is applied to a smart speaker, and the method may include:
in step S801, a control voice for controlling the smart speaker is collected;
the step can be referred to as step S101, and is not described in detail here.
In step S802, it is determined whether the control voice is a wake-up free control voice;
this step can be realized by the following process, including:
8021. carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
in this step, the control speech may be subjected to speech recognition through any speech recognition algorithm to obtain a control text corresponding to the control speech.
8022. Judging whether the control text carries a wake-up free keyword or not;
in this application, the owner of the smart speaker may set a wake-up-free keyword in the smart speaker, for example, the owner of the smart speaker may set a wake-up-free keyword for dialing an emergency call in an emergency situation such as "dial 110", "dial 120", and "119", where the owner of the smart speaker may input a wake-up-free setting instruction in the smart speaker, the smart speaker receives the wake-up-free setting instruction, then uses the voice collecting device to collect the voice of the owner of the smart speaker, then may identify a voiceprint feature of the voice of the owner of the smart speaker, and store the voiceprint feature in the smart speaker as a voiceprint feature of a wake-up-free user. The smart speaker can store the wake-up-free keyword set by the owner of the smart speaker.
8023. And under the condition that the control text carries the wake-up-free keyword, determining the control voice as the wake-up-free control voice.
If the control voice is the wake-up free control voice, the smart speaker is controlled based on the control voice in step S803.
In the embodiment shown in fig. 1, the smart speaker is controlled based on the control voice under the condition that the initiator of the control voice is a wake-free user. In the embodiment of the application, the identity of the initiator is not limited, and no matter whether the initiator controlling the voice is a wake-free user or not, the intelligent sound box can be controlled based on the control voice under the condition that the control voice is the wake-free control voice. According to the method and the device, the situation that the user is not free to wake up under some special conditions can be guaranteed to be controlled by the intelligent sound box based on the control voice, for example, under an emergency condition, the user needs to make an emergency call, for example, 110, 119, 120 and the like, so that the user can be supported to wake up non-free and control the intelligent sound box based on the control voice, a wake-up word does not need to be input into the intelligent sound box, and therefore the efficiency of controlling the intelligent sound box by the user is improved.
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the control voice is a wake-up free control voice; and under the condition that the control voice is the wake-up-free control voice, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports that a user can carry out voice control on the intelligent sound box based on the control voice without awakening under the condition that the control voice spoken by the user is the control voice without awakening the voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Referring to fig. 12, a flow chart of a control method according to the present application is shown, the method is applied to a smart speaker, and the method may include:
in step S901, a control voice for controlling the smart speaker is collected;
the step can be referred to as step S101, and is not described in detail here.
In step S902, acquiring a collecting time when the smart speaker collects the control voice;
in one scenario, the smart sound box can provide services for a large number of users, for example, staff of some organizations can set the smart sound box in a hall of the organization for the purpose of facilitating the work of staff in the organization, so that the staff in the organization can rely on the smart sound box to obtain services. However, these organizations also provide services to a large number of customers at a specific time, for example, providing route guidance service inside a shopping mall and shopping guidance service inside the shopping mall for a large number of customers.
However, the general customers may not be the owners of the smart speakers, or may not know the brands of the smart speakers, so that the general customers may not know the awakening words of the smart speakers, and therefore, if the awakening words are needed to control the smart speakers through voice, the general customers may not control the smart speakers through voice.
Therefore, in this case, in order to allow a large number of customers to control the smart speaker by voice, in the present application, the smart speaker is supported for a certain period of time to respond to the control voice of the user without a wake-up word.
In this application, the owner of intelligent audio amplifier can set up in intelligent audio amplifier exempts from to awaken up constantly, for example, the owner of intelligent audio amplifier can set up the moment that the mechanism is open to the outside for exempting from to awaken up constantly etc. and intelligent audio amplifier can save exempting from to awaken up constantly that the owner of intelligent audio amplifier set up.
In step S903, if the collection time is the wake-up free time, the smart speaker is controlled based on the control voice.
In the embodiment shown in fig. 1, the smart speaker is controlled based on the control voice under the condition that the initiator of the control voice is a wake-free user. In the embodiment of the application, the identity of the initiator may not be limited, and no matter whether the initiator controlling the voice is a wake-up-free user or not, the smart speaker may be controlled based on the control voice under the condition that the collection time when the smart speaker collects the control voice is the wake-up-free time.
Through the method and the device, the non-free awakening user can be ensured to be based on the control voice control intelligent sound box under some special conditions, so that the non-free awakening user can be supported to be based on the control voice control intelligent sound box, and the intelligent sound box is not required to be input with awakening words, so that vast customers can obtain route navigation service inside a market and shopping navigation service inside the market and the like from the intelligent sound box.
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; acquiring the acquisition time of the intelligent sound box when the control voice is acquired; and under the condition that the collection time is the wake-up-free time, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports that a user can carry out voice control on the intelligent sound box based on the control voice without awakening under the condition that the control voice spoken by the user is the control voice without awakening the voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Referring to fig. 13, a flow chart of a control method according to the present application is shown, the method is applied to a smart speaker, and the method may include:
in step S1001, a control voice for controlling the smart speaker is collected;
the step can be referred to as step S101, and is not described in detail here.
In step S1002, determining a position of the smart speaker;
in one scenario, the intelligent sound box can provide services for a large number of users, for example, staff in a shopping mall can conveniently provide route navigation services inside the shopping mall and shopping navigation services inside the shopping mall for a large number of customers, and one intelligent sound box can be temporarily placed at an entrance position of the shopping mall, so that after a customer enters the shopping mall from the entrance position of the shopping mall, the route navigation services inside the shopping mall, the shopping navigation services inside the shopping mall, and the like can be acquired by means of the intelligent sound box.
However, the general customers may not be the owners of the smart speakers, or may not know the brands of the smart speakers, so that the general customers may not know the awakening words of the smart speakers, and therefore, if the awakening words are needed to control the smart speakers through voice, the general customers may not control the smart speakers through voice.
Therefore, in this case, in order to allow a large number of customers to control the smart speaker by voice, in the present application, the smart speaker is supported in a specific position to respond to the control voice of the user without a wake-up word.
In this application, the owner of intelligent audio amplifier can set up in the intelligent audio amplifier and exempt from to awaken the region, for example, the owner of intelligent audio amplifier can set up the region at the entry place of market for exempting from to awaken the region, the region at the entry place of railway station is for exempting from to awaken the region, the region at the area of bus stop for exempting from to awaken the region and the region at the entry place of airport is for exempting from to awaken the region etc. intelligence audio amplifier can save the exempting from to awaken the region that the owner of intelligent audio amplifier set up.
In step S1003, when the position is located in the wake-up exempt area, the smart speaker is controlled based on the control voice.
In the embodiment shown in fig. 1, the smart speaker is controlled based on the control voice under the condition that the initiator of the control voice is a wake-free user. In the embodiment of the application, the identity of the initiator is not limited, and no matter whether the initiator controlling the voice is a wake-up-free user or not, the smart sound box can be controlled based on the control voice under the condition that the smart sound box is located in the wake-up-free area. Through this application, can guarantee under some special circumstances non-exempt awaken up the user and also can be based on control speech control intelligent audio amplifier, for example, at the entrance intelligent audio amplifier of market, customers need acquire the inside route navigation service of market and the inside shopping navigation service of market etc. so, can support non-exempt to awaken up the user and also can be based on control speech control intelligent audio amplifier, do not need to awaken up the word to intelligent audio amplifier input, thereby make customers can follow intelligent audio amplifier department and acquire the inside route navigation service of market and the inside shopping navigation service of market etc..
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining the position of the intelligent sound box; and under the condition that the position is in the wake-up-free area, controlling the intelligent loudspeaker box based on the control voice. Through the application, the intelligent sound box supports that a user can carry out voice control on the intelligent sound box based on the control voice without awakening under the condition that the control voice spoken by the user is the control voice without awakening the voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Referring to fig. 14, a block diagram of a control device according to an embodiment of the present disclosure is shown, which may specifically include the following modules:
the first acquisition module 11 is used for acquiring control voice for controlling the intelligent sound box;
a first determining module 12, configured to determine whether a sponsor of the control voice is a wake-exempt user;
and the first control module 13 is configured to control the smart sound box based on the control voice under the condition that the initiator is a person free from waking up the user.
In an optional implementation manner, the first determining module includes:
a first recognition unit configured to recognize a voiceprint feature of the control voice;
a first determining unit, configured to determine that the initiator is an aroused-free user when the voiceprint feature is a voiceprint feature of an aroused-free user.
In an optional implementation manner, the first determining module includes:
the second determination unit is used for determining the position of the wake-up-free device in communication connection with the intelligent sound box;
a third determining unit, configured to determine, according to the position, a relative direction of the wake-up-free device with respect to the smart speaker;
a fourth determining unit, configured to determine a source direction of the control voice;
a fifth determining unit, configured to determine that the initiator is an awake-exempt user when the relative direction is the same as the source direction.
In an optional implementation manner, the first determining module includes:
a fourth determining unit, configured to determine a source direction of the control voice;
the acquisition unit is used for acquiring an image which is positioned in a source direction and comprises the initiator;
a second identifying unit configured to identify a facial feature of the originator in the image;
a sixth determining unit, configured to determine that the initiator is an aroused-free user if the facial feature is a facial feature of an aroused-free user.
In an optional implementation manner, the smart sound box includes at least two voice collecting devices;
the first acquisition module is specifically configured to:
respectively acquiring control voices for controlling the intelligent sound boxes based on at least voice acquisition equipment;
the fourth determination unit includes:
the first determining subunit is used for determining phase information of control voices acquired by at least two voice acquisition devices respectively;
a second determining subunit for determining the source direction based on the phase information.
In an optional implementation manner, the first determining module includes:
a seventh determining unit, configured to determine a relative direction of the initiator with respect to the smart sound box;
the obtaining unit is used for obtaining the historical direction of the awakening-free user relative to the intelligent sound box which is determined for the last time;
an eighth determining unit, configured to determine that the initiator is an awaking-free user when a difference between the relative direction and the historical direction is smaller than a preset difference.
In an optional implementation, the first control module includes:
the third recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
a ninth determining unit configured to determine a control intention of the control voice based on at least the control text;
a tenth determination unit configured to determine an intention field in which the control intention is located;
and the first control unit is used for controlling the intelligent sound box based on the control intention under the condition that the intention field is the intention field supported by the intelligent sound box.
In an optional implementation manner, the sixth determining unit includes:
and the input subunit is used for inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.
In an optional implementation manner, the sixth determining unit further includes:
the acquisition subunit is used for acquiring a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;
a construction subunit, configured to construct a network structure of the intent prediction model;
and the third determining subunit is used for training the network parameters in the intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.
In an alternative implementation, the intention prediction model network structure includes at least:
the system comprises a word segmentation layer, an encoding layer, a bidirectional recurrent neural network, a polymerization layer and a full connection layer;
the word segmentation layer is used for segmenting the control text to obtain a plurality of words;
the coding layer is used for converting a plurality of vocabularies into feature vectors respectively;
the bidirectional recurrent neural network is used for respectively performing feature supplementation on the multiple vectors based on the dependency relationship between at least two adjacent feature vectors in the multiple feature vectors;
the aggregation layer is used for aggregating a plurality of feature vectors after feature supplement is completed to obtain an aggregation vector;
the fully-connected layer is to predict a control intent from the aggregated vector.
In an alternative implementation, the bidirectional recurrent neural network includes a forward long short term memory network (LSTM) network and a backward LSTM network;
the forward LSTM network comprises a plurality of LSTM models connected in sequence;
the backward LSTM network includes a plurality of LSTM models connected sequentially;
the connection order among the plurality of LSTM models included in the forward LSTM network is opposite to the connection order among the plurality of LSTM models included in the backward LSTM network.
In an optional implementation manner, the tenth determining unit includes:
the fourth determining subunit is configured to determine a service scene where the smart sound box is currently located;
a fifth determining subunit, configured to determine the control intention based on the service scenario and the control text.
In an optional implementation manner, the tenth determining unit is specifically configured to: and searching an intention field corresponding to the control intention in the corresponding relation between the control intention and the intention field.
In an optional implementation manner, the control voices are multiple and are respectively sent by multiple initiators; and the number of wake-up free users among the plurality of initiators is at least two;
the first control module includes:
an eleventh determining unit, configured to determine priorities of at least two wake-exempt users;
and the second control unit is used for controlling the intelligent sound box based on the control voice sent by the non-awakening user with high priority.
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the initiator of the control voice is a wake-free user; and under the condition that the initiator of the control voice is a wake-up-free user, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports the situation that a user is free from awakening the intelligent sound box, and can realize voice control on the intelligent sound box based on the control voice under the situation that the user can speak the control voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Referring to fig. 15, a block diagram of a control device according to an embodiment of the present disclosure is shown, which may specifically include the following modules:
the second acquisition module 21 is configured to acquire a control voice for controlling the smart sound box;
a second determining module 22, configured to determine whether the control voice is a wake-up-free control voice;
and the second control module 23 is configured to control the smart sound box based on the control voice under the condition that the control voice is the wake-up-free control voice.
In an optional implementation manner, the second determining module includes:
the fourth recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
the judging unit is used for judging whether the control text carries the wake-up-free keyword or not;
a twelfth determining unit, configured to determine that the control speech is the wake-up free control speech when the control text carries the wake-up free keyword.
In the embodiment of the application, the identity of the initiator is not limited, and no matter whether the initiator controlling the voice is the wake-up-free user or not, the intelligent sound box can be controlled based on the control voice under the condition that the control voice is the wake-up-free control voice. According to the method and the device, the situation that the user is not free to wake up under some special conditions can be guaranteed to be controlled by the intelligent sound box based on the control voice, for example, under an emergency condition, the user needs to make an emergency call, for example, 110, 119, 120 and the like, so that the user can be supported to wake up non-free and control the intelligent sound box based on the control voice, a wake-up word does not need to be input into the intelligent sound box, and therefore the efficiency of controlling the intelligent sound box by the user is improved.
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the control voice is a wake-up free control voice; and under the condition that the control voice is the wake-up-free control voice, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports that a user can carry out voice control on the intelligent sound box based on the control voice without awakening under the condition that the control voice spoken by the user is the control voice without awakening the voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Referring to fig. 16, a block diagram of a control device according to an embodiment of the present disclosure is shown, which may specifically include the following modules:
the third acquisition module 31 is configured to acquire a control voice for controlling the smart sound box;
the acquisition module 32 is configured to acquire an acquisition time of the smart sound box when the control voice is acquired;
and the third control module 33 is configured to control the smart sound box based on the control voice under the condition that the collection time is the wake-up-free time.
In the embodiment of the application, the identity of the initiator may not be limited, and no matter whether the initiator controlling the voice is a wake-up-free user or not, the smart speaker may be controlled based on the control voice under the condition that the collection time when the smart speaker collects the control voice is the wake-up-free time.
Through the method and the device, the non-free awakening user can be ensured to be based on the control voice control intelligent sound box under some special conditions, so that the non-free awakening user can be supported to be based on the control voice control intelligent sound box, and the intelligent sound box is not required to be input with awakening words, so that vast customers can obtain route navigation service inside a market and shopping navigation service inside the market and the like from the intelligent sound box.
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; acquiring the acquisition time of the intelligent sound box when the control voice is acquired; and under the condition that the collection time is the wake-up-free time, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports that a user can carry out voice control on the intelligent sound box based on the control voice without awakening under the condition that the control voice spoken by the user is the control voice without awakening the voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
Referring to fig. 17, a block diagram of a control device according to an embodiment of the present disclosure is shown, which may specifically include the following modules:
a fourth collecting module 41, configured to collect a control voice for controlling the smart sound box;
a third determining module 42, configured to determine a location of the smart speaker;
and a fourth control module 43, configured to control the smart speaker based on the control voice when the location is in the wake-up-free area.
In the embodiment of the application, the identity of the initiator is not limited, and no matter whether the initiator controlling the voice is a wake-up-free user or not, the intelligent sound box can be controlled based on the control voice under the condition that the intelligent sound box is located in the wake-up-free area. Through this application, can guarantee under some special circumstances non-exempt awaken up the user and also can be based on control speech control intelligent audio amplifier, for example, at the entrance intelligent audio amplifier of market, customers need acquire the inside route navigation service of market and the inside shopping navigation service of market etc. so, can support non-exempt to awaken up the user and also can be based on control speech control intelligent audio amplifier, do not need to awaken up the word to intelligent audio amplifier input, thereby make customers can follow intelligent audio amplifier department and acquire the inside route navigation service of market and the inside shopping navigation service of market etc..
In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining the position of the intelligent sound box; and under the condition that the position is in the wake-up-free area, controlling the intelligent loudspeaker box based on the control voice. Through the application, the intelligent sound box supports that a user can carry out voice control on the intelligent sound box based on the control voice without awakening under the condition that the control voice spoken by the user is the control voice without awakening the voice without speaking the awakening word. The awakening words can not be spoken, so that the interaction process between the user and the intelligent sound box is simple and convenient, the interaction efficiency can be improved, and the user experience can be improved.
The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.
Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the application, the electronic device comprises a server, a gateway, a sub-device and the like, wherein the sub-device is a device such as an internet of things device.
Embodiments of the present disclosure may be implemented as an apparatus, which may include electronic devices such as servers (clusters), terminal devices such as IoT devices, and the like, using any suitable hardware, firmware, software, or any combination thereof, for a desired configuration.
Fig. 18 schematically illustrates an example apparatus 1300 that can be used to implement various embodiments described herein.
For one embodiment, fig. 18 illustrates an example apparatus 1300 having one or more processors 1302, a control module (chipset) 1304 coupled to at least one of the processor(s) 1302, memory 1306 coupled to the control module 1304, non-volatile memory (NVM)/storage 1308 coupled to the control module 1304, one or more input/output devices 1310 coupled to the control module 1304, and a network interface 1312 coupled to the control module 1306.
Processor 1302 may include one or more single-core or multi-core processors, and processor 1302 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1300 can be a server device such as a gateway described in the embodiments of the present application.
In some embodiments, apparatus 1300 may include one or more computer-readable media (e.g., memory 1306 or NVM/storage 1308) having instructions 1314 and one or more processors 1302, which in combination with the one or more computer-readable media, are configured to execute instructions 1314 to implement modules to perform actions described in this disclosure.
For one embodiment, control module 1304 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1302 and/or any suitable device or component in communication with control module 1304.
The control module 1304 may include a memory controller module to provide an interface to the memory 1306. The memory controller module may be a hardware module, a software module, and/or a firmware module.
Memory 1306 may be used, for example, to load and store data and/or instructions 1314 for device 1300. For one embodiment, memory 1306 may comprise any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 1306 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).
For one embodiment, control module 1304 may include one or more input/output controllers to provide an interface to NVM/storage 1308 and input/output device(s) 1310.
For example, NVM/storage 1308 may be used to store data and/or instructions 1314. NVM/storage 1308 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).
NVM/storage 1308 may include storage resources that are physically part of the device on which apparatus 1300 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 1308 may be accessible over a network via input/output device(s) 1310.
Input/output device(s) 1310 may provide an interface for apparatus 1300 to communicate with any other suitable device, input/output device(s) 1310 may include communication components, audio components, sensor components, and so forth. The network interface 1312 may provide an interface for the device 1300 to communicate over one or more networks, and the device 1300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, e.g., WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.
For one embodiment, at least one of the processor(s) 1302 may be packaged together with logic for one or more controllers (e.g., memory controller modules) of the control module 1304. For one embodiment, at least one of the processor(s) 1302 may be packaged together with logic for one or more controllers of the control module 1304 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1302 may be integrated on the same die with logic for one or more controller(s) of the control module 1304. For one embodiment, at least one of the processor(s) 1302 may be integrated on the same die with logic of one or more controllers of the control module 1304 to form a system on chip (SoC).
In various embodiments, apparatus 1300 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, apparatus 1300 may have more or fewer components and/or different architectures. For example, in some embodiments, device 1300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.
An embodiment of the present application provides a server, including: one or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the server to perform a method of inter-device communication as described in one or more of the embodiments of the present application.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above detailed description is given to a control method and apparatus provided by the present application, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (38)

1. A control method is applied to a smart sound box and comprises the following steps:
collecting control voice for controlling the intelligent sound box;
determining whether the initiator of the control voice is a wake-free user;
and under the condition that the initiator is used for avoiding awakening the user, controlling the intelligent sound box based on the control voice.
2. The method of claim 1, the determining whether the originator of the control speech is a wake-exempt user, comprising:
recognizing the voiceprint characteristics of the control voice;
and under the condition that the voiceprint feature is the voiceprint feature of the wake-free user, determining that the initiator is the wake-free user.
3. The method of claim 1, the determining whether the originator of the control speech is a wake-exempt user, comprising:
determining the position of a wake-up-free device in communication connection with the intelligent sound box;
determining the relative direction of the wake-up-free device relative to the smart sound box according to the position;
determining the source direction of the control voice;
and under the condition that the relative direction is the same as the source direction, determining that the initiator is a person-free wake-up user.
4. The method of claim 1, the determining whether the originator of the control speech is a wake-exempt user, comprising:
determining the source direction of the control voice;
acquiring an image including the initiator in a direction of origin;
identifying facial features of the sponsor in the image;
determining that the initiator is a wake-exempt user if the facial features are of the wake-exempt user.
5. The method of claim 3 or 4, the smart speaker comprising at least two voice capture devices;
the collection is used for controlling the control pronunciation of intelligent audio amplifier includes:
respectively acquiring control voices for controlling the intelligent sound boxes based on at least voice acquisition equipment;
the determining the source direction of the control voice comprises:
determining phase information of control voices respectively acquired by at least two voice acquisition devices;
determining the source direction based on the phase information.
6. The method of claim 1, the determining whether the originator of the control speech is a wake-exempt user, comprising:
determining the relative direction of the initiator relative to the intelligent sound box;
acquiring the historical direction of the awakening-free user relative to the intelligent sound box which is determined for the last time;
and under the condition that the difference between the relative direction and the historical direction is smaller than a preset difference, determining that the initiator is a person-free wake-up user.
7. The method of claim 1, the controlling the smart speaker based on the control speech, comprising:
carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
determining a control intent of the control speech based at least on the control text;
determining an intention field in which the control intention is;
and under the condition that the intention field is the intention field supported by the intelligent sound box, controlling the intelligent sound box based on the control intention.
8. The method of claim 7, the determining a control intent of the control speech based at least on the control text, comprising:
and inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.
9. The method of claim 8, wherein training the intent prediction model comprises:
obtaining a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;
constructing a network structure of an intention prediction model;
and training network parameters in an intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.
10. The method of claim 9, wherein the intent prediction model network structure comprises at least:
the system comprises a word segmentation layer, an encoding layer, a bidirectional recurrent neural network, a polymerization layer and a full connection layer;
the word segmentation layer is used for segmenting the control text to obtain a plurality of words;
the coding layer is used for converting a plurality of vocabularies into feature vectors respectively;
the bidirectional recurrent neural network is used for respectively performing feature supplementation on the multiple vectors based on the dependency relationship between at least two adjacent feature vectors in the multiple feature vectors;
the aggregation layer is used for aggregating a plurality of feature vectors after feature supplement is completed to obtain an aggregation vector;
the fully-connected layer is to predict a control intent from the aggregated vector.
11. The method of claim 10, wherein the bidirectional recurrent neural network comprises a forward Long Short Term Memory (LSTM) network and a backward LSTM network;
the forward LSTM network comprises a plurality of LSTM models connected in sequence;
the backward LSTM network includes a plurality of LSTM models connected sequentially;
the connection order among the plurality of LSTM models included in the forward LSTM network is opposite to the connection order among the plurality of LSTM models included in the backward LSTM network.
12. The method of claim 7, the determining a control intent of the control speech based at least on the control text, comprising:
determining a current service scene of the intelligent sound box;
determining the control intent based on the traffic scenario and the control text.
13. The method of claim 7, the determining the area of intent in which the control intent is located, comprising:
and searching an intention field corresponding to the control intention in the corresponding relation between the control intention and the intention field.
14. The method of claim 1, wherein the control voice is plural, and the control voice is respectively sent by plural initiators; and the number of wake-up free users among the plurality of initiators is at least two;
based on control speech control smart speaker includes:
determining the priority of at least two wake-free users;
and controlling the intelligent sound box based on the control voice sent by the non-awakening user with high priority.
15. A control method is applied to a smart sound box and comprises the following steps:
collecting control voice for controlling the intelligent sound box;
determining whether the control voice is a wake-up free control voice;
and under the condition that the control voice is the wake-up-free control voice, controlling the intelligent sound box based on the control voice.
16. The method of claim 15, the determining whether the control voice is a wake-exempt control voice, comprising:
carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
judging whether the control text carries a wake-up free keyword or not;
and under the condition that the control text carries the wake-up-free keyword, determining the control voice as the wake-up-free control voice.
17. A control method is applied to a smart sound box and comprises the following steps:
collecting control voice for controlling the intelligent sound box;
acquiring the acquisition time of the intelligent sound box when the control voice is acquired;
and under the condition that the collection time is the wake-up-free time, controlling the intelligent sound box based on the control voice.
18. A control method is applied to a smart sound box and comprises the following steps:
collecting control voice for controlling the intelligent sound box;
determining the position of the intelligent sound box;
and under the condition that the position is located in the wake-up-free area, controlling the intelligent sound box based on the control voice.
19. The utility model provides a controlling means is applied to intelligent audio amplifier, includes:
the first acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the first determining module is used for determining whether the initiator of the control voice is a wake-free user;
and the first control module is used for controlling the intelligent sound box based on the control voice under the condition that the initiator is used for avoiding awakening the user artificially.
20. The apparatus of claim 19, the first determining module comprising:
a first recognition unit configured to recognize a voiceprint feature of the control voice;
a first determining unit, configured to determine that the initiator is an aroused-free user when the voiceprint feature is a voiceprint feature of an aroused-free user.
21. The apparatus of claim 19, the first determining means comprising:
the second determination unit is used for determining the position of the wake-up-free device in communication connection with the intelligent sound box;
a third determining unit, configured to determine, according to the position, a relative direction of the wake-up-free device with respect to the smart speaker;
a fourth determining unit, configured to determine a source direction of the control voice;
a fifth determining unit, configured to determine that the initiator is an awake-exempt user when the relative direction is the same as the source direction.
22. The apparatus of claim 19, the first determining module comprising:
a fourth determining unit, configured to determine a source direction of the control voice;
the acquisition unit is used for acquiring an image which is positioned in a source direction and comprises the initiator;
a second identifying unit configured to identify a facial feature of the originator in the image;
a sixth determining unit, configured to determine that the initiator is an aroused-free user if the facial feature is a facial feature of an aroused-free user.
23. The apparatus of claim 21 or 22, the smart speaker comprising at least two voice capture devices;
the first acquisition module is specifically configured to:
respectively acquiring control voices for controlling the intelligent sound boxes based on at least voice acquisition equipment;
the fourth determination unit includes:
the first determining subunit is used for determining phase information of control voices acquired by at least two voice acquisition devices respectively;
a second determining subunit for determining the source direction based on the phase information.
24. The apparatus of claim 19, the first determining module comprising:
a seventh determining unit, configured to determine a relative direction of the initiator with respect to the smart sound box;
the obtaining unit is used for obtaining the historical direction of the awakening-free user relative to the intelligent sound box which is determined for the last time;
an eighth determining unit, configured to determine that the initiator is an awaking-free user when a difference between the relative direction and the historical direction is smaller than a preset difference.
25. The apparatus of claim 19, the first control module comprising:
the third recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
a ninth determining unit configured to determine a control intention of the control voice based on at least the control text;
a tenth determination unit configured to determine an intention field in which the control intention is located;
and the first control unit is used for controlling the intelligent sound box based on the control intention under the condition that the intention field is the intention field supported by the intelligent sound box.
26. The apparatus of claim 25, the sixth determining unit comprising:
and the input subunit is used for inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.
27. The apparatus according to claim 26, wherein the sixth determining unit further comprises:
the acquisition subunit is used for acquiring a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;
a construction subunit, configured to construct a network structure of the intent prediction model;
and the third determining subunit is used for training the network parameters in the intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.
28. The apparatus of claim 27, wherein the intent prediction model network structure comprises at least:
the system comprises a word segmentation layer, an encoding layer, a bidirectional recurrent neural network, a polymerization layer and a full connection layer;
the word segmentation layer is used for segmenting the control text to obtain a plurality of words;
the coding layer is used for converting a plurality of vocabularies into feature vectors respectively;
the bidirectional recurrent neural network is used for respectively performing feature supplementation on the multiple vectors based on the dependency relationship between at least two adjacent feature vectors in the multiple feature vectors;
the aggregation layer is used for aggregating a plurality of feature vectors after feature supplement is completed to obtain an aggregation vector;
the fully-connected layer is to predict a control intent from the aggregated vector.
29. The apparatus of claim 28, wherein the bidirectional recurrent neural network comprises a forward Long Short Term Memory (LSTM) network and a backward LSTM network;
the forward LSTM network comprises a plurality of LSTM models connected in sequence;
the backward LSTM network includes a plurality of LSTM models connected sequentially;
the connection order among the plurality of LSTM models included in the forward LSTM network is opposite to the connection order among the plurality of LSTM models included in the backward LSTM network.
30. The apparatus of claim 25, the tenth determination unit comprising:
the fourth determining subunit is configured to determine a service scene where the smart sound box is currently located;
a fifth determining subunit, configured to determine the control intention based on the service scenario and the control text.
31. The apparatus of claim 25, the tenth determining unit is specifically configured to: and searching an intention field corresponding to the control intention in the corresponding relation between the control intention and the intention field.
32. The apparatus of claim 19, wherein the control voice is plural, and the control voice is respectively uttered by plural initiators; and the number of wake-up free users among the plurality of initiators is at least two;
the first control module includes:
an eleventh determining unit, configured to determine priorities of at least two wake-exempt users;
and the second control unit is used for controlling the intelligent sound box based on the control voice sent by the non-awakening user with high priority.
33. The utility model provides a controlling means is applied to intelligent audio amplifier, includes:
the second acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the second determination module is used for determining whether the control voice is the wake-up-free control voice;
and the second control module is used for controlling the voice to be in a condition of avoiding awakening the control voice and controlling the intelligent sound box based on the control voice.
34. The apparatus of claim 33, the second determining means comprising:
the fourth recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;
the judging unit is used for judging whether the control text carries the wake-up-free keyword or not;
a twelfth determining unit, configured to determine that the control speech is the wake-up free control speech when the control text carries the wake-up free keyword.
35. The utility model provides a controlling means is applied to intelligent audio amplifier, includes:
the third acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the acquisition module is used for acquiring the acquisition time of the intelligent sound box when the control voice is acquired;
and the third control module is used for controlling the intelligent sound box based on the control voice under the condition that the collection time is the wake-up-free time.
36. The utility model provides a controlling means is applied to intelligent audio amplifier, includes:
the fourth acquisition module is used for acquiring control voice for controlling the intelligent sound box;
the third determining module is used for determining the position of the intelligent sound box;
and the fourth control module is used for controlling the intelligent sound box based on the control voice under the condition that the position is located in the wake-up-free area.
37. A smart sound box, comprising:
a processor; and
memory having stored thereon executable code which, when executed, causes the processor to perform a control method as claimed in one or more of claims 1-19.
38. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform a control method as recited in one or more of claims 1-19.
CN202010167783.8A 2020-03-11 2020-03-11 Control method and device Active CN113393834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010167783.8A CN113393834B (en) 2020-03-11 2020-03-11 Control method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010167783.8A CN113393834B (en) 2020-03-11 2020-03-11 Control method and device

Publications (2)

Publication Number Publication Date
CN113393834A true CN113393834A (en) 2021-09-14
CN113393834B CN113393834B (en) 2024-04-16

Family

ID=77615423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010167783.8A Active CN113393834B (en) 2020-03-11 2020-03-11 Control method and device

Country Status (1)

Country Link
CN (1) CN113393834B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816192A (en) * 2020-07-07 2020-10-23 云知声智能科技股份有限公司 Voice equipment and control method, device and equipment thereof

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328132A (en) * 2016-08-15 2017-01-11 歌尔股份有限公司 Voice interaction control method and device for intelligent equipment
KR20170100722A (en) * 2016-02-26 2017-09-05 주식회사 서연전자 Smartkey System having a function to recognition sounds of users
CN107315561A (en) * 2017-06-30 2017-11-03 联想(北京)有限公司 A kind of data processing method and electronic equipment
CN108737933A (en) * 2018-05-30 2018-11-02 上海与德科技有限公司 A kind of dialogue method, device and electronic equipment based on intelligent sound box
CN108762104A (en) * 2018-05-17 2018-11-06 江西午诺科技有限公司 Speaker control method, device, readable storage medium storing program for executing and mobile terminal
CN108958810A (en) * 2018-02-09 2018-12-07 北京猎户星空科技有限公司 A kind of user identification method based on vocal print, device and equipment
CN108962260A (en) * 2018-06-25 2018-12-07 福来宝电子(深圳)有限公司 A kind of more human lives enable audio recognition method, system and storage medium
CN109104664A (en) * 2018-06-25 2018-12-28 福来宝电子(深圳)有限公司 Control method, system, intelligent sound box and the storage medium of intelligent sound box
WO2019007245A1 (en) * 2017-07-04 2019-01-10 阿里巴巴集团控股有限公司 Processing method, control method and recognition method, and apparatus and electronic device therefor
CN109286875A (en) * 2018-09-29 2019-01-29 百度在线网络技术(北京)有限公司 For orienting method, apparatus, electronic equipment and the storage medium of pickup
CN109308908A (en) * 2017-07-27 2019-02-05 深圳市冠旭电子股份有限公司 A kind of voice interactive method and device
CN109326289A (en) * 2018-11-30 2019-02-12 深圳创维数字技术有限公司 Exempt to wake up voice interactive method, device, equipment and storage medium
CN109545206A (en) * 2018-10-29 2019-03-29 百度在线网络技术(北京)有限公司 Voice interaction processing method, device and the smart machine of smart machine
CN109637548A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Voice interactive method and device based on Application on Voiceprint Recognition
CN110491387A (en) * 2019-08-23 2019-11-22 三星电子(中国)研发中心 A kind of interactive service method and system based on multiple terminals
WO2019223102A1 (en) * 2018-05-22 2019-11-28 平安科技(深圳)有限公司 Method and apparatus for checking validity of identity, terminal device and medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170100722A (en) * 2016-02-26 2017-09-05 주식회사 서연전자 Smartkey System having a function to recognition sounds of users
CN106328132A (en) * 2016-08-15 2017-01-11 歌尔股份有限公司 Voice interaction control method and device for intelligent equipment
CN107315561A (en) * 2017-06-30 2017-11-03 联想(北京)有限公司 A kind of data processing method and electronic equipment
WO2019007245A1 (en) * 2017-07-04 2019-01-10 阿里巴巴集团控股有限公司 Processing method, control method and recognition method, and apparatus and electronic device therefor
CN109308908A (en) * 2017-07-27 2019-02-05 深圳市冠旭电子股份有限公司 A kind of voice interactive method and device
CN108958810A (en) * 2018-02-09 2018-12-07 北京猎户星空科技有限公司 A kind of user identification method based on vocal print, device and equipment
CN108762104A (en) * 2018-05-17 2018-11-06 江西午诺科技有限公司 Speaker control method, device, readable storage medium storing program for executing and mobile terminal
WO2019223102A1 (en) * 2018-05-22 2019-11-28 平安科技(深圳)有限公司 Method and apparatus for checking validity of identity, terminal device and medium
CN108737933A (en) * 2018-05-30 2018-11-02 上海与德科技有限公司 A kind of dialogue method, device and electronic equipment based on intelligent sound box
CN109104664A (en) * 2018-06-25 2018-12-28 福来宝电子(深圳)有限公司 Control method, system, intelligent sound box and the storage medium of intelligent sound box
CN108962260A (en) * 2018-06-25 2018-12-07 福来宝电子(深圳)有限公司 A kind of more human lives enable audio recognition method, system and storage medium
CN109286875A (en) * 2018-09-29 2019-01-29 百度在线网络技术(北京)有限公司 For orienting method, apparatus, electronic equipment and the storage medium of pickup
CN109545206A (en) * 2018-10-29 2019-03-29 百度在线网络技术(北京)有限公司 Voice interaction processing method, device and the smart machine of smart machine
CN109326289A (en) * 2018-11-30 2019-02-12 深圳创维数字技术有限公司 Exempt to wake up voice interactive method, device, equipment and storage medium
CN109637548A (en) * 2018-12-29 2019-04-16 苏州思必驰信息科技有限公司 Voice interactive method and device based on Application on Voiceprint Recognition
CN110491387A (en) * 2019-08-23 2019-11-22 三星电子(中国)研发中心 A kind of interactive service method and system based on multiple terminals

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李四维;程贵锋;何双旺;张笛;: "语音助手能力评估研究及趋势分析", 广东通信技术, no. 12 *
陈东升;: "百度AI黑科技 小度在家智能音箱", 计算机与网络, no. 22, 26 November 2018 (2018-11-26) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816192A (en) * 2020-07-07 2020-10-23 云知声智能科技股份有限公司 Voice equipment and control method, device and equipment thereof

Also Published As

Publication number Publication date
CN113393834B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
US10546578B2 (en) Method and device for transmitting and receiving audio data
KR102458805B1 (en) Multi-user authentication on a device
US10043520B2 (en) Multilevel speech recognition for candidate application group using first and second speech commands
US9064495B1 (en) Measurement of user perceived latency in a cloud based speech application
CN114830228A (en) Account associated with a device
US20160049152A1 (en) System and method for hybrid processing in a natural language voice services environment
WO2020238209A1 (en) Audio processing method, system and related device
CN110097870B (en) Voice processing method, device, equipment and storage medium
CN107408386A (en) Electronic installation is controlled based on voice direction
CN108831508A (en) Voice activity detection method, device and equipment
CN103038765A (en) Method and apparatus for adapting a context model
CN110070859B (en) Voice recognition method and device
US11443730B2 (en) Initiating synthesized speech output from a voice-controlled device
US20150317998A1 (en) Method and apparatus for recognizing speech, and method and apparatus for generating noise-speech recognition model
US10923101B2 (en) Pausing synthesized speech output from a voice-controlled device
EP3308379B1 (en) Motion adaptive speech processing
KR20160106075A (en) Method and device for identifying a piece of music in an audio stream
US20160027435A1 (en) Method for training an automatic speech recognition system
US11626104B2 (en) User speech profile management
WO2022206602A1 (en) Speech wakeup method and apparatus, and storage medium and system
US10950221B2 (en) Keyword confirmation method and apparatus
CN110992942A (en) Voice recognition method and device for voice recognition
US20240013784A1 (en) Speaker recognition adaptation
CN113362812A (en) Voice recognition method and device and electronic equipment
CN113779208A (en) Method and device for man-machine conversation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40058138

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant