CN113393834B

CN113393834B - Control method and device

Info

Publication number: CN113393834B
Application number: CN202010167783.8A
Authority: CN
Inventors: 张平
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2024-04-16
Anticipated expiration: 2040-03-11
Also published as: CN113393834A

Abstract

The embodiment of the application provides a control method and device. In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the initiator of the control voice is a wake-up-free user; and under the condition that the control voice is initiated to avoid waking up the user, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports the wake-up-free user to realize the voice control on the intelligent sound box based on the control voice under the condition that the control voice is spoken without speaking wake-up words. Because the wake-up word can not be uttered, the interaction process between the user and the intelligent sound box is simpler and more convenient, so that the interaction efficiency can be improved, and the user experience can be improved.

Description

Control method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a control method and apparatus.

Background

With the continuous development of technology, users can control devices through voice interaction in daily life. For example, a user may speak a voice command directly into the device, which the device may respond to.

In the prior art, a user often needs to use a "voice wake-up word+voice instruction" expression mode, so that the user needs to speak a wake-up word for waking up the intelligent sound box every time the user interacts with the device, so that the device knows that the user is speaking to the user.

However, speaking the wake-up word makes the interaction process between the user and the device more complicated, and reduces the interaction efficiency, thereby reducing the user experience.

Disclosure of Invention

In order to improve interaction efficiency and further improve user experience, the application discloses a control method and device.

In a first aspect, the present application shows a control method, including:

collecting control voice for controlling the intelligent sound box;

determining whether the initiator of the control voice is a wake-up-free user;

and under the condition that the initiator is free from waking up the user, controlling the intelligent sound box based on the control voice.

In an optional implementation manner, the determining whether the initiator of the control voice is a wake-free user includes:

identifying voiceprint features of the control speech;

and determining that the initiator is the wake-free user under the condition that the voiceprint characteristic is the voiceprint characteristic of the wake-free user.

determining the position of wake-up-free equipment in communication connection with the intelligent sound box;

determining the relative direction of the wake-up-free equipment relative to the intelligent sound box according to the position;

Determining a source direction of the control speech;

and determining that the initiator is free of waking up the user under the condition that the relative direction is the same as the source direction.

determining a source direction of the control speech;

collecting an image including the sponsor located in a source direction;

identifying facial features of an sponsor in the image;

and determining that the initiator is a person to wake-up free user under the condition that the facial features are the person to wake-up free user.

In an alternative implementation, the smart speaker includes at least two voice capture devices;

the collection is used for controlling the control pronunciation of intelligent audio amplifier includes:

respectively collecting control voices for controlling the intelligent sound box based on at least voice collecting equipment;

the determining the source direction of the control voice comprises:

determining phase information of control voices respectively acquired by at least two voice acquisition devices;

the source direction is determined based on the phase information.

Determining the relative direction of the initiator relative to the intelligent sound box;

acquiring the history direction of the last determined wake-up-free user relative to the intelligent sound box;

and determining that the initiator is a wake-free user under the condition that the difference between the relative direction and the historical direction is smaller than a preset difference.

In an optional implementation manner, the controlling the smart speaker based on the control voice includes:

performing voice recognition on the control voice to obtain a control text corresponding to the control voice;

determining a control intent of the control speech based at least on the control text;

determining the intention field of the control intention;

and controlling the intelligent sound box based on the control intention under the condition that the intention field is the intention field supported by the intelligent sound box.

In an alternative implementation, the determining the control intent of the control speech based at least on the control text includes:

and inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.

In an alternative implementation, the means for training the intent prediction model includes:

Acquiring a sample data set, wherein the sample data set comprises sample control text marked with sample control intention;

constructing a network structure of an intention prediction model;

training network parameters in the intention prediction model by using the sample data set until the network parameters are converged, so as to obtain the intention prediction model.

In an alternative implementation, the network structure of the intent prediction model includes at least:

the system comprises a word segmentation layer, a coding layer, a bidirectional cyclic neural network, a polymerization layer and a full connection layer;

the word segmentation layer is used for segmenting the control text to obtain a plurality of words;

the coding layer is used for respectively converting a plurality of vocabularies into feature vectors;

the bidirectional cyclic neural network is used for respectively supplementing the characteristics of a plurality of vectors based on the dependency relationship between at least two adjacent characteristic vectors in the plurality of characteristic vectors;

the aggregation layer is used for aggregating a plurality of feature vectors with the feature supplemented to obtain an aggregation vector;

the full connection layer is used for predicting control intention according to the aggregate vector.

In an alternative implementation, the bidirectional recurrent neural network includes a forward long short term memory network LSTM network and a backward LSTM network;

The forward LSTM network comprises a plurality of LSTM models which are connected in sequence;

the backward LSTM network comprises a plurality of LSTM models which are sequentially connected;

the connection order between the plurality of LSTM models included in the forward LSTM network is opposite to the connection order between the plurality of LSTM models included in the backward LSTM network.

determining a current service scene of the intelligent sound box;

the control intent is determined based on the business scenario and the control text.

In an alternative implementation manner, the determining the intention field in which the control intention is located includes:

in the correspondence between the control intention and the intention field, the intention field corresponding to the control intention is searched.

In an alternative implementation, the control voices are multiple, and the control voices are respectively sent by multiple sponsors; and at least two wake-up-free users in a plurality of sponsors;

the controlling the intelligent sound box based on the control voice comprises the following steps:

determining priorities of at least two wake-up-free users;

And controlling the intelligent sound box based on the control voice sent by the high-priority wake-up-free user.

In a second aspect, the present application shows a control method applied to an intelligent sound box, including:

collecting control voice for controlling the intelligent sound box;

determining whether the control voice is a wake-up-free control voice;

and under the condition that the control voice is wake-up-free control voice, controlling the intelligent sound box based on the control voice.

In an alternative implementation, the determining whether the control speech is a wake-up-free control speech includes:

judging whether the control text carries a wake-up-free keyword or not;

and under the condition that the control text carries the wake-up-free keyword, determining the control voice as the wake-up-free control voice.

In a third aspect, the present application shows a control method applied to an intelligent sound box, including:

collecting control voice for controlling the intelligent sound box;

acquiring the acquisition time of the intelligent sound box when the control voice is acquired;

and under the condition that the acquisition time is the wake-up-free time, controlling the intelligent sound box based on the control voice.

In a fourth aspect, the present application shows a control method applied to an intelligent sound box, including:

collecting control voice for controlling the intelligent sound box;

determining the position of the intelligent sound box;

and controlling the intelligent sound box based on the control voice under the condition that the position is in the wake-up-free area.

In a fifth aspect, the present application shows a control device applied to an intelligent sound box, including:

the first acquisition module is used for acquiring control voice for controlling the intelligent sound box;

the first determining module is used for determining whether the initiator of the control voice is a wake-up-free user;

the first control module is used for controlling the intelligent sound box based on the control voice under the condition that the user is free from waking up by the initiator.

In an alternative implementation, the first determining module includes:

the first recognition unit is used for recognizing voiceprint features of the control voice;

and the first determining unit is used for determining that the initiator is the wake-up-free user under the condition that the voiceprint characteristic is the voiceprint characteristic of the wake-up-free user.

In an alternative implementation, the first determining module includes:

The second determining unit is used for determining the position of the wake-up-free equipment in communication connection with the intelligent sound box;

the third determining unit is used for determining the relative direction of the wake-up-free equipment relative to the intelligent sound box according to the position;

a fourth determining unit configured to determine a source direction of the control voice;

and a fifth determining unit, configured to determine that the initiator is a wake-up-free user when the relative direction is the same as the source direction.

In an alternative implementation, the first determining module includes:

the acquisition unit is used for acquiring an image including the initiator in the source direction;

a second recognition unit configured to recognize a facial feature of an initiator in the image;

a sixth determining unit, configured to determine that the initiator is a wake-free user if the facial feature is a wake-free facial feature of the user.

the first acquisition module is specifically configured to:

The fourth determination unit includes:

the first determining subunit is used for determining phase information of the control voices respectively acquired by the at least two voice acquisition devices;

a second determination subunit for determining the source direction based on the phase information.

In an alternative implementation, the first determining module includes:

a seventh determining unit, configured to determine a relative direction of the initiator with respect to the intelligent sound box;

the acquisition unit is used for acquiring the history direction of the last determined wake-up-free user relative to the intelligent sound box;

an eighth determining unit, configured to determine that the initiator is free from waking up the user if a difference between the relative direction and the historical direction is smaller than a preset difference.

In an alternative implementation, the first control module includes:

the third recognition unit is used for carrying out voice recognition on the control voice to obtain a control text corresponding to the control voice;

a ninth determination unit configured to determine a control intention of the control voice based at least on the control text;

a tenth determination unit configured to determine an intention field in which the control intention is located;

And the first control unit is used for controlling the intelligent sound box based on the control intention when the intention field is the intention field supported by the intelligent sound box.

In an alternative implementation, the ninth determining unit includes:

and the input subunit is used for inputting the control text into an intention prediction model to obtain the control intention output by the intention prediction model.

In an optional implementation manner, the ninth determining unit further includes:

the acquisition subunit is used for acquiring a sample data set, wherein the sample data set comprises sample control texts marked with sample control intents;

a building subunit, configured to build a network structure of the intent prediction model;

and the third determining subunit is used for training the network parameters in the intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.

In an alternative implementation, the ninth determining unit includes:

a fourth determining subunit, configured to determine a service scenario where the intelligent sound box is currently located;

and a fifth determining subunit, configured to determine the control intent based on the service scenario and the control text.

In an alternative implementation manner, the tenth determining unit is specifically configured to: in the correspondence between the control intention and the intention field, the intention field corresponding to the control intention is searched.

the first control module includes:

an eleventh determining unit, configured to determine priorities of at least two wake-up-free users;

and the second control unit is used for controlling the intelligent sound box based on the control voice sent by the wake-up-free user with high priority.

In a sixth aspect, the present application shows a control device applied to an intelligent sound box, including:

the second acquisition module is used for acquiring control voice for controlling the intelligent sound box;

the second determining module is used for determining whether the control voice is a wake-up-free control voice or not;

and the second control module is used for controlling the intelligent sound box based on the control voice under the condition that the control voice is the wake-up-free control voice.

In an alternative implementation, the second determining module includes:

A fourth recognition unit, configured to perform voice recognition on the control voice to obtain a control text corresponding to the control voice;

the judging unit is used for judging whether the control text carries the awakening-free keywords or not;

and the twelfth determining unit is used for determining that the control voice is the wake-up-free control voice under the condition that the wake-up-free keyword is carried in the control text.

In a seventh aspect, the present application shows a control device applied to an intelligent sound box, including:

the third acquisition module is used for acquiring control voice for controlling the intelligent sound box;

the acquisition module is used for acquiring the acquisition time when the intelligent sound box acquires the control voice;

and the third control module is used for controlling the intelligent sound box based on the control voice under the condition that the acquisition time is the wake-up-free time.

In an eighth aspect, the present application shows a control device applied to an intelligent sound box, including:

the fourth acquisition module is used for acquiring control voice for controlling the intelligent sound box;

the third determining module is used for determining the position of the intelligent sound box;

and the fourth control module is used for controlling the intelligent sound box based on the control voice under the condition that the position is positioned in the wake-up-free area.

In a ninth aspect, the present application shows a smart speaker, including:

a processor; and

a memory having executable code stored thereon, which when executed causes the processor to perform the control method of the first, second, third or fourth aspect.

In a fourth aspect, the present application shows one or more machine readable media having stored thereon executable code which when executed causes a processor to perform the control method of the first, second, third or fourth aspects.

Compared with the prior art, the embodiment of the application has the following advantages:

in the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the initiator of the control voice is a wake-up-free user; and under the condition that the control voice is initiated to avoid waking up the user, controlling the intelligent sound box based on the control voice. Through the application, the intelligent sound box supports the wake-up-free user to realize the voice control on the intelligent sound box based on the control voice under the condition that the control voice is spoken without speaking wake-up words. Because the wake-up word can not be uttered, the interaction process between the user and the intelligent sound box is simpler and more convenient, so that the interaction efficiency can be improved, and the user experience can be improved.

Drawings

Fig. 1 is a flow chart illustrating a control method according to an exemplary embodiment of the present application.

Fig. 2 is a flow chart illustrating a method for determining a wake-free user according to an exemplary embodiment of the present application.

Fig. 3 is a flow chart illustrating a method for determining a wake-free user according to an exemplary embodiment of the present application.

Fig. 4 is a schematic view of a scenario illustrated in an exemplary embodiment of the present application.

Fig. 5 is a flow chart illustrating a method for determining a wake-free user according to an exemplary embodiment of the present application.

Fig. 6 is a flow chart illustrating a method of determining a wake-free user according to an exemplary embodiment of the present application.

Fig. 7 is a flowchart illustrating a method for controlling a smart speaker according to an exemplary embodiment of the present application.

FIG. 8 is a schematic diagram of a network architecture of an intent prediction model, as illustrated in an exemplary embodiment of the present application.

Fig. 9 is a flow chart illustrating a method for determining a control intent according to an exemplary embodiment of the present application.

Fig. 10 is a flowchart illustrating a method for controlling a smart speaker according to an exemplary embodiment of the present application.

Fig. 11 is a flow chart illustrating a control method according to an exemplary embodiment of the present application.

Fig. 12 is a flow chart illustrating a control method according to an exemplary embodiment of the present application.

Fig. 13 is a flow chart illustrating a control method according to an exemplary embodiment of the present application.

Fig. 14 is a block diagram showing a control apparatus according to an exemplary embodiment of the present application.

Fig. 15 is a block diagram showing a control apparatus according to an exemplary embodiment of the present application.

Fig. 16 is a block diagram showing a control apparatus according to an exemplary embodiment of the present application.

Fig. 17 is a block diagram showing a control apparatus according to an exemplary embodiment of the present application.

Fig. 18 is a schematic structural view of an apparatus according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings.

Referring to fig. 1, a flow chart of a control method of the present application is shown, where the method is applied to an intelligent sound box, and the method may include:

in step S101, a control voice for controlling the smart speaker is collected;

in this application, when the user needs to come control intelligent audio amplifier through pronunciation, the user can speak the control pronunciation that is used for controlling intelligent audio amplifier to intelligent audio amplifier, and intelligent audio amplifier can gather the control pronunciation that is used for controlling intelligent audio amplifier that the user spoken based on pronunciation collection equipment.

Wherein the voice acquisition device comprises a microphone and the like.

In the present application, the control voice includes a control instruction for controlling the intelligent sound box, for example, an instruction such as "play Zhang three songs" or "query today's air temperature", and the control voice may not include a wake-up word of the intelligent sound box.

The present application is to support that a user can control an intelligent sound box without speaking a wake-up word and speaking a control voice, however, sometimes, the user speaks a control voice not used for controlling the intelligent sound box, but a normal dialogue with other people, etc., if the intelligent sound box captures a voice which is not used for controlling the intelligent sound box and executes the flow of the present application based on the captured voice, system resources of the intelligent sound box, such as CPU (Central Processing Unit ) resources, memory resources, electric energy resources, etc., are wasted.

Therefore, in order to avoid wasting system resources of the smart speaker, in another embodiment of the present application, a voice collection area may be set in the smart speaker in advance, for example, the voice collection area includes a circular area with a specific radius including 1m, 2m, 3m, and the like, which is centered in a location where the smart speaker is located, and the present application is not limited thereto. The intelligent sound box can collect the voice sent out from the voice collecting area, but can not collect the voice sent out from the outside of the voice collecting area, so, under the condition that a user needs to control the intelligent sound box through voice, the user can go to the voice collecting area of the intelligent sound box, then can speak the control voice used for controlling the intelligent sound box, the intelligent sound box can collect the control voice used for controlling the intelligent sound box and spoken by the user positioned in the voice collecting area based on the voice collecting equipment, and then step S102 is executed.

Under the condition that the user does not need to control the intelligent sound box through voice, if the user needs to normally talk with other people, the user can leave the voice collecting area and then normally talk with other people, and because the user is located outside the voice collecting area, the intelligent sound box can not collect voice sent by the user when the user normally talk with other people, and therefore the system resource of the intelligent sound box can be prevented from being wasted.

In step S102, determining whether the initiator of the control voice is a wake-up-free user;

in the present application, for the smart speaker, the user inputting control voice to the smart speaker includes a wake-free user and a non-wake-free user.

The intelligent sound box supports the wake-up-free user to realize the voice control on the intelligent sound box based on the control voice under the condition that the control voice is spoken without speaking wake-up words.

The intelligent sound box does not support that a non-wake-up-free user can realize the voice control on the intelligent sound box based on the control voice under the condition that the control voice is not spoken by the wake-up word, and the non-wake-up-free user can realize the voice control on the intelligent sound box based on the control voice under the condition that the wake-up word is spoken and the control voice is spoken by the non-wake-up-free user.

The present step may be specifically referred to the embodiments shown in fig. 2 to 6, and will not be described in detail herein.

In the case where the initiation of the control voice is a wake-up-free user, in step S103, the smart speaker is controlled based on the control voice.

In one embodiment of the present application, controlling a smart speaker based on the control voice includes: the voice information is played based on the speaker, the voice information being used in response to the control voice.

For example, the control voice input by the user is "how much the temperature is today", the intelligent sound box can search the temperature of today according to the control voice "how much the temperature is today", for example, the searched temperature is 20 ℃ to 25 ℃, and the voice information can be played based on the loudspeaker, so that the user can know that the temperature of today is 20 ℃ to 25 ℃.

In another embodiment of the present application, the smart speaker may also perform multiple conversations with the user, so that the smart speaker may use a full duplex mode, for example, the smart speaker may collect one control voice based on a microphone and play voice information in response to another control voice based on a speaker, etc. Namely, the intelligent sound box is also receiving sound in the speaking process.

In another embodiment of the present application, in the process of performing multiple conversations between the wake-up-free user and the intelligent sound box, if a non-wake-up-free user speaks, the intelligent sound box collects the voice spoken by the non-wake-up-free user, and determines that the voice is spoken by the non-wake-up-free user through the voice, so that the intelligent sound box does not respond to the voice spoken by the non-wake-up-free user.

The specific control method for controlling the intelligent sound box based on the control voice can be referred to the embodiments shown in fig. 7 to 10, and will not be described in detail herein.

In the case where the sponsor of the control speech is a non-wake-up-free user, it may be detected whether a wake-up word entered by the sponsor is received before the control speech is received, and in the case where the wake-up word entered by the sponsor is not received, the control speech may be left unprocessed, e.g., discarded, etc. Or prompting the sponsor that the sponsor does not need to wake up the user, and the sponsor needs to speak the wake-up word for waking up the intelligent sound box to control the intelligent sound box through the control voice, so that the sponsor knows that the sponsor needs to speak the wake-up word for waking up the intelligent sound box to control the intelligent sound box through the control voice.

In one embodiment of the present application, referring to fig. 2, step S102 includes:

in step S201, identifying a voiceprint feature of the control speech;

the recognition mode for recognizing the voiceprint features of the control voice is not limited, and any recognition mode is within the protection scope of the application.

In step S202, in the case that the voiceprint feature is a voiceprint feature of the wake-free user, it is determined that the initiator is an artificial wake-free user.

In this application, the owner of intelligent audio amplifier can set up and exempt from to wake up the user in intelligent audio amplifier, for example, the owner of intelligent audio amplifier can set up oneself and exempt from to wake up the user for intelligent audio amplifier, wherein, the owner of intelligent audio amplifier can input and exempt from to wake up the setting instruction in intelligent audio amplifier, and intelligent audio amplifier receives and exempt from to wake up the setting instruction, then uses voice acquisition equipment to gather the pronunciation of the owner of intelligent audio amplifier, then can discern the voiceprint characteristic of the pronunciation of the owner of intelligent audio amplifier, and regard as the voiceprint characteristic storage of exempting from to wake up the user in intelligent audio amplifier with this voiceprint characteristic.

Other wake-up-free users can also be set for the intelligent sound box by the owner of the intelligent sound box, for example, the user trusted by the owner of the intelligent sound box is set as the wake-up-free user, for example, the owner of the intelligent sound box is set as the user is not required to wake up manually, and the detailed description of the setting method can be seen.

In the application, the voiceprint characteristics of different users are often different, so that whether the initiator controlling the voice is a wake-up-free user can be accurately determined according to the voiceprint characteristics.

In one embodiment of the present application, referring to fig. 3, step S102 includes:

in step S301, determining a position of a wake-up free device communicatively connected to the smart speaker;

in this application, the owner of intelligent audio amplifier can set up and exempt from to wake up equipment in intelligent audio amplifier, for example, the owner of intelligent audio amplifier can set up the cell-phone of oneself and exempt from to wake up equipment for intelligent audio amplifier, wherein, the owner of intelligent audio amplifier can input in intelligent audio amplifier and exempt from to wake up the setting command, carry the equipment identification of the cell-phone of the owner of intelligent audio amplifier in the command, the intelligent audio amplifier receives and exempts from to wake up the setting command, then can follow and exempt from to wake up the equipment identification of the cell-phone of the owner of intelligent audio amplifier to with the equipment identification storage of the cell-phone of the owner of intelligent audio amplifier in intelligent audio amplifier.

Other wake-up-free devices can also be set for the intelligent sound box by the owner of the intelligent sound box, for example, the device of the user trusted by the owner of the intelligent sound box is set as the wake-up-free device, for example, the mobile phone of the owner of the intelligent sound box is set as the wake-up-free device, and the like, and the detailed description of the setting method can be seen.

Before the user needs to control the intelligent sound box through voice, the user can also establish communication connection between the user's equipment and the intelligent sound box, after the communication connection is established, the intelligent sound box can acquire the equipment identification of the user's equipment through the communication connection, judge whether the equipment identification is the stored equipment identification of the equipment without waking up, and determine that the equipment is the equipment without waking up under the condition that the equipment identification is the stored equipment identification of the equipment without waking up.

The hardware device in the application includes the intelligent sound box and the wake-up-free device, and of course, other devices, such as a router in a home, may also be included, where the intelligent sound box and the wake-up-free device are respectively in communication connection with the router, that is, the intelligent sound box, the wake-up-free device, and the router are in communication connection with each other, so that in the intelligent sound box, the wake-up-free device, and the router, the position of the wake-up-free device may be determined by combining with a triangle positioning method, and the specific determination method is not described in detail herein.

In step S302, determining a relative direction of the wake-up-free device with respect to the smart speaker according to the position;

in the application, the position of the intelligent sound box can be determined by combining a triangular positioning method in the intelligent sound box, the wake-up-free equipment and the router, the specific determination method is not described in detail herein, and then the relative direction of the wake-up-free equipment relative to the intelligent sound box can be determined based on the position of the wake-up-free equipment and the position of the intelligent sound box.

In step S303, determining a source direction of the control speech;

in one embodiment of the present application, the smart speaker may include at least two voice capture devices;

thus, when the intelligent sound box collects control voice for controlling the intelligent sound box, the intelligent sound box can collect the control voice for controlling the intelligent sound box based on at least the voice collecting equipment respectively; the control voices respectively collected by at least the voice collecting devices are sent by the same sponsor, and the control contents in the control voices respectively collected by at least the voice collecting devices are the same, however, as the distance between each voice collecting device and the sponsor is different, the phase information of the control voices respectively collected by each voice collecting device is different.

Thus, when determining the source direction of the control voice, the phase information of the control voice respectively collected by the at least two voice collecting devices can be determined, and then the source direction of the control voice is determined based on the phase information.

For example, the collection time of the control voice collected by the at least two voice collection devices respectively may be determined, then the time difference between the collection time of the control voice collected by the at least two voice collection devices respectively may be determined, and then the source direction of the control voice may be determined according to the time difference.

Referring to fig. 4, taking the example of the smart glasses including 2 voice collection devices, two voice collection devices are a and B, respectively, and if the initiator is located at the location S, the control voice is also sent from the location S.

Assuming that the collection time of the control voice collected by the voice collection device a is T1, the collection time of the control voice collected by the voice collection device B is T2, since the distance between the voice collection device a and the position S in fig. 4 is greater than the distance between the voice collection device B and the position S, T1 is greater than T2, a perpendicular line of the line segment AS can be made along B to obtain the perpendicular line BM, the point M divides the line segment AS into two segments, wherein the control voice propagates in space AS a spherical wave instead of a plane wave, and therefore, the distance of the control voice from S to the point M is the same AS the distance of the control voice from the point S to the point B, so the length of the line segment AM is the product of the sound velocity and the time difference, and the time difference includes the time difference between the time when the control voice reaches the voice collection device a and the time when the control voice reaches the voice collection device B.

Since the distance between the voice acquisition device a and the voice acquisition device B is already known, the angle of the angle a can be determined according to the distance between the voice acquisition device a and the voice acquisition device B and the length of the line segment AM, so that the source direction of the control voice can be determined.

In step S304, in the case that the relative direction is the same as the source direction, it is determined that the initiator of the control voice is a wake-free user.

In this application, in general, a user may hold his own device, or, although the user does not hold his own device, the user is often close to his own device, so the direction of the user relative to the smart speaker is often the same as the direction of his own device relative to the smart speaker.

Thus, in the case where the relative direction is the same as the source direction, it is often explained that the originator of the control voice is an owner of the wake-free device or a user authorized by the owner, etc., so that it can be determined that the originator of the control voice is a wake-free user.

In yet another embodiment of the present application, referring to fig. 5, step S102 includes:

in step S401, determining a source direction of the control voice;

this step is specifically described with reference to step S303, and will not be described in detail herein.

In step S402, an image including an initiator located in the source direction is acquired;

in the application, at least one image acquisition device, such as a camera, is arranged on the intelligent sound box, and images in any direction can be acquired based on the at least one image acquisition device.

In this application, because the control voice is sent by the initiator, the direction of the initiator of the control voice relative to the smart speaker and the source direction of the control voice are the same. Therefore, the collected image located in the direction of the source also includes the image of the initiator of the control voice.

In step S403, identifying facial features of the sponsor of the control speech in the image;

in the present application, the facial features of the initiator in the image may be identified by any identification method, and the specific identification method is not limited in the present application, and any identification method is in the protection scope of the present application.

In step S404, in the case that the facial feature is a wake-free user' S facial feature, it is determined that the initiator is an artificial wake-free user.

In this application, the owner of intelligent audio amplifier can set up and exempt from to wake up the user in intelligent audio amplifier, for example, the owner of intelligent audio amplifier can set up oneself and exempt from to wake up the user for intelligent audio amplifier, wherein, the owner of intelligent audio amplifier can input and exempt from to wake up the setting instruction in intelligent audio amplifier, and intelligent audio amplifier receives and exempts from to wake up the setting instruction, then uses image acquisition device to shoot the facial image of the owner of intelligent audio amplifier, then extracts the facial feature of this facial image, and does not store this facial feature as the facial feature that exempts from to wake up the user in intelligent audio amplifier.

In the application, facial features of different users often differ, so that whether an initiator controlling voice is a wake-up-free user can be accurately determined according to the facial features.

In yet another embodiment of the present application, referring to fig. 6, step S102 includes:

in step S501, determining a relative direction of the initiator of the control voice with respect to the smart speaker;

in this application, because the control voice is sent by the initiator, the direction of the initiator of the control voice relative to the smart speaker and the source direction of the control voice are the same.

That is, the source direction of the control voice is the same as the relative direction of the initiator of the control voice with respect to the intelligent speaker.

Therefore, after the source direction of the control voice is determined, the relative direction of the control voice initiator relative to the intelligent sound box can be obtained.

The specific manner of determining the source direction of the control voice can be referred to the description of step S303, which is not described in detail herein.

In step S502, a history direction of the last determined wake-up free user relative to the intelligent sound box is obtained;

in the application, in the process of carrying out multi-wheel dialogue between a user and the intelligent sound box, each time the intelligent sound box determines that the user is a wake-up-free user according to control voice sent by the user and determines the relative direction of the user relative to the intelligent sound box, the intelligent sound box can store the determined relative direction of the user relative to the intelligent sound box as the latest determined historical direction of the wake-up-free user relative to the intelligent sound box in the intelligent sound box.

Thus, in this step, the intelligent sound box can obtain the latest stored historical direction of the wake-up-free user relative to the intelligent sound box, and the latest determined historical direction of the wake-up-free user relative to the intelligent sound box.

In step S503, in the case that the difference between the relative direction and the historical direction is smaller than the preset difference, it is determined that the initiator of the control voice is a wake-up-free user.

In the application, in the process of carrying out multi-round dialogue between the non-wake-up user and the intelligent sound box, each round of dialogue is free from the wake-up user to input control voice to the intelligent sound box, and in general, in the multi-round dialogue, the non-wake-up user often does not walk far away, namely, in the multi-round dialogue, the position of the non-wake-up user does not change greatly,

Therefore, if the difference between the historical direction of the last determined wake-up-free user relative to the intelligent sound box and the relative direction of the initiator of the control voice is small, the control voice is often output by the wake-up-free user in a certain dialog in the middle of the process of carrying out multiple dialogues with the intelligent sound box or in the last dialog, so that the initiator of the control voice can be determined to be the wake-up-free user.

In this application, the direction may be a direction angle, and the difference between the direction and the direction is a difference between direction angles, for example, the difference between the relative direction and the historical direction is a difference between two direction angles, and the preset difference includes an angle value, for example, 30 °, 40 ° or 50 °, which is not limited in this application, and the preset difference may be obtained according to historical statistics.

In yet another embodiment of the present application, referring to fig. 7, step S103 includes:

in step S601, performing speech recognition on the control speech to obtain a control text corresponding to the control speech;

in this step, the control voice may be subjected to voice recognition by any voice recognition algorithm to obtain a control text corresponding to the control voice.

In step S602, determining a control intention of the control speech based at least on the control text;

in this application, control is intended to be used to show what the user needs to control the smart speakers or ask what the smart speakers do, e.g., control the smart speakers to play a Zhang three song, ask the graduation school of Lifour who is, ask today's weather conditions, etc.

In the present application, the control intention of the control speech may be determined based on the control text by means of an intention prediction model.

The intention prediction model can be trained in advance, and specific training modes comprise:

11 Acquiring a sample data set, wherein the sample data set comprises a sample control text marked with a sample control intention;

12 Constructing a network structure of the intention prediction model;

wherein, referring to fig. 8, the network structure of the intent prediction model at least includes:

the system comprises a word segmentation layer, a coding layer, a bidirectional cyclic neural network, a polymerization layer and a full connection layer.

The control text may be input to a word segmentation layer, which is configured to segment the control text to obtain a plurality of words, and input the plurality of words to a coding layer.

The coding layer is used for converting the plurality of vocabularies into feature vectors respectively and inputting the plurality of feature vectors into the bidirectional cyclic neural network respectively.

The bidirectional cyclic neural network is used for respectively supplementing the characteristics of the plurality of characteristic vectors based on the dependency relationship between at least two adjacent characteristic vectors in the plurality of characteristic vectors, and inputting the characteristic-supplemented plurality of characteristic vectors into the aggregation layer.

The aggregation layer is used for aggregating a plurality of feature vectors with the feature supplemented to obtain an aggregation vector, and inputting the aggregation vector into the full-connection layer.

The fully connected layer is used for predicting control intents according to the aggregate vector.

The bidirectional cyclic neural network comprises a forward LSTM (Long Short-Term Memory) network and a backward LSTM network, wherein the forward LSTM network comprises a plurality of LSTM models which are sequentially connected, the backward LSTM network comprises a plurality of LSTM models which are sequentially connected, and the connection sequence among the plurality of LSTM models which are included in the forward LSTM network is opposite to the connection sequence among the plurality of LSTM models which are included in the backward LSTM network.

13 Training the network parameters in the intention prediction model by using the sample data set until the network parameters are converged to obtain the intention prediction model.

In the process of training the intent prediction model, parameters in the intent prediction model are usually optimized according to gradient values of the intent prediction model and output values of a loss function of the intent prediction model until network parameters in the intent prediction model converge, however, if the gradient values of the intent prediction model disappear in the process of training the intent prediction model, the network parameters of the intent prediction model cannot be accurately optimized according to the values of the loss function only, and normal training of the intent prediction model is affected.

Thus, to avoid this, the recurrent neural network may include an LSTM (Long Short-Term Memory) model, with which gradient disappearance may be avoided. Wherein the loss function includes a mean square error, etc. The LSTM model is used as a network unit in the recurrent neural network, so that the gradient value disappearance of the intention prediction model in the process of training the intention prediction model can be avoided.

In this way, when the control intention of the control voice is determined based on the control text by means of the intention prediction model, the control text can be input into the intention prediction model, and the control intention of the control voice output by the intention prediction model can be obtained.

For example, the control text is input into a word segmentation layer; the word segmentation layer segments the control text to obtain a plurality of words, and inputs the words into the coding layer; the coding layer is used for respectively converting the plurality of vocabularies into feature vectors; for example, the encoding layer performs single-hot encoding on each vocabulary, obtains feature vectors of each vocabulary, and inputs a plurality of feature vectors into a bidirectional cyclic neural network.

When a plurality of feature vectors are respectively input into the bidirectional recurrent neural network, for any one feature vector, the LSTM model corresponding to the feature vector in the forward LSTM network and the LSTM model corresponding to the feature vector in the backward LSTM network can be respectively input into the LSTM model corresponding to the feature vector in the forward LSTM network and the LSTM model corresponding to the feature vector in the backward LSTM network. For example, the order of the feature vector and its position in the feature vector of the plurality of words included in the control text may be determined. Determining an LSTM model corresponding to the position sequence in a plurality of LSTM models included in a forward LSTM network, and taking the LSTM model as the LSTM model corresponding to the feature vector; and determining an LSTM model corresponding to the position sequence from a plurality of LSTM models included in the backward LSTM network, and taking the LSTM model as the LSTM model corresponding to the feature vector. The vector output by the LSTM model corresponding to the forward LSTM network of the feature vector and the vector output by the LSTM model corresponding to the backward LSTM network of the feature vector can be aggregated into a vector to be used as the output vector corresponding to the feature vector.

The above operation is performed similarly for each of the other feature vectors.

Thus, an output vector corresponding to each feature vector is obtained.

The output vector corresponding to each feature vector can be aggregated into a vector, and the vector can be used as the output vector for obtaining the bidirectional recurrent neural network.

For example, assuming that the order of the positions of the feature vectors of a certain word in the control text in the feature vectors of a plurality of words included in the control text is nth, the feature vectors of the word are input into an nth LSTM model included in the forward LSTM network, and the feature vectors of the word are input into an nth LSTM model included in the backward LSTM network.

In this way, not only the forward related content of the position sequence but also the reverse related content of the position sequence can be used in the intent prediction model, that is, the sequence relation, the position relation, the dependency relation and the like among various vocabularies in the control text can be better used, so that more contents can be obtained in the intent prediction model, the prediction result is more perfect, and for example, the control intent determined later can be more accurate.

In one example, referring to fig. 9, a bidirectional recurrent neural network includes a forward LSTM network and a backward LSTM network.

The forward LSTM network comprises an LSTM model 1, an LSTM model 2, an LSTM model 3 and an LSTM model 4, wherein the LSTM model 1 is positioned before the LSTM model 2, the LSTM model 2 is positioned before the LSTM model 3, and the LSTM model 3 is positioned before the LSTM model 4 in sequence.

The backward LSTM network comprises an LSTM model 5, an LSTM model 6, an LSTM model 7 and an LSTM model 8, wherein the LSTM model 5 is positioned before the LSTM model 6, the LSTM model 6 is positioned before the LSTM model 7, and the LSTM model 7 is positioned before the LSTM model 8 in sequence.

Assume that vocabulary 1, vocabulary 2, vocabulary 3, and vocabulary 4 are controlled in the text, the vocabulary 1 is located before the vocabulary 2, the vocabulary 2 is located before the vocabulary 3, and the vocabulary 3 is located before the vocabulary 4.

Feature vectors of respective words may be obtained, for example, feature vector 1 of word 1, feature vector 2 of word 2, feature vector 3 of word 3, and feature vector 4 of word 4.

Feature vector 1 may then be input into LSTM model 1, the output of LSTM model 1 and feature vector 2 into LSTM model 2, the output of LSTM model 2 and feature vector 3 into LSTM model 3, and the output of LSTM model 3 and feature vector 4 into LSTM model 4.

Also, the feature vector 4 may be input into the LSTM model 5, the output of the LSTM model 5 and the feature vector 3 may be input into the LSTM model 6, the output of the LSTM model 6 and the feature vector 2 may be input into the LSTM model 7, and the output of the LSTM model 7 and the feature vector 1 may be input into the LSTM model 8.

Thus, the LSTM model 1 and the LSTM model 8 output a vector, and the output vectors of the LSTM model 1 and the LSTM model 8 can be aggregated into a vector, which is the output vector 1 corresponding to the feature vector 1.

The LSTM model 2 and the LSTM model 7 may output a vector, and the output vectors of the LSTM model 2 and the LSTM model 7 may be aggregated into a vector, which is the output vector 2 corresponding to the feature vector 2.

The LSTM model 3 and the LSTM model 6 may output a vector, and the output vectors of the LSTM model 3 and the LSTM model 6 may be aggregated into a vector, which is the output vector 3 corresponding to the feature vector 3.

The LSTM model 4 and the LSTM model 5 may output a vector, and the output vectors of the LSTM model 4 and the LSTM model 5 may be aggregated into a vector, which is the output vector 4 corresponding to the feature vector 4.

Then, the output vector 1 corresponding to the feature vector 1, the output vector 2 corresponding to the feature vector 2, the output vector 3 corresponding to the feature vector 3, and the output vector 4 corresponding to the feature vector 4 may be aggregated (Concat) to obtain an aggregate vector. And inputs the aggregate vector into the full connection layer. The full connection layer predicts the control intention of the control voice according to the aggregate vector.

In one embodiment of the present application, in one possible scenario, the control intention accuracy of the control voice determined based on the control text is low, for example, it is assumed that the Zhang three-thinking public, which is a person of the public, recommends a type of merchandise, for example, a good-listening music, a good-eating restaurant, a good-looking movie, a store with a great discount strength, and the like.

Assuming that the user needs to search for "Zhang Sanjingled good music" through the intelligent speaker, but the control voice input by the user to the intelligent speaker may not be accurate, for example, the control text corresponding to the control voice input by the user to the intelligent speaker is "search for Zhang Sanjingled commodity", and what type of commodity needs to be searched for Zhang Sanjingled is not limited to the intelligent speaker. In this way, in general, the smart speaker does not necessarily search for "Zhang Sanjiu recommended good music" for the merchandise searched for by the user based on the control voice of the user, but may search for other types of merchandise that are Zhang Sanjiu recommended, such as a restaurant that is preferably Zhang Sanjiu recommended, a movie that is preferably recommended, a store that is preferably discount-intensive, etc. This results in the intelligent speaker providing services to the user that the user may not have intended to obtain, resulting in a lower accuracy of the intelligent speaker providing services to the user.

In this case, therefore, in order to improve the accuracy of the service provided by the smart speakers to the users,

when the control intention of the control voice is determined based on the control text, the current business scene of the intelligent sound box can be determined; a control intent is then determined based on the business scenario and the control text.

For example, sometimes a service is provided for a user in a service scenario, for example, the user speaks "turn on a music player" to the smart speaker, so that the smart speaker turns on the music player of the smart speaker, and then the smart speaker enters the service scenario for playing music, and if the user needs to search for the Zhang-three recommended good music through the smart speaker, even if the control voice input by the user to the smart speaker may not be accurate, for example, even if the control text corresponding to the control voice input by the user to the smart speaker is "search for Zhang-three recommended goods", what type of goods need to be searched for Zhang-three recommended is not limited to the smart speaker.

However, at this time, the intelligent sound box is already in the service scene of playing music, so the intelligent sound box can determine that the user needs the intelligent sound box to search for the Zhang Sanzhu recommended commodity to actually be the good music which needs the intelligent sound box to search for the Zhang Sanzhu recommended based on the service scene of playing music.

Therefore, the possibility that the intelligent sound box provides service for the user is the service which the user originally wants to obtain is increased, and the accuracy of the intelligent sound box for providing service for the user is improved.

In step S603, determining an intention field in which the control intention is located;

the control field supported by the intelligent sound box comprises a music field, a weather field, a telephone making field, a shopping field, a route searching field and the like. The control field supported by the intelligent sound box can be determined by the manufacturer of the intelligent sound box or the owner of the intelligent sound box.

In the case where the control intention of the control voice of the user is located in the control area supported by the smart speaker, the user can control the smart speaker based on the control voice.

In the case where the control intention of the control voice of the user is not located in the control area supported by the smart speaker, the user cannot control the smart speaker based on the control voice.

For example, the smart speaker does not support the medical field, and if the wake-up-free user speaks the control voice of "help me injection" to the smart speaker, the smart speaker cannot respond to the instruction of the user, that is, cannot help the user to inject.

Therefore, after obtaining the control intention of the control voice, it is necessary to determine the intention field in which the control intention is located, and then step S604 is performed.

For any intention field supported by the intelligent sound box, control intention which belongs to the intention field and can control the intelligent sound box can be counted in advance, and then each control intention is respectively in a corresponding table entry with the intention field and is stored in a corresponding relation between the control intention and the intention field. The same is true for each other intended area supported by the smart speakers.

Therefore, in this step, the intention field corresponding to the control intention can be found in the correspondence relationship between the control intention and the intention field.

In step S604, if the intention field is an intention field supported by the smart box, the smart box is controlled based on the control intention.

In another embodiment, the process may be ended in the event that the intent domain is not an intent domain supported by the smart speakers.

In the application, in a scene, a plurality of wake-up-free users are arranged around an intelligent sound box, and if the wake-up-free users need to control the intelligent sound box based on control voice at the same time, the plurality of wake-up-free users can respectively input respective control voice to the intelligent sound box, at the moment, the number of the control voice received by the intelligent sound box is a plurality of control voices which are respectively sent by a plurality of sponsors, and the number of the wake-up-free users in the plurality of sponsors is at least two;

In this case, referring to fig. 10, step S103 includes:

in step S701, determining priorities of at least two wake-up free users;

in the present application, among a plurality of wake-up-free users of the intelligent sound box, an owner of the intelligent sound box may set priority of each wake-up-free user in the intelligent sound box in advance, for example, in the intelligent sound box, the voiceprint features of each wake-up-free user are ordered, the higher the priority of the wake-up-free user corresponding to the voiceprint feature that is more forward in the order, the lower the priority of the wake-up-free user corresponding to the generated feature that is more backward in the order.

Thus, for any one collected control voice, under the condition that the control voice is determined to be initiated by the user without waking up, the priority of the user without waking up can be determined, for example, the priority of the user without waking up is determined according to the positions of the voiceprint features of the control voice in a plurality of voiceprint features which are ordered from high to low according to the priority, and the same is true for each other collected control voice. Thereby obtaining the priority of each wake-up free user.

Of course, facial features and the like may be used in addition to voiceprint features, which are not limited in this application.

In step S701, the smart speaker is controlled based on the control voice issued by the wake-up-free user with high priority.

Second, other control voices and the like emitted by the wake-up-free user can be discarded.

Through the application, under the condition that a plurality of wake-up-free users of the intelligent sound boxes need to control the intelligent sound boxes based on respective control voices, the wake-up-free users with high priority can be guaranteed to smoothly control the intelligent sound boxes through the control voices, and the situation that the wake-up-free users with high priority cannot smoothly control the intelligent sound boxes through the control voices due to interference of other wake-up-free users is avoided.

Referring to fig. 11, a flow chart of a control method of the present application is shown, where the method is applied to an intelligent sound box, and the method may include:

in step S801, a control voice for controlling the smart speaker is collected;

this step is specifically referred to step S101, and will not be described in detail herein.

In step S802, it is determined whether the control speech is a wake-up-free control speech;

the method can be realized by the following steps:

8021. performing voice recognition on the control voice to obtain a control text corresponding to the control voice;

8022. Judging whether the control text carries a wake-up-free keyword or not;

in this application, the owner of intelligent audio amplifier can set up and exempt from the key word of waking up in intelligent audio amplifier, for example, the owner of intelligent audio amplifier can set up and exempt from the key word of waking up such as "dial 110", "dial 120" and dial "119" emergency and so on, wherein, the owner of intelligent audio amplifier can input and exempt from to wake up the setting instruction in intelligent audio amplifier, the intelligent audio amplifier receives and exempt from to wake up the setting instruction, then uses voice acquisition equipment to gather the voice of the owner of intelligent audio amplifier, then can discern the voiceprint characteristic of the voice of the owner of intelligent audio amplifier, and store this voiceprint characteristic as exempting from to wake up user's voiceprint characteristic in intelligent audio amplifier. The intelligent sound box can store the awakening-free keywords set by the owner of the intelligent sound box.

8023. And under the condition that the control text carries the wake-up-free keyword, determining the control voice as the wake-up-free control voice.

If the control voice is the wake-up-free control voice, in step S803, the smart speaker is controlled based on the control voice.

In the embodiment shown in fig. 1, the intelligent sound box is controlled based on the control voice in the case that the initiator of the control voice is free from waking the user. In the embodiment of the application, the identity of the initiator is not limited, and the intelligent sound box can be controlled based on the control voice under the condition that the control voice is the wake-up-free control voice no matter whether the initiator of the control voice is the wake-up-free user or not. According to the method and the device, the user can control the intelligent sound box based on the control voice under special conditions, for example, in emergency conditions, the user needs to dial an emergency call, for example, dial 110, 119 and 120, and the like, so that the user can support the non-wake-up-free user to control the intelligent sound box based on the control voice, wake-up words do not need to be input to the intelligent sound box, and the efficiency of controlling the intelligent sound box by the user is improved.

In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining whether the control speech is a wake-up-free control speech; and under the condition that the control voice is the wake-up-free control voice, controlling the intelligent sound box based on the control voice. Through the method and the device, the intelligent sound box can realize voice control on the intelligent sound box based on the control voice without waking up under the condition that the control voice which is spoken without speaking the wake-up word is the control voice without waking up. Because the wake-up word can not be uttered, the interaction process between the user and the intelligent sound box is simpler and more convenient, so that the interaction efficiency can be improved, and the user experience can be improved.

Referring to fig. 12, a flow chart of a control method of the present application is shown, where the method is applied to a smart speaker, and the method may include:

in step S901, a control voice for controlling the smart speaker is collected;

In step S902, acquiring an acquisition time when the intelligent sound box acquires the control voice;

in one scenario, the intelligent speaker may provide services to a wide variety of users, for example, staff in some institutions may set up the intelligent speaker in the lobby of the institution for convenience in order for staff in the institution to work, and thus staff in the institution may rely on the intelligent speaker to obtain services. However, these institutions also provide services to a wide range of customers at a specific time, for example, providing route guidance services within a mall, shopping guidance services within a mall, and the like to a wide range of customers.

However, the vast majority of customers may not be owners of the smart speakers, or may not know brands of the smart speakers, etc., so that the vast majority of customers may not know wake-up words of the smart speakers, and therefore, if the wake-up words need to be used to possibly control the smart speakers through voice, the vast majority of customers may not be able to control the smart speakers through voice.

Thus, in this case, in order to allow a wide range of customers to control the smart speakers through voice, in the present application, the smart speakers are supported for a specific period of time to respond to the user's control voice without wake-up words.

In this application, the owner of intelligent audio amplifier can set up in intelligent audio amplifier and exempt from the wake-up time, for example, the owner of intelligent audio amplifier can set up the time that the mechanism was outwards opened and exempt from wake-up time etc. the owner of intelligent audio amplifier can store the wake-up time that exempts from that the owner of intelligent audio amplifier set up.

In step S903, when the acquisition time is the wake-up-free time, the smart speaker is controlled based on the control voice.

In the embodiment shown in fig. 1, the intelligent sound box is controlled based on the control voice in the case that the initiator of the control voice is free from waking the user. In the embodiment of the application, the identity of the initiator is not limited, and the intelligent sound box can be controlled based on the control sound, no matter whether the initiator of the control sound is a wake-up-free user or not, under the condition that the collection time when the intelligent sound box collects the control sound is the wake-up-free time.

According to the intelligent sound box control method and device, the fact that the intelligent sound box can be controlled by the non-wake-up-free user under some special conditions can be guaranteed, and therefore the non-wake-up-free user can be supported, the intelligent sound box can be controlled by the control voice, wake-up words do not need to be input into the intelligent sound box, and therefore a large number of customers can obtain route navigation service inside a mall, shopping navigation service inside the mall and the like from the intelligent sound box.

In the embodiment of the application, control voice for controlling the intelligent sound box is collected; acquiring the acquisition time of the intelligent sound box when the control voice is acquired; and under the condition that the acquisition time is the wake-up-free time, controlling the intelligent sound box based on the control voice. Through the method and the device, the intelligent sound box can realize voice control on the intelligent sound box based on the control voice without waking up under the condition that the control voice which is spoken without speaking the wake-up word is the control voice without waking up. Because the wake-up word can not be uttered, the interaction process between the user and the intelligent sound box is simpler and more convenient, so that the interaction efficiency can be improved, and the user experience can be improved.

Referring to fig. 13, a flow chart of a control method of the present application is shown, where the method is applied to a smart speaker, and the method may include:

In step S1001, a control voice for controlling the smart speaker is collected;

In step S1002, determining a position of the smart speaker;

in one scenario, the smart speaker may provide services for a wide range of users, for example, in order to conveniently provide a route guidance service inside a mall and a shopping guidance service inside the mall for a wide range of customers, a worker in the mall may temporarily place the smart speaker at an entrance position of the mall, so that after a customer enters the mall from the entrance position of the mall, the worker may rely on the smart speaker to obtain the route guidance service inside the mall and the shopping guidance service inside the mall.

In this case, therefore, in order to enable a wide range of customers to control the smart speakers through voice, in the present application, the smart speakers are supported to respond to the user's control voice without wake-up words at a specific location.

In this application, the owner of intelligent audio amplifier can set up and exempt from the area of awakening in intelligent audio amplifier, for example, the owner of intelligent audio amplifier can set up the area that the entry of market place was located and be exempt from the area of awakening for exempting from the area that the entry of railway station was located, and the area of bus station is exempt from the area of awakening and the area that the entry of airport was located is exempt from to awaken from the area etc. the area that the owner of intelligent audio amplifier set up can be stored to the intelligent audio amplifier.

In step S1003, if the position is in the wake-up free area, the smart speaker is controlled based on the control voice.

In the embodiment shown in fig. 1, the intelligent sound box is controlled based on the control voice in the case that the initiator of the control voice is free from waking the user. In the embodiment of the application, the identity of the initiator is not limited, and the intelligent sound box can be controlled based on the control voice under the condition that the intelligent sound box is positioned in the wake-up-free area no matter whether the initiator for controlling the voice is a wake-up-free user or not. According to the method and the device, the intelligent sound box can be controlled based on control voice even if a non-wake-up-free user is in some special conditions, for example, the intelligent sound box is arranged at an entrance of a mall, and customers need to acquire route navigation service inside the mall, shopping navigation service inside the mall and the like, so that the non-wake-up-free user can be supported, the intelligent sound box can be controlled based on the control voice, wake-up words do not need to be input to the intelligent sound box, and therefore, customers can acquire route navigation service inside the mall, shopping navigation service inside the mall and the like from the intelligent sound box.

In the embodiment of the application, control voice for controlling the intelligent sound box is collected; determining the position of the intelligent sound box; and controlling the intelligent sound box based on the control voice under the condition that the position is in the wake-up-free area. Through the method and the device, the intelligent sound box can realize voice control on the intelligent sound box based on the control voice without waking up under the condition that the control voice which is spoken without speaking the wake-up word is the control voice without waking up. Because the wake-up word can not be uttered, the interaction process between the user and the intelligent sound box is simpler and more convenient, so that the interaction efficiency can be improved, and the user experience can be improved.

Referring to fig. 14, a block diagram of an embodiment of a control device of the present application is shown, and may specifically include the following modules:

the first acquisition module 11 is used for acquiring control voice for controlling the intelligent sound box;

a first determining module 12, configured to determine whether the initiator of the control voice is a wake-up-free user;

the first control module 13 is configured to control the smart speaker based on the control voice in a case that the initiator is free of waking up the user.

In an alternative implementation, the first determining module includes:

the first acquisition module is specifically configured to:

the fourth determination unit includes:

In an alternative implementation, the first determining module includes:

In an alternative implementation, the first control module includes:

In an alternative implementation, the ninth determining unit includes:

the first control module includes:

Referring to fig. 15, a block diagram of an embodiment of a control device of the present application is shown, and may specifically include the following modules:

the second acquisition module 21 is used for acquiring control voice for controlling the intelligent sound box;

a second determining module 22, configured to determine whether the control speech is a wake-up-free control speech;

and the second control module 23 is used for controlling the intelligent sound box based on the control voice under the condition that the control voice is the wake-up-free control voice.

In an alternative implementation, the second determining module includes:

In the embodiment of the application, the identity of the initiator is not limited, and the intelligent sound box can be controlled based on the control voice under the condition that the control voice is the wake-up-free control voice no matter whether the initiator of the control voice is the wake-up-free user or not. According to the method and the device, the user can control the intelligent sound box based on the control voice under special conditions, for example, in emergency conditions, the user needs to dial an emergency call, for example, dial 110, 119 and 120, and the like, so that the user can support the non-wake-up-free user to control the intelligent sound box based on the control voice, wake-up words do not need to be input to the intelligent sound box, and the efficiency of controlling the intelligent sound box by the user is improved.

Referring to fig. 16, a block diagram of an embodiment of a control device of the present application is shown, and may specifically include the following modules:

the third collection module 31 is configured to collect control voice for controlling the intelligent sound box;

the acquisition module 32 is configured to acquire an acquisition time when the intelligent sound box acquires the control voice;

and the third control module 33 is configured to control the intelligent sound box based on the control voice when the acquisition time is a wake-up-free time.

In the embodiment of the application, the identity of the initiator is not limited, and the intelligent sound box can be controlled based on the control sound, no matter whether the initiator of the control sound is a wake-up-free user or not, under the condition that the collection time when the intelligent sound box collects the control sound is the wake-up-free time.

Referring to fig. 17, a block diagram of an embodiment of a control device of the present application is shown, and may specifically include the following modules:

a fourth collection module 41, configured to collect control voice for controlling the intelligent sound box;

a third determining module 42, configured to determine a location of the smart sound box;

and the fourth control module 43 is configured to control the smart speaker based on the control voice when the location is in the wake-up-free area.

In the embodiment of the application, the identity of the initiator is not limited, and the intelligent sound box can be controlled based on the control voice under the condition that the intelligent sound box is positioned in a wake-up-free area no matter whether the initiator of the control voice is a wake-up-free user or not. According to the method and the device, the intelligent sound box can be controlled based on control voice even if a non-wake-up-free user is in some special conditions, for example, the intelligent sound box is arranged at an entrance of a mall, and customers need to acquire route navigation service inside the mall, shopping navigation service inside the mall and the like, so that the non-wake-up-free user can be supported, the intelligent sound box can be controlled based on the control voice, wake-up words do not need to be input to the intelligent sound box, and therefore, customers can acquire route navigation service inside the mall, shopping navigation service inside the mall and the like from the intelligent sound box.

The embodiment of the application also provides a non-volatile readable storage medium, where one or more modules (programs) are stored, where the one or more modules are applied to a device, and the device may be caused to execute instructions (instractions) of each method step in the embodiment of the application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon that, when executed by one or more processors, cause an electronic device to perform a method as described in one or more of the above embodiments. In this embodiment of the present application, the electronic device includes a server, a gateway, a sub-device, and the sub-device is a device such as an internet of things device.

Embodiments of the present disclosure may be implemented as an apparatus for performing a desired configuration using any suitable hardware, firmware, software, or any combination thereof, which may include a server (cluster), a terminal device, such as an IoT device, or the like.

Fig. 18 schematically illustrates an example apparatus 1300 that may be used to implement various embodiments described herein.

For one embodiment, fig. 18 illustrates an example apparatus 1300 having one or more processors 1302, a control module (chipset) 1304 coupled to at least one of the processor(s) 1302, a memory 1306 coupled to the control module 1304, a non-volatile memory (NVM)/storage 1308 coupled to the control module 1304, one or more input/output devices 1310 coupled to the control module 1304, and a network interface 1312 coupled to the control module 1306.

The processor 1302 may include one or more single-core or multi-core processors, and the processor 1302 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1300 can be a server device such as a gateway as described in embodiments of the present application.

In some embodiments, the apparatus 1300 may include one or more computer-readable media (e.g., memory 1306 or NVM/storage 1308) having instructions 1314 and one or more processors 1302 combined with the one or more computer-readable media configured to execute the instructions 1314 to implement the modules to perform actions described in this disclosure.

For one embodiment, the control module 1304 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 1302 and/or any suitable device or component in communication with the control module 1304.

The control module 1304 may include a memory controller module to provide an interface to the memory 1306. The memory controller modules may be hardware modules, software modules, and/or firmware modules.

Memory 1306 may be used to load and store data and/or instructions 1314 for device 1300, for example. For one embodiment, memory 1306 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, memory 1306 may include double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, the control module 1304 may include one or more input/output controllers to provide interfaces to the NVM/storage 1308 and the input/output device(s) 1310.

For example, NVM/storage 1308 may be used to store data and/or instructions 1314. NVM/storage 1308 may include any suitable nonvolatile memory (e.g., flash memory) and/or may include any suitable nonvolatile storage device(s) (e.g., hard disk drive(s) (HDD), compact disk drive(s) (CD) and/or digital versatile disk drive (s)).

NVM/storage 1308 may include storage resources that are physically part of the device on which apparatus 1300 is installed, or may be accessible by the device without necessarily being part of the device. For example, NVM/storage 1308 may be accessed over a network via input/output device(s) 1310.

Input/output device(s) 1310 may provide an interface for apparatus 1300 to communicate with any other suitable device, input/output device 1310 may include communication components, audio components, sensor components, and the like. The network interface 1312 may provide an interface for the device 1300 to communicate over one or more networks, and the device 1300 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as accessing a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1302 may be packaged together with logic of one or more controllers (e.g., memory controller modules) of the control module 1304. For one embodiment, at least one of the processor(s) 1302 may be packaged together with logic of one or more controllers of the control module 1304 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1302 may be integrated on the same mold as logic of one or more controllers of the control module 1304. For one embodiment, at least one of the processor(s) 1302 may be integrated on the same die with logic of one or more controllers of the control module 1304 to form a system on chip (SoC).

In various embodiments, apparatus 1300 may be, but is not limited to being: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 1300 may have more or fewer components and/or different architectures. For example, in some embodiments, apparatus 1300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and a speaker.

The embodiment of the application provides a server, which comprises: one or more processors; and one or more machine-readable media having instructions stored thereon, which when executed by the one or more processors, cause the server to perform the inter-device communication method as described in one or more of the embodiments of the present application.

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present embodiments have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the present application.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The foregoing has described in detail a control method and apparatus provided herein, and specific examples have been presented herein to illustrate the principles and embodiments of the present application, the above examples being provided only to assist in understanding the method and core ideas of the present application; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A control method is applied to an intelligent sound box and comprises the following steps:

collecting control voice for controlling the intelligent sound box;

determining whether the initiator of the control voice is a wake-up-free user;

under the condition that the initiator is free of waking up the user, controlling the intelligent sound box based on the control voice;

the control voice is multiple, and the control voice is respectively sent by multiple sponsors; and at least two wake-up-free users in a plurality of sponsors;

determining priorities of at least two wake-up-free users;

2. The method of claim 1, the determining whether the sponsor of control speech is a wake-free user comprising:

identifying voiceprint features of the control speech;

3. The method of claim 1, the determining whether the sponsor of control speech is a wake-free user comprising:

determining a source direction of the control speech;

4. The method of claim 1, the determining whether the sponsor of control speech is a wake-free user comprising:

determining a source direction of the control speech;

collecting an image including the sponsor located in a source direction;

identifying facial features of an sponsor in the image;

5. The method of claim 3 or 4, the smart speaker comprising at least two voice capture devices;

the determining the source direction of the control voice comprises:

The source direction is determined based on the phase information.

6. The method of claim 1, the determining whether the sponsor of control speech is a wake-free user comprising:

7. The method of claim 1, the controlling the smart speaker based on the control voice, comprising:

determining the intention field of the control intention;

8. The method of claim 7, the determining a control intent of the control speech based at least on the control text, comprising:

9. The method of claim 8, the manner in which the intent prediction model is trained, comprising:

constructing a network structure of an intention prediction model;

10. The method of claim 9, the network structure of the intent prediction model comprising at least:

11. The method of claim 10, the bidirectional recurrent neural network comprising a forward long short term memory network LSTM network and a backward LSTM network;

12. The method of claim 7, the determining a control intent of the control speech based at least on the control text, comprising:

determining a current service scene of the intelligent sound box;

13. The method of claim 7, the determining an intent domain in which the control intent is located comprising:

14. A control method is applied to an intelligent sound box and comprises the following steps:

collecting control voice for controlling the intelligent sound box;

Determining whether the control voice is a wake-up-free control voice;

under the condition that the control voice is wake-up-free control voice, controlling the intelligent sound box based on the control voice;

determining priorities of at least two wake-up-free users;

15. The method of claim 14, the determining whether the control speech is a wake-up free control speech, comprising:

judging whether the control text carries a wake-up-free keyword or not;

16. A control method is applied to an intelligent sound box and comprises the following steps:

collecting control voice for controlling the intelligent sound box;

Under the condition that the acquisition time is the wake-up-free time, controlling the intelligent sound box based on the control voice;

determining priorities of at least two wake-up-free users;

17. A control method is applied to an intelligent sound box and comprises the following steps:

collecting control voice for controlling the intelligent sound box;

determining the position of the intelligent sound box;

controlling the intelligent sound box based on the control voice under the condition that the position is in the wake-up-free area;

determining priorities of at least two wake-up-free users;

18. A control device applied to an intelligent sound box, comprising:

the first control module is used for controlling the intelligent sound box based on the control voice under the condition that the user is free from waking up by the initiator;

the first control module includes:

19. The apparatus of claim 18, the first determination module comprising:

20. The apparatus of claim 18, the first determination module comprising:

21. The apparatus of claim 18, the first determination module comprising:

22. The apparatus of claim 20 or 21, the smart speaker comprising at least two voice capture devices;

the first acquisition module is specifically configured to:

The fourth determination unit includes:

23. The apparatus of claim 18, the first determination module comprising:

24. The apparatus of claim 18, the first control module comprising:

25. The apparatus of claim 24, the ninth determination unit comprising:

26. The apparatus of claim 25, the ninth determination unit further comprising:

27. The apparatus of claim 26, the network structure of the intent prediction model comprising at least:

28. The apparatus of claim 27, the bidirectional recurrent neural network comprising a forward long short term memory network LSTM network and a backward LSTM network;

29. The apparatus of claim 24, the ninth determination unit comprising:

30. The apparatus of claim 24, the tenth determining unit is specifically configured to: in the correspondence between the control intention and the intention field, the intention field corresponding to the control intention is searched.

31. A control device applied to an intelligent sound box, comprising:

the second control module is used for controlling the intelligent sound box based on the control voice under the condition that the control voice is the wake-up-free control voice;

the second control module includes:

32. The apparatus of claim 31, the second determination module comprising:

33. A control device applied to an intelligent sound box, comprising:

the third control module is used for controlling the intelligent sound box based on the control voice under the condition that the acquisition time is the wake-up-free time;

the third control module includes:

34. A control device applied to an intelligent sound box, comprising:

the fourth control module is used for controlling the intelligent sound box based on the control voice under the condition that the position is located in the wake-up-free area;

the fourth control module includes:

35. An intelligent sound box, comprising:

a processor; and

memory having executable code stored thereon that, when executed, causes the processor to perform the control method of one or more of claims 1-17.

36. One or more machine readable media having executable code stored thereon that, when executed, causes a processor to perform the control method of one or more of claims 1-17.