CN110335607B

CN110335607B - A method, device and electronic device for executing a voice command

Info

Publication number: CN110335607B
Application number: CN201910766037.8A
Authority: CN
Inventors: 杜国威
Original assignee: Beijing Anyun Century Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2021-07-27
Anticipated expiration: 2039-08-19
Also published as: CN110335607A

Abstract

The present invention discloses a method for executing a voice command provided by the present invention, comprising: when simultaneously receiving voice commands from at least two sound source objects, determining the positions of at least two sound source objects; The position of the object performs image acquisition on at least two sound source objects, and obtains at least one frame of sound source image; when the at least one frame of sound source image has a first directivity feature, the target sound source object is determined based on the first directivity feature, The first directivity feature is used to indicate the position of the target sound source object; and the voice instruction issued by the target sound source object is executed. The invention also discloses a voice command execution device and electronic equipment.

Description

Voice instruction execution method and device and electronic equipment

Technical Field

The invention relates to the technical field of intelligent electronic equipment, in particular to a method and a device for executing a voice instruction and electronic equipment.

Background

Along with artificial intelligence's rapid development, pronunciation intelligent device such as intelligent audio amplifier, intelligent speech recognition robot constantly emerge to because pronunciation intelligent device can intelligent recognition user's voice command, need not manual operation, very big having made things convenient for the user to the control that can only equipment, intelligent speech recognition technique obtains very big development.

In the related technology, the intelligent voice recognition mainly depends on a voice front-end processing module, and the original voice is processed through the front-end processing module before feature extraction, so that noise and sound boxes brought by different speakers are partially eliminated, various noise interferences are inhibited, the voice to be recognized is cleaner, and the essential features of the voice can be reflected better.

However, when a plurality of persons simultaneously issue voice commands, the intelligent voice device in the related art cannot accurately determine the voice command to be executed that needs to be executed.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for executing a voice command, and an electronic device, so as to solve the problem that an intelligent voice device in the related art cannot accurately determine a to-be-executed voice command that needs to be executed when multiple people simultaneously issue the voice command.

To achieve the above object, according to a first aspect of the present invention, there is provided a method for executing a voice command, including:

when voice commands sent by at least two sound source objects are received simultaneously, determining the positions of the at least two sound source objects;

acquiring images of the at least two sound source objects according to the positions of the at least two sound source objects to obtain at least one frame of sound source image;

determining a target sound source object based on a first directivity characteristic when the at least one frame of sound source image has the first directivity characteristic, the first directivity characteristic being used for indicating a position of the target sound source object;

and executing the voice instruction sent by the target sound source object.

In one alternative, the first directional characteristic comprises a first directional gesture of at least one sound source object;

the determining a target sound source object based on the first directivity characteristic includes:

and determining the target sound source object according to the position pointed by the first directional gesture.

In an alternative, the determining the target sound source object according to the position pointed by the first directional gesture includes:

acquiring the number of first directional gestures pointing to the position of each sound source object in the case that a first directional characteristic points to a plurality of sound source objects;

acquiring a first position with the largest number of the first directional gestures;

determining the sound source object at the first position as the target sound source object.

multi-frame sound source images within a preset time length; respectively acquiring the frame number of the first directional gesture pointing to the position of each sound source object;

acquiring a second position with the maximum number of frames at the same position pointed by the first directional gesture;

determining the sound source object at the second position as the target sound source object.

acquiring a moving distance of the first directional gesture pointing to the position of the sound source object in the sound source image with preset frame numbers;

acquiring a third position pointed by the first directional gesture with the largest moving distance;

determining the sound source object at the third position as the target sound source object.

In an optional manner, after determining the positions of at least two sound source objects when receiving voice commands issued by the at least two sound source objects at the same time, the method further includes:

and converting the voice instruction into text information and displaying the text information.

In an optional mode, after the voice command is converted into text information and displayed, the method further includes:

and adding sound source object identification to the text information.

In an optional manner, before determining the positions of at least two sound source objects when receiving voice commands issued by the at least two sound source objects at the same time, the method further includes:

acquiring an environment image of the surrounding environment;

and when the environment image has a second directional characteristic, executing the step of receiving the voice instruction.

In one option, the second directional characteristic includes a second directional gesture directed at the executing subject.

In one option, the second directional characteristic includes a third directional gesture directed out of the performing body;

after receiving voice commands sent by at least two sound source objects and determining the positions of the at least two sound source objects, the method further comprises the following steps:

and masking the voice instruction sent by the sound source object outside the preset range of the indication direction of the third directional gesture.

According to a second aspect of the present invention, there is provided an apparatus for executing a voice command, comprising;

the receiving module is used for determining the positions of at least two sound source objects when voice instructions sent by the at least two sound source objects are received simultaneously;

the acquisition module is used for acquiring images of the at least two sound source objects according to the positions of the at least two sound source objects and acquiring at least one frame of sound source image;

a determining module, configured to determine a target sound source object based on a first directivity characteristic when the at least one frame of sound source image has the first directivity characteristic, where the first directivity characteristic is used to indicate a position of the target sound source object;

and the execution module is used for executing the voice instruction sent by the target sound source object.

the determining module is specifically configured to determine the target sound source object according to a position pointed by the first directional gesture.

In an optional manner, the obtaining module is further configured to obtain, in a case where a first directional characteristic points to a plurality of sound source objects, the number of first directional gestures pointing to the position of each of the sound source objects;

the obtaining module is further configured to obtain a first position where the first directional gesture is most numerous;

the determining module is further configured to determine the sound source object at the first position as the target sound source object.

In an optional mode, the obtaining module is further configured to obtain multiple frames of sound source images within a preset time length; respectively acquiring the frame number of the first directional gesture pointing to the position of each sound source object;

the acquisition module is further configured to acquire a second position where the first directional gesture points to the same position with the largest number of frames;

the determining module is further configured to determine the sound source object at the second position as the target sound source object.

In an optional manner, the obtaining module is further configured to obtain, in the sound source image with a preset number of frames, a moving distance when the first directional gesture points to the position of the sound source object;

the obtaining module is further configured to obtain a third position pointed by the first directional gesture with the largest moving distance;

the determining module is further configured to determine the sound source object at the third position as the target sound source object.

In an alternative, the apparatus further comprises:

and the display module is used for converting the voice instruction into text information and displaying the text information after determining the positions of the at least two sound source objects when the voice instruction sent by the at least two sound source objects is received at the same time.

In an alternative, the apparatus further comprises:

and the adding module is used for converting the voice instruction into text information and adding a sound source object identifier to the text information after the text information is displayed.

In an alternative, the apparatus further comprises:

the system comprises an image acquisition module, a voice recognition module and a voice recognition module, wherein the image acquisition module is used for acquiring an environment image of the surrounding environment before determining the positions of at least two sound source objects when receiving voice instructions sent by the at least two sound source objects at the same time;

the execution module is used for executing the step of receiving the voice instruction when the environment image has a second directive characteristic.

the device further comprises:

and the masking module is used for masking the voice instruction sent by the sound source object outside the preset range of the indication direction of the third directional gesture.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising:

the device comprises a memory, a processor and a communication bus, wherein the memory is in communication connection with the processor through the communication bus;

the memory has stored therein computer-executable instructions for execution by the processor to perform the method provided in any of the alternatives of the first aspect of the invention.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, which when executed, implement the method provided in any alternative of the first aspect of the present invention.

The invention provides a voice instruction execution method, a voice instruction execution device, electronic equipment and a computer readable storage medium; the execution method of the voice instruction comprises the following steps: when voice commands sent by at least two sound source objects are received simultaneously, the positions of the at least two sound source objects are determined; acquiring images of at least two sound source objects according to the positions of the at least two sound source objects to obtain at least one frame of sound source image; determining a target sound source object based on a first directivity characteristic when the at least one frame of sound source image has the first directivity characteristic, the first directivity characteristic being used for indicating the position of the target sound source object; and executing the voice command sent by the target sound source object. In this way, when voice instructions sent by a plurality of sound source objects are received, the target sound source object is determined according to the position of the sound source object specified by the first directivity characteristic in the sound source image by positioning the sound source object and collecting at least one frame of sound source image; then, executing a voice instruction sent by the target sound source object; the problem that the voice instruction to be executed cannot be accurately determined when a plurality of sound sources send the voice instruction at the same time is solved, and the accuracy of executing the voice instruction to be executed in a multi-person environment is improved; meanwhile, when a plurality of people send out voice commands, the electronic equipment does not need to be awakened again after one voice command is executed, and the execution efficiency of the voice commands is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings.

Fig. 1 is a schematic application scenario diagram of a method for executing a voice instruction according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating an implementation of a method for executing voice commands according to an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of a method for performing voice command according to another embodiment of the present application;

fig. 4A is a schematic view of a specific application scenario of a method for executing a voice instruction according to an embodiment of the present application;

fig. 4B is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application;

fig. 4C is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application;

fig. 4D is a schematic diagram of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application;

fig. 4E is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application;

fig. 4F is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application;

fig. 4G is a schematic diagram of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for executing a voice command according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the description of the embodiments of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Fig. 1 is a schematic application scenario diagram of a method for executing a voice instruction according to an embodiment of the present application.

Referring to fig. 1, a voice instruction execution method provided in an embodiment of the present application is applied to an electronic device, and specifically, in an embodiment of the present application, the electronic device may be an electronic device such as a smart speaker, a smart phone with a voice control function, a notebook computer, or a tablet computer. It should be noted that fig. 1 illustrates a smart sound box as an example, but does not limit the specific form of the electronic device; referring to fig. 1, during the use of smart sound box 11, there may be a plurality of sound source objects, for example, a case where first sound source object 12 and second sound source object 13 simultaneously issue voice commands to smart sound box 11; or a case where the first sound source object 12 and the second sound source object 13 issue voice instructions in tandem. In the related technology, the original voice is processed, the influence caused by noise and different speakers is partially eliminated, and various interferences are suppressed; since the suppression strategy is applied to other sounds, when a plurality of persons speak (for example, two persons shown in fig. 1), and the first sound source object 12 speaks before the second sound source object 13, the smart sound box 11 can only respond to and execute the voice command issued by the first sound source object 12, no matter what voice command is issued by the second sound source object 13 or the voice commands are repeated for a plurality of times, the smart sound box 11 cannot respond to and execute the voice command issued by the second sound source object 13, and the second sound source object 13 can only wait for the smart sound box 11 to execute the voice command of the first sound source object 12, wake up the smart sound box 11 again, and issue the voice command again. In another possible scenario, the first sound source object 12 and the second sound source object 13 issue voice commands simultaneously, and at this time, the smart sound box 11 has difficulty in determining which sound source object issued voice command needs to be executed. It should be noted that the multi-person scenario shown in fig. 1 is only one application scenario for exemplarily illustrating the voice instruction execution method provided in the embodiment of the present application, and it is understood that the voice instruction execution method provided in the embodiment of the present application is also applicable to a single-person scenario.

Fig. 2 is a flowchart illustrating an implementation of a method for executing a voice command according to an embodiment of the present application.

Referring to fig. 2, in the voice instruction execution method provided in an embodiment of the present application, the voice instruction execution method may be specifically used for electronic devices such as a smart speaker, a smart phone, a notebook computer, a personal digital computer, or a tablet computer, and certainly, the voice instruction execution method provided in the embodiment of the present application may also be used for other electronic devices having a voice control function, which is not listed in this embodiment; the method comprises the following steps:

step 201, when voice commands sent by at least two sound source objects are received simultaneously, positions of the at least two sound source objects are determined.

Specifically, in this embodiment, the simultaneous reception of the voice commands issued by at least two sound source objects may be: voice instructions simultaneously uttered by at least two sound source objects, for example, the first sound source object 12 and the second sound source object 13 shown in fig. 1; it can also be: when the voice commands of some sound source objects in the at least two sound source objects are not completed, the voice commands sent by other sound source objects are received; for example, first sound source object 12 shown in fig. 1 is issuing voice instructions to smart sound box 11, and at this time, second sound source object 13 issues voice instructions to smart sound box 11 again; for example, in some specific scenarios, the first sound source object 12 issues a speech instruction "help me open the music of zhou jilun"; at the same time, the second sound source object 13 issues a voice instruction "help me open music of royal fei".

Specifically, in the present embodiment, the determining of the positions of at least two sound source objects may be positioning a sound source object that issues a voice command by using a sound source positioning principle. Specifically, when receiving a voice command from the first sound source object 12 and the second sound source object 13 at the same time, the smart sound box 11 shown in fig. 1 detects the positions of the sound sources through the microphone array, determines the positions of the first sound source object 12 and the second sound source object 13 in the space, and then forms two different beams according to the positions of the first sound source object 12 and the second sound source object 13 by the smart sound box 11 to respectively obtain the sounds from the first sound source object 12 and the second sound source object 13.

Step 202, acquiring images of at least two sound source objects according to the positions of the at least two sound source objects, and acquiring at least one frame of sound source image.

Specifically, in the present embodiment, the smart speaker 11 is a speaker having a camera function, and after the smart speaker 11 determines the positions of the first sound source object 12 and the second sound source object 13 according to the sound source localization principle, the camera of the smart speaker 11 performs image acquisition on the first sound source object 12 and the second sound source object 13, respectively. In some optional embodiments, the camera function of the smart sound box 11 may be in a state of being always activated, and after the smart sound box 11 determines the positions of the first sound source object 12 and the second sound source object 13, the image acquisition step is directly performed; in other alternative embodiments, the camera function of the smart sound box 11 may be in a sleep state, and after the smart sound box 11 determines the positions of the first sound source object 12 and the second sound source object 13, the camera function of the smart sound box 11 is awakened, and the image capturing step is performed. Specifically, in this embodiment, the smart sound box 11 may continuously collect the images of the first sound source object 12 and the second sound source object 13, and in the continuous collection, at least one frame of sound source image is collected; specifically, the sound source image may include images of the first sound source object 12 and the second sound source object 13, and may also include an image of the surrounding environment.

Step 203, when at least one frame of sound source image has a first directivity characteristic, determining a target sound source object based on the first directivity characteristic, wherein the first directivity characteristic is used for indicating the position of the target sound source object.

Specifically, in the present embodiment, the feature recognition technique may be used to recognize the feature in at least one frame of the sound source image, and in the present embodiment, the first directional feature may be a directional operation of the first sound source object 12 or the second sound source object 13, for example

The directional operation of (2). In this embodiment, when the sound source image has the first directivity characteristic, the position to which the first directivity characteristic points may also be identified; specifically, the position at which the first directivity characteristic is directed may specifically be the position of the first sound source object 12 or the second sound source object 13. Thus, from the position at which the first directivity characteristic is directed, it is determined whether the first directivity characteristic is directed to the first sound source object 12 or the second sound source object 13.

Step 204, executing the voice command sent by the target sound source object.

Specifically, as shown in fig. 1, when the sound source objects are two sound source objects, that is, a first sound source object 12 and a second sound source object 13, if the first directivity characteristic points at the position of the first sound source object 12, it is determined that the first sound source object 12 is the target sound source object; at this time, smart sound box 11 executes the voice instruction issued by first sound source object 12; if the first directivity characteristic points to the position of the second sound source object 13, it is determined that the second sound source object 13 is the target sound source object; at this time, smart sound box 11 executes the voice instruction issued by second sound source object 13.

The method for executing the voice instruction provided by the embodiment of the application comprises the following steps: when voice commands sent by at least two sound source objects are received simultaneously, the positions of the at least two sound source objects are determined; acquiring images of at least two sound source objects according to the positions of the at least two sound source objects to obtain at least one frame of sound source image; determining a target sound source object based on a first directivity characteristic when the at least one frame of sound source image has the first directivity characteristic, the first directivity characteristic being used for indicating the position of the target sound source object; and executing the voice command sent by the target sound source object. In this way, when voice instructions sent by a plurality of sound source objects are received, the target sound source object is determined according to the position of the sound source object specified by the first directivity characteristic in the sound source image by positioning the sound source object and collecting at least one frame of sound source image; then, executing a voice instruction sent by the target sound source object; the problem that the voice instruction to be executed cannot be accurately determined when a plurality of sound sources send the voice instruction at the same time is solved, and the accuracy of executing the voice instruction to be executed in a multi-person environment is improved; meanwhile, when a plurality of people send out voice commands, the electronic equipment does not need to be awakened again after one voice command is executed, and the execution efficiency of the voice commands is improved.

FIG. 3 is a flowchart of an implementation of a method for performing voice command according to another embodiment of the present application; fig. 4A is a schematic view of a specific application scenario of a method for executing a voice instruction according to an embodiment of the present application; fig. 4B is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application; fig. 4C is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application; fig. 4D is a schematic diagram of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application; fig. 4E is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application; fig. 4F is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application; fig. 4G is a schematic view of another specific application scenario of the method for executing a voice instruction according to the embodiment of the present application.

Referring to fig. 1 and fig. 3 to 4F, another embodiment of the present application provides a method for executing a voice command, including the following steps:

step 301, when receiving voice commands sent by at least two sound source objects at the same time, determining the positions of the at least two sound source objects.

Specifically, in the present embodiment, the simultaneous reception of the voice commands issued by at least two sound source objects may be the voice commands issued by at least two sound source objects, for example, the first sound source object 12 and the second sound source object shown in fig. 1; or when the voice commands of some sound source objects in the at least two sound source objects are not completed, the voice commands sent by other sound source objects are received; for example, fig. 1 shows first sound source object 12 issuing voice instructions to smart sound box 11, while second sound source object 13 in turn issues voice instructions to smart sound box 11.

In some optional implementations, after step 301, the method for executing a voice instruction provided in this embodiment of the present application further includes:

Specifically, in this embodiment, the vocabulary content in the voice command issued by the sound source object can be converted into computer-readable input, such as binary code or character sequence, by using a voice Recognition technology (ASR), and the converted input is converted into specific text information to be displayed on the display screen of the smart sound box 11. For example, "please help me open the music of zhou jieren" or "please help me open the music of royal fei" or the like is displayed on the display screen of the smart speaker 11.

In some optional implementations, the method for executing a voice instruction provided in an embodiment of the present application further includes:

and adding sound source object identification to the text information.

Specifically, taking the case of two sound source objects as an example, after the voice instruction sent by the sound source object is converted into text information, a sound source object identifier may be added to the text information based on the sequence of receiving the voice instructions sent by the two sound source objects; for example, "object 1, please help me open the music of zhou jieren," object 2, please help me open the music of royal phenanthrene. It should be noted that the sound source object identifier "object 1" or "object 2" in the present embodiment is only an exemplary illustration, and is not a limitation to a specific identifier; for example, the sound source object identification may also be "object a", "object B"; in some alternative embodiments, the sound source object identification may also be "instruction 1", "instruction 2", or the like.

Step 302, acquiring images of at least two sound source objects according to the positions of the at least two sound source objects, and acquiring at least one frame of sound source image.

Specifically, in the present embodiment, the smart sound box 11 is an electronic device having a camera function, and after the smart sound box 11 determines the positions of the first sound source object 12 and the second sound source object 13 according to the sound source localization principle, the camera of the smart sound box 11 performs image capturing on the first sound source object 12 and the second sound source object 13, respectively. In some optional embodiments, the camera function of the smart sound box 11 may be in a state of being always activated, and after the smart sound box 11 determines the positions of the first sound source object 12 and the second sound source object 13, the image acquisition step is directly performed; in other alternative embodiments, the camera function of the smart sound box 11 may be in an off state, and after the smart sound box 11 determines the positions of the first sound source object 12 and the second sound source object 13, the camera function of the smart sound box 11 is started, and the image capturing step is performed. Specifically, in this embodiment, the smart sound box 11 may continuously collect the images of the first sound source object 12 and the second sound source object 13, and in the continuous collection, at least one frame of sound source image is collected; specifically, the sound source image may include images of the first sound source object 12 and the second sound source object 13, and may also include an image of the surrounding environment.

As an illustration, in some specific scenarios, at least one frame of sound source image collected by the smart sound box 11 may be an image as shown in fig. 4A, and fig. 4A illustrates an example in which the first sound source object 12 and the second sound source object 13 are in the same frame of sound source image. It is to be understood that in some specific scenarios, the first sound source object 12 and the second sound source object 13 may also be in sound source images of different frames.

Step 303, when at least one frame of sound source image has the first directional characteristic, determining a target sound source object based on the position pointed by the first directional gesture. Wherein the first directional characteristic comprises a first directional finger.

The directional operation of (2). In this embodiment, when the sound source image has the first directivity characteristic, the position to which the first directivity characteristic points may also be identified; specifically, the position at which the first directivity characteristic is directed may specifically be the position of the first sound source object 12 or the second sound source object 13. Thus, it is determined whether the first directivity characteristic refers to the first sound source object 12 or the second sound source object 13 from the position at which the first directivity characteristic is directed.

Specifically, as an example, the first directional gesture may be a gesture in which the first sound source object 12 points to the second sound source object 13 as shown in fig. 4A; in some specific scenarios, the first directional gesture may also be a gesture in which the second sound source object 13 shown in fig. 4A points to itself. Of course, it should be noted that fig. 4A is only shown by way of example as a gesture; in some specific application scenarios, the first directional gesture may also be a directional motion of an article held in a hand of the sound source object. In a specific application, for example, as shown in fig. 4A, the position pointed by the first directional gesture of the first sound source object 12 is the position of the second sound source object 13, and the position pointed by the first directional gesture of the second sound source object 13 is the position of the second sound source object 13, so that the second sound source object 13 is determined to be the target sound source object at this time.

In some optional embodiments, step 303, determining the target sound source object according to the position pointed by the first directional gesture includes:

referring to fig. 4B, in the case where the first directivity characteristic points to a plurality of sound source objects, the number of first directivity gestures pointing to the position of each sound source object is acquired.

By way of illustration, in some specific scenarios, the number of sound source objects may exceed two sound source objects; for example, in the case of a large number of people at some family gathering; at this time, in the sound source image collected by the smart sound box, the first directional gestures of the multiple sound source objects may respectively point to the positions of different sound source objects, for example, as shown in fig. 4B, the first directional finger of one part of the sound source objects points to the position of the first sound source object 12, and the first directional gesture of another part of the sound source objects points to the position of the second sound source object 13; at this time, the number of first directional gestures pointing to the position of each sound source object is acquired; for example, the number of the first directional gestures acquired at the position of the first sound source object 12 in fig. 4B is "2"; the number of first directional gestures directed at the position of the second sound source object 13 is "6". It is understood that the number of the first directional gestures in fig. 4B is only an illustration and is not a limitation on the specific number.

And acquiring a first position with the largest number of first directional gestures.

Specifically, referring to fig. 4B, since the number of first directional gestures directed to the position of the first sound source object 12 is "2"; the number of first directional gestures directed at the second source object 13 location is "6"; at this time, the first position where the first directional gesture is the largest in number is the position of the second sound source object 13.

A sound source object at the first position is determined as a target sound source object.

Specifically, as shown in fig. 4B, the second sound source object 13 is determined as the target sound source object.

In the embodiment, the target sound source object is determined according to the number of the first directional operations at the position of each sound source object, so that the accuracy of determining the target sound source object is improved, and the accuracy of executing the voice command is effectively improved.

multi-frame sound source images within a preset time length; and respectively acquiring the frame number of the first directional gesture pointing to the position of each sound source object.

Specifically, in this embodiment, the preset time length may be configured when the smart sound box 11 is factory set; the preset time length may also be defined by the user according to the specific situation during the use of the smart sound box 11. In the present embodiment, the preset time period may be, for example, 2s, 3s, or 5s, and the preset time period is not particularly limited in the present embodiment. Referring to fig. 4C, in this embodiment, the number of multi-frame sound source images acquired within the preset time period may be, for example, 5 frames or 10 frames, and the number of frames of sound source images that can be acquired within the preset time period depends on specific parameters of the camera of the smart sound box 11, which is not limited in this embodiment. For example, in the present embodiment, 5 frames of sound source images are acquired as an example; for example, the first directional gesture indicated by the dotted line in fig. 4C, in the 5-frame sound source image, the first directional gesture indicated by the front dotted line is directed to the second sound source object 13 position in the first two-frame sound source images; at the position of the first sound source object 12 in the last three frames of sound source images.

And acquiring a second position with the maximum frame number and pointed by the first directional gesture at the same position.

Specifically, taking the example of obtaining 5 frames of sound source images; for example, the first directional gesture indicated by the dotted line in fig. 4C, in the 5-frame sound source image, the first directional gesture indicated by the front dotted line is directed to the second sound source object 13 position in the first two-frame sound source images; the directional first sound source object 12 is not remitted in the last three sound source images. The number of frames pointing to the first directional gesture at the location of the first sound source object 12 is the largest, i.e. the first sound source object 12 is at the second location.

The sound source object at the second position is determined as the target sound source object.

Specifically, referring to fig. 4C, the first sound source object 12 is determined as the target sound source object.

In some optional embodiments, referring to fig. 4D, step 303, determining a target sound source object according to a position pointed by the first directional gesture includes:

and acquiring the moving distance of the first directional gesture when the first directional gesture points to the position of the sound source object in the sound source image with the preset number of frames.

Specifically, as shown in fig. 4D, in the present embodiment, the moving distance when the first directional finger points to the sound source object may be a moving distance when the first sound source object 12 points to another sound source object position, for example, L2 in fig. 4D; of course, the movement distance may also refer to a movement distance when the second sound source object 13 points to its own position, for example, L1 in fig. 4D. In some optional manners, the sound source images with the preset frame number may be, for example, 5 frames, 8 frames, and the preset frame number may be configured when the smart sound box 11 is factory set, or may be specifically set by the user according to a specific situation. Specifically, in each frame of sound source image collected by the smart sound box 11, the first directional gesture has a certain position, and in a plurality of frames of continuous images, the position change of the first directional gesture is also continuous. Therefore, the moving distance of the first directional gesture pointing to the position of the sound source object is obtained in the sound source image with the preset frame number, and the speed degree of the first directional gesture pointing to different sound source object positions can be determined. For example, as shown in fig. 4D, in the sound source images of the same number of frames, the moving distance L2 of the first directional gesture of the first sound source object 12 is significantly smaller than the moving distance L1 of the first directional gesture of the second sound source object 13; at this time, it is determined that the first directional gesture of the second sound source object 13 is faster, and the first directional gesture of the first sound source object 12 is slower.

And acquiring a third position pointed by the first directional gesture with the largest moving distance.

Specifically, for example, the position pointed by the first directional gesture of the second sound source object 13 shown in fig. 4D is the third position.

The sound source object at the third position is determined as the target sound source object.

Specifically, the second sound source object 13 is determined as the target sound source object.

Step 304, executing the voice command sent by the target sound source object.

Specifically, as shown in fig. 4A, taking the sound source objects as the first sound source object 12 and the second sound source object 13 as an example, the first directional gesture of the first sound source object 12 is directed to the position of the second sound source object 13, and the first directional gesture of the second sound source object 13 is also directed to the position of the second sound source object 13, at this time, the smart sound box 11 executes the voice command issued by the second sound source object 13.

In some optional implementations, before step 301, the method for executing a voice instruction provided in this embodiment of the present application further includes:

an environmental image of the surrounding environment is acquired.

Specifically, in this embodiment, the camera of the smart sound box 11 is in a continuously-maintained start state, and the camera of the smart sound box 11 continuously collects the environment image around the smart sound box 11.

When the environment image has the second directional characteristic, step 301 is executed.

Specifically, referring to fig. 4E, in an embodiment of the present disclosure, the second directional characteristic may be a directional gesture of an object in the environment image acquired by the smart sound box. In some alternatives, the second directional characteristic may be a second directional gesture shown in fig. 4E directed to smart sound box 11. When the camera of the smart sound box 11 acquires that the object in the environment has the second directional gesture pointing to the smart sound box 11, the smart sound box 11 starts a voice instruction receiving function and receives a voice instruction sent by the sound source object.

In the embodiment, the intelligent sound box is awakened and activated through the second directional gesture of the sound source object, the problem that the intelligent sound box is not activated timely due to inaccurate pronunciation of the voice when the intelligent sound box is activated by the voice is avoided, and the activation awakening efficiency of the intelligent sound box is improved.

In some alternative embodiments, the second directional characteristic includes pointing to a third directional gesture outside of the performance subject.

In some specific scenarios, referring to fig. 4F, the third directional gesture may be a directional gesture pointing to a direction other than the smart sound box.

After step 301, the method for executing a voice instruction provided in the embodiment of the present application further includes:

Specifically, referring to fig. 4F, in some specific scenarios, the preset range may be a preset included angle range of the indication direction of the third directional gesture; in this embodiment, the preset included angle may be, for example, 20 °, 30 °, or 45 °, and the specific range of the preset included angle is not limited in this embodiment; for example, in fig. 4F, when there are multiple sound source objects, the smart sound box 11 receives sounds emitted by the multiple sound source objects, and when it is determined that the third directional gesture exists in the image collected by the smart sound box 11, masks a voice instruction outside the preset range of the indication direction of the third directional gesture; for example, the voice command issued by the sound source object shown by the dotted line in fig. 4F is masked, and at this time, the smart sound box 11 only processes and executes the voice command within the preset range of the direction indicated by the third directional gesture. Specifically, in the present embodiment, the masking of the voice instruction outside the preset range of the third directional gesture indication direction may be performed by performing signal processing on the sound output by each microphone in the microphone array by using a beam forming principle, so as to form spatial directivity, suppress sound interference caused by the target sound, and suppress noise and other voice instructions outside the preset range of the third directional gesture indication direction.

In some optional embodiments, referring to fig. 4G, the method for executing the voice command provided in this embodiment of the present application may also be applicable to a single scene, for example, when one sound source object shown in fig. 4G points to the sound source object itself, the smart speaker is woken up and receives the voice command sent by the sound source object.

It should be noted that this embodiment has the same or similar beneficial effects as the other embodiments of the present application, and details are not described in this embodiment.

Fig. 5 is a schematic structural diagram of an apparatus for executing a voice command according to an embodiment of the present application.

Referring to fig. 5, an apparatus 50 for executing a voice command provided in an embodiment of the present application includes;

a receiving module 51, configured to determine positions of at least two sound source objects when receiving voice commands issued by the at least two sound source objects at the same time;

an obtaining module 52, configured to perform image acquisition on the at least two sound source objects according to positions of the at least two sound source objects, and obtain at least one frame of sound source image;

a determining module 53, configured to determine a target sound source object based on a first directivity characteristic when the at least one frame of sound source image has the first directivity characteristic, where the first directivity characteristic is used to indicate a position of the target sound source object;

and the execution module 54 is used for executing the voice command sent by the target sound source object.

In some alternative embodiments, the first directional characteristic comprises a first directional gesture of the at least one sound source object;

the determining module 53 is specifically configured to determine the target sound source object according to the position pointed by the first directional gesture.

In some optional embodiments, the obtaining module 52 is further configured to obtain, in a case where the first directivity characteristic points to a plurality of sound source objects, the number of first directional gestures pointing to the position of each sound source object;

the obtaining module 52 is further configured to obtain a first position where the number of the first directional gestures is the largest;

the determining module 53 is further configured to determine the sound source object at the first position as the target sound source object.

In some optional embodiments, the obtaining module 52 is further configured to obtain a plurality of frames of sound source images within a preset time length; respectively acquiring the frame number of the first directional gesture pointing to the position of each sound source object;

the obtaining module 52 is further configured to obtain a second position where the first directional gesture points at the same position and has the largest number of frames;

the determining module 53 is further configured to determine the sound source object at the second position as the target sound source object.

In some optional embodiments, the obtaining module 52 is further configured to obtain, in the sound source image with the preset number of frames, a moving distance when the first directional gesture points to the position of the sound source object;

the obtaining module 52 is further configured to obtain a third position pointed by the first directional gesture with the largest moving distance;

and a determining module 53, further configured to determine the sound source object at the third position as the target sound source object.

In some optional implementations, the apparatus 50 for executing a voice instruction provided in this embodiment of the present application further includes:

and the display module 55 is configured to, when receiving voice commands sent by at least two sound source objects at the same time, determine positions of the at least two sound source objects, and convert the voice commands into text messages and display the text messages.

and the adding module 56 is used for converting the voice instruction into the text information and adding the sound source object identifier to the text information after the text information is displayed.

an image collecting module 57, configured to collect an environmental image of a surrounding environment before determining positions of at least two sound source objects when receiving voice commands issued by the at least two sound source objects at the same time;

and the execution module 54 is configured to execute the step of receiving the voice instruction when the environment image has the second directivity characteristic.

In some alternative embodiments, the second directional characteristic includes a second directional gesture directed to the performing subject.

In some alternative embodiments, the second directional characteristic includes pointing to a third directional gesture outside of the performance subject;

the apparatus 50 for executing a voice instruction provided in the embodiment of the present application further includes:

and the masking module 58 is used for masking the voice instruction sent by the sound source object outside the preset range of the indication direction of the third directional gesture.

It should be noted that the device embodiment and the method embodiment of the present application have the same or similar beneficial effects, and are not described in detail in this embodiment.

Referring to fig. 6, an electronic device 60 provided in an embodiment of the present application includes:

the device comprises a memory 61, a processor 62 and a communication bus 63, wherein the memory 61 is connected with the processor 62 in a communication mode through the communication bus 63;

the memory 61 stores computer-executable instructions, and the processor 62 is configured to execute the computer-executable instructions to implement the method for executing the voice instruction provided in any optional implementation manner of the embodiment of the present application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of a method, apparatus, and electronic device for executing voice instructions according to embodiments of the present invention. The present invention may also be embodied as devices or device programs (e.g., computer programs and computer program products) for performing some or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. an execution method of a voice command, is characterized in that, comprises:

When simultaneously receiving voice instructions from at least two sound source objects, determine the positions of the at least two sound source objects;

Image acquisition is performed on the at least two sound source objects according to the positions of the at least two sound source objects to obtain at least one frame of sound source images;

When the at least one frame of the sound source image has a first directivity feature, a target sound source object is determined based on the first directivity feature, where the first directivity feature is used to indicate the position of the target sound source object ;

Execute the voice command issued by the target sound source object;

The first directional feature includes a first directional gesture of at least one sound source object;

The determining of the target sound source object based on the first directivity feature includes:

determining the target sound source object according to the position pointed by the first directional gesture;

The determining the target sound source object according to the position pointed to by the first directional gesture includes:

In the case that the first directional feature points to multiple sound source objects, obtain the number of the first directional gestures pointing to the position of each of the sound source objects;

obtaining the first position with the largest number of the first directional gestures;

The sound source object at the first position is determined as the target sound source object.

2 . The method according to claim 1 , wherein the determining the target sound source object according to the position pointed by the first directional gesture comprises: 2 .

Multiple frames of sound source images within a preset time length; respectively acquiring the number of frames at which the first directional gesture points to the position of each of the sound source objects;

obtaining the second position where the first directional gesture points to the same position with the largest number of frames;

3. The method according to claim 1, wherein the determining the target sound source object according to the position pointed by the first directional gesture comprises:

In the sound source image of the preset number of frames, obtain the moving distance when the first directional gesture points to the position of the sound source object;

acquiring a third position pointed to by the first directional gesture with the largest moving distance;

4. The method according to claim 1, characterized in that, after the positions of the at least two sound source objects are determined when the voice commands issued by the at least two sound source objects are simultaneously received, the method further comprises: include:

The voice command is converted into text information and displayed.

5. The method according to claim 4, wherein after converting the voice command into text information and displaying it, the method further comprises:

A sound source object identifier is added to the text information.

6. The method according to any one of claims 1-5, characterized in that, when simultaneously receiving voice instructions from at least two sound source objects, before determining the positions of the at least two sound source objects , the method also includes:

Collect environmental images of the surrounding environment;

The step of receiving the voice instruction is performed when there is a second directional feature in the environment image.

7 . The method according to claim 6 , wherein the second directional feature comprises a second directional gesture pointing to the executive body. 8 .

8. The method according to claim 6, wherein the second directional feature comprises a third directional gesture pointing outside the execution body;

After receiving the voice instructions sent by the at least two sound source objects and determining the positions of the at least two sound source objects, the method further includes:

Mask the voice command issued by the sound source object outside the preset range of the direction indicated by the third directional gesture.

9. A device for executing a voice command, comprising:

a receiving module, configured to determine the positions of the at least two sound source objects when simultaneously receiving the voice commands sent by the at least two sound source objects;

an acquisition module, configured to perform image acquisition on the at least two sound source objects according to the positions of the at least two sound source objects, and acquire at least one frame of sound source images;

a determining module, configured to determine a target sound source object based on the first directivity feature when the at least one frame of sound source image has a first directivity feature, where the first directivity feature is used to indicate the target the position of the sound source object;

an execution module for executing the voice command sent by the target sound source object;

The determining module is specifically configured to determine the target sound source object according to the position pointed by the first directional gesture;

The obtaining module is further configured to obtain the number of the first directional gestures pointing to the position of each of the sound source objects when the first directional feature points to a plurality of sound source objects;

The obtaining module is further configured to obtain the first position with the largest number of the first directional gestures;

10. The device of claim 9, wherein:

The acquisition module is also used for multiple frames of sound source images within a preset time length; respectively acquiring the number of frames at which the first directional gesture points to the position of each of the sound source objects;

The obtaining module is further configured to obtain the second position where the first directional gesture points to the same position with the largest number of frames;

11. The apparatus of claim 9, wherein:

The obtaining module is further configured to obtain, in the sound source image of the preset number of frames, the moving distance when the first directional gesture points to the position of the sound source object;

The obtaining module is further configured to obtain the third position pointed to by the first directional gesture with the largest moving distance;

12. The apparatus of claim 9, wherein the apparatus further comprises:

The display module is configured to convert the voice instructions into text information and display after determining the positions of the at least two sound source objects when simultaneously receiving voice instructions from at least two sound source objects.

13. The apparatus of claim 12, wherein the apparatus further comprises:

The adding module is configured to add a sound source object identifier to the text information after converting the voice command into text information and displaying it.

14. The device according to any one of claims 9-13, wherein the device further comprises:

an image acquisition module, configured to acquire an environmental image of the surrounding environment before determining the positions of the at least two sound source objects when simultaneously receiving voice instructions from at least two sound source objects;

The executing module is configured to execute the step of receiving the voice instruction when the environment image has a second directional characteristic.

15. The apparatus of claim 14, wherein the second directional feature comprises a second directional gesture pointing to the executive body.

16. The apparatus according to claim 14, wherein the second directional feature comprises a third directional gesture pointing outside the execution body;

The device also includes:

The masking module is used for masking the voice instruction issued by the sound source object outside the preset range of the direction indicated by the third directional gesture.

17. An electronic device, comprising:

a memory, a processor and a communication bus, the memory is communicatively connected to the processor through the communication bus;

Computer-executable instructions are stored in the memory, and the processor is configured to execute the computer-executable instructions to implement the method of any one of claims 1-8.

18. A computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, which, when executed, are used to implement any one of claims 1-8 the method described.