CN115148202A

CN115148202A - Voice instruction processing method and device, storage medium and electronic device

Info

Publication number: CN115148202A
Application number: CN202210609766.4A
Authority: CN
Inventors: 姬光飞
Original assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Current assignee: Qingdao Haier Technology Co Ltd; Haier Smart Home Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-10-04

Abstract

The application provides a voice instruction processing method and device, a storage medium and an electronic device, and relates to the technical field of smart home/smart home, wherein the method comprises the following steps: acquiring a plurality of voice signals acquired by a plurality of acquisition components on intelligent equipment, wherein each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object; selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals; and sending the energy value of the target voice signal to the server so that the server determines whether the intelligent device responds to the voice control command or not according to the energy value of the target voice signal. By the method and the device, the problem that the accuracy of executing the voice command is poor due to the fact that the energy value of the voice signal determined by the equipment is inaccurate in the processing method of the voice command in the related technology is solved.

Description

Voice instruction processing method and device, storage medium and electronic device

Technical Field

The present application relates to the field of communications, and in particular, to a method and an apparatus for processing a voice command, a storage medium, and an electronic apparatus.

Background

At present, a plurality of voice devices (i.e., intelligent devices with voice acquisition functions) may be in a home, and after a user sends a voice instruction, a distributed voice control scheme may be adopted to select a response device for each voice device receiving the voice instruction. The distributed voice control scheme can enable the equipment nearest to the user to respond nearby according to the position of the user, and therefore nearby voice control is achieved.

A plurality of voice acquisition components (sound reception components, such as microphones) can be arranged in the voice equipment, and voice instruction acquisition can be carried out through the arranged voice acquisition components. After gathering multichannel speech signal through a plurality of pronunciation collection parts, can transmit the high in the clouds with the energy value of the speech signal of the same way or speech signal that predetermined fixed pronunciation collection part gathered, the high in the clouds is according to the energy value that a plurality of speech equipment received sent, judges (the equipment that the energy is little does not awaken up, the equipment that the energy is the biggest awakens up).

However, the energy value of the voice signal collected by the voice collecting component may be different due to different placement positions of the voice devices. For example, when a voice device is placed close to a wall, if the voice signal collected by the voice collecting component close to the wall is selected to calculate the energy value of the voice device, the calculated energy value is inaccurate due to the influence of wall reflection and the like, and the condition that the selection of the awakening device is inaccurate occurs, so that the voice command sent by the user cannot be accurately executed.

Therefore, the processing method of the voice command in the related art has the problem of poor accuracy of executing the voice command due to the inaccuracy of the energy value of the voice signal determined by the equipment.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing a voice command, a storage medium and an electronic device, so as to at least solve the problem that the method for processing the voice command in the related art has poor accuracy of executing the voice command due to inaccurate energy value of a voice signal determined by equipment.

According to an aspect of the embodiments of the present application, there is provided a method for processing a voice instruction, including: acquiring a plurality of voice signals acquired by a plurality of acquisition components on intelligent equipment, wherein each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object; selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals; and sending the energy value of the target voice signal to a server so that the server determines whether the intelligent equipment responds to the voice control instruction or not according to the energy value of the target voice signal.

In an exemplary embodiment, the selecting a target speech signal from the plurality of speech signals according to the signal characteristics of the plurality of speech signals includes: selecting a target acquisition component closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals; determining a voice signal acquired by the target acquisition component from among the plurality of voice signals as the target voice signal; or, determining the speech signal with the largest energy value in the plurality of speech signals as the target speech signal.

In an exemplary embodiment, the selecting, from the plurality of acquisition components, a target acquisition component closest to the target object according to the signal characteristics of the plurality of voice signals includes: determining object angle information of the target object according to signal characteristics of the voice signals, wherein the object angle information is used for describing a relative angle between the target object and the intelligent equipment; and selecting the target acquisition component from the plurality of acquisition components according to the object angle information, wherein the target acquisition component is the acquisition component which is closest to the target object after the target object is projected to the plane where the plurality of acquisition components are located according to the relative angle.

In an exemplary embodiment, before the extracting the target speech signal from the plurality of speech signals according to the signal features of the plurality of speech signals, the method further includes: performing acoustic echo cancellation operation on the plurality of voice signals to obtain the plurality of processed voice signals; and performing signal feature extraction on each processed voice signal in the plurality of voice signals to obtain the signal features of the plurality of voice signals.

In an exemplary embodiment, before acquiring the plurality of voice signals acquired by the plurality of acquisition components on the smart device, the method further comprises: acquiring an image of the target object through an image acquisition component on the intelligent equipment to obtain a target acquisition image; carrying out object identification on the target acquisition image to obtain object position information of the target object, wherein the object position information is used for representing the relative position of the target object and the intelligent equipment; and adjusting the acquisition angles of the plurality of acquisition components for acquiring the voice signals according to the object position information.

In an exemplary embodiment, before said transmitting the energy value of the target speech signal to the server, the method further comprises: determining the average value of the amplitude values of a plurality of sampling points in the target voice signal as the energy value of the target voice signal; or, determining the square sum of the amplitude values of a plurality of sampling points in the target speech signal as the energy value of the target speech signal.

In an exemplary embodiment, after the transmitting the energy value of the target speech signal to a server, the method further comprises: receiving an energy value of a voice signal sent by each device in a plurality of devices, wherein the plurality of devices comprise the intelligent device, and the voice signal to which the energy value of the voice signal sent by each device belongs is a voice signal corresponding to a voice control instruction sent by the target object; determining a target device from the plurality of devices according to the energy value of the voice signal sent by each device, wherein the target device is a device used for executing the voice control instruction in the plurality of devices; and controlling the target equipment to execute equipment operation matched with the voice control instruction.

According to another aspect of the embodiments of the present application, there is also provided a processing apparatus for a voice instruction, including: the intelligent device comprises an acquisition unit, a processing unit and a control unit, wherein the acquisition unit is used for acquiring a plurality of voice signals acquired by a plurality of acquisition components on the intelligent device, and each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object; the selecting unit is used for selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals; and the sending unit is used for sending the energy value of the target voice signal to a server so as to determine whether the intelligent equipment responds to the voice control instruction or not according to the energy value of the target voice signal by the server.

In an exemplary embodiment, the selecting unit includes: the selecting module is used for selecting a target acquisition component which is closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals; a first determining module, configured to determine, as the target speech signal, a speech signal acquired by the target acquisition component from among the plurality of speech signals; or, the second determining module is configured to determine, as the target speech signal, a speech signal with a largest energy value in the plurality of speech signals.

In one exemplary embodiment, the selecting module includes: the determining submodule is used for determining object angle information of the target object according to the signal characteristics of the voice signals, wherein the object angle information is used for describing the relative angle between the target object and the intelligent equipment; and the selecting submodule is used for selecting the target acquisition component from the plurality of acquisition components according to the object angle information, wherein the target acquisition component is the acquisition component which is closest to the target object after the target object is projected to the plane where the plurality of acquisition components are located according to the relative angle.

In one exemplary embodiment, the apparatus further comprises: an execution unit, configured to perform an acoustic echo cancellation operation on the multiple voice signals before the target voice signal is selected from the multiple voice signals according to the signal features of the multiple voice signals, so as to obtain the processed multiple voice signals; and the extraction unit is used for extracting the signal characteristics of each processed voice signal in the plurality of voice signals to obtain the signal characteristics of the plurality of voice signals.

In one exemplary embodiment, the apparatus further comprises: the acquisition unit is used for acquiring images of the target object through the image acquisition component on the intelligent equipment before acquiring the voice signals acquired by the acquisition components on the intelligent equipment to obtain a target acquisition image; the identification unit is used for carrying out object identification on the target acquisition image to obtain object position information of the target object, wherein the object position information is used for representing the relative position of the target object and the intelligent equipment; and the adjusting unit is used for adjusting the acquisition angles of the plurality of acquisition components for acquiring the voice signals according to the object position information.

In one exemplary embodiment, the apparatus further comprises: a first determining unit, configured to determine an average value of amplitude values of a plurality of sampling points in the target speech signal as an energy value of the target speech signal before the energy value of the target speech signal is sent to a server; or, the second determining unit is configured to determine a sum of squares of amplitude values of a plurality of sampling points in the target speech signal as an energy value of the target speech signal.

In one exemplary embodiment, the apparatus further comprises: a receiving unit, configured to receive an energy value of a voice signal sent by each device in a plurality of devices after the energy value of the target voice signal is sent to a server, where the plurality of devices include the smart device, and a voice signal to which the energy value of the voice signal sent by each device belongs is a voice signal corresponding to a voice control instruction sent by the target object; a third determining unit, configured to determine a target device from the multiple devices according to an energy value of a voice signal sent by each device, where the target device is a device in the multiple devices that is used to execute the voice control instruction; and the control unit is used for controlling the target equipment to execute equipment operation matched with the voice control instruction.

According to another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the processing method of the above voice command when running.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the processing method of the voice instruction through the computer program.

In the embodiment of the application, a mode of selecting a voice signal for representing the distance between the voice device and a user from a plurality of voice signals based on the signal characteristics of the voice signals acquired by each voice acquisition component is adopted, and a plurality of voice signals acquired by the plurality of acquisition components on the intelligent device are acquired, wherein each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object; selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals; the method comprises the steps of sending an energy value of a target voice signal to a server, determining whether an intelligent device responds to a voice control command or not by the server according to the energy value of the target voice signal, selecting a voice signal for representing the distance between the voice device and a user from a plurality of voice signals according to signal characteristics of the plurality of voice signals collected by a plurality of voice collecting components on the voice device, and dynamically selecting the voice signal capable of representing the distance between the voice device and the user based on the signal characteristics of the currently collected voice signal relative to a mode of selecting the voice signal collected by a fixed voice component.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a diagram illustrating a hardware environment for an alternative method of processing voice commands according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating an alternative method for processing voice commands according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative method of processing voice commands according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an alternative method of processing voice commands in accordance with embodiments of the present application;

FIG. 5 is a block diagram of an alternative apparatus for processing voice commands according to an embodiment of the present application;

fig. 6 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of the embodiment of the application, a method for processing a voice instruction is provided. Optionally, the processing method of the voice instruction is widely applied to full-House intelligent digital control application scenarios such as Smart Home (Smart Home), smart Home device ecology, smart Home (intelligent House) ecology and the like. Alternatively, in this embodiment, the processing method of the voice instruction may be applied to a hardware environment formed by the terminal 102 and the server 104 as shown in fig. 1. As shown in fig. 1, the server 104 is connected to the terminal 102 through a network, and may be configured to provide a service (e.g., an application service) for the terminal or a client installed on the terminal, and may configure a cloud computing and/or edge computing service on the server or separately from the server, so as to provide a data operation service for the server 104.

The network may include, but is not limited to, at least one of: wired networks, wireless networks. The wired network may include, but is not limited to, at least one of: wide area networks, metropolitan area networks, local area networks, which may include, but are not limited to, at least one of the following: WIFI (Wireless Fidelity ), bluetooth. Terminal 102 can be but not limited to be PC, the cell-phone, the panel computer, intelligent air conditioner, intelligent cigarette machine, intelligent refrigerator, intelligent oven, intelligent kitchen range, intelligent washing machine, intelligent water heater, intelligent washing equipment, intelligent dish washer, intelligent projection equipment, intelligent TV, intelligent clothes hanger, intelligent curtain, intelligence is audio-visual, smart jack, intelligent stereo set, intelligent audio amplifier, intelligent new trend equipment, intelligent kitchen guarding equipment, intelligent bathroom equipment, intelligence robot of sweeping the floor, intelligence robot of wiping the window, intelligence robot of mopping the ground, intelligent air purification equipment, intelligent steam ager, intelligent microwave oven, intelligence kitchen is precious, intelligent clarifier, intelligent water dispenser, intelligent lock etc..

The method for processing the voice command according to the embodiment of the present application may be executed by the server 104, the terminal 102, or both the server 104 and the terminal 102. The processing method for the terminal 102 to execute the voice command according to the embodiment of the present application may also be executed by a client installed thereon.

Taking the method for processing the voice command in the embodiment executed by the terminal 102 as an example, fig. 2 is a schematic flow chart of an optional method for processing the voice command according to the embodiment of the present application, and as shown in fig. 2, the flow of the method may include the following steps:

step S202, acquiring a plurality of voice signals acquired by a plurality of acquisition components on the intelligent device, wherein each voice signal in the plurality of voice signals is a voice signal acquired by one acquisition component in the plurality of acquisition components and corresponding to a voice control instruction sent by a target object.

The processing method of the voice command in this embodiment may be applied to a scenario in which a voice control command sent by a target object is processed by an intelligent device (i.e., the terminal 102). The intelligent device can be an intelligent home device or a terminal device. The smart home devices may be smart home devices located in a user's home, and may be electronic devices equipped with smart chips, such as a smart television, a smart refrigerator, and a smart water heater, and compared with conventional home devices, the smart home devices are added with a computing module, a network interface, an input/output device, and the like, so that the smart home devices in this embodiment have functions of intelligent analysis and intelligent service.

Optionally, the target object may be an object that establishes a connection relationship with the smart home device, or an object that is located in the same location area as the smart home device, and may be used to represent a specific user, or may be used to represent a user that is located in the same location area as the smart home device, which is not limited in this embodiment. For example, the target object may be a user located in the same room as the smart refrigerator.

In this embodiment, for the target smart device, the target smart device may obtain a voice control instruction issued by the target object, and optionally, the target smart device may obtain a voice signal acquired by each of a plurality of acquisition components (a plurality of voice acquisition components, for example, a microphone array) on the target smart device, so as to obtain a plurality of voice signals, where each of the plurality of voice signals is a voice signal acquired by one of the plurality of acquisition components and corresponding to the voice control instruction issued by the target object.

The acquisition component can be a voice acquisition component arranged in the intelligent device. Alternatively, the acquisition component may be a microphone component arranged in the smart device, for example, a voice control instruction sent by the user may be acquired by a plurality of microphones arranged in the smart device.

Optionally, in order to better acquire the voice control instruction sent by the target object through the multiple acquisition components, before acquiring the multiple voice signals acquired by the multiple acquisition components on the smart device, an acquisition angle at which the multiple acquisition components acquire the voice signals may be adjusted, which is not limited in this embodiment.

It should be noted that the multiple collecting components may be arranged circumferentially in the intelligent device, or may be arranged in other arrangement manners, which is not limited in this embodiment. In addition, the number of the plurality of collecting components in the intelligent device may be 4, 6, or another number, which is not limited in this embodiment.

For example, a microphone array may be disposed on the smart home device, and a multichannel microphone signal may be acquired by the microphone array.

In step S204, a target speech signal is selected from the plurality of speech signals according to the signal characteristics of the plurality of speech signals.

In this embodiment, after acquiring the plurality of voice signals, a target voice signal may be selected from the plurality of voice signals, where the target voice signal is a voice signal representing the target smart device or a voice signal representing the distance between the target smart device and the target object. Optionally, the feature extraction operation may be performed on each of the multiple voice signals to obtain a signal feature of each voice signal, and then the target voice signal is selected from the multiple voice signals according to the signal feature of each voice signal, which is not limited in this embodiment.

Alternatively, the process of selecting the target speech signal from the plurality of speech signals according to the signal characteristics of the plurality of speech signals may be: according to a plurality of energy values corresponding to a plurality of voice signals, a target voice signal is selected from the plurality of voice signals, or a target acquisition component closest to a target object is selected from the plurality of acquisition components according to signal characteristics of the plurality of voice signals, and then the voice signal acquired by the target acquisition component is determined as the target voice signal, or other modes of selecting the target voice signal are available, which is not limited in this embodiment.

It should be noted that before performing a feature extraction operation on a plurality of voice signals to obtain signal features of the plurality of voice signals, a preprocessing operation may be performed on the plurality of voice signals to improve the accuracy of signal feature extraction of the plurality of voice signals, where the preprocessing operation may be an Acoustic Echo Cancellation (AEC) operation performed on the plurality of voice signals or a filtering operation performed on the plurality of voice signals to filter out an interference signal in the plurality of voice signals, which is not limited in this embodiment.

Step S206, the energy value of the target voice signal is sent to the server, so that the server determines whether the intelligent device responds to the voice control command or not according to the energy value of the target voice signal.

In this embodiment, after the target speech signal is selected from the plurality of speech signals, the energy value of the target speech signal may be sent to the server, so that the server determines whether the smart device responds to the speech control command according to the energy value of the target speech signal.

Alternatively, the server may be a server that establishes a connection relationship with the target smart device, that is, data may be exchanged between the target smart device and the server. The server, after receiving the energy value of the target voice signal transmitted by the target smart device, may compare the energy value of the target voice signal with the energy values of the received voice signals transmitted by the other devices, and determine whether the voice control command is responded to by the target smart device (i.e., which smart device responds to the voice control command) according to the result of the comparison. This is not limited in this embodiment.

It should be noted that, because the target smart device has a risk of being stolen in the process of sending the energy value of the target voice signal to the server, the energy value of the target voice signal may be encrypted and then sent to the server. This is not limited in this embodiment.

Through the steps S202 to S206, acquiring a plurality of voice signals acquired by a plurality of acquisition components on the smart device, wherein each voice signal in the plurality of voice signals is a voice signal acquired by one acquisition component in the plurality of acquisition components and corresponding to a voice control instruction sent by a target object; selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals; the energy value of the target voice signal is sent to the server, so that the server determines whether the intelligent device responds to the voice control command according to the energy value of the target voice signal, the problem that the accuracy of executing the voice command is poor due to the fact that the energy value of the voice signal determined by the device is inaccurate in a voice command processing method in the related art is solved, and the accuracy of executing the voice command is improved.

In an exemplary embodiment, the manner of selecting the target speech signal from the plurality of speech signals may be various and may include, but is not limited to, at least one of the following: the target voice signal is directly selected from the voice signals based on the voice signal characteristics, the target acquisition component is selected from the acquisition components based on the voice signal characteristics, and then the voice signal acquired by the target acquisition component is determined as the target voice signal.

As an optional implementation manner, selecting a target acquisition component from a plurality of acquisition components according to signal characteristics of a plurality of voice signals includes:

s11, selecting a target acquisition component closest to a target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals;

and S12, determining the voice signal acquired by the target acquisition component in the plurality of voice signals as a target voice signal.

The target acquisition component can be obtained by selecting the acquisition component closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals. For example, a target microphone (an embodiment Of the target acquisition unit) may be obtained by selecting a microphone closest to the target from the plurality Of microphones by using a DOA (Direction Of Arrival) algorithm based on the speech information acquired by the plurality Of microphones.

As another alternative embodiment, selecting a target acquisition component from a plurality of acquisition components according to signal characteristics of a plurality of voice signals includes:

and S13, determining the voice signal with the largest energy value in the plurality of voice signals as the target voice signal.

The speech signal with the largest energy value among the plurality of speech signals may be determined as the target speech signal. For example, if the energy value of the voice signal collected by the a microphone is 40, the energy value collected by the b microphone is 30, the energy value collected by the C microphone is 45, and the energy value collected by the d microphone is 35 in the plurality of microphones A, B, C, D, the voice signal collected by the C microphone will be determined as the target voice signal.

According to the embodiment, the voice signal acquired by the acquisition component closest to the target object in the acquisition components or the voice signal with the largest energy value is determined as the target voice signal, so that the flexibility and the accuracy of voice signal selection can be improved.

In an exemplary embodiment, selecting a target acquisition component closest to a target object from a plurality of acquisition components according to signal characteristics of a plurality of voice signals includes:

s21, determining object angle information of a target object according to the signal characteristics of the voice signals, wherein the object angle information is used for describing the relative angle between the target object and the intelligent equipment;

and S22, selecting and taking out a target acquisition component from the plurality of acquisition components according to the object angle information, wherein the target acquisition component is the acquisition component which projects the target object to the plane where the plurality of acquisition components are located according to the relative angle and is closest to the target object.

In this embodiment, a target acquisition unit closest to the target object may be selected from the plurality of acquisition units based on signal characteristics of the plurality of voice signals. Optionally, the above-mentioned process of selecting the target acquisition component closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals may be: the method comprises the steps of determining object angle information of a target object according to signal characteristics of a plurality of voice signals, and selecting a target acquisition component from a plurality of acquisition components according to the object angle information. For example, the angle information of the user sound source may be calculated by a DOA algorithm according to the collected microphone signals of the multiple channels, and then the channel of the microphone signal closest to the user sound source may be found according to the calculated angle information.

Optionally, the object angle information is used to describe a relative angle between the target object and the smart device, and the target collection component is a collection component that projects the target object to a plane where the collection components are located according to the relative angle and is closest to the target object. For example, the relative angle between the user object and the smart device may be determined, and then the user object may be projected to the plane where the plurality of microphones are located according to the relative angle.

Optionally, the process of determining the object angle information of the target object may be: firstly, a coordinate system is established by taking the position of the target intelligent equipment as an original point, and then the relative angle between the target object and the intelligent equipment is determined according to the position of the target object in the coordinate system. This is not limited in this embodiment.

Through this embodiment, according to the signal characteristics of a plurality of speech signals, confirm the object angle information of target object earlier, according to object angle information, select out the target acquisition component from a plurality of acquisition components again, can improve the accuracy nature that speech signal selected.

In an exemplary embodiment, before the target speech signal is selected from the plurality of speech signals according to signal characteristics of the plurality of speech signals, the method further includes:

s31, performing acoustic echo cancellation operation on the voice signals to obtain a plurality of processed voice signals;

and S32, extracting the signal characteristics of each of the processed voice signals to obtain the signal characteristics of the voice signals.

In this embodiment, since there may be an interference signal in the speech signal acquired by the acquisition component, the accuracy of the signal feature of the extracted speech signal may be affected. Optionally, before the target speech signal is selected from the plurality of speech signals according to the signal features of the plurality of speech signals, the plurality of speech signals may be denoised, and then the signal features of the processed speech signals may be extracted.

Optionally, the process of denoising the plurality of speech signals may be: and performing acoustic echo cancellation operation on the plurality of voice signals to obtain a plurality of processed voice signals. After obtaining the processed plurality of speech signals, signal feature extraction may be performed on each of the processed plurality of speech signals to obtain signal features of the plurality of speech signals.

Alternatively, the signal characteristic of the plurality of signals may be an energy value of the plurality of signals. When the signal feature is an energy value, the above-mentioned process of extracting the signal feature of each of the processed multiple speech signals to obtain the signal features of the multiple speech signals may be: and determining the average value or the square sum of the preset elements of the plurality of sampling points in each of the plurality of processed voice signals as the energy value (namely, the signal characteristic) of the voice signal. The preset element may be an amplitude value of the sampling point, an energy value of the sampling point, or another element of the sampling point, which is not limited in this embodiment.

Through this embodiment, after carrying out the acoustic echo cancellation operation to a plurality of speech signals, extract the signal characteristic of a plurality of speech signals, can promote the accuracy of the signal characteristic extraction of speech signals.

In an exemplary embodiment, before acquiring the plurality of voice signals acquired by the plurality of acquisition components on the smart device, the method further includes:

s41, carrying out image acquisition on a target object through an image acquisition component on the intelligent equipment to obtain a target acquisition image;

s42, carrying out object identification on the target collected image to obtain object position information of the target object, wherein the object position information is used for representing the relative position of the target object and the intelligent equipment;

and S43, adjusting the acquisition angles of the voice signals acquired by the plurality of acquisition components according to the position information of the object.

In this embodiment, in order to better perform voice acquisition on a voice control instruction sent by a target object through an acquisition component, before acquiring a plurality of voice signals acquired by a plurality of acquisition components on a smart device, the acquisition angles of the plurality of acquisition components may be adjusted, for example, adjusted to a direction in which the target object is located. Because it is impossible to accurately determine when the target object will issue the voice control command, the acquisition angles of the plurality of acquisition components can be adjusted in real time or periodically according to the position of the target object.

When the acquisition angle is adjusted once, image acquisition may be performed on a target object through an image acquisition component (e.g., a camera) on the smart device to obtain a target acquisition image, and then the target acquisition image is subject to object recognition to obtain object position information of the target object, where the object position information is used to indicate a relative position between the target object and the smart device; and finally, adjusting the acquisition angles of the plurality of acquisition components for acquiring the voice signals according to the position information of the object.

Alternatively, after adjusting the collection angles at which the plurality of collection parts collect the voice signals, the angles at which the collection parts collect the voice signals may be directed toward the target object. In addition, the relative position of the target object and the target intelligent device can be detected by other perception sensors (for example, a human body sensor), and the acquisition angles of the voice signal acquisition of the plurality of acquisition components are adjusted based on the detected relative position.

Because the target object may be in a moving state (e.g., a user may be moving around the smart device), the relative position of the target object and the target smart device may be constantly changing. Therefore, the collection angles of the plurality of collection components for collecting the voice signals can be continuously adjusted. Optionally, the target object may be periodically subjected to image acquisition to obtain a target acquisition image, and the acquisition angles of the plurality of acquisition components for acquiring the voice signals are adjusted according to the periodically acquired target acquisition image.

Optionally, in order to reduce resource consumption, the acquisition angles of the multiple acquisition components for acquiring the voice signals may be adjusted only when a difference between the object position information of the target object determined in the next period and the object position information of the target object determined in the previous period is greater than a preset threshold, which is not limited in this embodiment.

It should be noted that the image collecting component may be a camera on the intelligent device, an infrared area array projector on the intelligent device, or other image collecting components, which is not limited in this embodiment.

Through this embodiment, through the relative position who detects user and smart machine to according to the relative position adjustment smart machine on the collection part carry out the angle that speech signal gathered, can improve the signal quality of the speech signal who acquires.

In an exemplary embodiment, before transmitting the energy value of the target speech signal to the server, the method further includes:

s51, determining the average value of amplitude values of a plurality of sampling points in the target voice signal as an energy value of the target voice signal; or,

and S52, determining the square sum of the amplitude values of the plurality of sampling points in the target voice signal as the energy value of the target voice signal.

In this embodiment, the energy value of the target speech signal may be determined before transmitting the energy value to the server. Alternatively, the energy value of the target speech signal may be determined from the amplitude values of a plurality of sample points in the target speech signal.

As an alternative implementation, an average value of amplitude values of a plurality of sampling points in the target speech signal may be determined as an energy value of the target speech signal. For example, the amplitude values of the multiple sampling points E1, F1, G1 are 2,3,4, respectively, and the energy value of the target speech signal is 3 (i.e., (2 +3+ 4)/3).

As another alternative, the sum of squares of the amplitude values of a plurality of sampling points in the target speech signal may be determined as the energy value of the target speech signal, and the number of sampling points may be predetermined. For example, the amplitude values of the multiple sampling points E2, F2, and G2 are 2,3,4, respectively, and the energy value of the target voice signal is 29 (i.e., 2^2+3^2+4^2= 29).

By the embodiment, the average value or the square sum of the amplitude values of the plurality of sampling points in the voice signal is determined as the energy value of the voice signal, so that the convenience of determining the energy value of the voice signal can be improved.

In an exemplary embodiment, after transmitting the energy value of the target speech signal to the server, the method further includes:

s61, receiving an energy value of a voice signal sent by each device in a plurality of devices, wherein the plurality of devices comprise intelligent devices, and the voice signal to which the energy value of the voice signal sent by each device belongs is a voice signal corresponding to a voice control instruction sent by a target object;

s62, determining target equipment from the multiple equipment according to the energy value of the voice signal sent by each equipment, wherein the target equipment is the equipment used for executing the voice control instruction in the multiple equipment;

and S63, controlling the target equipment to execute equipment operation matched with the voice control instruction.

In this embodiment, after the energy value of the target voice signal is sent to the server, the server may further receive, according to the received energy value of the target voice signal, energy values of voice signals sent by other smart devices, that is, energy values of voice signals sent by multiple devices, where the voice signal to which the energy value of the voice signal sent by each device belongs is a voice signal corresponding to the voice control instruction sent by the target object

It is determined whether the voice control command is responded to by the smart device.

Optionally, the server may receive an energy value of a voice signal sent by each of a plurality of devices, where the plurality of devices include an intelligent device, and a voice signal to which the energy value of the voice signal sent by each device belongs is a voice signal corresponding to a voice control instruction sent by the target object.

After receiving the energy value of the voice signal transmitted by each device, the server may determine a target device from the multiple devices according to the energy value of the voice signal transmitted by each device, where the target device is a device for executing a voice control instruction in the multiple devices.

Optionally, the above process of determining the target device from the multiple devices according to the energy value of the voice signal transmitted by each device may be: and determining the device corresponding to the maximum energy value in the energy values of the transmitted voice signals in the plurality of devices as the target device. For example, if the server receives that the energy value of the voice signal transmitted by the H device is 40, the energy value of the voice signal transmitted by the i device is 38, and the energy value of the voice signal transmitted by the J device is 45, the J device will be determined as the target device.

Alternatively, after the target device is determined, the target device may be controlled to perform a device operation matching the voice control instruction. For example, when the voice control instruction is a device wake-up instruction, the target device may be woken up; when the voice control instruction is used for controlling the target device to execute a specific device operation, the device operation instruction may be sent to the target device, and the device operation instruction may carry an operation parameter of the specific device operation. After receiving the device operation instruction, the target device may perform a specific device operation according to the operation parameter in the device operation instruction.

Through this embodiment, the equipment of response voice operation instruction is selected based on the energy value that a plurality of equipment sent, can improve the precision that voice control instruction executed, and then promotes user's use experience.

The following explains a processing method of a voice instruction in the embodiment of the present application with an alternative example. In this optional example, the target smart device is a smart home device, the collection component is a microphone, and the server is a cloud (i.e., a cloud server).

A processing method of a voice instruction in the related art is roughly as shown in fig. 3, and a path of signal fixed in a multi-channel microphone array is selected to be used for calculating an energy value of a device, the signal acquired by the path of signal is processed by AEC, the energy value is calculated, the calculated energy value is transmitted to a cloud, after the cloud receives the energy values uploaded by each device, the cloud can compare the energy values of different devices, make a decision, and then issue the decision result to a device terminal.

However, a fixed path of signal in the multi-channel microphone array is selected for calculating the energy value of the device, and the calculated energy value may generate a larger energy difference along with the device placement position of the user. For example, when a user is placed close to a wall, if a microphone signal close to the wall is selected to calculate an energy value of a device, the calculated energy value of the device is inaccurate due to the influence of wall reflection and the like, and further, the device B is awakened due to inaccurate awakening, which is originally closest to the device a.

In order to solve the problem that the energy value calculation of the equipment is inaccurate, the optional example provides a method for improving the awakening accuracy of the voice equipment.

As shown in fig. 4, the flow of the processing method of the voice instruction in this alternative example may include the following steps:

step S402, acquiring a multichannel microphone signal.

Step S404, performing acoustic echo cancellation processing on the acquired multi-channel microphone signal.

Step S406, calculating the angle information of the user sound source according to the processed multi-channel microphone signals.

In step S408, the channel of the microphone signal closest to the user sound source is found according to the calculated angle information of the user sound source.

In step S410, energy calculation is performed using the microphone signal closest to the user' S sound source.

The calculation of the audio energy may be done by taking the sum of the squares of the sample points or the like.

In step S412, the energy value calculated by the device is uploaded to the cloud.

And step S414, the cloud makes a decision according to the energy values reported by the multiple devices.

In step S416, the device with the largest energy value is awakened.

Through this embodiment, can limit evade the influence of echo or the wall reflection (for example, when equipment microphone is placed by the wall or equipment microphone has sheltered from) of equipment locating position, improve the corresponding accuracy of equipment, avoid awakening inaccurate problem up, promote user experience and control accuracy to can effectively solve many equipment under the same family user and awaken up nearby, awaken up inaccurate problem up.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., a ROM (Read-Only Memory)/RAM (Random Access Memory), a magnetic disk, an optical disk) and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the methods according to the embodiments of the present application.

According to another aspect of the embodiment of the present application, there is also provided a processing apparatus of a voice instruction for implementing the processing method of a voice instruction. Fig. 5 is a block diagram of an alternative voice instruction processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus may include:

an obtaining unit 502, configured to obtain a plurality of voice signals collected by a plurality of collecting components on an intelligent device, where each voice signal in the plurality of voice signals is a voice signal which is collected by one collecting component in the plurality of collecting components and corresponds to a voice control instruction sent by a target object;

a selecting unit 504, connected to the obtaining unit 502, configured to select a target speech signal from the multiple speech signals according to signal characteristics of the multiple speech signals;

and the sending unit 506 is connected with the selecting unit 504 and is used for sending the energy value of the target voice signal to the server so that the server can determine whether the intelligent device responds to the voice control command according to the energy value of the target voice signal.

It should be noted that the obtaining unit 502 in this embodiment may be configured to execute the step S202, the selecting unit 504 in this embodiment may be configured to execute the step S204, and the sending unit 506 in this embodiment may be configured to execute the step S206.

Through the module, a plurality of voice signals acquired by a plurality of acquisition components on the intelligent equipment are acquired, wherein each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object; selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals; the energy value of the target voice signal is sent to the server, so that the server determines whether the intelligent device responds to the voice control command according to the energy value of the target voice signal, the problem that the accuracy of executing the voice command is poor due to the fact that the energy value of the voice signal determined by the device is inaccurate in a voice command processing method in the related art is solved, and the accuracy of executing the voice command is improved.

In one exemplary embodiment, the selecting unit includes:

selecting a target acquisition component closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals;

the first determining module is used for determining the voice signal acquired by the target acquisition component in the plurality of voice signals as a target voice signal.

In one exemplary embodiment, the selecting unit includes:

and the second determining module is used for determining the voice signal with the largest energy value in the plurality of voice signals as the target voice signal.

In one exemplary embodiment, the selecting module includes:

the determining submodule is used for determining object angle information of the target object according to the signal characteristics of the voice signals, wherein the object angle information is used for describing the relative angle between the target object and the intelligent equipment;

and the selecting submodule is used for selecting a target acquisition component from the plurality of acquisition components according to the angle information of the object, wherein the target acquisition component is the acquisition component which projects the target object to the plane where the plurality of acquisition components are located according to the relative angle and is closest to the target object.

In an exemplary embodiment, the apparatus further includes:

the execution unit is used for executing acoustic echo cancellation operation on the voice signals before target voice signals are selected from the voice signals according to the signal characteristics of the voice signals to obtain processed voice signals;

and the extraction unit is used for extracting the signal characteristics of each of the processed voice signals to obtain the signal characteristics of the voice signals.

In an exemplary embodiment, the apparatus further comprises:

the acquisition unit is used for acquiring images of a target object through an image acquisition component on the intelligent equipment before acquiring a plurality of voice signals acquired by a plurality of acquisition components on the intelligent equipment to obtain a target acquisition image;

the identification unit is used for carrying out object identification on the target acquisition image to obtain object position information of the target object, wherein the object position information is used for representing the relative position of the target object and the intelligent equipment;

and the adjusting unit is used for adjusting the acquisition angles of the plurality of acquisition components for acquiring the voice signals according to the position information of the object.

In an exemplary embodiment, the apparatus further includes:

the device comprises a first determining unit, a second determining unit and a processing unit, wherein the first determining unit is used for determining the average value of amplitude values of a plurality of sampling points in a target voice signal as the energy value of the target voice signal before the energy value of the target voice signal is sent to a server; or,

and a second determining unit for determining the sum of squares of the amplitude values of the plurality of sampling points in the target speech signal as the energy value of the target speech signal.

In an exemplary embodiment, the apparatus further includes:

the device comprises a receiving unit, a processing unit and a processing unit, wherein the receiving unit is used for receiving the energy value of the voice signal sent by each device in a plurality of devices after the energy value of the target voice signal is sent to a server, the plurality of devices comprise intelligent devices, and the voice signal to which the energy value of the voice signal sent by each device belongs is the voice signal corresponding to the voice control instruction sent by the target object;

a third determining unit, configured to determine a target device from the multiple devices according to an energy value of a voice signal transmitted by each device, where the target device is a device used for executing a voice control instruction in the multiple devices;

and the control unit is used for controlling the target equipment to execute equipment operation matched with the voice control instruction.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may be operated in a hardware environment as shown in fig. 1, and may be implemented by software, or may be implemented by hardware, where the hardware environment includes a network environment.

According to still another aspect of an embodiment of the present application, there is also provided a storage medium. Optionally, in this embodiment, the storage medium may be configured to execute a program code of any one of the voice instruction processing methods in this embodiment of the present application.

Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the embodiment.

Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:

the method comprises the following steps of S1, acquiring a plurality of voice signals acquired by a plurality of acquisition components on intelligent equipment, wherein each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object;

s2, selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals;

and S3, sending the energy value of the target voice signal to a server, and determining whether the intelligent equipment responds to the voice control command or not by the server according to the energy value of the target voice signal.

Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.

According to another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the method for processing a voice command, where the electronic device may be a server, a terminal, or a combination thereof.

Fig. 6 is a block diagram of an alternative electronic device according to an embodiment of the present invention, as shown in fig. 6, including a processor 602, a communication interface 604, a memory 606, and a communication bus 608, where the processor 602, the communication interface 604, and the memory 606 communicate with each other through the communication bus 608, where,

a memory 606 for storing computer programs;

the processor 602, when executing the computer program stored in the memory 606, implements the following steps:

and S3, sending the energy value of the target voice signal to a server, and determining whether the intelligent equipment responds to the voice control instruction or not by the server according to the energy value of the target voice signal.

Alternatively, in this embodiment, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface is used for communication between the electronic device and other equipment.

The memory may include RAM, or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

As an example, the memory 606 may include, but is not limited to, an obtaining unit 502, a selecting unit 504, and a sending unit 506 in a control device of the apparatus. In addition, other module units in the control device of the above-mentioned apparatus may also be included, but are not limited to these, and are not described in detail in this example.

The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), an NP (Network Processor), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

It can be understood by those skilled in the art that the structure shown in fig. 6 is only an illustration, and the device implementing the processing method of the voice instruction may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 6 is a diagram illustrating a structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 6, or have a different configuration than shown in FIG. 6.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the methods described in the embodiments of the present application.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be implemented in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, and may also be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for processing a voice command, comprising:

acquiring a plurality of voice signals acquired by a plurality of acquisition components on intelligent equipment, wherein each voice signal in the plurality of voice signals is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object;

selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals;

and sending the energy value of the target voice signal to a server so that the server determines whether the intelligent equipment responds to the voice control instruction or not according to the energy value of the target voice signal.

2. The method according to claim 1, wherein said selecting a target speech signal from the plurality of speech signals according to the signal characteristics of the plurality of speech signals comprises:

selecting a target acquisition component closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals; determining a voice signal acquired by the target acquisition component from the plurality of voice signals as the target voice signal; or,

and determining the voice signal with the largest energy value in the plurality of voice signals as the target voice signal.

3. The method according to claim 2, wherein selecting a target acquisition component closest to the target object from the plurality of acquisition components according to the signal characteristics of the plurality of voice signals comprises:

determining object angle information of the target object according to signal characteristics of the voice signals, wherein the object angle information is used for describing a relative angle between the target object and the intelligent equipment;

and selecting the target acquisition component from the plurality of acquisition components according to the object angle information, wherein the target acquisition component is the acquisition component which is closest to the target object after the target object is projected to the plane where the plurality of acquisition components are located according to the relative angle.

4. The method according to claim 1, wherein before said extracting a target speech signal from the plurality of speech signals according to the signal characteristics of the plurality of speech signals, the method further comprises:

performing acoustic echo cancellation operation on the plurality of voice signals to obtain the plurality of processed voice signals;

and performing signal feature extraction on each processed voice signal in the plurality of voice signals to obtain the signal features of the plurality of voice signals.

5. The method of any of claims 1 to 4, wherein prior to said acquiring a plurality of speech signals acquired by a plurality of acquisition components on a smart device, the method further comprises:

acquiring an image of the target object through an image acquisition component on the intelligent equipment to obtain a target acquisition image;

performing object identification on the target acquisition image to obtain object position information of the target object, wherein the object position information is used for representing the relative position of the target object and the intelligent equipment;

and adjusting the acquisition angles of the plurality of acquisition components for acquiring the voice signals according to the object position information.

6. The method according to any of claims 1 to 4, wherein prior to said transmitting the energy value of the target speech signal to a server, the method further comprises:

determining the average value of amplitude values of a plurality of sampling points in the target voice signal as an energy value of the target voice signal; or,

and determining the sum of squares of amplitude values of a plurality of sampling points in the target voice signal as an energy value of the target voice signal.

7. The method according to any of claims 1 to 4, wherein after said transmitting the energy value of the target speech signal to a server, the method further comprises:

receiving an energy value of a voice signal sent by each device in a plurality of devices, wherein the plurality of devices comprise the intelligent device, and the voice signal to which the energy value of the voice signal sent by each device belongs is a voice signal corresponding to a voice control instruction sent by the target object;

determining a target device from the plurality of devices according to the energy value of the voice signal sent by each device, wherein the target device is a device used for executing the voice control instruction in the plurality of devices;

and controlling the target equipment to execute equipment operation matched with the voice control instruction.

8. An apparatus for processing a voice command, comprising:

the intelligent device comprises an acquisition unit, a processing unit and a control unit, wherein the acquisition unit is used for acquiring a plurality of voice signals acquired by a plurality of acquisition components on the intelligent device, and each voice signal in the plurality of voice signals is a voice signal which is acquired by one acquisition component in the plurality of acquisition components and corresponds to a voice control instruction sent by a target object;

the selecting unit is used for selecting a target voice signal from the voice signals according to the signal characteristics of the voice signals;

and the sending unit is used for sending the energy value of the target voice signal to a server so as to determine whether the intelligent equipment responds to the voice control instruction or not according to the energy value of the target voice signal by the server.

9. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.