CN116896488A

CN116896488A - Voice control method and device, storage medium and electronic equipment

Info

Publication number: CN116896488A
Application number: CN202310848085.8A
Authority: CN
Inventors: 鲁勇; 刘玉洁; 崔潇潇; 张新科
Original assignee: Beijing Intengine Technology Co Ltd
Current assignee: Beijing Intengine Technology Co Ltd
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-10-17

Abstract

The embodiment of the application discloses a voice control method, a voice control device, a storage medium and electronic equipment. The method comprises the following steps: receiving a voice wake-up signal through intelligent equipment in the intelligent home network, extracting directional words in the voice wake-up signal, determining target equipment corresponding to the directional words, waking up the target equipment based on the intelligent home network, receiving a voice command signal through the intelligent equipment in the intelligent home network, and controlling the target equipment according to the voice command signal. According to the scheme provided by the embodiment of the application, the target equipment can be determined from a plurality of intelligent equipment in the intelligent home network according to the directional words in the voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

Description

Voice control method and device, storage medium and electronic equipment

Technical Field

The application relates to the technical field of audio data processing, in particular to a voice control method, a voice control device, a storage medium and electronic equipment.

Background

In recent years, with the popularization of smart speakers, voice assistants, and the like, voice recognition is increasingly accepted, and the application of the technology is also increasingly in a scene such as: the control of the device by voice, the realization of the content search, is an important part of the daily life of the person. The continuous development of the voice recognition technology is perfect, the development and popularization of a voice intelligent home control system are greatly promoted, a large number of intelligent home control systems taking voice sound boxes or other voice collectors as control interfaces appear in the market at present, and great convenience is brought to the daily life of users.

However, in the current products, when a user controls a certain device through voice, the user needs to accurately speak the name of the device and perform the operation, which has high requirements on the user, and when there are a plurality of identical devices, the voice control may have misoperation, for example, when the living room and bedroom of the user have air conditioners, the user needs to accurately speak where to control the air conditioner to perform the follow-up operation, if only: the instruction of opening the air conditioner may cause the system to separate the equipment to be controlled, thereby causing misoperation and affecting the use efficiency.

Disclosure of Invention

The embodiment of the application provides a voice control method, a device, a storage medium and electronic equipment, which can determine target equipment from a plurality of intelligent equipment in an intelligent home network according to directional words in a voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

The embodiment of the application provides a voice control method, which comprises the following steps:

receiving a voice wake-up signal through intelligent equipment in an intelligent home network;

extracting directional words in the voice wake-up signal, determining target equipment corresponding to the directional words, and waking up the target equipment based on the intelligent home network;

Receiving a voice command signal through intelligent equipment in an intelligent home network;

and controlling the target equipment according to the voice command signal.

In an embodiment, the extracting the directional words in the voice wake-up signal includes:

acquiring voice characteristics of the voice wake-up signal;

inputting the voice characteristics into a pre-trained acoustic model, and outputting acoustic posterior probability corresponding to the voice wake-up signal;

and decoding the voice wake-up signal through the decoding diagram and the acoustic posterior probability to identify directional words in the voice wake-up signal.

In an embodiment, the acquiring the voice feature of the voice wake-up signal includes:

performing pre-emphasis, framing and windowing processing operations on the received voice wake-up signal to obtain a processed time domain signal;

performing Fourier transform on the time domain signals frame by frame to convert the time domain signals into frequency domain signals;

and carrying out Mel filtering processing on the frequency domain signal to obtain Mel characteristics and taking the Mel characteristics as voice characteristics.

In one embodiment, the training process of the acoustic model includes:

constructing a phoneme sequence difference minimization loss value based on a prediction result output by a training sample through the acoustic model;

Calculating a loss function according to the phoneme sequence difference minimization loss value and the cross entropy loss value;

training the acoustic model based on the loss function.

In an embodiment, the process of constructing the decoding graph includes:

constructing a project word list according to preset directional words, and training the project word list through a statistical language model tool to obtain a first language model converter;

constructing a pronunciation dictionary according to the preset directive words and the corresponding phonemes;

and compounding the first language model converter and the pronunciation dictionary converter, and performing minimization processing to generate a forward decoding diagram.

In an embodiment, after generating the forward decoding map, the method further comprises:

generating an inverted sequence project vocabulary according to the project vocabulary and a corresponding second language model converter;

generating an inverted pronunciation dictionary according to the pronunciation dictionary;

and compounding the second language model converter and the reverse order pronunciation dictionary converter, and performing minimization processing to generate a reverse decoding diagram.

In an embodiment, the waking up the target device based on the smart home network includes:

Generating a control instruction according to the equipment information of the target equipment;

and broadcasting the control instruction to the intelligent home network to find and wake up the target equipment.

In an embodiment, before receiving the voice wake signal by the smart device in the smart home network, the method further comprises:

accessing at least one intelligent device to a public network;

controlling the at least one intelligent device to broadcast own device information and receiving broadcast information of other intelligent devices;

selecting a central device from the at least one intelligent device according to the number of broadcasts received by the intelligent device and the signal strength;

and establishing an intelligent home network based on the central equipment, and controlling other intelligent equipment to join the intelligent home network.

The application also provides a voice control method, which comprises the following steps:

waking up all intelligent devices in the intelligent home network according to the voice wake-up signal;

and extracting directional words in the voice command signal, determining target equipment corresponding to the directional words, and controlling the target equipment based on the intelligent home network.

The embodiment of the application also provides a voice control device, which comprises:

the first receiving module is used for receiving a voice wake-up signal through intelligent equipment in the intelligent home network;

the wake-up module is used for extracting directional words in the voice wake-up signal, determining target equipment corresponding to the directional words, and waking up the target equipment based on the intelligent home network;

the second receiving module is used for receiving the voice command signal through intelligent equipment in the intelligent home network;

and the control module is used for controlling the target equipment according to the voice command signal.

Embodiments of the present application also provide a storage medium storing a computer program adapted to be loaded by a processor to perform the steps of the speech control method according to any of the embodiments above.

The embodiment of the application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the steps in the voice control method according to any embodiment by calling the computer program stored in the memory.

According to the voice control method, the voice control device, the storage medium and the electronic equipment, the intelligent equipment in the intelligent home network can receive the voice wake-up signal, directional words in the voice wake-up signal are extracted, target equipment corresponding to the directional words is determined, the target equipment is waken based on the intelligent home network, the intelligent equipment in the intelligent home network receives the voice command signal, and the target equipment is controlled according to the voice command signal. According to the scheme provided by the embodiment of the application, the target equipment can be determined from a plurality of intelligent equipment in the intelligent home network according to the directional words in the voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic system diagram of a voice control apparatus according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a voice control method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of another voice control method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice control device according to an embodiment of the present application;

fig. 5 is another schematic structural diagram of a voice control apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a voice control method, a voice control device, a storage medium and electronic equipment. Specifically, the voice control method of the embodiment of the application can be executed by an electronic device or a server, wherein the electronic device can be a terminal. The terminal can be a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA), an intelligent home and other devices, and the terminal can also comprise a client, wherein the client can be a media playing client or an instant messaging client and the like.

For example, when the voice control method is operated on the electronic device, the electronic device may receive the audio signal through the intelligent devices located in the distributed network, identify the command word in the audio signal, determine the candidate device associated with the command word, if the number of the candidate devices is at least two, calculate the energy values of the audio signal received by different intelligent devices in the distributed network, determine the target device from the candidate devices according to the energy values of the audio signal, and wake up the target device to realize audio control. The electronic device may be any one of intelligent devices in a distributed network.

Referring to fig. 1, fig. 1 is a schematic system diagram of a voice control device according to an embodiment of the application. The system may include at least one smart device 1000, the at least one smart device 1000 may be connected through a distributed network. The electronic device 1000 may be a terminal device having computing hardware capable of supporting and executing software products corresponding to multimedia. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, the different electronic devices 1000 may be connected to other embedded platforms or to a server, a personal computer, or the like using their own bluetooth network or hotspot network. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms.

The embodiment of the application provides a voice control method which can be used by electronic equipment. The embodiment of the application is described by taking a voice control method executed by electronic equipment as an example. The electronic equipment comprises a microphone, wherein the microphone is used for receiving voice sent by a user and converting the voice into an audio signal, so that subsequent equipment control is realized according to the audio signal.

Referring to fig. 2, the specific flow of the method may be as follows:

step 101, receiving a voice wake-up signal through an intelligent device in the intelligent home network.

In an embodiment, the smart device may be a smart home device that is pre-connected to a smart home network, such as a smart lamp, a smart television, a smart air conditioner, a smart curtain, a smart water heater, a smart washing machine, and so on. The smart device may include at least one microphone, configured to receive a voice wake-up signal and a voice command signal sent by a user.

In an embodiment, to further improve the recognition rate of the audio, a noise reduction operation may be performed after the audio signal is received, for example, the human voice and the environmental sound in the audio signal are separated, so as to obtain the human voice audio. In one implementation manner, the audio signal may be input into an existing human voice separation model, and separation of human voice audio and environmental audio is performed to obtain human voice audio, where the human voice separation model may be a human voice separation model based on a PIT (Permutation Invariant Train, substitution-invariant training) deep neural network. In another implementation manner, the separation tool is used to separate the voice audio from the environmental audio, so as to obtain the voice audio, for example, voice extraction processing can be performed according to the frequency spectrum characteristic or the frequency characteristic of the audio data.

Step 102, extracting directional words in the voice wake-up signal, determining target equipment corresponding to the directional words, and waking up the target equipment based on the intelligent home network.

In an embodiment, the directional words may be preset for each intelligent device in the smart home network, where the directional words may include, but are not limited to, a location of the intelligent device, a nickname, and the like, and herein, by way of example, a plurality of air conditioners in a home may be respectively set as "bedroom whitelet", "living room whitelet", "restaurant whitelet", and the like, where "bedroom", "living room", and "restaurant" are directional words.

In particular, the voice feature of the received voice wake-up signal may be extracted, and the voice feature may be an Fbank (filter bank) feature or an MFCC (Mel frequency cepstral coefficient) feature. For example, preprocessing operations such as pre-emphasis, framing, windowing may be performed on the received speech signal, then fourier transforming the time-domain speech signal of the framing frame by frame into a frequency-domain signal, and finally performing mel filtering on the frequency-domain signal to obtain mel characteristics and use the mel characteristics as speech characteristics. That is, the step of acquiring the voice characteristics of the voice wake-up signal may comprise: and performing pre-emphasis, framing and windowing processing on the received voice wake-up signal to obtain a processed time domain signal, performing Fourier transformation on the time domain signal frame by frame to convert the time domain signal into a frequency domain signal, and performing Mel filtering processing on the frequency domain signal to obtain Mel characteristics and taking the Mel characteristics as voice characteristics.

In another embodiment, the step of extracting the directional words in the voice wake-up signal may be performed at the cloud, for example, the intelligent device may determine whether the current environment is in a network connection state after receiving the voice wake-up signal, if it is determined that the current environment is in the network connection state, upload the preprocessed audio signal to a cloud voice recognition platform, reprocess the submitted audio signal on the cloud voice recognition platform, and then convert the recognized text result into a command word for the candidate intelligent device.

In an embodiment, after the directional words are obtained, the target device in the intelligent home network can be further determined and awakened, specifically, the target device can be searched in the intelligent home network according to the device information of the target device, and the target device is awakened after the corresponding device is searched, for example, intelligent voice air conditioners exist at a plurality of positions in a home, and when the user wants to start the air conditioner in a bedroom, the user can awaken the air conditioner in the bedroom by speaking a voice signal of 'bedroom air conditioner'. Other devices continue to maintain the non-awake state, and it should be noted that in the non-awake state, the power consumption of the whole device is low, and the processing capability of voice is relatively weak. That is, the step of waking up the target device based on the smart home network may include: and generating a control instruction according to the equipment information of the target equipment, and broadcasting the control instruction to the intelligent home network to find and wake up the target equipment.

Step 103, receiving a voice command signal through an intelligent device in the intelligent home network.

Step 104, controlling the target equipment according to the voice command signal.

In an embodiment, after the target device in the smart home network is awakened, the target device can be further controlled, specifically, by speaking a voice command signal to any smart device in the smart home network, the device receiving the signal can automatically transmit the signal to the awakened target device based on the smart home network. For example, after waking up the air conditioner in the bedroom, the user can realize the corresponding function only by speaking the on-refrigeration mode to the intelligent television or the intelligent sound box in the intelligent home network. In the process, other intelligent devices are always in the non-awakened state, so that the power consumption can be effectively reduced.

In an embodiment, after the target device is awakened and voice control is implemented according to the voice command signal, a prompt message may also be generated to remind the user, for example, after the air conditioner in the bedroom is completed, a voice signal of "the air conditioner is turned on" may be sent through the built-in speaker to remind the user.

As can be seen from the above, the voice control method provided by the embodiment of the application can receive the voice wake-up signal through the intelligent device in the intelligent home network, extract the directional words in the voice wake-up signal, determine the target device corresponding to the directional words, wake-up the target device based on the intelligent home network, receive the voice command signal through the intelligent device in the intelligent home network, and control the target device according to the voice command signal. According to the scheme provided by the embodiment of the application, the target equipment can be determined from a plurality of intelligent equipment in the intelligent home network according to the directional words in the voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

Fig. 3 is a schematic flow chart of a voice control method according to an embodiment of the application. The specific flow of the method can be as follows:

and step 201, at least one intelligent device is accessed to a public network, and the at least one intelligent device is controlled to broadcast own device information and receive broadcast information of other intelligent devices.

In an embodiment, the condition that at least one intelligent device accesses the same network is that a plurality of devices possess the same network key (network key), so the purpose of the network configuration is to have the plurality of devices possess the same network key. Therefore, in this embodiment, after the voice chip in the smart device recognizes the command word for starting networking, the at least one smart device may be set as a public network key (preset by factory), and devices in the public network key may communicate with each other.

The intelligent devices entering the public network can continuously broadcast own device information (such as MAC) according to a preset device discovery protocol, and receive device discovery data packets from other intelligent devices and cache the device discovery data packets in a device discovery list.

Step 202, selecting a central device from at least one intelligent device according to the number of broadcasts received by the intelligent device and the signal strength.

Specifically, the embodiment can comprehensively elect the intelligent device at the central position according to the number of the discovered intelligent devices and the accumulation of the receiving sensitivity (or signal strength, for example, RSSI) to be used as the optimal node of the networking.

And 203, establishing an intelligent home network based on the central equipment, and controlling other intelligent equipment to join the intelligent home network.

In one embodiment, the best node elected in step 202 generates a private network key according to a certain rule, and the best node broadcasts the generated private network key to other intelligent devices in the public network through radio frequency. It should be noted that, the intelligent device that obtains the private network key may complete forwarding at least once. Finally, the intelligent equipment with the private network key can exit the public network and enter the private intelligent home network, so that the automatic networking is completed.

Step 204, receiving a voice wake-up signal by an intelligent device in the intelligent home network.

In an embodiment, the smart device may be a smart home device that accesses a smart home network, such as a smart light, a smart television, a smart air conditioner, a smart curtain, a smart water heater, a smart washing machine, and the like. The intelligent device may include at least one microphone, for receiving a voice signal sent by a user.

Step 205, obtaining the voice characteristics of the voice wake-up signal, inputting the voice characteristics into the pre-trained acoustic model, and outputting the acoustic posterior probability corresponding to the voice wake-up signal.

In an embodiment, the acoustic model may use a network structure of TDNN (time delay neural network) or DCNN (deep convolutional neural network). The output of the acoustic model may select a transition-id, a phoneme, a word, etc., to facilitate subsequent reverse decoding. Preferably, the present application may use phonemes as the output of the acoustic model.

In one embodiment, the training process of the acoustic model may include: constructing a phoneme sequence difference minimization loss value based on a prediction result output by the training sample through the acoustic model, calculating a loss function according to the phoneme sequence difference minimization loss value and the cross entropy loss value, and training the acoustic model based on the loss function.

Specifically, after adding directional words in the voice wake-up signal, a large number of prefix words/suffix words, such as bedroom whites, living room whites, restaurant whites, etc., are generated, and the proportion of the difference parts of the command words becomes smaller due to the fact that the words of the prefix/suffix parts are the same, which makes recognition very difficult. In view of the above problems, the present application proposes a Multi-loss method, that is, introduces a phoneme sequence difference minimization (MPSD, minimize Phoneme Sequential Differences) loss based on a conventional cross entropy loss, where the scheme aims to sufficiently improve the distinguishability between a target command word and a confusion command word, and the calculation formula of a loss function L is as follows:

L＝α*L _CE +(1-α)*L _MPSD

Wherein L is _CE Is a classical cross entropy loss and is not described in detail here.

L _MPSD Is a phoneme sequence difference minimization penalty implemented using CTCs as a discriminant function, L _MPSD The calculation formula of (2) is as follows:

wherein y is _i Representing the frame level output of the ith training sample through the network, i.e., the predicted outcome of the acoustic model.And the real phoneme text sequence label representing the ith training sample has the length of the number of phonemes contained in the phoneme text sequence.Representation pair->The jth mixed phoneme sequence generated by the on-line expansion mode has the length of the number of phonemes contained in the jth mixed phoneme sequence. N represents the total number of training data in the current batch size. M represents the total number of the augmentation data generated by the on-line augmentation mode of the phoneme text corresponding to the training sample i. L (L) _CTC CTC loss function representing the output sequence.

In an embodiment, the mixed phoneme sequence can be generated by online phoneme text expansion, and the application can only expand directional parts, such as 'bedrooms', 'living rooms', and the like, aiming at directional words. For confusable directive words, it is recommended to preferentially expand confusable parts such as "guest" and "meal" in "living room" and "restaurant" respectively.

In "bedroom little white: w o — 4,sh i_4,x iao_3,b ai_2 ″ examples of the amplification methods include, but are not limited to, the following modes and any combination thereof:

the insertion pattern, i.e. the insertion of a random number of phonemes at randomly selected positions on the original phoneme sequence, is exemplified by the insertion in syllable units. The amplified phoneme sequence is shown as follows

w o_4,sh i_4,g e_4,x iao_3,b ai_2

The deletion pattern, i.e. randomly selecting positions on the original phoneme sequence, deletes the corresponding phonemes. The application is exemplified by deletion in syllable units. The amplified phoneme sequence is as follows:

sh i_4,x iao_3,b ai_2

the substitution pattern, i.e. randomly selecting a random number of positions on the original phoneme sequence to substitute the corresponding phonemes. The amplified phoneme sequence is as follows:

w o_4,sh ao_4,x iao_3,b ai_2

step 206, decoding the voice wake-up signal by decoding the graph and the acoustic posterior probability to identify directional words in the voice wake-up signal.

In order to further improve the voice instruction after adding the directional words, particularly the accuracy of decoding the voice instruction with the directional words positioned at the head, the application provides a bidirectional decoding scheme. In order to ensure the voice decoding speed, the forward decoding graph adopts a simpler language model, recommends to use but is not limited to a binary language model, and most negative samples are filtered rapidly in the forward decoding process. The reverse decoding diagram adopts a more accurate language model, is recommended to use but not limited to a ternary language model, and can effectively inhibit the problem of decoding forgetting in the forward decoding process of the decoding diagram through reverse decoding, thereby further improving the decoding accuracy.

In an embodiment, the process of constructing the decoding graph may include: constructing a project word list according to preset directional words, training the project word list through a statistical language model tool to obtain a first language model converter, constructing a pronunciation dictionary according to the preset directional words and corresponding phonemes, compounding the first language model converter and the pronunciation dictionary, and performing minimization processing to generate a forward decoding diagram.

After generating the forward decoding map, the method may further include: generating an inverted sequence project word list and a corresponding second language model converter according to the project word list, generating an inverted sequence pronunciation dictionary according to the pronunciation dictionary, compounding the second language model converter and the inverted sequence pronunciation dictionary, and generating an inverted decoding diagram through minimization.

In particular, the language model converter described above is used to describe the likelihood of word-to-word combinations. Its construction recommends the use of statistical language model tools (e.g., SRILM). The SRILM can obtain the language model converter by training the text corpus. The traditional construction mode often has great demand on text corpus, and the application recommends the use of a project word list. For example, the term vocabulary may include: bedroom whitelet, living room whitelet, etc. A pronunciation dictionary is then constructed that implements the phoneme-to-word conversion, such as shown in the following table:

And then, carrying out composite operation on the first language model converter and the pronunciation dictionary, then, carrying out determinism operation, reducing redundancy of the decoding diagram, and optimizing complexity of the decoding diagram through minimization operation to finally obtain the forward decoding diagram.

The construction flow of the reverse decoding diagram is the same as that of the forward decoding diagram, but the project vocabulary and the pronunciation dictionary need to be processed in reverse order, for example, a corresponding reverse order vocabulary can be generated according to the project vocabulary: white cell lying, bai Xiao hall guests, etc. The reverse order pronunciation dictionary may also be:

the decoding of the forward decoding graph and the decoding of the backward decoding graph can be performed later, and specifically, a mode of Lattice decoding can be adopted when the decoding of the forward decoding graph is performed. The decoding is propagated through tokens (Token) and processed frame by frame. Firstly, initializing a Token at a starting node of a decoding diagram; the Token is then propagated to the transfer arcs of the active node for each active node with which the originating node is associated. If the label of the input of the transfer arc is not 0, the number of frames corresponding to the activated Token is increased by 1, i.e. the next frame is moved. Each Token records the cumulative total cost of the current node, i.e. the previous optimal cost plus the decoding graph cost corresponding to the transition arc plus the acoustic cost.

Since the selection of the decoding path is determined according to the final accumulated cost of the decoding path, the smaller the accumulated cost of the decoding path is, the better the decoding path is. And therefore, the accumulated costs corresponding to the decoding paths are arranged from small to large, and N is more than or equal to 2 accumulated costs are selected to be used as a first voice recognition result. The accumulated cost is inverted as the first speech recognition confidence. If the confidence coefficient of the first voice recognition is larger than a preset threshold value, enabling the second voice recognition, otherwise, outputting a first recognition result as a final recognition result, namely that the current voice does not recognize the command word.

When the forward decoding image is decoded, the input of the decoder is the acoustic cost and the reverse decoding image of the reverse order, and the decoding process is the same as that of the forward decoding image, and adopts a mode of Lattice decoding. And the path corresponding to the minimum accumulated cost of reverse decoding is the final decoding path, and the minimum accumulated cost is the same as the final speech recognition confidence coefficient, namely the second speech recognition confidence coefficient. And comparing and judging the identification confidence of the decoding code in the reverse direction with a preset threshold value, and outputting a final identification result.

Step 207, determining a target device corresponding to the directional word, and waking up the target device based on the intelligent home network.

And step 208, receiving a voice command signal through intelligent equipment in the intelligent home network, and controlling the target equipment according to the voice command signal.

All the above technical solutions may be combined to form an optional embodiment of the present application, and will not be described in detail herein.

As can be seen from the foregoing, the voice control method provided by the embodiment of the present application may access at least one intelligent device to a public network, control the at least one intelligent device to broadcast its own device information and receive broadcast information of other intelligent devices, select a central device from the at least one intelligent device according to the number of broadcasts and signal strength received by the intelligent device, establish an intelligent home network based on the central device, control the other intelligent devices to join the intelligent home network, receive a voice wake-up signal through the intelligent device in the intelligent home network, obtain a voice feature of the voice wake-up signal, input the voice feature into an acoustic model trained in advance, output an acoustic posterior probability corresponding to the voice wake-up signal, decode the voice wake-up signal through a decoding graph and acoustic posterior probability, so as to identify directional words in the voice wake-up signal, determine a target device corresponding to the directional words, and wake-up the target device based on the intelligent home network, receive a voice command signal through the intelligent device in the intelligent home network, and control the target device according to the voice command signal. According to the scheme provided by the embodiment of the application, the target equipment can be determined from a plurality of intelligent equipment in the intelligent home network according to the directional words in the voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

The embodiment of the application provides a voice control method, which comprises the following specific processes:

step 301, receiving a voice wake-up signal by an intelligent device in an intelligent home network.

Step 302, waking up all intelligent devices in the intelligent home network according to the voice wake-up signal.

In step 303, a voice command signal is received by an intelligent device in the intelligent home network.

Step 304, extracting directional words in the voice command signal, determining target equipment corresponding to the directional words, and controlling the target equipment based on the intelligent home network.

In this embodiment, a plurality of intelligent devices in the intelligent home network can be set to share a wake-up word, then different directivity words are added in voice commands used for control by different intelligent devices. The above step of identifying the directional words can refer to the above embodiment, and further description is omitted.

In order to facilitate better implementation of the voice control method according to the embodiment of the present application, the embodiment of the present application further provides a voice control device. Referring to fig. 4, fig. 4 is a schematic structural diagram of a voice control device according to an embodiment of the application. The voice control apparatus may include:

A first receiving module 301, configured to receive a voice wake-up signal through an intelligent device in an intelligent home network;

the wake-up module 302 is configured to extract a directional word in the voice wake-up signal, determine a target device corresponding to the directional word, and wake-up the target device based on the smart home network;

a second receiving module 303, configured to receive a voice command signal through an intelligent device in the intelligent home network;

and the control module 304 is configured to control the target device according to the voice command signal.

In an embodiment, please further refer to fig. 5, fig. 5 is a schematic diagram of another structure of a voice control device according to an embodiment of the present application. The wake-up module 302 may specifically include:

an acquisition submodule 3021 for acquiring the voice characteristics of the voice wake-up signal;

a processing sub-module 3022, configured to input the above voice features into a pre-trained acoustic model, and output phoneme information corresponding to the voice wake-up signal;

and the recognition submodule 3023 is used for decoding the phoneme information through a decoding diagram so as to recognize directional words in the voice wake-up signal.

In an embodiment, the voice control apparatus further comprises:

The networking module 305 is configured to access at least one intelligent device to a public network before the first receiving module 301 receives the voice wake-up signal, control the at least one intelligent device to broadcast its own device information and receive broadcast information of other intelligent devices, select a central device from the at least one intelligent device according to the number of broadcasts received by the intelligent device and the signal strength, establish an intelligent home network based on the central device, and control other intelligent devices to join the intelligent home network.

As can be seen from the above, the voice control device provided by the embodiment of the application receives the voice wake-up signal through the intelligent device in the intelligent home network, extracts the directional word in the voice wake-up signal, determines the target device corresponding to the directional word, wakes up the target device based on the intelligent home network, receives the voice command signal through the intelligent device in the intelligent home network, and controls the target device according to the voice command signal. According to the scheme provided by the embodiment of the application, the target equipment can be determined from a plurality of intelligent equipment in the intelligent home network according to the directional words in the voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

Correspondingly, the embodiment of the application also provides electronic equipment which can be a terminal or a server, wherein the terminal can be terminal equipment such as a smart phone, a tablet personal computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA) and the like. Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 6. The electronic device 400 includes a processor 401 having one or more processing cores, a memory 402 having one or more storage media, and a computer program stored on the memory 402 and executable on the processor. The processor 401 is electrically connected to the memory 402. It will be appreciated by those skilled in the art that the electronic device structure shown in the figures is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 401 is a control center of the electronic device 400, connects various parts of the entire electronic device 400 using various interfaces and lines, and performs various functions of the electronic device 400 and processes data by running or loading software programs and/or modules stored in the memory 402, and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device 400.

In the embodiment of the present application, the processor 401 in the electronic device 400 loads the instructions corresponding to the processes of one or more application programs into the memory 402 according to the following steps, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions:

and controlling the target equipment according to the voice command signal.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 6, the electronic device 400 further includes: a touch display 403, a radio frequency circuit 404, an audio circuit 405, an input unit 406, and a power supply 407. The processor 401 is electrically connected to the touch display 403, the radio frequency circuit 404, the audio circuit 405, the input unit 406, and the power supply 407, respectively. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 6 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The touch display 403 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display screen 403 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the display panel may be configured in the form of a liquid crystal display (LCD, liquid Crystal Display), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 401, and can receive and execute commands sent from the processor 401. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 401 to determine the type of touch event, and the processor 401 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 403 to realize the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the touch-sensitive display 403 may also implement an input function as part of the input unit 406.

In an embodiment of the present application, the graphical user interface is generated on the touch display 403 by the processor 401 executing an application program. The touch display 403 is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface.

The radio frequency circuitry 404 may be used to transceive radio frequency signals to establish wireless communication with a network device or other electronic device via wireless communication.

The audio circuitry 405 may be used to provide an audio interface between a user and an electronic device through a speaker, microphone. The audio circuit 405 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 405 and converted into audio data, which are processed by the audio data output processor 401 and sent via the radio frequency circuit 404 to e.g. another electronic device, or which are output to the memory 402 for further processing. The audio circuit 405 may also include an ear bud jack to provide communication of the peripheral headphones with the electronic device.

The input unit 406 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 407 is used to power the various components of the electronic device 400. Alternatively, the power supply 407 may be logically connected to the processor 401 through a power management system, so as to implement functions of managing charging, discharging, and power consumption management through the power management system. The power supply 407 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 6, the electronic device 400 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the electronic device provided in this embodiment may receive the voice wake-up signal through the intelligent device in the intelligent home network, extract the directional word in the voice wake-up signal, determine the target device corresponding to the directional word, wake-up the target device based on the intelligent home network, receive the voice command signal through the intelligent device in the intelligent home network, and control the target device according to the voice command signal. According to the scheme provided by the embodiment of the application, the target equipment can be determined from a plurality of intelligent equipment in the intelligent home network according to the directional words in the voice signal, so that the target equipment is awakened and controlled, and the recognition accuracy of the voice signal and the use efficiency of the intelligent equipment are effectively improved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions or by controlling associated hardware, which may be stored in a storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a storage medium in which a plurality of computer programs are stored, the computer programs being capable of being loaded by a processor to perform the steps of any of the voice control methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

and controlling the target equipment according to the voice command signal.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The steps of any voice control method provided by the embodiment of the present application can be executed by the computer program stored in the storage medium, so that the beneficial effects of any voice control method provided by the embodiment of the present application can be achieved, and detailed descriptions of the foregoing embodiments are omitted.

The foregoing describes in detail a voice control method, apparatus, storage medium and electronic device provided by the embodiments of the present application, and specific examples are applied to illustrate the principles and embodiments of the present application, where the foregoing examples are only used to help understand the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A voice control method, comprising:

and controlling the target equipment according to the voice command signal.

2. The voice control method of claim 1, wherein the extracting directional words in the voice wake-up signal comprises:

acquiring voice characteristics of the voice wake-up signal;

3. The voice control method of claim 2, wherein the obtaining the voice characteristic of the voice wake-up signal comprises:

4. The speech control method of claim 2, wherein the training process of the acoustic model comprises:

training the acoustic model based on the loss function.

5. The voice control method according to claim 2, wherein the constructing process of the decoding graph includes:

and compounding the first language model converter and the pronunciation dictionary, and performing minimization processing to generate a forward decoding diagram.

6. The voice control method of claim 5, wherein after generating the forward decoding map, the method further comprises:

and compounding the second language model converter and the reverse order pronunciation dictionary, and performing minimization processing to generate a reverse decoding diagram.

7. The voice control method of claim 1, wherein the waking up the target device based on the smart home network comprises:

8. The voice control method of any of claims 1-7, wherein prior to receiving a voice wake-up signal by a smart device in a smart home network, the method further comprises:

accessing at least one intelligent device to a public network;

9. A voice control method, comprising:

10. A voice control apparatus, comprising:

11. A storage medium storing a computer program adapted to be loaded by a processor to perform the steps of the speech control method according to any one of claims 1-9.

12. An electronic device comprising a memory in which a computer program is stored and a processor that performs the steps in the speech control method according to any one of claims 1-9 by calling the computer program stored in the memory.