CN116229966A

CN116229966A - Method and system for controlling intelligent electrical appliance based on voice

Info

Publication number: CN116229966A
Application number: CN202310054485.1A
Authority: CN
Inventors: 董立伟
Original assignee: Foshan Hengyan Electronics Co ltd
Current assignee: Foshan Hengyan Electronics Co ltd
Priority date: 2023-02-03
Filing date: 2023-02-03
Publication date: 2023-06-06

Abstract

The invention relates to the technical field of voice control and discloses a method and a system for controlling an intelligent electrical appliance based on voice, wherein a voice recognition model based on a CNN model is established, the voice recognition model comprises a convolution layer, a batch normalization layer, a ReLu activation function layer, a maximum pooling layer, a full connection layer and a classification layer which are sequentially connected, and voice data in a spectrogram is captured by using the voice recognition model to train the voice recognition model; inputting the collected sound data into a trained sound recognition model, and determining the voice content to be recognized; and transmitting the voice content to an interconnection gateway of the intelligent electric appliance through the server, so as to realize voice control of the intelligent electric appliance. The invention not only reduces the development cost of the embedded equipment and saves resources, but also improves the accuracy and the operation speed of voice recognition.

Description

Method and system for controlling intelligent electrical appliance based on voice

Technical Field

The application relates to the technical field of voice control, in particular to a method and a system for controlling an intelligent electrical appliance based on voice.

Background

Along with the continuous development of the economy in China, the living standard of people is continuously improved, and the transition of the traditional household living system to the intelligent household living system is a necessary trend. However, the intelligent electrical system still has some defects and needs to be improved: on the one hand, for traditional household appliances, the current intelligent electrical appliance system is lack of transition mode, so that the traditional household appliances are difficult to network, and therefore consumers selecting the intelligent electrical appliance system have to update the household appliances uniformly at great cost, the use threshold of the intelligent electrical appliance system is improved, and the development of the intelligent electrical appliance system is slow; on the other hand, for the construction of an ideal intelligent electrical appliance mode facing the future, the improvement of the understanding recall rate and the response speed of intelligent voice control is a key point, but the existing voice recognition system cannot meet the requirements, and the prepositive optimization design facing the future is required.

Therefore, a method is needed to reduce the development cost of the embedded device, save resources, and improve the accuracy and the operation speed of voice recognition.

Disclosure of Invention

The application embodiment provides a method and a system for controlling an intelligent electrical appliance based on voice, which reduce the development cost of embedded equipment, save resources and improve the accuracy and the operation speed of voice recognition.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, a method for controlling a smart appliance based on voice is provided, the method comprising the steps of:

step S1, a sound acquisition module in an intelligent electrical appliance acquires an original sound signal, a controller of a control module in the intelligent electrical appliance pre-processes the original sound signal, generates a processed sound signal, and stores the processed sound signal as a model training set to a server;

s2, extracting characteristic data of the processed sound signals, and representing a sound data set by using a spectrogram;

step S3, establishing a voice recognition model based on a CNN model in a language recognition unit of the server, capturing voice data in a spectrogram by using the voice recognition model, training the voice recognition model by using a model training set, and presetting the iteration times and the learning rate to be complete;

the voice recognition model comprises a convolution layer, a batch normalization layer, a ReLu activation function layer, a maximum pooling layer, a full connection layer and a classification layer which are connected in sequence;

s4, inputting the voice data set into the voice recognition model in a data cube structure to train the voice recognition model, and generating a trained voice recognition model;

s5, the sound collection module collects the site sound data in real time to generate a corresponding sound data set represented by a spectrogram, and the sound data set is input into a sound recognition model of a training number to determine the voice content to be recognized;

and S6, transmitting the voice content to be recognized to an interconnection gateway of the intelligent electric appliance through a server to realize voice control of the intelligent electric appliance.

In a possible implementation manner, the method for performing preprocessing in step S1 includes a sound denoising process, a sound pre-emphasis process, a sound windowing framing process, and a sound endpoint detection process.

In one possible implementation manner, the step S2 includes:

and extracting the characteristic data of the processed sound signal by using a linear prediction coefficient algorithm.

In one possible implementation manner, the step S3 includes:

after inputting the sound data set into the sound recognition model, extracting local information of the sound data set by utilizing a convolution layer, a batch normalization layer and a ReLu activation function layer, sampling the sound data by utilizing the convolution layer, the batch normalization layer and the ReLu activation function layer, mining the sound data set by utilizing the depths of a maximum pooling layer, a full connection layer and a classification layer, calculating loss values of predicted points of the sound data set and real points of the sound data set in a spectrogram by utilizing a loss function after characteristic information of the sound data set is aggregated, continuously iterating and attenuating the loss values by utilizing a learning rate reduction method, optimizing weight parameters of the sound recognition model until the iteration times are equal to the maximum iteration times, stopping training, and generating the trained sound recognition model.

In one possible implementation, the loss function comprises a cross entropy loss function.

In one possible implementation manner, the step S5 includes:

and inputting the corresponding sound data set into a trained sound recognition model, comparing and matching the sound data set with sample parameters in a server sample library through deep mining analysis, and determining the voice content to be recognized according to the matching similarity.

In one possible implementation manner, the step S6 includes:

and identifying the voice content to be identified by using the attention mechanism model, and realizing instruction conversion.

In a second aspect, the present invention further provides a system for controlling an intelligent electrical appliance based on voice, including a sound collection module, a server and an interconnection gateway, wherein:

the intelligent electric appliance comprises a sound acquisition module, a server and a model training set, wherein the sound acquisition module is used for acquiring an original sound signal, and a controller of the control module in the intelligent electric appliance preprocesses the original sound signal to generate a processed sound signal and stores the processed sound signal as the model training set to the server; the method comprises the steps of acquiring live sound data in real time to generate a corresponding sound data set represented by a spectrogram, and determining voice content to be recognized;

a server for storing the feature data of the processed sound signal;

the server comprises a language identification unit, wherein the language identification unit is used for establishing a voice identification model based on a CNN model, capturing voice data in a spectrogram by using the voice identification model, training the voice identification model by using a model training set, and presetting the iteration times and the learning rate to be complete; the voice recognition model comprises a convolution layer, a batch normalization layer, a ReLu activation function layer, a maximum pooling layer, a full connection layer and a classification layer which are connected in sequence; inputting the voice data set into the voice recognition model in a data cube structure to train the voice recognition model, and generating a trained voice recognition model;

and the interconnection gateway is used for receiving the voice content to be recognized and transmitted by the server and realizing voice control of the intelligent electrical appliance.

In a third aspect, the present invention also provides an electronic device comprising a processor and a memory; the processor comprises the system for intelligent monitoring based on image recognition according to the second aspect.

In a fourth aspect, the present invention also provides a computer-readable storage medium comprising instructions; when the instructions are executed on the electronic device described in the third aspect, the electronic device is caused to perform the method described in the first aspect,

drawings

Fig. 1 is a schematic structural diagram of a sound collection module in a method and a system for controlling an intelligent electrical appliance based on voice according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a system in a method and a system for controlling an intelligent electrical appliance based on voice according to an embodiment of the present application.

The invention can improve the accuracy and stability of the intelligent electrical appliance voice control system identification.

The invention can improve the response speed of the intelligent electrical appliance voice control system.

The invention reduces the development cost of the embedded equipment, saves resources, and improves the accuracy and the operation speed of voice recognition.

Detailed Description

It should be noted that the terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between the same type of feature, and not to be construed as indicating a relative importance, quantity, order, or the like.

The terms "exemplary" or "such as" and the like, as used in connection with embodiments of the present application, are intended to be exemplary, or descriptive. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The terms "coupled" and "connected" in connection with embodiments of the present application are to be construed broadly, and may refer, for example, to a physical direct connection, or to an indirect connection via electronic devices, such as, for example, a connection via electrical resistance, inductance, capacitance, or other electronic devices.

Example 1:

in the method for performing intelligent monitoring based on image recognition in the standard data format in the neural network, the sound in reality is a continuous signal, most of the sound is stored in discrete digital signals, such as a CD and MP3 audio format, the sound is collected in a continuous signal, the discrete signal is adopted, and the collection density is represented by a sampling rate.

From the network structure, the special network structure of the CNN model enables the CNN model to extract local information of the input voice features, and the invariance of the CNN model to the translation of the input features in frequency and time domains is enhanced through the pooling layer downsampling operation, so that the robustness of the model is greatly enhanced. The CNN model serves as a deep model that can effectively model the spatial distribution of speech feature data.

As shown in fig. 1, C is a convolution layer, BN is a batch normalization layer, reLu is a ReLu activation function layer, P is a max pooling layer, FC is a fully connected layer, softmax is a classification layer, and the embodiment establishes a voice recognition model based on a CNN model, where the voice recognition model includes the convolution layer, the batch normalization layer, the ReLu activation function layer, the max pooling layer, the fully connected layer, and the classification layer, which are sequentially connected. After inputting the sound data set into the sound recognition model, extracting local information of the sound data set by utilizing a convolution layer, a batch normalization layer and a ReLu activation function layer, sampling the sound data by utilizing the convolution layer, the batch normalization layer and the ReLu activation function layer, mining the sound data set by utilizing the depths of a maximum pooling layer, a full connection layer and a classification layer, calculating loss values of predicted points of the sound data set and real points of the sound data set in a spectrogram by utilizing a loss function after characteristic information of the sound data set is aggregated, continuously iterating and attenuating the loss values by utilizing a learning rate reduction method, optimizing weight parameters of the sound recognition model until the iteration times are equal to the maximum iteration times, stopping training, and generating the trained sound recognition model.

Example 2:

the present embodiment is further optimized on the basis of embodiment 1, in which the pre-emphasis processing is signal compensation or enhancement of the high frequency component of the input speech signal. For the spectrum of the sound signal, the energy of the low frequency part is generally higher than that of the high frequency part, and the high frequency end is generally attenuated at a speed of 6 dB/frequency multiplication above 800Hz, and in order to reduce the influence of body organs, actions and the like in the sounding process, the energy of the high frequency part and the low frequency part have similar amplitude, so that the spectrum of the signal is kept flat in the whole frequency band from low frequency to high frequency, and pre-emphasis is a necessary step of preprocessing. Meanwhile, the noise in the signal is unchanged, the energy of the high-frequency part is increased, and the signal-to-noise ratio can be improved.

The sound windowing framing process is to divide the complete voice signal into a small section according to the time period, wherein each section is a frame, and then only each frame needs to be calculated, namely the framing process. To achieve framing processing requires multiplying the speech signal with a movable window weight of determined length, called windowing.

The sound endpoint detection process detects a speech segment according to two functions of short-time average energy and short-time average zero-crossing rate of the speech signal: in chinese there are unvoiced and voiced sounds, where voiced sounds contain a parent sound and there is a great energy in the parent sound, and unvoiced sounds include sub-sounds (consonants) with a high frequency, so that the voiced sounds can be detected with short-time average energy, the unvoiced sounds can be detected with short-time average zero-crossing rate (zero-crossing), and thus the whole syllable of chinese characters can be found.

After denoising, a high-frequency component is increased through a pre-emphasis process; then windowing and framing are carried out, so that the digitization processing is facilitated; and then, the effective voice segment can be detected, only the effective voice segment is processed, the data volume is reduced, and the problems of missing detection and false detection can be reduced by utilizing an improved endpoint detection algorithm, so that the method has good robustness.

Other portions of this embodiment are the same as those of embodiment 1, and thus will not be described in detail.

Example 3:

the present embodiment is further optimized on the basis of the above embodiment 1 or 2, in which the loss function includes a cross entropy loss function, and the voice recognition model is constructed based on a CNN neural network, and thus, the output is a vector, not in the form of a probability distribution. Therefore, the softmax activation function is required to "normalize" a vector into the form of a probability distribution, and then calculate loss using the cross entropy loss function. The cross entropy loss function can describe the difference of the two probability distributions, thus ultimately using the way the softmax activation function and the cross entropy loss function combine.

Example 4:

the basic principle of the present embodiment is that, for a speech signal, there is a linear relationship between the previous time and the subsequent time, i.e. the value at a certain time can describe the prediction by a plurality of sampling values at the previous time, and the value is infinitely approximated to the sampling values, in this way, a uniquely determined set of coefficient values, i.e. a characteristic parameter of the signal, can be obtained. The linear prediction coefficient algorithm has the advantage that the estimated characteristic parameters are accurate and can be used for describing the time domain characteristic and the frequency domain characteristic of the voice signal.

Other portions of this embodiment are the same as any of embodiments 1 to 3 described above, and thus will not be described again.

Example 5:

the present embodiment is further optimized based on any one of the above embodiments 1 to 4, and as shown in fig. 2, the present embodiment provides a system for controlling an intelligent electrical appliance based on voice, which is used for the intelligent electrical appliance, and includes a sound collection module, a server and an interconnection gateway. After completing the task that the voice content to be recognized is transmitted to the interconnection gateway of the intelligent electrical appliance through the server, the server side transmits the recognition result to the hardware platform of the intelligent electrical appliance for execution.

The controller for controlling the intelligent electrical appliance can be connected to the network through the interconnection gateway of the intelligent electrical appliance, the server can recognize the voice command and then package the command information, the voice recognition model TCP/IP voice recognition model protocol is used for transmitting the command information, and the interconnection gateway of the intelligent electrical appliance can analyze and recognize useful information and send specific control commands to the controller after receiving the data packet, so that the remote voice control function is completed. By transferring the voice recognition task to the server, the intelligent electrical appliance interconnection gateway only needs to upload data, execute commands and analyze protocols. The invention not only reduces the development cost of the embedded equipment and saves resources, but also improves the accuracy and the operation speed of voice recognition.

Other portions of this embodiment are the same as any of embodiments 1 to 4 described above, and thus will not be described again.

Example 6:

the invention also provides an electronic device, which comprises a processor and a memory; the processor comprises the system for intelligent monitoring based on image recognition described in the embodiment.

Example 7:

the present invention also provides a computer-readable storage medium comprising instructions; when the instructions are executed on the electronic device described in the above embodiment, the electronic device is caused to perform the method described in the above embodiment. In the alternative, the computer readable storage medium may be a memory.

The processor referred to in the embodiments of the present application may be a chip. For example, it may be a field programmable gate array (field programmable gate array, FPGA), an application specific integrated chip (application specific integrated circuit, ASIC), a system on chip (SoC), a central processing unit (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip.

The memory to which embodiments of the present application relate may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physically separate, i.e., may be located in one device, or may be distributed over multiple devices. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated in one device, or each module may exist alone physically, or two or more modules may be integrated in one device.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, etc. that can be integrated with the medium. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The method for controlling the intelligent electrical appliance based on the voice is characterized by comprising the following steps of:

2. The method for controlling a smart appliance according to claim 1, wherein the pre-processing in step S1 includes a sound denoising process, a sound pre-emphasis process, a sound windowing framing process, and a sound endpoint detection process.

3. The method of claim 1, wherein the step S2 includes:

4. The method of claim 1, wherein the step S3 includes:

5. The method of claim 4, wherein the loss function comprises a cross entropy loss function.

6. The method of claim 1, wherein the step S5 comprises:

7. The method of claim 1, wherein the step S6 includes:

8. A system based on voice control intelligent electrical apparatus for intelligent electrical apparatus, its characterized in that includes sound collection module, server and interconnection gateway, wherein:

a server for storing the feature data of the processed sound signal;

9. An electronic device comprising a processor and a memory; the processor comprises the system for intelligent monitoring based on image recognition as claimed in claim 8.

10. A computer-readable storage medium, the computer-readable storage medium comprising instructions; the instructions, when executed on an electronic device as claimed in claim 8, cause the electronic device to perform the method as claimed in any one of claims 1-7.