CN113808587B

CN113808587B - Voice instruction autonomous recognition algorithm

Info

Publication number: CN113808587B
Application number: CN202111364061.2A
Authority: CN
Inventors: 付俊生; 陶阳; 靳凯丽
Original assignee: Nanjing Lonrec Electric Technology Co ltd
Current assignee: Nanjing Lonrec Electric Technology Co ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-04-12
Anticipated expiration: 2041-11-17
Also published as: CN113808587A

Abstract

The invention discloses a voice instruction self-recognition algorithm, which belongs to the technical field of voice instruction recognition and solves the problem of complex operation of the existing voice instruction, and the voice instruction self-recognition algorithm comprises the following steps: s1: setting a voice instruction library, and adding instructions and corresponding operations thereof in the voice instruction library by a user; s2: collecting audio through a microphone, and filtering audio noise; s3: carrying out voice recognition on the audio, matching the audio with a voice instruction library after recognition, and forming an instruction set; s4: after a complete instruction set is collected, coding is carried out according to each instruction; s5: and after the coding is finished, executing a corresponding program through the coding, and further displaying a corresponding response to the user. The user can complete all the user-defined operations under the instruction at one time only by one instruction, thereby effectively simplifying the operation of the user and improving the use convenience of the equipment.

Description

Voice instruction autonomous recognition algorithm

Technical Field

The invention relates to the technical field of voice instruction recognition, in particular to a voice instruction autonomous recognition algorithm.

Background

It has been a long-standing research direction to communicate voice with a machine to make the machine understand what you say, and voice recognition technology is a highly new technology to make the machine convert voice signals into corresponding texts or commands through a recognition and understanding process, and has been widely used in various fields so far.

Patent number is CN 201410470891.7's patent discloses a speech recognition intelligence LED bulb, its characterized in that: the lamp is internally integrated with a microphone for receiving sound, the microphone continuously receives external sound signals, the indicator lamp is turned on when the sound signals are received, and the indicator lamp is turned off when no sound signals exist; the intelligent voice recognition system is integrated in the lamp, external voice signals received by the microphone can be processed in real time, and lamp switch control instructions of turning on and off the lamp spoken by a user are recognized through the operation of an internal recognition algorithm; when a user speaks a 'light on' voice command, the LED lamp panel is lightened; when a user speaks a voice command of turning off the lamp, the LED lamp panel is turned off. The operation of starting and stopping the lamp by pressing the switch by hand is not needed to be directly carried out before the switch. The lamp can be controlled by only speaking a voice command without any remote control equipment. The voice control brightness function which is not possessed by the traditional lamp is expanded.

However, the conventional speech recognition has the following problems: the relatively simple instruction is realized through the complex operation of the user, wherein one instruction corresponds to one conventional operation, so that the operation is complicated, the corresponding speed of the high-frequency instruction commonly used by the user is low, the existing voice recognition equipment is in a normally-started state, and the danger is improved when the dangerous situations such as water leakage, fire and the like are met.

Disclosure of Invention

The invention aims to provide a voice instruction self-recognition algorithm, which is programmed by a self-defined voice instruction, and a user can complete all self-defined operations under the instruction at one time only by one instruction, thereby effectively simplifying the operation of the user, improving the use convenience of equipment and solving the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: an autonomous recognition algorithm for voice commands, comprising the steps of:

s1: setting a voice instruction library, and adding instructions and corresponding operations thereof in the voice instruction library by a user;

s2: collecting audio through a microphone, converting the audio into an audio oscillogram, correcting the fluctuation amplitude between adjacent fluctuation points, and converting the audio oscillogram into the audio to realize noise filtration;

s3: performing voice recognition on the audio, matching the audio with a voice instruction library after recognition, and forming an instruction set, if the instruction does not exist, repeating the operation of S1 to add the instruction and the corresponding operation thereof to the instruction library;

s4: after a complete instruction set is collected, coding is carried out according to each instruction;

s5: after the coding is finished, executing a corresponding program through the coding, and further displaying a corresponding response to the user;

in S2, I (t) is set to represent the audio waveform received at time t, I (x)_t，y_t) Representing coordinates (x) in an audio waveform diagram_t，y_t) The fluctuation point of (c) is set to L (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) And L (x)_t，y_t) To form a fluctuation point I (x)_t，y_t) A local region of 2n ± 1 as the center, where n is a given positive integer; let S (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) In the local neighborhood L (x)_t，y_t) Setting a smoothness threshold H (t) and I (a) in the similar fluctuation point set_t，b_t) Representing a local neighborhood L (x)_t，y_t) Middle coordinate (a)_t，b_t) The fluctuation point of (c);

when the fluctuation point I (a)_t，b_t) Satisfies the following equation [ h (a)_t，b_t）-h（x_t，y_t）]When the ratio is less than or equal to H (t), the fluctuation point I (a)_t，b_t) Join into set S (x)_t，y_t) Performing the following steps;

when the fluctuation point I (a)_t，b_t) Satisfies the following equation [ h (a)_t，b_t）-h（x_t，y_t）]>H (t), the fluctuation point I (a) is not set_t，b_t) Join into set S (x)_t，y_t) Performing the following steps;

wherein, h (a)_t，b_t) Represents the fluctuation point I (a)_t，b_t) A smoothed value of h (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) A smoothed value of (d);

let s (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) In the local neighborhood L (x)_t，y_t) And s (x) is_t，y_t) Is equal to N_s（x_t，y_t) Divided by N_L（x_t，y_t) Wherein N is_s（x_t，y_t) Represents the set S (x)_t，y_t) Number of fluctuation points in, N_L（x_t，y_t) Representing a local neighborhood L (x)_t，y_t) The number of fluctuation points in;

let s₁（x_t，y_t) Representing a local neighborhood L (x)_t，y_t) Middle pixel in local neighborhood L (x)_t，y_t) When the fluctuation point is I (x) is the median of the similarity detection coefficients_t，y_t) Satisfies s (x)_t，y_t) Is greater than or equal to s₁（x_t，y_t) When it is, the fluctuation point I (x) is determined_t，y_t) For normal fluctuation amplitude, the fluctuation point I (x) is not changed_t，y_t) Is a value h (x)_t，y_t) When the fluctuation point I (x)_t，y_t) Satisfies s (x)_t，y_t) Is less than s₁（x_t，y_t) Then, the fluctuation point is determined as a noise fluctuation point, and the following method is adopted to the fluctuation point I (x)_t，y_t) Amplitude value h (x) of_t，y_t) And (5) correcting:

s（x_t，y_t）=I（x_t，y_t）/I（a_t，b_t）（1）

wherein

I（x_t，y_t）∈S（x_t，y_t）（2）

I（a_t，b_t）∈S（x_t，y_t）（3）

In the formula (1), s (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) In the local neighborhood L (x)_t，y_t) The similarity detection coefficient of (1), I (x)_t，y_t) Representing a local neighborhood L (x)_t，y_t) Middle coordinate (x)_t，y_t) A point of fluctuation of (a)_t，b_t) Representing a local neighborhood L (x)_t，y_t) Middle coordinate (a)_t，b_t) The fluctuation point of (2) and (3) is S (x)_t，y_t) Representing a local neighborhood L (x)_t，y_t) A set of medium fluctuation points;

I（x_t，y_t）/I（a_t，b_t）=（x_t-a_t）（y_t-b_t）（4）

then

s（x_t，y_t）=（x_t-a_t）（y_t-b_t）（5）

Let the fluctuation point before correction be I (x)_t，y_t) The fluctuation point to be corrected is I^、（x^、 _t，y^、 _t），I（a_t，b_t) Representing the point of fluctuation I (x)_t，y_t) Fluctuation points in the neighborhood which do not need to be corrected;

I^、（x^、 _t，y^、 _t）=I（a_t，b_t）/s（x_t，y_t）（6）

x^、 _t=a_t/s（x_t，y_t）（7）

y^、 _t=b_t/s（x_t，y_t）（8）

I（a_t，b_t）∈S（x_t，y_t）（9）

wherein s (x) in the formulas (7) and (8)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) In the local neighborhood L (x)_t，y_t) The similarity detection coefficient of (1), x^、 _tIs the X-axis coordinate value, y, of the modified fluctuation point^、 _tIs the Y-axis coordinate value of the modified fluctuation point.

Further, a plurality of groups of instruction data packets are arranged in the voice instruction library in S1, each group of instruction data packets is provided with text fields, instructions and operations, or each group of instruction data packets is provided with audio segments, instructions and operations, wherein the text fields are words and phrases frequently used by the user and obtained by collecting big data, the operations of the words and phrases are frequently used by the user, and the audio segments are underwater sounds and fire alarm sounds obtained by collecting big data.

Furthermore, the text field, the instruction and the operation in each group of instruction data packets are in one-to-one correspondence, the audio sections, the instructions and the operations are also in one-to-one correspondence in each group of instruction data packets, and one audio can trigger multiple groups of instruction data packets simultaneously.

Furthermore, in the S2, there are two audio acquisition modes of the microphone, one is to set the microphone to record audio at a fixed frequency, and the second is to set an instruction of the awakening degree, and the microphone records voice after awakening.

Further, the step S3 includes the following steps:

s301: converting the denoised audio data into an audio signal:

s302: performing voice feature recognition on the audio signal, comparing the audio signal with an audio segment in a voice instruction library if the voice feature does not exist to obtain a matched instruction and operation, and performing S303 operation if the voice feature exists;

s303: extracting voice features, and performing voice recognition on the voice features according to a voice recognition algorithm to obtain character data;

s304: and screening and analyzing the character data, and matching the character data with a voice instruction library after word segmentation to obtain matched instructions and operations.

Further, the voice recognition algorithm is a DTW algorithm.

Further, the instruction encoding method in S4 is as follows:

s401: acquiring audio and entering a programming mode;

s402: starting to collect voice commands issued by the user, and carrying out the next step after the user stops issuing the specified pause time;

s403: giving a prompt to the user whether instructions exist, if so, repeatedly executing S402, otherwise, executing S404;

s404: converting the collected voice into characters, displaying the characters on a screen, giving a prompt, and if the characters are correct, carrying out the next step;

s405: analyzing the grammar structure and the language entity of the text instruction after word segmentation by respectively using the trained grammar structure deep neural network model and the language entity deep neural network model, then searching and extracting specific instruction elements from the analyzed text instruction according to an instruction library, and collecting the specific instruction elements into an instruction set;

s406: playing prompt to trigger shortcut voice of the instruction set;

s407: collecting the voice of the user, playing whether the prompt is confirmed, if so, storing the voice into a user-defined instruction set library, otherwise, repeating S405;

s408: and prompting the user that the programming mode is finished.

Further, in S404, if the instruction is wrong, the user is prompted which character is to be changed, and the user manually inputs a changed field or re-enters the audio until the instruction is correct, and the user performs the next step after confirming the instruction.

Further, the instruction set and the corresponding audio are stored after the instruction set is executed in S405, and the microphone directly triggers the instruction set to operate again after acquiring the same audio.

Compared with the prior art, the invention has the beneficial effects that:

1. the voice instruction self-recognition algorithm provided by the invention is programmed by a self-defined voice instruction, and a user can complete all self-defined operations under the instruction at one time only by one instruction, so that the operation of the user is effectively simplified, and the use convenience of equipment is improved;

2. according to the voice instruction autonomous recognition algorithm, high-frequency words, phrases, underwater sounds, fire alarm sounds and the like used by a user are collected through big data and used as data of comparison parameters, and the data are arranged into an instruction library, so that the efficiency of manual input is improved;

3. the invention provides a voice instruction autonomous recognition algorithm, which collects indoor natural sound at a fixed frequency, judges whether underwater sound, fire alarm sound or other alarm sound exists in audio after analysis and processing, sends a warning to a user after recognition, and reduces the harm brought by the power equipment in an dangerous case by adopting operations of turning off the power equipment, turning off a power supply, automatically shutting down the power supply and the like.

Drawings

FIG. 1 is an overall flow chart of the voice command autonomous recognition algorithm of the present invention;

FIG. 2 is a schematic diagram of the operation of the voice command autonomous recognition algorithm of the present invention;

FIG. 3 is a diagram of a voice command library structure of the voice command autonomous recognition algorithm of the present invention;

FIG. 4 is a flow chart of a speech recognition method of the voice command autonomous recognition algorithm of the present invention;

FIG. 5 is a flowchart of an instruction encoding method of the speech instruction autonomous recognition algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1 to 3, an autonomous voice command recognition algorithm includes the following steps:

the voice instruction library is provided with a plurality of groups of instruction data packets, each group of instruction data packets is provided with text fields, instructions and operations, or each group of instruction data packets is provided with audio sections, instructions and operations, wherein the text fields are user commonly-used words and phrases obtained by big data collection, the operations are user commonly-used operations, and the audio sections are underwater sounds and fire alarm sounds obtained by big data collection, so that the efficiency of manual entry is improved; the text field, the instruction and the operation in each group of instruction data packets are in one-to-one correspondence, the audio frequency section, the instruction and the operation are also in one-to-one correspondence in each group of instruction data packets, and one audio frequency can trigger a plurality of groups of instruction data packets simultaneously;

s2: collecting audio by a microphone, converting the audio into an audio waveform diagram, and setting I (t) to represent the audio waveform diagram received at the time t, I (x)_t，y_t) Representing coordinates (x) in an audio waveform diagram_t，y_t) Correcting the fluctuation amplitude between adjacent fluctuation points, and converting the audio oscillogram into audio to realize noise filtration; the microphone acquires audio frequency in two modes, one mode is that the microphone is set to record audio frequency at a fixed frequency, indoor natural sound is acquired at the fixed frequency, whether underwater sound, fire alarm sound or other alarm sound exists in the audio frequency is judged after analysis and processing, a warning is sent to a user after recognition, and then the harm of the power equipment in a dangerous case is reduced by adopting operations of turning off the power equipment, turning off a power supply, automatically shutting down the power supply and the like; the second is an instruction for setting the awakening degree, and after awakening, the microphone records voice;

setting L (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) And L (x)_t，y_t) To form a fluctuation point I (x)_t，y_t) A local region of 2n ± 1 as the center, where n is a given positive integer; let S (x)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) In the local neighborhood L (x)_t，y_t) Setting a smoothness threshold H (t) and I (a) in the similar fluctuation point set_t，b_t) Representing a local neighborhood L (x)_t，y_t) Middle coordinate (a)_t，b_t) The fluctuation point of (c);

s（x_t，y_t）=I（x_t，y_t）/I（a_t，b_t）（1）

wherein

I（x_t，y_t）∈S（x_t，y_t）（2）

I（a_t，b_t）∈S（x_t，y_t）（3）

I（x_t，y_t）/I（a_t，b_t）=（x_t-a_t）（y_t-b_t）（4）

then

s（x_t，y_t）=（x_t-a_t）（y_t-b_t）（5）

I^、（x^、 _t，y^、 _t）=I（a_t，b_t）/s（x_t，y_t）（6）

x^、 _t=a_t/s（x_t，y_t）（7）

y^、 _t=b_t/s（x_t，y_t）（8）

I（a_t，b_t）∈S（x_t，y_t）（9）

wherein s (x) in the formulas (7) and (8)_t，y_t) Representing the point of fluctuation I (x)_t，y_t) In the local neighborhood L (x)_t，y_t) The similarity detection coefficient of (1), x^、 _tIs the X-axis coordinate value, y, of the modified fluctuation point^、 _tA Y-axis coordinate value of the corrected fluctuation point;

noise is eliminated, noise interference in the voice recognition process is small, and the error is reduced when underwater sound and fire alarm sound are recognized;

s5: after the coding is completed, corresponding programs are executed through the coding, and then corresponding responses are displayed for users, so that more concise operation is facilitated, the problem of complex operation existing in intelligent equipment is mainly solved through the setting of the coding programs, and most of users can use the intelligent equipment more conveniently in daily life.

Referring to fig. 4, the speech recognition in S3 includes the following steps:

s301: converting the denoised audio data into an audio signal:

s303: extracting voice features, and performing voice recognition on the voice features according to a voice recognition algorithm to obtain the character data, wherein the voice recognition algorithm is a DTW algorithm;

Referring to fig. 5, the instruction encoding method in S4 is as follows:

s401: the method comprises the steps of obtaining audio, entering a programming mode, wherein the obtained audio can be actively recorded or recorded after a microphone is started by adopting a wake-up program instruction, wherein the wake-up program instruction is adopted, for example: the intelligent floor sweeping robot is applied to an intelligent floor sweeping robot, the default name of the robot is 'wisdom', and therefore when the floor sweeping robot receives the 'wisdom' instruction, the operation corresponding to the instruction is an answer: "what is on, what is on", etc.; one instruction performs multiple operations, such as: "I want to sleep", its corresponding operation: the method has the advantages that the method firstly executes the operations of turning off the lamp, then executes the operations of operating the air conditioner in the sleep mode, pulling the curtain and the like, realizes a plurality of operations with one instruction, reduces the complex operation of users, improves the happiness index and makes the life more efficient;

s404: converting the collected voice into characters, displaying the characters on a screen, giving a prompt, and if the characters are correct, carrying out the next step; if the instruction is wrong, prompting the user which character to replace, wherein the user can manually input a changed field or re-input the audio until the instruction is correct, and then the user carries out the next step after confirming;

s405: analyzing the grammar structure and the language entity of the text instruction after word segmentation by respectively using the trained grammar structure deep neural network model and the language entity deep neural network model, then searching and extracting specific instruction elements from the analyzed text instruction according to an instruction library, and collecting the specific instruction elements into an instruction set; the instruction set and the corresponding audio frequency are stored after the instruction set is executed, and the microphone directly triggers the instruction set to operate again after the same audio frequency is collected;

s406: playing prompt to trigger shortcut voice of the instruction set;

s408: and prompting the user that the programming mode is finished.

In summary, the following steps: the voice instruction self-recognition algorithm provided by the invention is programmed by a self-defined voice instruction, and a user can complete all self-defined operations under the instruction at one time only by one instruction, so that the operation of the user is effectively simplified, and the use convenience of equipment is improved; high-frequency words, phrases, underwater sounds, fire alarm sounds and the like used by a user are collected through big data and are used as data of comparison parameters, and the data are arranged into an instruction library, so that the manual input efficiency is improved; the method comprises the steps of collecting indoor natural sound with fixed frequency, judging whether underwater sound, fire alarm sound or other alarm sound exists in audio after analysis and processing, sending a warning to a user after identification, and reducing harm to the power equipment under a dangerous condition by adopting operations of turning off the power equipment, turning off a power supply, automatically turning off the power supply and the like.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be able to cover the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims

1. An autonomous voice command recognition algorithm, comprising the steps of:

s（x_t，y_t）=I（x_t，y_t）/I（a_t，b_t）（1）

wherein

I（x_t，y_t）∈S（x_t，y_t）（2）

I（a_t，b_t）∈S（x_t，y_t）（3）

I（x_t，y_t）/I（a_t，b_t）=（x_t-a_t）（y_t-b_t）（4）

then

s（x_t，y_t）=（x_t-a_t）（y_t-b_t）（5）

I^、（x^、 _t，y^、 _t）=I（a_t，b_t）/s（x_t，y_t）（6）

x^、 _t=a_t/s（x_t，y_t）（7）

y^、 _t=b_t/s（x_t，y_t）（8）

I（a_t，b_t）∈S（x_t，y_t）（9）

2. The algorithm for autonomous speech instruction recognition according to claim 1, wherein a plurality of groups of instruction data packets are provided in the speech instruction library in S1, each group of instruction data packets is provided with text fields, instructions and operations, or each group of instruction data packets is provided with audio segments, instructions and operations, wherein the text fields are words and phrases commonly used by the user and obtained by big data collection, and the operations are operations commonly used by the user, and the audio segments are underwater sounds and fire alarms obtained by big data collection.

3. The algorithm for autonomous speech instruction recognition according to claim 2, wherein text fields, instructions and operations in each group of instruction packets are in one-to-one correspondence, each group of instruction packets has the same one-to-one correspondence of audio fields, instructions and operations, and one audio can simultaneously trigger multiple groups of instruction packets.

4. The algorithm of claim 1, wherein the microphone in S2 collects audio in two ways, one is configured to record audio at a fixed frequency, and the second is configured to set the waking degree, and the microphone records audio after waking.

5. The algorithm for autonomous recognition of voice commands according to claim 2, wherein said S3 comprises the steps of:

s301: converting the denoised audio data into an audio signal:

6. The voice command autonomous recognition algorithm of claim 1 wherein the voice recognition algorithm is a DTW algorithm.

7. The algorithm of claim 1, wherein the instruction in S4 is encoded as follows:

s401: acquiring audio and entering a programming mode;

s406: playing prompt to trigger shortcut voice of the instruction set;

s408: and prompting the user that the programming mode is finished.

8. The algorithm for autonomous speech command recognition according to claim 7, wherein the user is prompted as to which word to replace if the command is incorrect in S404, the user manually enters a field to be changed, or the user re-enters audio until the command is correct and the user confirms that the next step is to be performed.

9. The algorithm as claimed in claim 7, wherein the instruction set and the corresponding audio are stored after the instruction set is executed in S405, and the microphone directly triggers the instruction set to operate again after the same audio is collected.