WO2021042969A1

WO2021042969A1 - Construction apparatus and construction method for self-learning speech recognition system

Info

Publication number: WO2021042969A1
Application number: PCT/CN2020/109393
Authority: WO
Inventors: 樊茂
Original assignee: 晶晨半导体(上海)股份有限公司
Priority date: 2019-09-05
Filing date: 2020-08-14
Publication date: 2021-03-11
Also published as: CN110610710A; CN110610710B

Abstract

A construction apparatus and construction method for a self-learning speech recognition system. The construction apparatus is applied to a speech recognition system; the speech recognition system comprises a microphone and a speech recognition module using the construction apparatus; the microphone is connected to the speech recognition module; the construction apparatus comprises an analysis unit (1) configured to analyze an output signal of the microphone to obtain a plurality of signal parameters, and a recognition unit (2) connected to the analysis unit (1) and configured to determine, according to the signal parameters, whether the output signal is a preset activation speech. A wake up operation is implemented by means of an activation speech, so that a power supply module, an ADC, and a CPU in a standby process are enabled to sleep, thereby reducing energy consumption in the standby process.

Description

Construction device and construction method of self-learning speech recognition system

Technical field

The present invention relates to the technical field of voice recognition, in particular to a construction device and construction method of a self-learning speech recognition system.

Background technique

With the rapid development of computer application technology, the application of speech or other types of voice recognition technology is becoming more and more extensive, and the demand for voice recognition is also increasing. The current ultra-high-definition smart TVs and smart speakers still need to retain the voice wake-up function during the standby period, so the voice recognition system still needs to work, that is, the power module, ADC (Analog-to-Digital Converter, analog-to-digital converter), CPU ( The Central Processing Unit (Central Processing Unit) is still in working mode, which consumes a lot of energy during standby.

Summary of the invention

In view of the above-mentioned problems in the prior art, a construction device of a self-learning speech recognition system aimed at reducing energy consumption during standby is provided.

The specific technical solutions are as follows:

A construction device for a self-learning speech recognition system is applied to a speech recognition system. The speech recognition system includes a microphone and a speech recognition module to which the construction device is applied. The microphone and the speech recognition module are connected. The construction device includes:

The analysis unit is used to analyze the output signal of the microphone to obtain multiple signal parameters;

The recognition unit is connected with the analysis unit, and judges whether the output signal is a preset activation voice according to the signal parameters.

Preferably, the device for constructing a self-learning speech recognition system, wherein the output signal is a waveform signal.

Preferably, the device for constructing a self-learning speech recognition system, wherein the analysis unit sequentially saves each type of signal parameter into a corresponding sequence, and outputs the signal parameter of each sequence to the recognition unit.

Preferably, a device for constructing a self-learning speech recognition system, wherein:

The recognition unit is a neural network, which includes:

The first calculation unit is configured to output the first output parameter according to the signal parameters of the multiple sequences;

The second calculation unit is configured to output the second output parameter according to the signal parameters of the multiple sequences;

The third calculation unit is configured to output the third output parameter according to the signal parameter of the corresponding sequence;

The fourth calculation unit is configured to output the fourth output parameter according to the signal parameter of the corresponding sequence;

The hidden layer includes multiple first nodes, each of the first nodes is connected to the first calculation unit, the second calculation unit, the third calculation unit, and the fourth calculation unit, and each first node is set with a feature that activates the voice Information, the first node receives and judges whether the first output parameter, the second output parameter, the third output parameter, and the fourth output parameter conform to the corresponding characteristic information, and outputs the judgment result;

The output layer includes a plurality of second nodes, each second node is connected to each first node, and each second node is set with a corresponding activation voice, and it is determined whether the output signal matches the activation voice according to the judgment result.

Preferably, the device for constructing a self-learning speech recognition system, wherein the types of signal parameters include wave troughs, wave crests, and the interval time between adjacent wave troughs and wave crests.

Preferably, the device for constructing a self-learning speech recognition system, wherein the first output parameter is an envelope value; and/or

The second output parameter is the number of wave edges formed by adjacent wave troughs and wave crests; and/or

The third output parameter is the difference between two adjacent troughs; and/or

The fourth output parameter is the difference between two adjacent peaks.

The first calculation unit calculates the envelope value through the trough, the crest and the interval time; and/or

The second calculation unit calculates the number of wave edges formed by adjacent wave troughs and wave crests through calculation of wave troughs and wave crests; and/or

The third calculation unit calculates the difference between two adjacent wave troughs through wave troughs; and/or

The fourth calculating unit obtains the difference between two adjacent wave crests by calculating the wave crests.

It also includes a method for constructing a self-learning voice recognition system, which is applied to a voice recognition system. The voice recognition system includes a microphone and a voice recognition module to which a building device is applied, and the microphone is connected to the voice recognition module. The construction method includes the following steps:

Step S1, analyzing the output signal of the microphone to obtain multiple signal parameters;

Step S2, judging whether the output signal is a preset activation voice according to the signal parameters.

In step S2, a neural network is provided, and the neural network determines whether the output signal is a preset activation voice.

Preferably, the device for constructing a self-learning speech recognition system, wherein the neural network includes:

The first calculation unit is used to output the envelope value according to the trough, the crest and the interval time;

The second calculation unit is used to output the number of wave edges composed of adjacent wave troughs and wave crests according to wave troughs and wave crests;

The third calculation unit is used to output the difference between two adjacent wave troughs according to the wave trough;

The fourth calculation unit is configured to output the difference between two adjacent wave crests according to the wave crests;

The hidden layer includes multiple first nodes, each of the first nodes is connected to the first calculation unit, the second calculation unit, the third calculation unit, and the fourth calculation unit, and each first node is set with a feature that activates the voice Information, the first node receives and judges whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs, and the difference between two adjacent wave crests conform to the corresponding characteristic information, and outputs the judgment result;

The output layer includes a plurality of second nodes, each second node is connected to each first node, and each second node is set with a corresponding activation voice, and judging whether the output signal conforms to the activation voice according to the judgment result;

Step S2 includes the following steps:

Step S21, calculating the envelope value through the trough, crest and interval time; and

Calculate through troughs and crests to obtain the number of wave edges formed by adjacent troughs and crests; and

Calculate the difference between two adjacent troughs through troughs; and

Calculate the difference between two adjacent wave crests through wave crests;

Step S22, each first node receives and judges whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs, and the difference between two adjacent wave crests conform to the characteristic information, and output the judgment result;

In step S23, each second node judges whether the output signal conforms to the activation voice according to the judgment result, and outputs the judgment result.

The above technical solution has the following advantages or beneficial effects: the wake-up work is performed by activating the voice, so that the power module, ADC, and CPU sleep during the standby process, so as to reduce the energy consumption during the standby process.

Description of the drawings

With reference to the attached drawings, the embodiments of the present invention are described more fully. However, the attached drawings are only for illustration and illustration, and do not constitute a limitation to the scope of the present invention.

Figure 1 is a schematic structural diagram of an embodiment of a device for constructing a self-learning speech recognition system according to the present invention;

2 is a schematic diagram of the structure of the neural network of the embodiment of the construction device of the self-learning speech recognition system of the present invention;

3 is a flowchart of an embodiment of a method for constructing a self-learning speech recognition system according to the present invention;

Fig. 4 is a flowchart of step S2 of an embodiment of the method for constructing a self-learning speech recognition system of the present invention.

detailed description

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

It should be noted that the embodiments of the present invention and the features in the embodiments can be combined with each other if there is no conflict.

The present invention will be further described below in conjunction with the drawings and specific embodiments, but it is not a limitation of the present invention.

The present invention includes a construction device for a self-learning speech recognition system, which is applied to a speech recognition system. The speech recognition system includes a microphone and a speech recognition module to which the construction device is applied. The microphone and the speech recognition module are connected, as shown in FIG. The device includes:

In the above embodiment, the recognition unit recognizes whether the signal parameter in the analysis unit is a preset activation voice, and the activation voice is used to wake up, so that the power module, ADC, and CPU sleep during the standby process to reduce standby Energy consumption in the process.

Among them, the preset activation voice can be a preset number, of which the preset number can be 2, 3, or 4. Since the preset activation voice is obtained according to the first node in the hidden layer, the preset activation Do not speak too much to reduce energy consumption.

Further, in the above embodiment, the output signal is a waveform signal. In this way, multiple signal parameters can be obtained in the waveform signal. For example, the types of the aforementioned signal parameters may include troughs, crests, and the interval time between adjacent troughs and crests.

Further, in the foregoing embodiment, the analysis unit may sequentially save the signal parameters of each type in the corresponding sequence, and output the signal parameters of each sequence to the identification unit.

For example, the sequence where the wave trough is located can be {drop ₁ , drop ₂ ,...drop _n }, where drop is used to represent the wave trough;

The sequence where the wave crest is located can be {rise ₁ , rise ₂ ,...rise _n }, where rise is used to represent the wave crest;

The sequence of the interval time can be {T ₁ , T ₂ ,...T _n }, where T is used to represent the interval time.

Further, in the foregoing embodiment, the recognition unit may be a neural network. As shown in FIG. 2, the neural network includes:

The first calculation unit 10 is configured to output the first output parameter according to the signal parameters of the multiple sequences;

The second calculation unit 20 is configured to output the second output parameter according to the signal parameters of the multiple sequences;

The third calculation unit 30 is configured to output the third output parameter according to the signal parameter of the corresponding sequence;

The fourth calculation unit 40 is configured to output the fourth output parameter according to the signal parameter of the corresponding sequence;

The hidden layer includes a plurality of first nodes 50, and each first node 50 is connected to a first calculation unit 10, a second calculation unit 20, a third calculation unit 30, and a fourth calculation unit 40, and each first node 50 is connected to the first calculation unit 10, the second calculation unit 20, the third calculation unit 30, and the fourth calculation unit 40. Setting a feature information of an activated voice, the first node 50 receives and judges whether the first output parameter, the second output parameter, the third output parameter, and the fourth output parameter conform to the corresponding feature information, and outputs the judgment result;

The output layer includes a plurality of second nodes 60, each second node 60 is connected to each first node 50, and each second node 60 is set with a corresponding activation voice, and it is determined whether the output signal matches the activation voice according to the judgment result.

Among them, the number of the aforementioned hidden layers can be self-set according to the needs of the user.

In the above neural network, each node can be a filter.

Further, as a preferred embodiment, the first output parameter is an envelope value;

The second output parameter is the number of wave edges formed by adjacent wave troughs and wave crests;

The third output parameter is the difference between two adjacent troughs;

The fourth output parameter is the difference between two adjacent peaks. And the first calculation unit 10 calculates the envelope value through the wave trough, the wave crest and the interval time;

The second calculation unit 20 calculates the number of wave edges formed by adjacent wave valleys and wave crests through calculation of wave valleys and wave crests;

The third calculation unit 30 calculates the difference between two adjacent wave troughs by calculating the wave trough;

The fourth calculating unit 40 obtains the difference between two adjacent wave crests by calculating the wave crests.

Among them, it should be noted that the difference between the two adjacent wave troughs is the difference between the previous wave trough in the wave trough sequence minus the next wave trough; the difference between the two adjacent wave crests is the previous wave peak minus the wave peak sequence in the wave sequence. Go to the difference of the last wave crest.

Further, the neural network can set a plurality of preset activation voices for training, and input the signal parameters in the output signals corresponding to the preset activation voices into the neural network. The first calculation unit 10 in the neural network is based on the troughs. , The wave crest and interval time are calculated to obtain the envelope value, the second calculating unit 20 calculates the number of adjacent wave troughs and wave crests according to the wave trough and the wave crest, and the third calculating unit 30 calculates according to the wave trough to obtain the adjacent wave For the difference between the two troughs, the fourth calculation unit 40 calculates the difference between the two adjacent wave crests according to the wave crests. Each first node 50 in the hidden layer receives and judges the envelope value, the number of wave edges, and the adjacent wave crests. Whether the difference between the two troughs and the two adjacent peaks match the characteristic information, and output the judgment result. Each second node 60 in the output layer judges whether the output signal conforms to the active voice according to the judgment result, and outputs the judgment As a result, when the output signal is the corresponding activated voice, the signal parameters in the output signal corresponding to the preset activation voice are repeatedly input for training; when the output signal is not the corresponding activated voice, the first node 50 corresponding to the judgment result is adjusted Weight, continue to input the signal parameters in the output signal for training, until the output layer judges that the output signal is the corresponding active voice, input the signal parameters in the output signal corresponding to other preset active voices for training, so as to achieve prediction Obtain the activation voice corresponding to the output signal.

The judgment result can be represented by a logical value. For example, the logical value of the preset activation voice in the corresponding second node 60 is 1010101010, and the signal parameters in the output signal corresponding to the preset activation voice are input into the neural network. Each first node 50 in the hidden layer receives and determines whether the output parameter matches the characteristic information. When the output parameter matches the characteristic information, the logical value corresponding to the output judgment result is 1; when the output parameter does not match the characteristic information, the output judgment The logical value corresponding to the result is 0; the second node 60 of the output layer judges whether the output signal conforms to the preset activation voice according to the received judgment result. When the judgment result is the corresponding logical value 1010101010, the second node 60 outputs the judgment result The corresponding logical value is 1, to indicate that the output signal conforms to the above-mentioned preset activation voice; when the judgment result is not the corresponding logical value 1010101010, the second node 60 outputs the logical value corresponding to the judgment result as 0, which means that the output signal is not The activation voice conforms to the above preset.

It also includes a method for constructing a self-learning speech recognition system, which is used in a speech recognition system. The speech recognition system includes a microphone and a speech recognition module with a construction device. The microphone and the speech recognition module are connected, as shown in Figure 4, the construction method It includes the following steps:

In the above embodiment, by analyzing whether the signal parameter is a preset activation voice, and waking up through the activation voice, the power module, ADC, and CPU sleep during the standby process to reduce energy consumption during the standby process.

Further, in the above embodiment, in step S2, a neural network is provided, and the neural network determines whether the output signal is a preset activation voice.

Neural networks include:

The first calculation unit 10 is configured to output the envelope value according to the trough, the crest and the interval time;

The second calculation unit 20 is configured to output the number of wave edges composed of adjacent wave troughs and wave crests according to wave troughs and wave crests;

The third calculation unit 30 is configured to output the difference between two adjacent wave troughs according to the wave trough;

The fourth calculation unit 40 is configured to output the difference between two adjacent wave crests according to the wave crests;

The hidden layer includes a plurality of first nodes 50, and each first node 50 is connected to a first calculation unit 10, a second calculation unit 20, a third calculation unit 30, and a fourth calculation unit 40, and each first node 50 is connected to the first calculation unit 10, the second calculation unit 20, the third calculation unit 30, and the fourth calculation unit 40. Set a feature information of an activated voice, the first node 50 receives and determines whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs, and the difference between two adjacent wave crests conform to the corresponding feature information, and Output the judgment result;

The output layer includes a plurality of second nodes 60, each second node 60 is connected to each first node 50, and each second node 60 is set with a corresponding activation voice, and judging whether the output signal conforms to the activation voice according to the judgment result;

Step S2 includes the following steps:

Calculate the difference between two adjacent troughs through troughs; and

Calculate the difference between two adjacent wave crests through wave crests;

Step S22, each first node 50 receives and judges whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs, and the difference between two adjacent wave crests conform to the characteristic information, and output the judgment result;

In step S23, each second node 60 judges whether the output signal conforms to the active voice according to the judgment result, and outputs the judgment result.

The above are only preferred embodiments of the present invention, and do not therefore limit the implementation and protection scope of the present invention. For those skilled in the art, they should be aware of equivalent replacements and equivalents made by using the description and illustrations of the present invention. All solutions obtained by obvious changes should be included in the protection scope of the present invention.

Claims

A device for constructing a self-learning voice recognition system is applied to the voice recognition system, the voice recognition system includes a microphone and a voice recognition module to which the device is applied, the microphone and the voice recognition module are connected, It is characterized in that the construction device includes:

An analysis unit for analyzing the output signal of the microphone to obtain multiple signal parameters;

A recognition unit is connected to the analysis unit, and judges whether the output signal is a preset activation voice according to the signal parameter.
The device for constructing a self-learning speech recognition system according to claim 1, wherein the output signal is a waveform signal.
The device for constructing a self-learning speech recognition system according to claim 1, wherein the analysis unit sequentially saves each type of signal parameter into a corresponding sequence, and stores the signal of each sequence The parameters are output to the identification unit.
The device for constructing a self-learning speech recognition system according to claim 3, wherein the recognition unit is a neural network, and the neural network comprises:

The first calculation unit is configured to output the first output parameter according to the signal parameters of the multiple sequences;

The second calculation unit is configured to output the second output parameter according to the signal parameters of the multiple sequences;

The third calculation unit is configured to output the third output parameter according to the signal parameter of the corresponding sequence;

The fourth calculation unit is configured to output the fourth output parameter according to the signal parameter of the corresponding sequence;

The hidden layer includes a plurality of first nodes, and each of the first nodes is connected to the first calculation unit, the second calculation unit, the third calculation unit, and the fourth calculation unit. The first node sets a feature information of the activated voice, and the first node receives and judges the first output parameter, the second output parameter, the third output parameter, and the fourth output Whether the parameter conforms to the corresponding characteristic information, and output the judgment result;

The output layer includes a plurality of second nodes, each of the second nodes is connected to each of the first nodes, and each of the second nodes is set with a corresponding activation voice, and the output is judged according to the judgment result Whether the signal conforms to the activation voice.
The device for constructing a self-learning speech recognition system according to claim 3, wherein the type of the signal parameter includes a wave trough, a wave crest, and the interval time between the adjacent wave trough and the wave crest.
The device for constructing a self-learning speech recognition system according to claim 5, wherein:

The first output parameter is an envelope value; and/or

The second output parameter is the number of wave edges formed by the adjacent wave troughs and wave crests; and/or

The third output parameter is the difference between two adjacent troughs; and/or

The fourth output parameter is the difference between two adjacent wave crests.
The device for constructing a self-learning speech recognition system according to claim 6, wherein:

The first calculation unit calculates the envelope value through the wave trough, the wave crest, and the interval time; and/or

The second calculation unit calculates the wave trough and the wave crest to obtain the number of the wave edges formed by the adjacent wave trough and the wave crest; and/or

The third calculation unit calculates the difference between the two adjacent wave troughs through the wave trough; and/or

The fourth calculation unit calculates the difference between the two adjacent wave crests through calculation of the wave crests.
A method for constructing a self-learning speech recognition system, applied to the speech recognition system, the speech recognition system including a microphone and a speech recognition module to which the construction device is applied, the microphone and the speech recognition module are connected, It is characterized in that the construction method includes the following steps:

Step S1, analyzing the output signal of the microphone to obtain multiple signal parameters;

Step S2, judging whether the output signal is a preset activation voice according to the signal parameter.
8. The method for constructing a self-learning speech recognition system according to claim 8, wherein in the step S2, a neural network is provided, and the neural network determines whether the output signal is a preset activation voice.
9. The method for constructing a self-learning speech recognition system according to claim 9, wherein the neural network comprises:

The first calculation unit is used to output the envelope value according to the trough, the crest and the interval time;

A second calculation unit, configured to output the number of wave edges formed by the adjacent wave trough and the wave crest according to the wave trough and the wave crest;

A third calculation unit, configured to output the difference between two adjacent wave troughs according to the wave trough;

A fourth calculation unit, configured to output the difference between two adjacent wave crests according to the wave crest;

The hidden layer includes a plurality of first nodes, and each of the first nodes is connected to the first calculation unit, the second calculation unit, the third calculation unit, and the fourth calculation unit. The first node sets a feature information of the activated voice, and the first node receives and judges the envelope value, the number of the wave edges, the difference between two adjacent wave troughs, and the adjacent Whether the difference between the two wave crests accords with the corresponding characteristic information, and output the judgment result;

The output layer includes a plurality of second nodes, each of the second nodes is connected to each of the first nodes, and each of the second nodes is set with a corresponding activation voice, and the output signal is determined according to the determination result Whether it meets the activation voice;

The step S2 includes the following steps:

Step S21, calculating the envelope value through the wave trough, the wave crest, and the interval time; and

Calculating the number of the wave edges formed by the adjacent wave troughs and wave crests by calculating the wave troughs and the wave crests; and

Calculate the difference between two adjacent wave troughs through the wave trough; and

Calculate the difference between two adjacent wave crests through the wave crest;

Step S22, each of the first nodes receives and determines whether the envelope value, the number of the wave edges, the difference between the two adjacent wave troughs, and the difference between the two adjacent wave crests conform to The characteristic information and output the judgment result;

Step S23, each of the second nodes judges whether the output signal conforms to the activation voice according to the judgment result, and outputs the judgment result.