CN110610710A

CN110610710A - Construction device and construction method of self-learning voice recognition system

Info

Publication number: CN110610710A
Application number: CN201910838612.0A
Authority: CN
Inventors: 樊茂
Original assignee: Amlogic Shanghai Co Ltd
Current assignee: Amlogic Shanghai Co Ltd; Amlogic Inc
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2019-12-24
Anticipated expiration: 2039-09-05
Also published as: CN110610710B; WO2021042969A1

Abstract

The invention provides a construction device and a construction method of a self-learning voice recognition system, wherein the construction device is applied to the voice recognition system, the voice recognition system comprises a microphone and a voice recognition module applied with the construction device, the microphone is connected with the voice recognition module, the construction device comprises an analysis unit, and the analysis unit is used for analyzing an output signal of the microphone to obtain a plurality of signal parameters; and the recognition unit is connected with the analysis unit and judges whether the output signal is a preset activated voice or not according to the signal parameter. The voice-activated sleep mode has the advantages that the voice is activated to wake up, so that the power module, the ADC and the CPU are put into sleep in the standby process, and the energy consumption in the standby process is reduced.

Description

Construction device and construction method of self-learning voice recognition system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a device and a method for constructing a self-learning voice recognition system.

Background

With the rapid development of computer application technology, speech or other types of voice recognition technology are applied more and more widely, and the demand for voice recognition is increasing. In the ultra-high definition smart television at present, the smart speaker still needs to retain the voice wake-up function during the standby period, so that the voice recognition system still needs to work, that is, the power module, the ADC (Analog-to-Digital Converter), and the CPU (Central Processing Unit) are all in the working mode, so that a large amount of energy is consumed during the standby process.

Disclosure of Invention

In view of the above problems in the prior art, a device for constructing a self-learning speech recognition system is provided to reduce energy consumption during standby.

The specific technical scheme is as follows:

the utility model provides a construct device of self-learning speech recognition system, is applied to among the speech recognition system, and the speech recognition system includes microphone and the speech recognition module that has the construction device of application, and microphone and speech recognition module are connected, and wherein, the construction device includes:

the analysis unit is used for analyzing the output signal of the microphone to obtain a plurality of signal parameters;

and the recognition unit is connected with the analysis unit and judges whether the output signal is a preset activated voice or not according to the signal parameter.

Preferably, the self-learning speech recognition system is constructed such that the output signal is a waveform signal.

Preferably, the self-learning speech recognition system is constructed such that the analysis unit sequentially stores the signal parameters of each type in the corresponding sequence and outputs the signal parameters of each sequence to the recognition unit.

Preferably, the building means of the self-learning speech recognition system, wherein,

the identification unit is a neural network, and the neural network comprises:

a first calculation unit for outputting a first output parameter according to the signal parameters of the plurality of sequences;

a second calculation unit for outputting a second output parameter according to the signal parameters of the plurality of sequences;

a third calculation unit for outputting a third output parameter according to the signal parameter of the corresponding sequence;

a fourth calculating unit, configured to output a fourth output parameter according to the signal parameter of the corresponding sequence;

the hidden layer comprises a plurality of first nodes, each first node is connected with the first computing unit, the second computing unit, the third computing unit and the fourth computing unit, each first node is provided with one piece of feature information for activating voice, and the first nodes receive and judge whether the first output parameters, the second output parameters, the third output parameters and the fourth output parameters accord with the corresponding feature information or not and output the judgment result;

and the output layer comprises a plurality of second nodes, each second node is connected with each first node, each second node is provided with a corresponding activated voice, and whether the output signal conforms to the activated voice is judged according to the judgment result.

Preferably, the construction device of the self-learning speech recognition system, wherein the types of the signal parameters include a trough, a peak, and an interval time between adjacent troughs and peaks.

Preferably, the construction device of the self-learning speech recognition system, wherein the first output parameter is an envelope value; and/or

The second output parameter is the number of wave edges formed by adjacent wave troughs and wave crests; and/or

The third output parameter is the difference between two adjacent wave troughs; and/or

The fourth output parameter is the difference between two adjacent peaks.

the first calculating unit calculates to obtain an envelope value through a trough, a crest and interval time; and/or

The second calculating unit calculates the number of wave edges formed by adjacent wave troughs and wave peaks through the wave troughs and the wave peaks; and/or

The third calculating unit calculates through the wave trough to obtain the difference between two adjacent wave troughs; and/or

And the fourth calculating unit calculates the difference between two adjacent wave crests through the wave crests.

The method is applied to the voice recognition system, the voice recognition system comprises a microphone and a voice recognition module with a construction device, and the microphone is connected with the voice recognition module, wherein the construction method comprises the following steps:

step S1, analyzing the output signal of the microphone to obtain a plurality of signal parameters;

and step S2, judging whether the output signal is a preset activated voice according to the signal parameter.

in step S2, a neural network is provided, and whether the output signal is a preset active voice is determined by the neural network.

Preferably, the apparatus for constructing a self-learning speech recognition system, wherein the neural network comprises:

the first calculating unit is used for outputting an envelope value according to the wave trough, the wave crest and the interval time;

the second calculating unit is used for outputting the number of wave edges formed by adjacent wave troughs and wave peaks according to the wave troughs and the wave peaks;

the third calculating unit is used for outputting the difference between two adjacent wave troughs according to the wave troughs;

the fourth calculating unit is used for outputting the difference between two adjacent wave crests according to the wave crests;

the hidden layer comprises a plurality of first nodes, each first node is connected with the first computing unit, the second computing unit, the third computing unit and the fourth computing unit, each first node is provided with one piece of feature information for activating voice, the first nodes receive and judge whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs and the difference between two adjacent wave crests conform to the corresponding feature information or not, and output the judgment result;

the output layer comprises a plurality of second nodes, each second node is connected with each first node, each second node is provided with corresponding activated voice, and whether the output signal accords with the activated voice is judged according to the judgment result;

step S2 includes the following steps:

step S21, calculating through a trough, a crest and interval time to obtain an envelope value; and

calculating through the wave troughs and the wave peaks to obtain the number of wave edges formed by the adjacent wave troughs and the adjacent wave peaks; and

calculating through the wave troughs to obtain the difference between two adjacent wave troughs; and

calculating through the wave crests to obtain the difference between two adjacent wave crests;

step S22, each first node receives and judges whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs and the difference between two adjacent wave crests accord with the characteristic information or not, and outputs the judgment result;

in step S23, each second node determines whether the output signal corresponds to the activated voice according to the determination result, and outputs the determination result.

The technical scheme has the following advantages or beneficial effects: the power module, the ADC and the CPU are put into sleep in the standby process by activating voice to wake up, so that energy consumption in the standby process is reduced.

Drawings

Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The drawings are, however, to be regarded as illustrative and explanatory only and are not restrictive of the scope of the invention.

FIG. 1 is a schematic structural diagram of an embodiment of a device for constructing a self-learning speech recognition system according to the present invention;

FIG. 2 is a schematic structural diagram of a neural network according to an embodiment of the apparatus for constructing the self-learning speech recognition system of the present invention;

FIG. 3 is a flow chart of an embodiment of a method of constructing a self-learning speech recognition system of the present invention;

FIG. 4 is a flowchart illustrating step S2 of an embodiment of a method for constructing a self-learning speech recognition system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

The invention includes a construction device of a self-learning speech recognition system, which is applied to the speech recognition system, the speech recognition system comprises a microphone and a speech recognition module with the construction device, the microphone is connected with the speech recognition module, as shown in figure 1, the construction device comprises:

In the above embodiment, the recognition unit recognizes whether the signal parameter in the analysis unit is a preset activated voice, and performs wake-up operation by activating the voice, so that the power module, the ADC and the CPU sleep in the standby process, and the energy consumption in the standby process is reduced.

The preset activated voices can be preset number, wherein the preset number can be 2, 3 or 4, and the preset activated voices are obtained according to the first node in the hidden layer, so that the preset activated voices are not too many, and energy consumption is reduced.

Further, in the above-described embodiment, the output signal is a waveform signal. Thus, a plurality of signal parameters can be obtained in the waveform signal, and for example, the types of the signal parameters may include a trough, a peak, and a time interval between adjacent troughs and peaks.

Further, in the above-described embodiment, the analysis unit may sequentially save each type of signal parameter into a corresponding sequence, and output the signal parameter of each sequence to the recognition unit.

For example, the sequence in which valleys are located may be { drop }₁，drop₂，……drop_nDrop is used to denote a trough;

the sequence in which the peaks lie may be { rise₁，rise₂，……rise_nRise is used to represent a peak;

the sequence in which the intervals are located may be { T }₁，T₂，……T_nWhere T is used to denote the interval time.

Further, in the above embodiment, the identification unit may be a neural network, as shown in fig. 2, the neural network includes:

a first calculation unit 10 for outputting a first output parameter based on the signal parameters of the plurality of sequences;

a second calculation unit 20 for outputting a second output parameter based on the signal parameters of the plurality of sequences;

a third calculating unit 30 for outputting a third output parameter according to the signal parameter of the corresponding sequence;

a fourth calculating unit 40 for outputting a fourth output parameter according to the signal parameter of the corresponding sequence;

the hidden layer comprises a plurality of first nodes 50, each first node 50 is connected with the first computing unit 10, the second computing unit 20, the third computing unit 30 and the fourth computing unit 40, each first node 50 is provided with one piece of feature information for activating voice, and the first nodes 50 receive and judge whether the first output parameters, the second output parameters, the third output parameters and the fourth output parameters accord with the corresponding feature information or not and output the judgment result;

and the output layer comprises a plurality of second nodes 60, each second node 60 is connected with each first node 50, each second node 60 is provided with a corresponding activated voice, and whether the output signal accords with the activated voice is judged according to the judgment result.

The number of the hidden layers can be set according to the requirements of users.

In the neural network described above, each node may be a filter.

Further, as a preferred embodiment, the first output parameter is an envelope value;

the second output parameter is the number of wave edges formed by adjacent wave troughs and wave crests;

the third output parameter is the difference between two adjacent wave troughs;

the fourth output parameter is the difference between two adjacent peaks. The first calculating unit 10 calculates the trough, the peak and the interval time to obtain an envelope value;

the second calculating unit 20 calculates the number of wave edges formed by adjacent wave troughs and wave peaks through the wave troughs and the wave peaks;

the third calculating unit 30 calculates the difference between two adjacent wave troughs through the wave trough;

the fourth calculating unit 40 calculates the difference between two adjacent peaks by using the peaks.

Wherein, it should be noted that, the difference between the two adjacent troughs is the difference obtained by subtracting the next trough from the previous trough in the trough sequence; the difference between the two adjacent peaks is the difference of the former peak minus the latter peak in the peak sequence.

Further, the neural network may set a plurality of preset activated voices for training, signal parameters in output signals corresponding to the preset activated voices are input into the neural network, the first calculating unit 10 in the neural network calculates according to the trough, the peak and the interval time to obtain an envelope value, the second calculating unit 20 calculates according to the trough and the peak to obtain the number of wave edges formed by adjacent trough and peak, the third calculating unit 30 calculates according to the trough to obtain the difference between two adjacent troughs, the fourth calculating unit 40 calculates according to the peak to obtain the difference between two adjacent peaks, each first node 50 in the hidden layer receives and determines whether the envelope value, the number of wave edges, the difference between two adjacent troughs and the difference between two adjacent peaks conform to the characteristic information and outputs the determination result, each second node 60 in the output layer determines whether the output signals conform to the activated voices according to the determination result, when the output signal is the corresponding activated voice, the judgment result is output, and the signal parameters in the output signal corresponding to the preset activated voice are repeatedly input for training; when the output signal is not the corresponding activated voice, the weight of the first node 50 corresponding to the judgment result is adjusted, the signal parameter in the output signal is continuously input for training, and when the output layer judges that the output signal is the corresponding activated voice, the signal parameter in the output signal corresponding to other preset activated voices is input for training, so that the activated voice corresponding to the output signal is obtained through prediction.

The judgment result may be represented by a logical value, for example, the logical value of the preset activated voice in the corresponding second node 60 is 1010101010, the signal parameter in the output signal corresponding to the preset activated voice is input in the neural network, each first node 50 in the hidden layer receives and judges whether the output parameter meets the feature information, and when the output parameter meets the feature information, the logical value corresponding to the judgment result is 1; when the output parameter does not accord with the characteristic information, outputting a logic value corresponding to the judgment result to be 0; the second node 60 of the output layer determines whether the output signal conforms to a preset activated voice according to the received determination result, and when the determination result is the corresponding logical value 1010101010, the second node 60 outputs the logical value corresponding to the determination result as 1 to indicate that the output signal conforms to the preset activated voice; when the determination result is not the corresponding logic value 1010101010, the second node 60 outputs the logic value corresponding to the determination result as 0 to indicate that the output signal does not conform to the preset activation voice.

The method for constructing the self-learning voice recognition system is applied to the voice recognition system, the voice recognition system comprises a microphone and a voice recognition module with a construction device, the microphone is connected with the voice recognition module, and as shown in fig. 4, the construction method comprises the following steps:

In the above embodiment, whether the signal parameter is the preset activated voice or not is analyzed, and the wake-up operation is performed by the activated voice, so that the power module, the ADC and the CPU are put to sleep in the standby process, and the energy consumption in the standby process is reduced.

Further, in the above embodiment, in step S2, a neural network is provided, and whether the output signal is a preset active voice is determined by the neural network.

The neural network includes:

a first calculating unit 10 for outputting an envelope value according to a valley, a peak and an interval time;

the second calculating unit 20 is configured to output the number of wave edges formed by adjacent wave troughs and wave peaks according to the wave troughs and wave peaks;

a third calculating unit 30 for outputting the difference between two adjacent troughs according to the trough;

a fourth calculating unit 40, configured to output a difference between two adjacent peaks according to the peak;

the hidden layer comprises a plurality of first nodes 50, each first node 50 is connected with the first calculating unit 10, the second calculating unit 20, the third calculating unit 30 and the fourth calculating unit 40, each first node 50 is provided with one piece of feature information for activating voice, and the first nodes 50 receive and judge whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs and the difference between two adjacent wave crests accord with the corresponding feature information or not and output the judgment result;

an output layer including a plurality of second nodes 60, each second node 60 being connected to each first node 50, each second node 60 setting a corresponding activated voice, and determining whether an output signal corresponds to the activated voice according to a determination result;

step S2 includes the following steps:

step S22, each first node 50 receives and determines whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs and the difference between two adjacent wave crests match the characteristic information, and outputs the determination result;

in step S23, each second node 60 determines whether the output signal corresponds to the activated voice according to the determination result, and outputs the determination result.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A constructing apparatus of a self-learning speech recognition system, applied to the speech recognition system, the speech recognition system comprising a microphone and a speech recognition module applied with the constructing apparatus, the microphone and the speech recognition module being connected, wherein the constructing apparatus comprises:

2. The apparatus for constructing a self-learning speech recognition system of claim 1, wherein the output signal is a waveform signal.

3. The apparatus for constructing a self-learning speech recognition system according to claim 1, wherein the analysis unit sequentially stores the signal parameters of each type into a corresponding sequence, and outputs the signal parameters of each of the sequences to the recognition unit.

4. The self-learning speech recognition system construction apparatus of claim 3,

the identification unit is a neural network, and the neural network comprises:

a hidden layer, including a plurality of first nodes, where each first node is connected to the first computing unit, the second computing unit, the third computing unit, and the fourth computing unit, and each first node sets a piece of feature information of the activated voice, receives and determines whether the first output parameter, the second output parameter, the third output parameter, and the fourth output parameter conform to the corresponding feature information, and outputs a determination result;

and the output layer comprises a plurality of second nodes, each second node is connected with each first node, each second node is provided with a corresponding activated voice, and whether the output signal accords with the activated voice is judged according to the judgment result.

5. The apparatus for constructing a self-learning speech recognition system according to claim 3, wherein the types of the signal parameters include a trough, a peak, and a time interval between the adjacent trough and the peak.

6. The self-learning speech recognition system construction apparatus of claim 5,

the first output parameter is an envelope value; and/or

The second output parameter is the number of wave edges formed by the adjacent wave troughs and the adjacent wave crests; and/or

The fourth output parameter is the difference between two adjacent peaks.

7. The self-learning speech recognition system construction apparatus of claim 6,

the first calculating unit calculates the envelope value through the trough, the peak and the interval time; and/or

The second calculating unit calculates the number of the wave edges formed by the adjacent wave troughs and the adjacent wave peaks through the wave troughs and the wave peaks; and/or

The third calculating unit calculates the difference between two adjacent wave troughs through the wave troughs; and/or

The fourth calculating unit calculates the difference between two adjacent wave crests through the wave crests.

8. A construction method of a self-learning speech recognition system, which is applied to the speech recognition system, wherein the speech recognition system comprises a microphone and a speech recognition module applied with the construction device, and the microphone is connected with the speech recognition module, and the construction method comprises the following steps:

9. The method for constructing a self-learning speech recognition system of claim 8, wherein in step S2, a neural network is provided, and it is determined whether the output signal is a predetermined active speech.

10. The method of constructing a self-learning speech recognition system of claim 9 wherein the neural network comprises:

the second calculating unit is used for outputting the number of wave edges formed by the adjacent wave troughs and wave peaks according to the wave troughs and the wave peaks;

a third calculating unit, configured to output a difference between two adjacent troughs according to the trough;

a fourth calculating unit, configured to output a difference between two adjacent peaks according to the peak;

a hidden layer, including a plurality of first nodes, where each first node is connected to the first computing unit, the second computing unit, the third computing unit, and the fourth computing unit, each first node sets a piece of feature information of the activated voice, and the first nodes receive and determine whether the envelope value, the number of wave edges, a difference between two adjacent wave troughs, and a difference between two adjacent wave crests match the corresponding feature information, and output a determination result;

the step S2 includes the steps of:

step S21, calculating the envelope value through the trough, the crest and the interval time; and

calculating through the wave troughs and the wave peaks to obtain the number of the wave edges formed by the adjacent wave troughs and the adjacent wave peaks; and

step S22, each first node receives and determines whether the envelope value, the number of wave edges, the difference between two adjacent wave troughs and the difference between two adjacent wave crests match the characteristic information, and outputs a determination result;

step S23, each second node determines whether the output signal matches the activated voice according to the determination result, and outputs the determination result.