CN112331204A

CN112331204A - Intelligent voice recognition method, device and storage medium

Info

Publication number: CN112331204A
Application number: CN202011327097.9A
Authority: CN
Inventors: 彭泽令; 梁志强; 匡勇建
Original assignee: Zhuhai Jieli Technology Co Ltd
Current assignee: Zhuhai Jieli Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-05
Anticipated expiration: 2040-11-24
Also published as: CN112331204B

Abstract

The application relates to an intelligent voice recognition method, equipment, a device and a storage medium. The intelligent voice recognition method comprises the following steps: controlling the data transmission unit to transmit an audio signal to be played to external sound equipment for playing; the external sound equipment is used for generating sound audio signals according to the audio signals to be played; acquiring an external audio signal; the external audio signals comprise sound audio signals and voice instruction signals collected by the audio collecting unit; positioning a sound audio signal in an external audio signal and then separating to obtain a voice instruction signal; and carrying out voice recognition on the voice instruction signal. This intelligence speech recognition device can eliminate the time delay and to the influence of echo cancellation accuracy, improves speech recognition's accuracy, can also improve simultaneously with the adaptation nature of outside stereo set.

Description

Intelligent voice recognition method, device and storage medium

Technical Field

The present application relates to the field of intelligent device technologies, and in particular, to an intelligent speech recognition method, an intelligent speech recognition device, and a storage medium.

Background

With the development of intelligent device technology, intelligent voice recognition technology appears, and an intelligent voice recognition product such as a voice recognition algorithm of an intelligent sound box needs to be subjected to echo cancellation processing. In the process, data recovery is required to be performed on the audio data played by the sound box, and the recovered data is used as a reference signal and then echo cancellation processing is performed.

In the prior art, the audio frequency back-sampling method of the smart sound box is mainly to convert the audio frequency back-sampling method into a digital signal through a sampling circuit and an ADC, and then input the digital signal to a main chip through I2S, so that a sampling circuit and an ADC conversion module are needed in the prior art. However, when an intelligent voice recognition function needs to be added to the existing non-intelligent voice recognition product, the non-intelligent voice recognition product does not have a sampling circuit and an ADC conversion module, so that the intelligent voice function cannot be realized on some non-intelligent voice products.

For a non-intelligent voice recognition sound, if an intelligent voice recognition function needs to be added, an intelligent voice recognition device needs to be arranged externally, however, the intelligent voice recognition device and the sound are separated from each other, and the relative position between the intelligent voice recognition device and the sound is not fixed, so that in the actual use process, due to the influences of factors such as distance, environmental interference or wireless interference, random delay can be generated in the transmission of audio signals. The adaptive algorithm adopted in the echo cancellation process has a high requirement on the stability of the delay, and once the delay is too large or the randomness is too high, the calculation amount of the adaptive algorithm is huge and the convergence is difficult, so that the final speech recognition accuracy is influenced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an intelligent speech recognition method, an intelligent speech recognition apparatus, an intelligent speech recognition device, and a storage medium, which can reduce delay and delay randomness.

An intelligent speech recognition method, the method comprising:

controlling the data transmission unit to transmit an audio signal to be played to external sound equipment for playing; the external sound equipment is used for generating sound audio signals according to the audio signals to be played;

acquiring an external audio signal; the external audio signals comprise sound audio signals and voice instruction signals collected by the audio collecting unit;

positioning a sound audio signal in the external audio signal and then separating to obtain a voice instruction signal;

and carrying out voice recognition on the voice instruction signal.

In one embodiment, the step of locating an acoustic audio signal in the external audio signal and then separating the acoustic audio signal to obtain a voice instruction signal includes:

extracting a plurality of delay periods of a maximum delay interval from the external audio signal;

segmenting each delay period according to the length of a preset first fixed frame, and calculating binary position information of each delay period;

and positioning the initial position of the sound audio signal in the external audio signal according to the binary position information of each delay period and the preset bit error rate of the binary position information of the audio signal to be played and extracting the sound audio signal.

In one embodiment, the method further comprises:

dividing original audio data into a plurality of audio segments according to a preset second fixed frame length;

embedding the verification information and a plurality of pieces of verification information into the high-frequency component coefficients of the audio segment one by one in sequence by adopting a discrete wavelet transform algorithm to obtain an audio signal to be played; and the number of the check information is the same as the number of the segments of the audio segments.

extracting check information in the external audio signal by adopting a discrete wavelet inverse transformation algorithm;

positioning the position of the sound audio signal in the external audio signal according to the verification information;

extracting a target signal in the external audio signal; the target signal is a signal segment corresponding to the sound audio signal;

and carrying out audio compensation on the target signal according to a preset audio amplitude compensation value to obtain the audio signal of the sound equipment.

In one embodiment, the step of obtaining the audio amplitude compensation value comprises:

acquiring a recovery signal; the back-picking signal is obtained by back-picking the audio signal to be played transmitted by the data transmission unit;

extracting a sound audio signal in the external audio signal;

calculating an audio amplitude compensation value under the current volume according to the stoping signal and the audio signal of the sound equipment;

and updating and storing the audio amplitude compensation value of the current volume.

In one embodiment, the step of obtaining the audio amplitude compensation value further comprises:

sending a volume level adjusting instruction to external sound equipment; the volume level adjusting instruction is used for indicating the external sound equipment to adjust the volume level;

and calculating and updating the audio amplitude compensation values corresponding to the stored different volume levels until all the volume levels are traversed.

An intelligent voice recognition device applying the intelligent voice recognition method comprises the following steps:

the data transmission unit is used for transmitting the audio signal to be played to external sound equipment; the external sound equipment is used for generating sound audio signals according to the audio signals to be played;

the audio acquisition unit is used for acquiring external audio signals; the external audio signal comprises the sound audio signal and a voice instruction signal;

the main control unit is used for controlling the data transmission unit to transmit an audio signal to be played to the external sound equipment for playing, acquiring the external audio signal fed back by the audio acquisition unit, separating the sound audio signal in the external audio signal to obtain a voice instruction signal, and performing voice recognition on the voice instruction signal.

An intelligent speech recognition device comprising:

the data transmission control module is used for controlling the data transmission unit to transmit the audio signal to be played to the external sound equipment for playing; the external sound equipment is used for generating sound audio signals according to the audio signals to be played;

the external audio signal acquisition module is used for acquiring an external audio signal; the external audio signals comprise sound audio signals and voice instruction signals collected by the audio collecting unit;

the signal separation module is used for separating and obtaining a voice instruction signal after positioning a sound audio signal in the external audio signal;

and the voice recognition module is used for carrying out voice recognition on the voice instruction signal.

An intelligent speech recognition device comprising a memory storing a computer program and a processor implementing the steps of the method described above when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the intelligent voice recognition method, the intelligent voice recognition device and the storage medium, the data transmission unit is controlled to transmit the audio signal to be played to the external sound equipment for playing, the audio acquisition unit is used for acquiring the external audio signal, the sound audio signal and the voice instruction signal are separated after the sound audio signal is positioned in the external audio signal, the influence of time delay on echo cancellation accuracy is eliminated, the accuracy of voice recognition is improved, and meanwhile the adaptability with the external sound equipment can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the descriptions of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following descriptions are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of an intelligent speech recognition device according to an embodiment;

FIG. 2 is a schematic diagram of an embodiment of an intelligent speech recognition device and an external audio device;

FIG. 3 is a schematic diagram of another embodiment of an intelligent speech recognition device and an external audio device;

FIG. 4 is a flow diagram illustrating an intelligent speech recognition method, according to one embodiment;

FIG. 5 is a flowchart illustrating the steps of separating the audio command signal after locating the audio signal in the external audio signal according to an embodiment;

FIG. 6 is a flow chart illustrating an intelligent speech recognition method according to another embodiment;

FIG. 7 is a schematic flowchart illustrating the steps of separating the audio command signal after locating the audio signal in the external audio signal according to another embodiment;

FIG. 8 is a flowchart illustrating the step of obtaining an audio amplitude compensation value according to one embodiment;

fig. 9 is a block diagram of the control device of the intelligent speech recognition device in one embodiment.

Detailed Description

To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Embodiments of the present application are set forth in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

It will be understood that the terms "first," "second," and the like as used herein may be used herein to describe various signals, but these signals are not limited by these terms. These terms are only used to distinguish between different signals.

It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element or be connected to the other element through intervening elements. Further, "connection" in the following embodiments is understood to mean "electrical connection", "communication connection", or the like, if there is a transfer of electrical signals or data between the connected objects.

As used herein, the singular forms "a", "an" and "the" may include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises/comprising," "includes" or "including," etc., specify the presence of stated features, integers, steps, operations, components, parts, or combinations thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, components, parts, or combinations thereof.

In one embodiment, as shown in fig. 1, there is provided an intelligent speech recognition device 100 comprising:

a data transmission unit 101 for transmitting an audio signal to be played to the external sound device 200; the external sound equipment 200 is used for generating sound audio signals according to the audio signals to be played;

an audio acquisition unit 102, configured to acquire an external audio signal; the external audio signal comprises a sound audio signal and a voice instruction signal;

the main control unit 103 is configured to control the data transmission unit 101 to transmit an audio signal to be played to the external sound device 200 for playing, and is further configured to obtain the external audio signal fed back by the audio acquisition unit 102, separate the sound audio signal from the external audio signal to obtain a voice instruction signal, and perform voice recognition on the voice instruction signal.

In one embodiment, as shown in fig. 2, the data transmission unit 101 includes:

the DAC module 1011 is configured to transmit the original audio signal or the modulated signal to the power amplifier unit 201 of the external audio device 200.

A Digital to analog converter (DAC) is a device that converts a Digital signal into an analog signal (in the form of current, voltage, or charge). In many digital systems (e.g., computers), signals are stored and transmitted digitally, and digital-to-analog converters can convert such signals to analog signals so that they can be recognized by the outside world (human or other non-digital system). For the external audio device 200 without the wireless communication function, the DAC module 1011 may convert the signal into an analog signal, transmit the analog signal to the power amplifier 201 of the external audio device 200 for amplification, and play the audio through the audio output unit 202.

In one embodiment, as shown in fig. 3, the data transmission unit 101 includes:

a communication unit 1012 for transmitting the original audio signal or the modulated signal to the communication module 203 of the external acoustic device 200.

For the external audio device 200 with wireless communication function, i.e. bluetooth sound, because the CPU unit 204 has digital-to-analog conversion capability, the CPU unit 204 can directly transmit the digital signal to the communication module 203 of the external audio device 200 through the communication unit 1012, and after the external audio device 200 receives the original audio signal or the modulated signal in the form of digital signal, the CPU unit 204 converts itself into an analog signal and sends it to the audio output unit 202 to play audio.

In one embodiment, as shown in fig. 4, an intelligent speech recognition method is provided, which can be applied to the above intelligent speech recognition device, and is described by taking the application to a main control unit as an example, the method includes:

step S100, controlling a data transmission unit to transmit an audio signal to be played to external sound equipment for playing; the external sound equipment is used for generating sound audio signals according to the audio signals to be played;

step S200, obtaining an external audio signal; the external audio signals comprise sound audio signals and voice instruction signals collected by the audio collecting unit;

s300, positioning a sound audio signal in the external audio signal and separating to obtain a voice instruction signal;

step S400, voice recognition is carried out on the voice command signal.

The audio signal to be played is an audio signal that needs to be sent to the external sound device 200 for playing, and the external sound device 200 plays the audio signal to be played according to the received audio signal to be played, so that an audio signal is generated. The external audio signal in one embodiment comprises an audio signal played by the external audio device 200; in one embodiment, the external audio signal further includes other sound signals within the current capture range, such as a voice command signal. Because of the influence of factors such as distance, environmental interference or wireless interference, random delay can be generated in the transmission of audio signals, and therefore before voice recognition, sound audio signals in external audio signals are firstly positioned, the initial position of the sound audio signals is determined, and then the offset training can be carried out more accurately, and voice command signals can be separated out after the offset training and are subjected to voice recognition.

According to the intelligent voice recognition method, the data transmission unit is controlled to transmit the audio signal to be played to the external sound equipment for playing, the audio acquisition unit is used for acquiring the external audio signal, the sound audio signal and the voice instruction signal are separated after the sound audio signal is positioned in the external audio signal, the influence of time delay on echo cancellation accuracy is eliminated, and the adaptability of the intelligent voice recognition method to the sound equipment is improved.

In one embodiment, as shown in fig. 5, the step of locating the acoustic audio signal in the external audio signal and then separating to obtain the voice command signal includes:

step S310, extracting a plurality of delay periods of the maximum delay interval from the external audio signal;

step S320, segmenting each delay period according to a preset first fixed frame length, and calculating binary position information of each delay period;

and step S330, positioning the initial position of the sound audio signal in the external audio signal according to the binary position information of each delay period and the bit error rate of the binary position information of the preset audio signal to be played and extracting the sound audio signal.

An audio signal can be generally divided into voiced segments, unvoiced segments, and mute segments. The voiced sound segment is an audio frequency segment corresponding to the sound produced by the vocal cord vibration, and has the characteristics of large short-time energy, large short-time average amplitude, low short-time zero-crossing rate and the like. The mute section is mainly a background noise section, and has the lowest average energy and low zero crossing rate. The unvoiced segment is an audio segment which generates friction sound by the friction, impact or explosion of air in the oral cavity, the average energy is between the unvoiced segment and the air, and the zero crossing rate is high. The audio signal energy is mainly concentrated in voiced segments and by playing the re-sampled audio signal the relative change to voiced segments is limited, otherwise the value of the audio signal is lost. Based on the above principles, the target audio can be located according to the short-time energy of the audio signal:

firstly, uniformly dividing an audio signal S to be played into L sections according to a fixed frame length (3 frames and 5 frames are the fixed frame lengths, the smaller the fixed frame length is, the more the sections are, and the smaller the delay processing error is), and calculating the short-time energy of each section of voiced sound amplitude according to the following short-time energy calculation formula of the audio signal:

wherein, X_LRepresenting the amplitude of each sampling point of the L-th section of the audio signal; m is the total number of sampling points of the current audio signal; e_LRepresenting the short-time energy of the L-th segment of the audio signal; m is independent variable and takes the value of 1-M.

Calculating, comparing E_LAnd E_L-1Short-time energy of two adjacent audio signals, if E_L>E_L-1If the current audio signal is represented by binary 1, otherwise, it is represented by 0, continuously operating, traversing all the audio signals for short-time energy (the binary information of the initial audio signal is default to 0, calculating the ratio of the short-time energy to the last audio signal from the second segment of audio signal, traversing all the audio signals to obtain L-bit binary position information, such as 0XXX), then the method can be usedTo obtain binary position information of the audio data S to be played.

The external audio signal collected by the audio collecting unit has unknown initial position of the sound audio signal in the external audio signal due to the existence of time delay, the length of the first fixed frame is segmented at different initial positions, and the amplitude energy of each obtained delay period may be different. In order to find out the position information corresponding to the audio signal of the sound, a plurality of delay periods of the maximum delay interval are extracted from the external audio signal, each delay period is segmented according to the length of a preset first fixed frame, namely the external audio signal is segmented by delay nt (n is 0,1,2 …), the binary position information of each delay period is calculated respectively according to the mode of obtaining the binary position information, and the binary position information of each delay period is compared with the binary position information of the audio signal to be played. The comparison of the binary position information may calculate the audio position information through a variety of logical relationships, such as logical and, exclusive or, and so on. Illustratively, logical or is adopted to perform binary position information comparison, if the binary position information of a certain delay period is: 1001011010, the binary position information of the audio signal to be played is: 1001001010, then the result of the same or is: 1111101111, the number of 0 in the XNOR result is the wrong code element, 1 is the correct code element; and calculating an error rate according to the result of the same or the result, wherein the error rate is the number of error code elements/the total number of transmission code elements, positioning the initial position of the sound audio signal in the external audio signal according to the error rate (when the error rate is smaller than a preset threshold value, the initial position of the delay time period is determined to be consistent with the initial position of the audio signal to be played), and extracting the sound audio signal so as to separate out the voice command signal. In one embodiment, the predetermined threshold of the bit error rate is 10%.

The method for acquiring the position binary information is used for positioning the audio signal of the sound, and each time, the method is only compared with the previous segment of voiced sound energy, is not influenced by the volume of the playing equipment and the playing equipment, has no accumulative error, and is only related to the playing sound source and the external sound. Meanwhile, the time lengths of the audio signal to be played and the external audio signal are basically stable. Therefore, the method is adopted to position the audio signal, and the problem of audio file dislocation caused by random time delay can be solved. When the audio signal to be played and the sound audio signal are aligned, if the bit error rates of the binary position information W and the Wi bit error rates of the two binary position information W and the Wi bit error rates exceed a set threshold value, the fact that other sounds are larger outside besides the sound audio signal can be judged, and therefore a voice recognition algorithm is started to recognize the voice.

In one embodiment, as shown in fig. 6, the intelligent speech recognition method further includes:

step S500, dividing original audio data into a plurality of audio segments according to a preset second fixed frame length;

step S600, embedding a plurality of check information and the discrete wavelet transform algorithm into high-frequency component coefficients of an audio segment one by one in sequence to obtain an audio signal to be played; the number of the check information is the same as the number of the segments of the audio segments.

Because the voice is at a low frequency end, the played audio file is mixed with low, medium and high frequencies, and the interference of the voice on the played audio signal is mainly in a low frequency stage. In addition, the characteristic that the human auditory system is insensitive to the tiny change of partial frequency components of the audio signal can hide the check information by adjusting the multi-level medium-high frequency wavelet coefficient obtained by wavelet transformation of the audio segment and further changing the energy states of the front part and the rear part of the audio segment.

The decomposition formula according to the audio discrete wavelet transform is:

low frequency band:

high frequency band:

the corresponding reconstruction formula is as follows:

wherein j and k are respectively a scale factor and a translation factor, and only take integers, m isThe number of the independent variables,

and

respectively, x (n) projection to V_jAnd W_jThe wavelet coefficient obtained in (1), W_jIs V_jThe orthogonal components of (a) and (b),

is a low-frequency coefficient reflecting a smooth structure, becomes an approximate component,

the high-frequency coefficient reflecting the fine structure becomes a detail component, and the process of reconstructing the signal is discrete wavelet transform.

For example, if the maximum delay is 2s and 20ms is a small segment, the maximum delay needs to be decomposed into L100 segments, and there is at least 7 binary digits according to the binary expression relationship;

the original audio data is segmented according to a preset second fixed frame length, and each audio segment is numbered in sequence to obtain position check information W (for example, the check information W corresponding to the 100 th audio segment is 1100100). And embedding the verification information W into a high-frequency component coefficient of the original audio data by adopting a Discrete Wavelet Transform (DWT) algorithm so as to obtain an audio signal to be played containing the position verification information.

In one embodiment, as shown in fig. 7, the step of locating the acoustic audio signal in the external audio signal and then separating to obtain the voice command signal includes:

step S340, extracting the check information in the external audio signal by adopting a discrete wavelet inverse transform algorithm;

step S350, positioning the position of the sound audio signal in the external audio signal according to the verification information;

step S360, extracting a target signal in the external audio signal; the target signal is a signal segment corresponding to the audio signal of the sound equipment;

step S370, performing audio compensation on the target signal according to a preset audio amplitude compensation value to obtain an audio signal.

And playing the audio signal to be played, and acquiring the external audio signal containing the verification information through the audio acquisition unit. The external audio signal is subjected to Inverse Discrete Wavelet Transform (IDWT), and the verification information W1 and the extraction audio file S' are extracted. By comparing the position verification information W and the position verification information W1, the position of a signal segment (namely a target signal) corresponding to the sound audio signal can be positioned, and the problem of dislocation of the original audio and the played audio caused by random time delay is solved. In order to eliminate the influence of the distance of acquisition on sound audio signals in external audio signals, audio compensation is carried out on target signals according to preset audio amplitude compensation values to obtain sound audio signals, and therefore the sound audio signals are compared with original audio data correspondingly, and then the offset training operation is achieved, and accurate human voice extraction and recognition are achieved.

In one embodiment, as shown in fig. 8, the step of obtaining the audio amplitude compensation value comprises:

step S371, acquiring an extraction signal; the extraction signal is obtained by extracting the audio signal to be played transmitted by the data transmission unit by the extraction unit;

step 372, extracting sound audio signals in the external audio signals;

step S373, calculating an audio amplitude compensation value under the current volume according to the first stope signal and the audio signal of the sound equipment;

in step S374, the audio amplitude compensation value of the current volume is updated and stored.

Extracting a sound audio signal from an external audio signal, calculating an audio amplitude compensation value by adopting a self-adaptive algorithm by combining the first recovery signal, namely the filter coefficient of the sound audio signal and the first recovery signal, compensating the recovery signal according to the audio amplitude compensation value at the corresponding volume during echo cancellation, and then offsetting the compensation value with the sound audio signal, thereby extracting a clearer voice signal.

The calculated audio amplitude compensation values may have differences at different volumes, in order to ensure the correction precision of the first recovery signal, the calculated audio amplitude compensation values are in corresponding relation with the volumes of external sound equipment, and after the audio amplitude compensation values at the current volumes are calculated, if the audio amplitude compensation values originally exist, the currently calculated audio amplitude compensation values are used for updating and storing; if the current audio amplitude compensation value is not stored, the currently calculated audio amplitude compensation value is directly stored.

step S375, sending a volume level adjustment instruction to the external sound device; the volume level adjusting instruction is used for indicating the external sound equipment to adjust the volume level;

step S376 calculates and updates the audio amplitude compensation values corresponding to the stored different volume levels until all volume levels are traversed.

After calculating and updating the audio amplitude compensation value corresponding to the current volume level, step S375 is repeatedly executed, the external audio device is controlled to adjust the volume level again, and the audio amplitude compensation value corresponding to the adjusted volume level is calculated until the audio amplitude compensation values of all volume levels are calculated. For each volume, its corresponding audio amplitude compensation value is determined. For example, at a volume level of 1%, an audio amplitude compensation value a1 is calculated, and an audio amplitude compensation value a1 at the volume level of 1% is updated or stored; controlling the external sound equipment to adjust to 2% volume level, calculating an audio amplitude compensation value A2, updating or storing the audio amplitude compensation value of 2% volume level as A2, and repeating the operation until the audio amplitude compensation value of 1% -100% volume level is calculated.

In one embodiment, the audio amplitude compensation value at each volume level is recalculated every preset period time.

If the audio amplitude compensation value is not updated regularly, and changes along with the environment, for example, the distance between the intelligent voice recognition device and the external sound equipment changes, the environmental noise changes, and the like, the audio amplitude compensation value cannot meet the accurate compensation of the extraction signal, and the accuracy of extracting the voice information is affected.

It should be understood that although the various steps in the flowcharts of fig. 4-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 4-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 9, there is provided an intelligent speech recognition apparatus, including:

the data transmission control module 310 is configured to control the data transmission unit to transmit the audio signal to be played to an external audio device for playing; the external sound equipment is used for generating sound audio signals according to the audio signals to be played;

an external audio signal obtaining module 320, configured to obtain an external audio signal; the external audio signals comprise sound audio signals and voice instruction signals collected by the audio collecting unit;

the signal separation module 330 is configured to separate a voice instruction signal after positioning an audio signal in the external audio signal;

and the voice recognition module 340 is configured to perform voice recognition on the voice instruction signal.

In one embodiment, the signal separation module 330 includes:

the delay time extracting unit is used for extracting a plurality of delay time intervals of the maximum delay time interval from the external audio signal;

the first segmentation unit is used for segmenting each delay period according to the length of a preset first fixed frame and calculating binary position information of each delay period;

and the first positioning unit is used for positioning the initial position of the sound audio signal in the external audio signal and extracting the sound audio signal according to the bit error rate of the binary position information of each delay period and the bit error rate of the binary position information of the preset audio signal to be played.

In one embodiment, the intelligent speech recognition device further comprises:

the second segmentation module is used for dividing the original audio data into a plurality of audio segments according to the length of a preset second fixed frame;

the check information embedding module is used for embedding the check information and a plurality of check information into the high-frequency component coefficients of the audio band one by one in sequence by adopting a discrete wavelet transform algorithm to obtain an audio signal to be played; the number of the check information is the same as the number of the segments of the audio segments.

In one embodiment, the signal separation module 330 includes:

the check information extraction unit is used for extracting check information in the external audio signal by adopting a discrete wavelet inverse transformation algorithm;

the second positioning unit is used for positioning the position of the sound audio signal in the external audio signal according to the verification information;

a first extraction unit for extracting a target signal in the external audio signal; the target signal is a signal segment corresponding to the sound audio signal;

and the audio compensation unit is used for carrying out audio compensation on the target signal according to a preset audio amplitude compensation value to obtain the sound audio signal.

In one embodiment, the intelligent speech recognition device further comprises:

the recovery module is used for acquiring a recovery signal; the back-picking signal is obtained by back-picking the audio signal to be played transmitted by the data transmission unit;

the second extraction module is used for extracting a sound audio signal in the external audio signal;

the audio amplitude compensation value calculation module is used for calculating an audio amplitude compensation value under the current volume according to the stoping signal and the sound audio signal;

and the updating module is used for updating and storing the audio amplitude compensation value of the current volume.

In one embodiment, the intelligent speech recognition device further comprises:

the instruction sending module is used for sending a volume level adjusting instruction to external sound equipment; the volume level adjusting instruction is used for indicating the external sound equipment to adjust the volume level;

and the traversal calculation module is used for calculating and updating the audio amplitude compensation values corresponding to the stored different volume levels until all the volume levels are traversed.

For specific limitations of the intelligent speech recognition apparatus, reference may be made to the above limitations of the intelligent speech recognition method, which are not described herein again. The modules in the intelligent speech recognition device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

step S400, voice recognition is carried out on the voice command signal.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

step S107, sending a volume level adjusting instruction to external sound equipment; the volume level adjusting instruction is used for indicating the external sound equipment to adjust the volume level;

and step S108, calculating and updating the audio amplitude compensation values corresponding to the stored different volume levels until all the volume levels are traversed.

step 372, extracting sound audio signals in the external audio signals;

step S373, calculating an audio amplitude compensation value under the current volume according to the stope signal and the audio signal of the sound equipment;

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor performs the steps of:

step S400, voice recognition is carried out on the voice command signal.

In one embodiment, the computer program when executed by the processor further performs the steps of:

step 372, extracting sound audio signals in the external audio signals;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

In the description herein, references to the description of "some embodiments," "other embodiments," "desired embodiments," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, a schematic description of the above terminology may not necessarily refer to the same embodiment or example.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An intelligent speech recognition method, characterized in that the method comprises:

and carrying out voice recognition on the voice instruction signal.

2. The intelligent voice recognition method according to claim 1, wherein the step of locating the acoustic audio signal in the external audio signal and then separating the acoustic audio signal to obtain the voice command signal comprises:

3. The intelligent speech recognition method of claim 1, further comprising:

4. The intelligent voice recognition method according to claim 3, wherein the step of locating the acoustic audio signal in the external audio signal and then separating the acoustic audio signal to obtain the voice command signal comprises:

5. The intelligent speech recognition method of claim 4, wherein the step of obtaining the audio magnitude compensation value comprises:

extracting a sound audio signal in the external audio signal;

6. The intelligent speech recognition method of claim 5, wherein the step of obtaining the audio magnitude compensation value further comprises:

7. An intelligent speech recognition device, wherein the intelligent speech recognition method according to any one of claims 1 to 6 is applied, the device comprising:

8. An intelligent speech recognition device, comprising:

9. An intelligent speech recognition device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.