CN116825109B

CN116825109B - Processing method, device, equipment and medium for voice command misrecognition

Info

Publication number: CN116825109B
Application number: CN202311098860.9A
Authority: CN
Inventors: 李�杰
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-12-08
Anticipated expiration: 2043-08-30
Also published as: CN116825109A

Abstract

The application belongs to the technical field of voice recognition, and discloses a processing method, a device, equipment and a medium for voice command misrecognition, wherein the method comprises the following steps: comparing the energy characteristics of the command word in the time interval with the energy characteristics of the first preset time before the time interval based on the energy characteristics of the command word in the time interval; judging whether the recognition of the command word is effective or not based on a comparison result; if the identification is effective, calculating an identification starting point of the command word according to the output value of the neural network; judging whether the number of effective outputs in a second preset time before the time point of the starting point exceeds a preset threshold value; if the number of the effective outputs does not exceed the preset threshold, the identification of the current command word is effective, and the current command word is executed; the application increases the accuracy and reliability of command word recognition, ensures that the system responds correctly to the command of the user, improves the interaction experience and operation efficiency of the user and the equipment, and reduces unnecessary false triggering and false recognition.

Description

Processing method, device, equipment and medium for voice command misrecognition

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a medium for processing speech command misrecognition.

Background

The command word recognition belongs to voice recognition and is widely applied to the field of intelligent home, such as an intelligent voice sound box, an intelligent voice earphone, an intelligent voice lamp, an intelligent voice fan and the like, although the positive recognition rate of the command word is obviously improved along with the development of deep learning technology, the user demand is basically met, more misrecognition situations exist, namely, the user does not specifically shout the command word, the device also misrecognizes the command word and then responds,

therefore, how to effectively reduce the false recognition rate and improve the user experience in the voice recognition process is a problem to be solved urgently.

Disclosure of Invention

The application mainly aims to provide a processing method, a device, equipment and a medium for voice command misrecognition, which aim to solve the technical problem of how to effectively reduce the misrecognition rate and improve the user experience in the voice recognition process.

In order to achieve the above object, a first aspect of the present application provides a method for processing speech command misrecognition, the method comprising: comparing the energy characteristics of the command word in the time interval with the energy characteristics of the first preset time before the time interval based on the energy characteristics of the command word in the time interval; judging whether the recognition of the command word is effective or not based on a comparison result; if the identification is effective, calculating an identification starting point of the command word according to the output value of the neural network; judging whether the number of effective outputs in a second preset time before the time point of the starting point exceeds a preset threshold value; if the number of the valid outputs does not exceed the preset threshold, the identification of the current command word is valid, and the current command word is executed.

Further, the step of comparing the energy characteristic of the first preset time before the time interval includes: setting the energy characteristics of the time interval and the first preset time to be p0 and p1 respectively; if P1>thp0, judging that the voice exists in the first preset time; if P1<th/>And p0, judging that no voice exists in the first preset time, wherein th is a first preset threshold value.

Further, after the step of determining whether the recognition of the command word is valid based on the comparison result, the method includes: if the voice exists in the first preset time, judging that the recognition of the command word is invalid; and if no voice exists in the first preset time, judging that the recognition of the command word is effective.

Further, the step of calculating the recognition start point of the command word according to the output value of the neural network includes: setting the current time of finishing the recognition of the command word as t, wherein the length of a phoneme recognized by the command word is word_len; if at t-2And if the characteristic maximum value between the words_len and the t-words_len is larger than a second preset threshold th2 and the characteristic maximum value is an invalid column, taking the position corresponding to the characteristic maximum value as the starting point.

Further, the step of determining whether the number of valid outputs in the second preset time before the time point where the start point is located exceeds a preset threshold value includes: obtaining a blank value in the phoneme probability in the second preset time; if the blank value is greater than a third preset threshold th3, judging that the output is invalid; identifying the number of effective outputs in the second preset time; judging whether the number of the effective outputs in the second preset time exceeds a preset threshold value or not; and if the number of the valid outputs exceeds a preset threshold, judging that the identification of the current command word is invalid.

Further, before the step of determining the energy characteristic in the time interval to which the command word belongs, the method includes: and respectively acquiring the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval, and calculating the respective average value as the corresponding energy characteristic.

Further, after the step of obtaining the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval, the method includes: and carrying out windowing smoothing on the power spectrum of each frame of audio frequency, and updating the power spectrum.

The application also provides a processing device for voice command misrecognition, which comprises:

the feature comparison module is used for comparing the energy features in the time interval to which the command word belongs with the energy features of the first preset time before the time interval; the effective judging module is used for judging whether the recognition of the command word is effective or not based on the comparison result; the starting point calculation module is used for calculating the recognition starting point of the command word according to the output value of the neural network if the recognition is effective; the effective output judging module is used for judging whether the number of effective outputs in a second preset time before the time point where the starting point is located exceeds a preset threshold value; and the final recognition effective module is used for executing the current command word if the number of the effective outputs does not exceed the preset threshold value and the recognition of the current command word is effective.

The application also provides a computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the processing method for recognizing the voice command error.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of processing speech command misrecognition as described in any of the above.

The beneficial effects are that:

the application can combine the use habit of the person when using voice control in the voice recognition process, carry out a multi-layer judgment mechanism for the command word, and judge the misjudgment in the dialogue by comparing the energy characteristics of the time interval where the command word is located with the energy characteristics of the previous time range; calculating and identifying a starting point according to the output of the neural network; the number of effective outputs before the starting point is counted, whether the threshold value is exceeded or not is judged, and whether more non-noise voices exist before the starting point or not is obtained, so that accuracy and reliability of command word recognition are improved, correct response of a system to a user command is ensured, interactive experience and operation efficiency of the user and equipment are improved, and unnecessary false triggering and false recognition are reduced.

Drawings

FIG. 1 is a flow chart of a method for handling speech command misrecognition according to an embodiment of the application;

FIG. 2 is a schematic block diagram of a processing device for speech command misrecognition according to an embodiment of the application;

FIG. 3 is a block diagram of a computer device according to one embodiment of the application;

the achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any module and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, an embodiment of the present application provides a processing method for speech command misrecognition, including the following steps S1 to S5:

s1: and comparing the energy characteristics in the time interval to which the command word belongs with the energy characteristics of the first preset time before the time interval.

Calculating the energy characteristics of the command word time interval: by calculating the average value of the power spectrum of each frame of audio in the time interval described by the command word as the energy characteristic, the fbank calculation method can be used to obtain the power spectrum of each frame of audio, average the power spectrums of the frames as the energy characteristic (P0), and then calculate the energy characteristic of the previous adjacent period (the first preset period): according to the starting point of the command word time interval, a period of time is shifted forwards, the power spectrum average value of each frame of audio in the period of time is calculated to be used as an energy characteristic, a fbank calculation method is also used, and the power spectrums of the frames are averaged to obtain the energy characteristic (P1); in addition, preferably, in order to eliminate interference, the energy spectrum of each frame may be windowed and smoothed to obtain new energy spectrum characteristics, and then the average value is calculated.

S2: and judging whether the recognition of the command word is effective or not based on the comparison result.

By comparing P1 with a threshold (thP0), if P1 is smaller than the threshold, the recognition is considered to be valid for the first preset time, wherein the threshold th is preferably set to 0.5.

S3: if the recognition is effective, calculating a recognition starting point of the command word according to the output value of the neural network.

Setting the completion time of the recognition command word as t, assuming that the phoneme length of the recognition result is words_len, the initial starting point can be obtained by subtracting words_len from t, and the time range is t-2Searches are made within words_len to t-words_len to find feature maxima to determine the start boundary, and the following three conditions are required to be met in order to determine the start point: (1) the current column is an invalid column: that is, the probability value of the corresponding blank phoneme is greater than a certain threshold, the threshold can be marked as th3, and if the blank probability value of the current column exceeds th3, the column is considered as an invalid column; (2) characterized by a maximum value: the definition of a feature is the number of active phonemes to the right (l_right) plus the number of inactive phonemes to the left (l_left) of the current time. Wherein L/uright can take the CTC output column number corresponding to the audio of 0.5 seconds, and l_left can take half of the phoneme length, if the feature value reaches the global maximum, i.e. the feature maximum of the time point, the condition is met; (3) The feature maximum is greater than a second preset threshold (th 2): the characteristic maximum value can be obtained by subtracting a delta (preset adjustment parameter) from the sum of left and right columns, and the delta can be generally set to be 3; so the characteristic maximum is greater than th2=l_right+l_left-delta if at t-2 +.>The start points satisfying the above three conditions are found in the time range from words_len to t-words_len, and the points satisfying the conditions are taken as the start points.

S4: judging whether the number of effective outputs in a second preset time before the time point of the starting point exceeds a preset threshold value.

And obtaining the phoneme probability in the second preset time period, checking the blank value in the phoneme probability in each time period, judging the number of effective outputs according to the blank value, judging whether the number of effective outputs in the second preset time exceeds a preset threshold value, judging whether the identification of the current command word is effective or not, and judging that the identification of the current command word is ineffective if the number of effective outputs exceeds the preset threshold value.

S5: if the number of the valid outputs does not exceed the preset threshold, the identification of the current command word is valid, and the current command word is executed.

When the system judges that the effective output number in the second preset time does not exceed the preset threshold value, the voice recognition system is stated to successfully recognize the command word, and the recognition result is reliable, so that the operation or instruction related to the current command word can be executed.

The embodiment can multiplex the existing calculation flow without additionally increasing the calculation amount and the space occupation, and the volume characteristic calculates the relative ratio without setting the absolute value, so that the method is suitable for different scenes such as near field and far field; the voice is judged by utilizing the proportion of the phonemes in CTC output, and the voice is calculated by a neural network, so that the distinguishing degree is more obvious, the identification is more accurate, the implementation calculation is simple and efficient, the frequency of false identification can be greatly reduced, the voice recognition method has higher real-time performance, and the degree of loose identification can be flexibly adjusted through the configuration of parameters.

In an embodiment, the step of comparing the energy characteristic of the first preset time before the time interval includes:

s11: and setting the energy characteristics of the time interval and the first preset time as p0 and p1 respectively.

S12: if P1>thAnd p0, judging that the voice exists in the first preset time.

S13: if P1<thAnd p0, judging that no voice exists in the first preset time, wherein th is a first preset threshold value.

In this embodiment, the energy characteristic in the time interval is set to P0, and the energy characteristic in the first preset time is set to P1; if P1 is larger than th multiplied by P0, then it is determined that there is voice in the first preset time, if P1 is smaller than th multiplied by P0, then it is determined that there is no voice in the first preset time, wherein the threshold th is a preset first preset threshold, if there is voice, it is proved that the currently recognized command word is most likely the same word carried in the conversation or talking, and is not a controlled instruction.

In an embodiment, the step of determining whether the recognition of the command word is valid based on the comparison result includes:

s21: and if the voice exists in the first preset time, judging that the recognition of the command word is invalid.

S22: and if no voice exists in the first preset time, judging that the recognition of the command word is effective.

In this embodiment, if there is a voice in the first preset time, that is, the energy characteristic in the previous period is higher than th times the energy characteristic in the current time interval, it is determined that the recognition of the command word is invalid, and if there is no voice in the first preset time, that is, the energy characteristic in the previous period is lower than the threshold th times the energy characteristic in the current time interval, it is determined that the recognition of the command word is valid; in the process of controlling the intelligent product, if a user speaks that the refrigerator is opened and closed in yesterday, the command word is identified to open the refrigerator, the energy characteristic of a previous continuous time period is obtained after the time interval in which the command word is identified, the energy characteristic of the same identification of the 'yesterday' is found, the energy characteristic is higher than th multiplied by the energy characteristic in the current time interval at the moment, the voice is judged, the voice is continuous language and is not noise, the misjudgment of the refrigerator can be avoided, and the judgment accuracy is improved.

In one embodiment, the step of calculating the recognition start point of the command word according to the output value of the neural network includes:

s31: and setting the moment of finishing the recognition of the command word as t, and setting the phoneme length of the recognition of the command word as words_len.

S32: if at t-2And if the characteristic maximum value between the words_len and the t-words_len is larger than a second preset threshold th2 and the characteristic maximum value is an invalid column, taking the position corresponding to the characteristic maximum value as the starting point.

In the present embodiment, the recognition command word completion time is set to be t, and assuming that the phoneme length of the recognition result is word_len, the initial starting point can be obtained by subtracting word_len from t, in the time range t-2Searches are made within words_len to t-words_len to find feature maxima to determine the start boundary, and the following three conditions are required to be met in order to determine the start point: (1) the current column is an invalid column: i.e. the probability value of the corresponding blank phoneme is greater than a certain threshold value, which can be takenThe threshold is denoted as th3, and if the blank probability value of the current column exceeds th3, the column is considered as an invalid column; (2) characterized by a maximum value: the definition of a feature is the number of active phonemes to the right (l_right) plus the number of inactive phonemes to the left (l_left) of the current time. Wherein l_right can take the CTC output column number corresponding to the audio of 0.5 seconds, and l_left can take half of the phoneme length, if the feature value reaches the global maximum, i.e. the feature maximum of the time point, the condition is met; (3) The feature maximum is greater than a second preset threshold (th 2): the characteristic maximum value can be obtained by subtracting a delta (preset adjustment parameter) from the sum of left and right columns, and the delta can be generally set to be 3; so the characteristic maximum is greater than th2=l_right+l_left-delta if at t-2 +.>If the boundaries satisfying the above three conditions are not found in the time range from words_len to t-words_len, then it is considered that there is still a sound feature before the current start point, and the decision of the start point is not established.

In an embodiment, the step of determining whether the number of valid outputs in the second preset time before the point in time at which the start point is located exceeds a preset threshold value includes:

s41: and obtaining a blank value in the phoneme probability in the second preset time.

S42: if the blank value is greater than the third preset threshold th3, the invalid output is judged.

S43: and identifying the number of effective outputs in the second preset time.

S44: and judging whether the number of the effective outputs in the second preset time exceeds a preset threshold value.

S45: and if the number of the valid outputs exceeds a preset threshold, judging that the identification of the current command word is invalid.

In this embodiment, a blank value in a phoneme probability in a second preset time is obtained, if the blank value is greater than a third preset threshold th3, it is determined that the output is invalid, the third preset threshold th3 is a preset threshold for determining whether the phoneme belongs to valid output or invalid output, the number of valid outputs in the second preset time is identified, whether the number of valid outputs in the second preset time exceeds the preset threshold is determined, if the number of valid outputs exceeds the preset threshold, it is determined that the identification of the current command word is invalid, whether the number of valid outputs exceeds the preset threshold is checked in the second preset time before the time point where the starting point is located, and accordingly whether the identification of the command word is valid is determined.

In one embodiment, before the step of determining the energy characteristic in the time interval to which the command word belongs, the method includes:

s51: and respectively acquiring the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval, and calculating the respective average value as the corresponding energy characteristic.

In this embodiment, the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval is obtained respectively, and the respective average value is calculated, and is used as the corresponding energy feature. Then, calculating the average value of the power spectrums to obtain the energy characteristics of the command words in the time interval, meanwhile, for comparison with the audio before the time interval, acquiring the audio of the first preset time before the time interval, carrying out the same processing process, namely cutting the audio into frames and calculating the power spectrums of each frame, and finally calculating the average value to obtain the energy characteristics of the command words in the time interval and before the time interval, thereby providing a data basis for the subsequent processing based on the energy characteristics.

In an embodiment, after the step of obtaining the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval, the step includes:

s61: and carrying out windowing smoothing on the power spectrum of each frame of audio frequency, and updating the power spectrum.

In this embodiment, the window smoothing processing is performed on the power spectrum of each frame of audio, and in general, the window smoothing may be implemented by multiplying the power spectrum of each frame by a window function (such as hanning window, hamming window, etc.), so that the fluctuation of the power spectrum may be reduced, so that the power spectrum is smoother and continuous, the noise and abrupt change of the power spectrum may be reduced by performing the window smoothing processing on the power spectrum of each frame of audio, and more stable and reliable features may be extracted, so that the accuracy and reliability of the subsequent processing steps may be improved, and the generation of the work distribution ratio and the reminding effect may be further optimized.

Referring to fig. 2, the embodiment of the application further provides a processing device for voice command misrecognition, which includes:

the feature comparison module 100 is configured to compare the energy feature in the time interval to which the command word belongs with the energy feature of the first preset time before the time interval;

the validity judging module 200 is configured to judge whether the recognition of the command word is valid based on the comparison result;

the starting point calculation module 300 is configured to calculate, if the recognition is valid, a recognition starting point of the command word according to an output value of the neural network;

an effective output judging module 400, configured to judge whether the number of effective outputs in a second preset time before the time point where the starting point is located exceeds a preset threshold;

and a final recognition valid module 500, configured to execute the current command word if the number of valid outputs does not exceed the preset threshold, and the recognition of the current command word is valid.

As described above, the processing apparatus for speech command misrecognition can realize the processing method for speech command misrecognition.

In one embodiment, the feature comparison module 100 includes:

and the energy characteristic acquisition unit is used for setting the energy characteristics of the time interval and the first preset time to be p0 and p1 respectively.

A first comparison unit for if P1>thp0, then determineAnd the human voice exists in the first preset time.

A second comparing unit for comparing P1<thAnd p0, judging that no voice exists in the first preset time, wherein th is a first preset threshold value.

In one embodiment, the validity judging module 200 includes:

and the first voice presence judging unit is used for judging that the recognition of the command word is invalid if the voice exists in the first preset time.

And the second voice presence judging unit is used for judging that the identification of the command word is effective if the voice does not exist in the first preset time.

In one embodiment, the starting point calculating module 300 includes:

the assignment unit is used for setting the current time of completing the recognition of the command word as t, and the length of a phoneme recognized by the command word is word_len;

condition compliance determination for if at t-2And if the characteristic maximum value between the words_len and the t-words_len is larger than a second preset threshold th2 and the characteristic maximum value is an invalid column, taking the position corresponding to the characteristic maximum value as the starting point.

In one embodiment, the effective output determining module 400 includes:

a blank value obtaining unit, configured to obtain a blank value in the phoneme probability in the second preset time;

an effective output judging unit, configured to judge that the output is invalid if the blank value is greater than a third preset threshold th 3;

the number judging unit is used for identifying the number of effective outputs in the second preset time;

the number exceeding judging unit is used for judging whether the number of the effective outputs in the second preset time exceeds a preset threshold value;

and the invalidation judging unit is used for judging that the recognition of the current command word is invalid if the number of the valid outputs exceeds a preset threshold value.

In one embodiment, the pre-calculation module comprises:

the power spectrum acquisition unit is used for respectively acquiring the power spectrum of each frame of audio frequency at a first preset time in a time interval to which the command word belongs and before the time interval, and calculating respective average values as corresponding energy characteristics.

In an embodiment, the pre-calculation module further comprises:

and the power spectrum updating unit is used for carrying out windowing smoothing on the power spectrum of each frame of audio frequency and updating the power spectrum.

Referring to fig. 3, an embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing work items and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. Further, the above-mentioned computer apparatus may be further provided with an input device, a display screen, and the like. The processing method for realizing voice command misrecognition when the computer program is executed by a processor comprises the following steps: comparing the energy characteristics of the command word in the time interval with the energy characteristics of the first preset time before the time interval based on the energy characteristics of the command word in the time interval; judging whether the recognition of the command word is effective or not based on a comparison result; if the identification is effective, calculating an identification starting point of the command word according to the output value of the neural network; judging whether the number of effective outputs in a second preset time before the time point of the starting point exceeds a preset threshold value; if the number of the valid outputs does not exceed the preset threshold, the identification of the current command word is valid, and the current command word is executed. It will be appreciated by those skilled in the art that the architecture shown in fig. 3 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application also provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for processing speech command misrecognition, including the steps of: comparing the energy characteristics of the command word in the time interval with the energy characteristics of the first preset time before the time interval based on the energy characteristics of the command word in the time interval; judging whether the recognition of the command word is effective or not based on a comparison result; if the identification is effective, calculating an identification starting point of the command word according to the output value of the neural network; judging whether the number of effective outputs in a second preset time before the time point of the starting point exceeds a preset threshold value; if the number of the valid outputs does not exceed the preset threshold, the identification of the current command word is valid, and the current command word is executed. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims

1. A method for processing speech command misrecognition, said method comprising:

comparing the energy characteristics of the command word in the time interval with the energy characteristics of the first preset time before the time interval based on the energy characteristics of the command word in the time interval;

judging whether the recognition of the command word is effective or not based on a comparison result;

if the identification is effective, calculating an identification starting point of the command word according to the output value of the neural network;

judging whether the number of effective outputs in a second preset time before the time point of the starting point exceeds a preset threshold value;

if the number of the effective outputs does not exceed the preset threshold, the identification of the current command word is effective, and the current command word is executed;

the step of calculating the recognition starting point of the command word according to the output value of the neural network comprises the following steps:

setting the current time of finishing the recognition of the command word as t, wherein the length of a phoneme recognized by the command word is word_len;

if at t-2The feature maximum value is larger than a second preset threshold th2 and is an invalid column, and the position corresponding to the feature maximum value is taken as the starting point;

if the blank probability value of the current column exceeds a third preset threshold th3, taking the current column as an invalid column;

the feature value is the number of valid phoneme columns on the right side (l_right) plus the number of invalid phonemes on the left side (l_left) of the current time, wherein if the feature value reaches the global maximum, the feature value is the feature maximum of the current time.

2. The method of claim 1, wherein the step of comparing the energy characteristics of the first predetermined time before the time interval comprises:

setting the energy characteristics of the time interval and the first preset time to be p0 and p1 respectively;

if P1>thp0, judging that the voice exists in the first preset time;

if P1< thAnd p0, judging that no voice exists in the first preset time, wherein th is a first preset threshold value.

3. The method according to claim 2, wherein after the step of judging whether the recognition of the command word is valid based on the comparison result, comprising:

if the voice exists in the first preset time, judging that the recognition of the command word is invalid;

and if no voice exists in the first preset time, judging that the recognition of the command word is effective.

4. The method for processing voice command misrecognition according to claim 1, wherein the step of determining whether the number of valid outputs in a second preset time period before the point in time at which the start point is located exceeds a preset threshold value comprises:

obtaining a blank value in the phoneme probability in the second preset time;

if the blank value is greater than a third preset threshold th3, judging that the output is invalid;

identifying the number of effective outputs in the second preset time;

judging whether the number of the effective outputs in the second preset time exceeds a preset threshold value or not;

and if the number of the valid outputs exceeds a preset threshold, judging that the identification of the current command word is invalid.

5. The method for processing voice command misrecognition according to claim 1, wherein before the step of determining the energy characteristic in the time interval to which the command word belongs, the method comprises:

and respectively acquiring the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval, and calculating the respective average value as the corresponding energy characteristic.

6. The method for processing voice command misrecognition according to claim 5, wherein after the step of obtaining the power spectrum of each frame of audio at the first preset time in the time interval to which the command word belongs and before the time interval, respectively, the method comprises:

and carrying out windowing smoothing on the power spectrum of each frame of audio frequency, and updating the power spectrum.

7. A processing apparatus for speech command misrecognition, comprising:

the feature comparison module is used for comparing the energy features in the time interval to which the command word belongs with the energy features of the first preset time before the time interval;

the effective judging module is used for judging whether the recognition of the command word is effective or not based on the comparison result;

the starting point calculation module is used for calculating the recognition starting point of the command word according to the output value of the neural network if the recognition is effective;

the effective output judging module is used for judging whether the number of effective outputs in a second preset time before the time point where the starting point is located exceeds a preset threshold value;

the final recognition effective module is used for executing the current command word if the number of the effective outputs does not exceed the preset threshold value and the recognition of the current command word is effective;

condition compliance determination for if at t-2The feature maximum value is larger than a second preset threshold th2 and is an invalid column, and the position corresponding to the feature maximum value is taken as the starting point;

8. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, carries out the steps of the method of processing a speech command misrecognition according to any one of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the processing method of speech command misrecognition according to any one of claims 1 to 6.