CN112542169A

CN112542169A - Voice recognition processing method and device

Info

Publication number: CN112542169A
Application number: CN202011560150.XA
Authority: CN
Inventors: 陈姿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-23

Abstract

The application provides a voice recognition processing method, a voice recognition processing device, electronic equipment and a computer readable storage medium; relates to a voice recognition processing technology based on artificial intelligence; the method comprises the following steps: aiming at any two first voice instructions and second voice instructions which are received successively in a plurality of voice instructions, the following processing is executed: determining the similarity of the first voice instruction and the second voice instruction; and when the similarity exceeds a similarity threshold, determining that the first voice instruction is a misrecognition instruction. Through the method and the device, the accuracy of voice recognition can be improved.

Description

Voice recognition processing method and device

Technical Field

The present application relates to speech processing technologies, and in particular, to a speech recognition processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. Key technologies for Speech processing Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein voice becomes one of the best viewed human-computer interaction modes in the future.

In the related art, the error reporting mode related to speech recognition is usually manual error reporting by a user, for example, the user selects the error reporting mode by himself, but this mode is inefficient, and many users may not want to spend time to select the error reporting mode or have no knowledge of selecting the error reporting mode, which may result in many erroneous use cases being unable to be counted and corrected.

Therefore, in the related art, an effective scheme for intelligently recognizing a voice recognition error is lacking.

Disclosure of Invention

The embodiment of the application provides a voice recognition processing method and device, electronic equipment and a computer readable storage medium, which can improve the accuracy of voice recognition.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a voice recognition processing method, which comprises the following steps:

aiming at any two first voice instructions and second voice instructions which are received successively in a plurality of voice instructions, the following processing is executed:

determining the similarity of the first voice instruction and the second voice instruction;

and when the similarity exceeds a similarity threshold, determining that the first voice instruction is a misrecognition instruction.

An embodiment of the present application provides a speech recognition processing apparatus, including:

the processing module is used for executing the following processing aiming at any two of the first voice instruction and the second voice instruction which are received successively in the plurality of voice instructions: determining the similarity of the first voice instruction and the second voice instruction; and when the similarity exceeds a similarity threshold, determining that the first voice instruction is a misrecognition instruction.

In the foregoing solution, a speech recognition processing apparatus provided in an embodiment of the present application further includes: the updating module is used for combining the error recognition instruction and the corresponding voice recognition result into a negative sample; acquiring a voice recognition model updated based on the negative sample; and recognizing the newly received voice recognition instruction based on the updated voice recognition model so as to output a corresponding voice recognition result.

In the above scheme, the update module is further configured to upload the negative sample to a state database of a block chain network for storage; invoking an intelligent contract in the blockchain network to cause the intelligent contract to perform the following: generating a virtual resource certificate according to the number of the uploaded negative samples; wherein the credentials of the virtual resource are used to request usage rights of the updated speech recognition model; obtaining the updated speech recognition model based on the virtual resource.

In the above scheme, the processing module is further configured to determine a receiving time interval between the first voice instruction and the second voice instruction; when the receiving time interval is smaller than the interval threshold and the similarity exceeds the similarity threshold, determining that the first voice instruction is a misrecognition instruction.

In the above scheme, the processing module is further configured to determine a short-time energy of the first voice instruction, and map the short-time energy of the first voice instruction to a first vector in cosine similarity; determining the short-time energy of the second voice instruction, and mapping the short-time energy of the second voice instruction to be a second vector in the cosine similarity; and taking the similarity between a first vector in the cosine similarity and a second vector in the cosine similarity as the similarity between the first voice instruction and the second voice instruction.

In the above scheme, the processing module is further configured to sample the first voice instruction to obtain a plurality of first voice sampling points; performing windowing and framing processing on the plurality of first voice sampling points to obtain a plurality of voice frames; taking the square sum of the amplitude values corresponding to the first voice sampling points in each voice frame as the short-time energy of each voice frame; taking the short-time energy of each voice frame in the first voice command as a component of the short-time energy of the first voice command; and integrating the components to obtain the short-time energy of the first voice command.

In the above scheme, the processing module is further configured to obtain the short-time energy threshold; and when the short-time energy of the voice frame is smaller than the short-time energy threshold value, determining the voice frame to be a mute frame, and removing the voice frame.

In the above scheme, the processing module is further configured to perform pre-emphasis processing on the plurality of first voice sampling points to obtain pre-emphasized first voice sampling points; and filtering the weighted first voice sampling point to obtain the first voice sampling point for windowing and framing.

In the above scheme, the processing module is further configured to obtain a maximum absolute value of the amplitude values corresponding to the plurality of first voice sampling points; determining the number of moving bits according to the maximum absolute value; and moving decimal points of the amplitude corresponding to the first voice sampling points to the left by the moving digit according to each first voice sampling point to obtain the first voice sampling points for performing windowing and framing processing.

a memory for storing executable instructions;

and the processor is used for realizing the voice recognition processing method provided by the embodiment of the application when the processor executes the executable instructions stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions and is used for implementing the speech recognition processing method provided by the embodiment of the present application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

through the similarity of comparison first voice command and second voice command, the intelligent determination error identification instruction can promote the efficiency of reporting an error, can effectively count the error identification instruction to follow-up correcting the error identification instruction, promote speech recognition's precision.

Drawings

FIG. 1A is a block diagram of an architecture of a speech recognition processing system 100 according to an embodiment of the present application;

fig. 1B is a schematic application diagram of a block chain-based speech recognition processing method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 500 provided in an embodiment of the present application;

fig. 3A is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application;

fig. 3B is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application;

fig. 3C is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application;

fig. 4 is a schematic flowchart illustrating a method for implementing a speech recognition processing by a speech recognition client according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a voice recognition processing method cooperatively executed by a terminal and a server according to an embodiment of the present application;

fig. 6 is a schematic flowchart of determining a speech similarity according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Cosine similarity: also called cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine value of the included angle between them. The most common application is to calculate text similarity. Two texts are established into two vectors according to the words of the two texts, and the cosine values of the two vectors are calculated, so that the similarity condition of the two texts in a statistical method can be known. This has proven to be a very effective method.

2) Short-time energy: the short-term energy can effectively judge the magnitude of the signal amplitude and can be used for sound/soundless judgment. The energy of the speech signal changes significantly over time, and a short-term energy analysis gives a suitable description of how these amplitude changes are reflected.

3) WAV, a sound file format, is a standard digital audio file developed by Microsoft corporation for Windows, and can record various mono or stereo sound information.

4) The virtual resource is information resources obtained by using a database and programming, such as the number of days of use of the speech recognition model requesting updating and the duration of use of the speech recognition model requesting updating.

5) A Blockchain Network (Blockchain Network) incorporates new blocks into a set of nodes of a Blockchain in a consensus manner.

6) Intelligent Contracts (Smart Contracts), also known as chain codes (chaincodes) or application codes, are programs deployed in nodes of a blockchain network, and the nodes execute the intelligent Contracts called in received transactions to perform operations of updating or querying key-value data of a state database.

7) Consensus (Consensus), a process in a blockchain network, is used to agree on a transaction in a block between the nodes involved, the agreed block to be appended to the end of the blockchain and used to update the state database.

On the user level, after the voice recognition is wrong, the user needs to upload the same voice for many times, and the user experience is influenced greatly. In the related art, the reporting mode of the speech recognition error is usually manual error reporting by a user, for example, the user selects to report the error by himself, but the efficiency of the mode is low, and the feedback has serious hysteresis; moreover, many users may not want to spend time selecting error reporting or have no knowledge of the way of selecting error reporting, which may result in many erroneous use cases being unable to be counted and corrected.

In view of the foregoing technical problems, embodiments of the present application provide a speech recognition processing method, a speech recognition processing apparatus, an electronic device, and a computer-readable storage medium, which can improve accuracy of speech recognition, and an exemplary application of the speech recognition processing method provided by embodiments of the present application is described below.

An exemplary application system architecture for implementing the speech recognition processing method provided by the embodiment of the present invention by cooperating with a terminal and a server is described below, referring to fig. 1A, where fig. 1A is an architecture schematic diagram of a speech recognition processing system 100 provided by the embodiment of the present application. Wherein, speech recognition processing system includes: the terminal 400 is connected with the voice similarity detection server 300 and the voice recognition server 200 through a network, the terminal and the server are cooperatively implemented, that is, the terminal 400 receives a voice command to call a voice recognition model of the voice recognition server 200 to perform voice recognition, and uploads the voice command to the voice similarity detection server 300 to determine an error recognition command, and the voice recognition server 200 updates the voice recognition model based on a negative sample (i.e., retrains the voice recognition model); the speech recognition, the determination of the error recognition instruction and the updating of the speech recognition model are all completed by the server; it should be noted that, in some embodiments, the voice instruction invoking the voice recognition server 200 and the voice similarity detection server 300 may also be implemented as the same server. Which will be separately described below.

The terminal 400 is configured to sequentially receive voice instructions uploaded by a user, upload the voice instructions to the voice similarity detection server 300 for similarity detection to determine an error recognition instruction in the voice instructions, upload the voice instructions to the voice recognition model in the voice recognition server 200, and respectively recognize a plurality of voice instructions based on the voice recognition model acquired from the voice recognition server 200, so as to respectively output voice recognition results corresponding to the plurality of voice instructions one to one in a human-computer interaction interface of the client.

The voice similarity detection server 300 is configured to determine at least one error recognition instruction in the multiple voice instructions according to the voice instruction uploaded by the terminal 400, combine the error recognition instruction and the corresponding voice recognition result into a negative sample, and report the negative sample to the voice recognition server 200.

The speech recognition server 200 is configured to update the speech recognition model based on the negative sample sent by the speech similarity detection server 300; and continuously recognizing the voice recognition instruction newly uploaded by the user based on the updated voice recognition model so as to output a corresponding voice recognition result in the human-computer interaction interface of the client of the terminal 400.

In some embodiments, the speech recognition processing method provided by the embodiment of the present application may also be implemented by the terminal 400 alone, that is, the speech recognition model locally integrated in the terminal performs offline speech recognition, and the speech recognition, the determination of the misrecognition instruction, and the updating of the speech recognition model are all performed locally in the terminal. The terminal 400 receives a plurality of voice instructions sent by a user, calls a local voice recognition model to respectively recognize the plurality of voice instructions, and respectively outputs voice recognition results corresponding to the plurality of voice instructions one by one in a human-computer interaction interface of a client; the terminal 400 automatically determines at least one misidentification command among the plurality of voice commands, and combines the misidentification command and the corresponding voice recognition result into a negative sample; and finishing the training of the voice recognition model locally at the terminal based on the negative sample so as to obtain the voice recognition model which is updated again based on the negative sample locally at the terminal, and continuously recognizing the voice recognition instruction newly uploaded by the user based on the updated voice recognition model so as to output a corresponding voice recognition result in a human-computer interaction interface of the client.

In other embodiments, the terminal may also receive a voice instruction and call a local voice recognition model to perform voice recognition on the voice instruction, determine an erroneous recognition instruction according to the voice instruction locally at the terminal to form a negative sample, send the negative sample to the voice recognition server 200, and the voice recognition server 200 updates the voice recognition model based on the negative sample; the terminal may obtain the updated speech recognition model from the speech recognition server 200 when accessing the network to update the local speech recognition model.

According to the method and the device, the error recognition instruction and the corresponding voice recognition result are intelligently reported according to the received voice instruction sent by the user, and the voice recognition model is updated based on the error recognition instruction and the corresponding voice recognition result, so that the voice instruction newly sent by the user is recognized based on the updated voice recognition model, the accuracy of voice recognition is improved, and the user experience is improved.

The embodiments of the present application may be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying series resources such as hardware, software, and network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing. The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.

As an example, the voice recognition server 200 and the voice similarity detection server 300 may be two independent physical servers, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal 400, the voice recognition server 200, and the voice similarity detection server 300 may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited thereto.

An exemplary application of the embodiments of the present application to a blockchain-based network is described below. Referring to fig. 1B, fig. 1B is an application schematic diagram of a block chain based speech recognition processing method provided in the embodiment of the present application, and includes a block chain network 600 (the block chain network 600 includes a node 610-1 and a node 610-2 is exemplarily shown), a speech recognition server 200, a speech similarity detection server 300, and a terminal 400, which are respectively described below.

The voice recognition server 200, the voice similarity detection server 300, and the terminal 400 may all join the blockchain network 600 to become (be mapped to) nodes (exemplarily showing the node 610-1 to the node 610-2), fig. 1B exemplarily shows that the voice recognition server 200 is mapped to the node 610-1 of the blockchain network 600, and the voice similarity detection server 300 is mapped to the node 610-2 of the blockchain network 600, each node (e.g., the node 610-1 to the node 610-2) has a consensus function and a billing (i.e., a function of maintaining a state database, such as a KV database).

The state database of each node (e.g., node 610-1 to node 610-2) has the virtual resources of the terminal 400 recorded therein, so that the terminal 400 can determine whether it can request an updated speech recognition model each time it queries the virtual resources recorded in the state database.

In some embodiments, a user initiates a voice command in a voice recognition human-machine interface of a client (e.g., input method, intelligent voice assistant) of the terminal 400, the client sends the voice command (including terminal identification) to the node 610-2 mapped by the voice similarity detection server 300 and the node 610-1 mapped by the voice recognition server 200 as the blockchain network 600, inquiring the certificate of the virtual resource recorded in the state database according to the terminal identifier, determining a voice recognition model capable of requesting updating according to the certificate of the virtual resource, the intelligent contract is called according to the voice command and integrated with the processing logic for voice recognition of the embodiment of the application, to query the state database in node 610-1, to determine an updated speech recognition model for recognition, and then recognizing the voice command based on the updated voice recognition model to obtain a corresponding voice recognition result.

The node 610-1 sends the voice recognition result to the node 610-2 mapped by the voice similarity detection server 300 for consensus, after the consensus passes, the voice recognition result is signed with the digital signatures of the voice similarity detection server 300 and the voice recognition server 200, and the voice recognition result is returned, and after the client verifies that the digital signatures pass, that is, the voice recognition result is reliable, so that the voice recognition result based on the updated voice recognition model recognition is output on a human-computer interaction interface of the client.

Meanwhile, the voice similarity detection server 300 determines a negative sample according to a received voice instruction sent by the user, uploads the negative sample to a block chain network, and an intelligent contract in the block chain generates a virtual resource certificate according to the number of the negative samples uploaded by the terminal 400, so that the terminal 400 determines whether the service authority of the voice recognition model requesting updating exists or not according to the virtual resource certificate; the speech recognition server 200 obtains negative samples from the blockchain network 600 to train the speech recognition model, and obtains an updated speech recognition model, so that the terminal 400 requests to call the updated speech recognition model for speech recognition.

In the embodiment of the application, the block chain network comprises a voice recognition server and a voice similarity detection server, and the reliability of the certificate of the virtual resource and the credibility of the determined voice recognition result can be ensured through a consensus mechanism between nodes; and the use authority is determined according to the number of the negative samples uploaded by the terminal, so that a terminal user can be stimulated to upload an error recognition instruction, and the accuracy of voice recognition is improved.

Next, a structure of an electronic device for implementing a speech recognition processing method provided in an embodiment of the present application is described, and as described above, the electronic device provided in the embodiment of the present application may be a server 500 that integrates functions of the speech recognition server 200 in fig. 1A and the speech similarity detection server 300 in fig. 1A. Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 500 provided in the embodiment of the present application, taking an electronic device as the server 500 as an example, the server 500 shown in fig. 2 includes: at least one processor 510, memory 550, and at least one network interface 520. The various components in server 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

The memory 550 may comprise volatile memory or nonvolatile memory, and may also comprise both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

in some embodiments, the speech recognition processing device provided by the embodiments of the present application may be implemented in software, and fig. 2 shows a speech recognition processing device 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: a processing module 5551 and an update module 5552, which are logical and thus can be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be explained below.

Next, a speech recognition processing method provided in the embodiment of the present application is implemented by the terminal 400 in fig. 1 as an example. Referring to fig. 3A, fig. 3A is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 101, for any two first voice instructions and second voice instructions received successively in the plurality of voice instructions, the following processing is executed: the similarity of the first voice command and the second voice command is determined.

In some embodiments, determining the similarity between the first voice instruction and the second voice instruction may be implemented by: determining the short-time energy of the first voice command, and mapping the short-time energy of the first voice command into a first vector in cosine similarity; acquiring short-time energy of a second voice instruction, and mapping the short-time energy of the second voice instruction to a second vector in the cosine similarity; and taking the similarity of a first vector in the cosine similarity and a second vector in the cosine similarity as the similarity of the first voice command and the second voice command.

For example, the short-time energy of the first voice instruction is taken as the first vector A in the cosine similarity_iThe short-time energy of the second voice command is used as a second vector B in the cosine similarity_iThe speech similarity is calculated in the following formula (1):

as an example, determining the short-time energy of the first voice instruction may be accomplished by: sampling the first voice instruction to obtain a plurality of first voice sampling points; performing windowing and framing processing on the plurality of first voice sampling points to obtain a plurality of voice frames; taking the square sum of the amplitude values corresponding to the first voice sampling points in each voice frame as the short-time energy of each voice frame; taking the short-time energy of each voice frame in the first voice command as a component of the short-time energy of the first voice command; and integrating the multiple components to obtain the short-time energy of the first voice command.

It should be noted that, here, sampling the first voice instruction may be implemented by: sampling the first voice instruction by using sampling frequency to obtain a plurality of first voice sampling points; the sampling process is a process of converting a continuous analog audio signal (i.e., a first voice command) into a digital voice signal, and the plurality of first voice sampling points are the digital voice signal. The sampling frequency is selected according to the bandwidth of the first voice instruction, so that the frequency domain aliasing distortion of the signal is avoided.

Windowing framing is performed by dividing the digital speech signal into short segments (i.e., speech frames) and by weighting the digital speech signal with a movable window of finite length (i.e., a window function). The frame division can adopt an overlapping segmentation method to enable the frame to be smoothly transited and maintain the continuity. The window function may be chosen as a rectangular window, a hamming window, etc.

For example, x (N) is a digital speech signal, w (N) is a window function, and N is the width of the window. The calculation formula of the short-time energy is as formula (2):

in some embodiments, after the step of performing windowing and framing on the plurality of first speech samples to obtain the plurality of speech frames, the following steps may be further performed: acquiring a short-time energy threshold; and when the short-time energy of the voice frame is less than the short-time energy threshold value, determining the voice frame as a mute frame, and removing the voice frame.

It should be noted that the short-time energy threshold may be a short-time energy average value, and the mute frame may be a noise frame with a small volume (e.g., background noise of an environment where the speaker is located).

In the embodiment of the application, for each voice frame, when the short-time energy of the voice frame is smaller than the short-time energy threshold, the voice frame is judged to be a mute frame or a noise frame and is removed, and the mute frame and the noise frame with smaller volume in the digital voice signal are effectively removed.

In step 102, when the similarity exceeds the similarity threshold, the first voice instruction is determined to be a misrecognition instruction.

For example, the terminal receives two voice commands (a first voice command and a second voice command) in sequence, and when the similarity between the first voice command received first and the second voice command received later exceeds a similarity threshold, it is indicated that the first voice command and the second voice command are the same voice command, so that it is determined that the first voice command is an error recognition command, that is, the terminal recognizes the first voice command incorrectly, and the terminal user can input the same voice command again. In the embodiment of the application, the terminal can intelligently recognize whether the received voice command is an error recognition command.

In some embodiments, determining the similarity between any two successively received voice commands in the plurality of voice commands may also be implemented in the following manner: determining the similarity between the third voice instruction and a preset voice instruction; when receiving a voice instruction each time, the operation of determining a preset voice instruction is executed, wherein the determination of the preset voice instruction can be realized by the following modes: and when the similarity between the fourth voice instruction and the fifth voice instruction is smaller than or equal to the similarity threshold value, determining that the fifth voice instruction is a preset voice instruction.

In the embodiment of the application, if a plurality of voice instructions which are continuously received are all the same voice instruction, the plurality of voice instructions are all subjected to similarity comparison with a preset voice instruction (namely, a fifth voice instruction), the terminal only needs to store the preset voice instruction every time, the preset voice instruction is not required to be updated every time the similarity comparison is performed, and the computing resources of the terminal are saved.

In some embodiments, determining at least one misrecognized instruction of the plurality of voice instructions may be accomplished by: determining at least one misrecognized instruction of the plurality of voice instructions may be accomplished by: aiming at any two first voice instructions and second voice instructions which are received successively in a plurality of voice instructions, the following processing is executed: determining the similarity between the first voice instruction and the second voice instruction; determining a receiving time interval of the first voice instruction and the second voice instruction; when the receiving time interval is smaller than an interval threshold value and the similarity of the first voice instruction and the second voice instruction exceeds the similarity threshold value, determining that the first voice instruction is a wrong recognition instruction; and when the similarity of the first voice instruction and the second voice instruction exceeds a similarity threshold value, determining that the first voice instruction is a misrecognition instruction.

It should be noted that the receiving time interval between the first voice command and the second voice command needs to be smaller than the interval threshold, and if the interval time is too long, the same voice command may be continuously sent twice in 2 different usage scenarios, for example, speaking home to eat at six night every day, at this time, since the receiving time interval is greater than the interval threshold, the voice that is spoken home to eat on the second day will not be determined as the wrong voice command.

In the embodiment of the application, the error recognition instruction can be intelligently determined and reported by comparing the similarity of the first voice instruction with the similarity of the second voice instruction, the inaccurate determination of the error recognition instruction is caused by considering the same voice instruction in a discontinuous session scene, and the accuracy of determining the error recognition instruction is improved by detecting the time interval between the voice instructions.

In some embodiments, after the step of sampling the first voice command to obtain a plurality of first voice sample band points, the following steps may be further performed: carrying out pre-emphasis processing on a plurality of first voice sampling points to obtain pre-emphasized first voice sampling points; and filtering the weighted first voice sampling point to obtain a first voice sampling point for windowing and framing.

It should be noted that a plurality of speech samples (digital speech signals) are pre-emphasized by a transfer function h (z) ═ 1-az^-1The first-order Finite-length unit impulse Response (FIR) high-pass digital filter is implemented, where a is a pre-emphasis coefficient, and a may be (0.9, 1.0). For example, suppose that the digital speech signal at time n is x (n), and the output result after pre-emphasis processing is y (n) ═ x (n) -0.98 x (n-1), so that the amplitude of the digital speech signal in the high-frequency part is increased.

The high-frequency part of a plurality of voice sampling points is pre-emphasized, so that the influence of lip radiation (the effect of vocal cords and lips in the sounding process) is eliminated, and the high-frequency resolution of voice is increased.

In some embodiments, after the step of sampling the first voice command to obtain a plurality of first voice sample points, the following steps may be further performed: acquiring the maximum absolute value of the amplitude corresponding to the plurality of first voice sampling points; determining the number of moving bits according to the maximum absolute value; and moving decimal points of the amplitude corresponding to the first voice sampling points to the left by moving digits according to each first voice sampling point to obtain first voice sampling points for windowing and framing.

In some examples, normalization of the digital speech signal is performed by shifting the position of the decimal point. The calculation formula of the amplitude corresponding to the normalized first voice sampling point is as formula (3):

in other embodiments, after the step of sampling the first voice command to obtain a plurality of first voice sample points, the following steps may be further performed for normalization: and projecting the amplitude corresponding to the first voice sampling point to a specified space [ min, max ], and calculating to obtain a minimum value min and a maximum value max of the amplitude corresponding to the first voice sampling point. The calculation formula of the amplitude corresponding to the normalized first voice sampling point is as formula (4):

in other embodiments, after the step of sampling the first voice command to obtain a plurality of first voice sample points, the following steps may be further performed for normalization: and the amplitude corresponding to the first voice sampling point is converted into a normal distribution form, so that the results are easy to compare. The calculation formula of the amplitude corresponding to the normalized first voice sampling point is as formula (5):

the embodiment of the application enables the amplitude corresponding to the first voice sampling point to meet a certain rule before windowing and framing processing is carried out on the first voice sampling point, so that the normative requirement is met, the calculation cost is reduced, and data mining is facilitated.

In some embodiments, referring to fig. 3B, fig. 3B is a schematic flowchart of a speech recognition processing method provided in an embodiment of the present application, and after step 102 shown in fig. 3A, steps 103 to 105 may also be executed, which will be described in conjunction with the steps.

In step 103, the misrecognition instruction and the corresponding speech recognition result are combined into a negative example.

In some embodiments, a voice instruction that is manually marked as erroneous may also be used as a misrecognized instruction; for example, when the first speech recognition result is output on the human-computer interaction interface of the client, a correct or incorrect feedback entry (e.g., the first speech recognition result is incorrect) is additionally displayed to collect whether the first speech instruction is a correct recognition instruction or an incorrect recognition instruction.

In some embodiments, whether the first voice instruction is a correct recognition instruction or a wrong recognition instruction may also be recognized by a specific voice instruction. For example, after the human-computer interaction interface of the client outputs the recognition result of the first voice command, if a specific preset voice command for indicating that the voice recognition result is wrong, such as "re-recognition" or "recognition error" is received, the first voice recognition result corresponding to the first voice command is not output, the first voice command is used as the wrong recognition command, and the newly received voice command is re-recognized to output the corresponding voice recognition result.

In the embodiment of the application, an artificial feedback inlet is added, and omission of error reporting is avoided by combining intelligent reporting and artificial reporting; reporting errors through a specific voice instruction, increasing the reporting mode of a user, improving the user experience, stopping outputting error recognition results in time and reducing computing resources.

In step 104, a speech recognition model that is updated based on the negative examples is obtained.

In some embodiments, the speech recognition model is retrained based on the negative examples, and the retrained speech recognition model, i.e. the updated speech recognition model, is obtained.

In some examples, when the speech recognition model is retrained based on negative examples, the weight of the negative examples may also be determined according to the degree of normalization of the misrecognition instruction. That is, the degree of standardization of the misrecognized instruction (e.g., the degree of approximating Mandarin pronunciation) is determined; searching a negative sample weight value corresponding to the error recognition instruction from the corresponding relation between the standardization degree of the voice instruction and the sample weight; retraining the speech recognition model based on the weighted negative samples; therefore, the degree of influence of the negative samples on the update training of the voice recognition model is controlled, for example, the normalization degree of the voice command is positively correlated with the sample weight, namely, the normalization degree of the voice command is about high, and the influence on the model training is about large.

In some embodiments, the speech recognition model may be further updated based on the positive and negative samples, and when the similarity between the first speech instruction and the second speech instruction is less than or equal to a similarity threshold, the first speech instruction is determined to be a correct recognition instruction; combining the correct recognition instruction and the corresponding voice recognition result into a positive sample; combining the error recognition instruction and the corresponding voice recognition result into a negative sample; and retraining the voice recognition model based on the positive sample and the negative sample, wherein the voice recognition model obtained by retraining is the updated voice recognition model.

In some embodiments, when the terminal acquires the speech recognition model updated based on the negative examples, the terminal may further acquire the usage right (e.g., the number of times of use, the duration of use, etc.) of the updated speech recognition model according to the number of the negative examples generated. That is, obtaining the voice recognition model updated based on the negative examples can be realized by determining the number of the negative examples produced by the terminal, determining the usage right of the updated voice recognition model requested by the terminal based on the number of the negative examples produced by the terminal, and determining whether to return the updated voice recognition model based on the usage right. The number of the produced negative samples is positively correlated with the use authority of the updated speech recognition model, namely the produced number is about more, and the use authority is larger.

In the embodiment of the application, the voice recognition model is updated through the intelligently reported error recognition instruction and the corresponding voice recognition result, so that the error correction capability of the voice recognition model is improved, and the accuracy of the voice recognition is improved.

In step 105, a newly received speech recognition instruction is recognized based on the updated speech recognition model to output a corresponding speech recognition result.

In some embodiments, speech features are extracted from the newly received speech recognition instruction; mapping the speech features into probabilities of a plurality of speech recognition results by the updated speech recognition model; and taking the voice recognition result corresponding to the maximum probability of the plurality of voice recognition results as the voice recognition result of the newly received voice recognition instruction. The first voice command, the second voice command and the like are all voice recognition commands.

In some embodiments, a plurality of voice instructions are received in sequence, and a voice recognition model is called to respectively recognize the plurality of voice instructions so as to respectively output voice recognition results corresponding to the plurality of voice instructions one by one; determining at least one misrecognized instruction of a plurality of voice instructions; combining the error recognition instruction and the corresponding voice recognition result into a negative sample; acquiring a voice recognition model updated based on a negative sample; and continuously recognizing the newly received voice recognition instruction based on the updated voice recognition model so as to output a corresponding voice recognition result.

In some embodiments, the terminal receives the voice command, and may also recognize the voice command by calling a voice recognition model integrated in the terminal, so as to output a voice recognition result corresponding to the voice command in a human-computer interaction interface of the terminal.

According to the method and the device, the error recognition instruction is automatically determined in the received voice instruction, the newly received voice recognition instruction is recognized based on the error recognition instruction and the voice recognition model updated by the corresponding voice recognition result, the corresponding voice recognition result is output, and the error recognition instruction and the corresponding wrong voice recognition result can be intelligently reported; and updating the voice recognition model based on the error recognition instruction and the corresponding voice recognition result, so that the error correction capability of the voice recognition model is improved, and the accuracy of the voice recognition is improved by performing the voice recognition based on the updated voice recognition model.

In the embodiment of the application, the speech recognition is carried out based on the updated speech recognition model, and the accuracy of the speech recognition is improved.

In some embodiments, referring to fig. 3C, fig. 3C is a flowchart of a speech recognition processing method provided in this embodiment, and step 104 shown in fig. 3A can also be implemented by executing step 1041 to step 1043, which will be described with reference to each step.

In step 1041, the negative examples are uploaded to a state database of the blockchain network for storage.

In step 1042, the intelligent contract in the blockchain network is invoked, so that the intelligent contract performs the following processes: and generating the certificate of the virtual resource according to the number of the uploaded negative samples. Wherein the credentials of the virtual resource are used to request usage rights of the updated speech recognition model.

In step 1043, an updated speech recognition model is obtained based on the virtual resource.

In some embodiments, the terminal uploads the determined negative sample to a state database of the blockchain network for storage; the block chain network generates a virtual resource certificate according to the number of the negative samples uploaded by the terminal so as to obtain an updated voice recognition model based on the virtual resource; and when the terminal acquires the voice recognition model updated by the negative sample, inquiring the virtual resource certificate recorded in the state database of the block chain network, and determining whether to return the updated voice recognition model.

In the embodiment of the application, the use permission is determined according to the number of the negative samples uploaded by the terminal, and the terminal user can be stimulated to upload the error recognition instruction, so that the accuracy of voice recognition is improved.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described. Taking a voice recognition client as an example, the client receives a plurality of voice instructions sent by a user, calls a voice recognition model to recognize the voice instructions so as to output voice recognition results corresponding to the voice instructions one by one respectively, detects the similarity of the voice instructions so as to determine an error recognition instruction in the voice instructions, and updates the voice recognition model based on the error recognition instruction and the corresponding voice recognition result, so that the error correction capability of the voice recognition model is improved; based on the updated voice recognition model, the voice recognition instruction newly uploaded by the user is continuously recognized, so that a corresponding voice recognition result is output in a human-computer interaction interface of the terminal client, and the accuracy of voice recognition is improved; the problem that wrong speech recognition cannot be counted and corrected is solved, and user experience is improved while manual testing cost is reduced. The following is directed to a speech recognition processing method implemented by a speech recognition client provided in an embodiment of the present application. Referring to fig. 4, fig. 4 is a schematic flowchart of a method for implementing a speech recognition processing by a speech recognition client according to an embodiment of the present application. The following is a detailed description:

step 401: the client receives the voice input by the user and outputs a voice recognition result. The voice input by the user is a first voice instruction; the client is a voice recognition client in the terminal. And calling a voice recognition model to perform voice recognition on the voice input by the user, and determining a voice recognition result corresponding to the voice so as to output the voice recognition result in a human-computer interaction interface of the voice recognition client.

Step 402: the voice similarity detection server compares the voice input by the user with the similarity of the latest voice. If the voice input by the user and the last voice exceed the similarity threshold, execute step 403; if the similarity between the voice input by the user and the last voice exceeds the threshold, go to step 404.

Step 403: and the voice similarity detection server reports the error recognition instruction to the voice recognition server. When the voice input by the user and the last voice exceed the similarity threshold, the last voice is determined to be the voice with wrong recognition (namely, a wrong recognition instruction), and the voice recognition server is reported.

Step 404: the voice similarity detection server stores the latest voice. And when the voice input by the user and the latest voice are less than or equal to the similarity threshold, determining that the voice input by the user is the latest voice, and storing the latest voice for carrying out similarity comparison with the voice input by the user next time.

Step 405: the speech recognition server performs model training based on the speech that is recognized incorrectly. And combining the speech with the wrong recognition and the corresponding speech recognition result into a negative sample, and retraining the speech recognition model based on the negative sample to obtain an updated speech recognition model.

In some embodiments, the voice recognition processing method provided in the embodiment of the present application may be cooperatively executed by a terminal and a server, referring to fig. 5, where fig. 5 is a schematic flowchart of the voice recognition processing method provided in the embodiment of the present application cooperatively executed by the terminal and the server. The following is a detailed description:

step 501: the client receives voice input by a user. Receiving voice input by a user through a medium (a remote controller, a mobile phone applet and a far-field microphone).

Step 502: the client uploads the voice to the voice recognition server. And simultaneously, the client uploads the received user voice to the voice recognition server for voice recognition so as to return a voice recognition result.

Step 503: and the client uploads the voice to the voice similarity detection server. And the client uploads the received user voice to the voice similarity detection server so as to detect the voice with wrong recognition.

It should be noted that the execution order of step 502 and step 503 may be exchanged, or may be parallel.

Step 504: the voice similarity detection server detects the similarity between the voice input by the user and the latest voice. The voice similarity detection server detects that the received voice input by the user is subjected to similarity comparison with the latest voice stored in the voice similarity detection server. The last voice is the preset voice command.

Step 505: the voice recognition server recognizes the voice and returns a voice recognition result. And the voice recognition server performs voice recognition on the voice uploaded by the client.

Step 506: and the client displays the voice recognition result. And the client receives the voice recognition result of the voice recognition server to the voice of the user, and displays the voice recognition result on a human-computer interaction interface of the client.

For example, the client receives a voice 1 sent by the user, the voice recognition server performs voice recognition on the voice 1 based on the voice recognition model, returns a voice recognition result 1 corresponding to the voice 1, and displays the voice recognition result 1 on the client; meanwhile, the voice similarity detection server carries out similarity detection on the voice 1, and the similarity between the voice 1 and the latest voice stored in the voice similarity detection server is smaller than a similarity threshold value, so that the voice 1 is judged to be a new voice; the voice similarity detection server stores the voice 1 as the latest voice; the client receives the voice 2 sent by the user, the voice recognition server carries out voice recognition on the voice 2 based on the voice recognition model, returns a voice recognition result 2 corresponding to the voice 2 and displays the voice recognition result at the client; meanwhile, the voice similarity detection server carries out similarity detection on the voice 2, detects that the similarity between the voice 2 and the latest voice is greater than a similarity threshold value, determines that the voice 1 and the voice 2 are the same voice, judges that the voice 1 is the voice with wrong recognition according to the result, and reports the voice 1 and a voice recognition result 1 corresponding to the voice 1 to the voice recognition server; the voice recognition server takes the voice 1 and the voice recognition result 1 as negative samples to retrain the voice recognition model; the client receives the voice 3 sent by the user, the voice recognition server carries out voice recognition on the voice 3 based on the retrained voice recognition model, returns a voice recognition result 3 corresponding to the voice 3 and displays the voice recognition result at the client; and meanwhile, the voice similarity detection server carries out similarity detection on the voice 3, detects that the similarity between the voice 3 and the latest voice is smaller than a similarity threshold value, thus judges that the voice 3 is a new voice, and stores the voice 3 as the latest voice for carrying out similarity detection on the next voice newly input by the user so as to determine the voice with wrong recognition.

The following describes the determination process of the speech similarity in detail. Referring to fig. 6, fig. 6 is a schematic flowchart of determining a speech similarity according to an embodiment of the present application. The specific complete process for determining the similarity of the voices is as follows:

step 601: and acquiring audio original data. The audio raw data is the first voice command.

Step 602: an audio signal is acquired. An audio signal (e.g., wav audio data) is an audio signal (digital voice signal), and a signed number is read every 16 bits as a sampling result of a sampling value; and normalizing the original audio data to obtain the maximum value max _ value (the maximum value of the absolute value) of all sampling values, and normalizing all the sampling value data through the max _ value. Because the amplitude distribution corresponding to each sampling value of the audio data is wide, the signal needs to be converted into a uniform standard mode through normalization processing, and the data amplitude of all the sampling values is adjusted to be between [ -1,1 ].

Step 603: the audio signal is high-pass filtered. The low frequency signal interference is filtered out, and in audio processing, a first-order high-pass filter is arranged to filter out the low frequency interference of 50Hz in a pre-emphasis mode.

Step 604: the short-time energy of the audio signal is calculated. The calculation formula of the short-time energy is shown in formula (1).

Step 605: and intercepting the valid data of the audio signal. And removing the head and tail mute segments to obtain an effective energy interval.

Step 606: and calculating the voice similarity. And obtaining effective short-time energy of the voice of the user, and then calculating the cosine distance to obtain the voice similarity between the voice input by the user and the latest voice. The way of calculating the speech similarity is shown in formula (2).

According to the voice recognition method and device, the voice with the recognition error and the corresponding voice recognition result are intelligently reported according to the received voice input by the user, and the voice recognition model is updated based on the voice with the recognition error and the corresponding voice recognition result, so that the voice newly input by the user is recognized based on the updated voice recognition model, the accuracy of voice recognition is intelligently improved through a mode of automatically detecting and recognizing the wrong voice, manual participation is reduced, and voice user experience is improved.

Continuing with the exemplary structure of the speech recognition processing device 555 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the speech recognition processing device 555 in the memory 550 may include: the processing module 5551 is configured to, for any two successively received first voice instructions and second voice instructions in the plurality of voice instructions, execute the following processing: determining the similarity of the first voice instruction and the second voice instruction; and when the similarity exceeds a similarity threshold, determining that the first voice instruction is a misrecognition instruction.

In some embodiments, a speech recognition processing apparatus provided in an embodiment of the present application further includes: an update module 5552, configured to combine the misrecognition instruction and the corresponding speech recognition result into a negative sample; acquiring a voice recognition model updated based on the negative sample; and recognizing the newly received voice recognition instruction based on the updated voice recognition model so as to output a corresponding voice recognition result.

In some embodiments, the update module 5552 is further configured to upload the negative examples to a state database of a blockchain network for storage; invoking an intelligent contract in the blockchain network to cause the intelligent contract to perform the following: generating a virtual resource certificate according to the number of the uploaded negative samples; wherein the credentials of the virtual resource are used to request usage rights of the updated speech recognition model; obtaining the updated speech recognition model based on the virtual resource.

In some embodiments, the processing module 5551 is further configured to determine a receiving time interval of the first voice instruction and the second voice instruction; when the receiving time interval is smaller than the interval threshold and the similarity exceeds the similarity threshold, determining that the first voice instruction is a misrecognition instruction.

In some embodiments, the processing module 5551 is further configured to determine a short-time energy of the first voice instruction, and map the short-time energy of the first voice instruction to a first vector in cosine similarity; determining the short-time energy of the second voice instruction, and mapping the short-time energy of the second voice instruction to be a second vector in the cosine similarity; and taking the similarity between a first vector in the cosine similarity and a second vector in the cosine similarity as the similarity between the first voice instruction and the second voice instruction.

In some embodiments, the processing module 5551 is further configured to sample the first voice instruction to obtain a plurality of first voice sampling points; performing windowing and framing processing on the plurality of first voice sampling points to obtain a plurality of voice frames; taking the square sum of the amplitude values corresponding to the first voice sampling points in each voice frame as the short-time energy of each voice frame; taking the short-time energy of each voice frame in the first voice command as a component of the short-time energy of the first voice command; and integrating the components to obtain the short-time energy of the first voice command.

In some embodiments, the processing module 5551 is further configured to obtain the short-time energy threshold; and when the short-time energy of the voice frame is smaller than the short-time energy threshold value, determining the voice frame to be a mute frame, and removing the voice frame.

In some embodiments, the processing module 5551 is further configured to perform pre-emphasis processing on the plurality of first speech samples, so as to obtain pre-emphasized first speech samples; and filtering the weighted first voice sampling point to obtain the first voice sampling point for windowing and framing.

In some embodiments, the processing module 5551 is further configured to obtain a maximum absolute value of the amplitude values corresponding to the plurality of first voice sampling points; determining the number of moving bits according to the maximum absolute value; and moving decimal points of the amplitude corresponding to the first voice sampling points to the left by the moving digit according to each first voice sampling point to obtain the first voice sampling points for performing windowing and framing processing.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the speech recognition processing method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, a speech recognition processing method as shown in fig. 3A, 3B, and 3C.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the present application, the newly received voice recognition instruction is recognized based on the voice recognition model updated by the erroneous recognition instruction and the corresponding voice recognition result to output the corresponding voice recognition result, so that the erroneous recognition instruction and the corresponding erroneous voice recognition result can be intelligently reported; considering that the same voice instructions in a discontinuous session scene can cause inaccurate determination of the misidentification instructions, the accuracy of determining the misidentification instructions is improved by detecting the time interval between the voice instructions; an artificial feedback inlet is added, and omission of error reporting is avoided in a mode of combining intelligent reporting and artificial reporting; reporting errors through a specific voice instruction, increasing a user reporting mode, improving user experience, stopping outputting error recognition results in time, and reducing computing resources; determining the use authority according to the number of negative samples uploaded by the terminal, and stimulating a terminal user to upload an error recognition instruction so as to improve the accuracy of voice recognition; the voice recognition model is updated through the intelligently reported error recognition instruction and the corresponding voice recognition result, so that the error correction capability of the voice recognition model is improved; the speech input by the user is identified based on the updated speech identification model, the accuracy of speech identification is intelligently improved by automatically detecting and identifying wrong speech, manual participation is reduced, and the speech user experience is improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A speech recognition processing method, comprising:

2. The method of claim 1, further comprising:

combining the error recognition instruction and the corresponding voice recognition result into a negative sample;

acquiring a voice recognition model updated based on the negative sample;

and recognizing the newly received voice recognition instruction based on the updated voice recognition model so as to output a corresponding voice recognition result.

3. The method of claim 2, wherein obtaining the updated speech recognition model based on the negative examples comprises:

uploading the negative sample to a state database of a block chain network for storage;

invoking an intelligent contract in the blockchain network to cause the intelligent contract to perform the following:

generating a virtual resource certificate according to the number of the uploaded negative samples;

wherein the credentials of the virtual resource are used to request usage rights of the updated speech recognition model;

obtaining the updated speech recognition model based on the virtual resource.

4. The method according to any of claims 1-3, wherein the determining that the first speech instruction is a misrecognized instruction when the similarity exceeds a similarity threshold comprises:

determining a receiving time interval of the first voice instruction and the second voice instruction;

when the receiving time interval is smaller than the interval threshold and the similarity exceeds the similarity threshold, determining that the first voice instruction is a misrecognition instruction.

5. The method of claim 1, wherein determining the similarity of the first voice instruction and the second voice instruction comprises:

determining the short-time energy of the first voice instruction, and mapping the short-time energy of the first voice instruction to be a first vector in cosine similarity;

determining the short-time energy of the second voice instruction, and mapping the short-time energy of the second voice instruction to be a second vector in the cosine similarity;

and taking the similarity between a first vector in the cosine similarity and a second vector in the cosine similarity as the similarity between the first voice instruction and the second voice instruction.

6. The method of claim 5, wherein determining the short-time energy of the first speech command comprises:

sampling the first voice instruction to obtain a plurality of first voice sampling points;

performing windowing and framing processing on the plurality of first voice sampling points to obtain a plurality of voice frames;

taking the square sum of the amplitude values corresponding to the first voice sampling points in each voice frame as the short-time energy of each voice frame;

taking the short-time energy of each voice frame in the first voice command as a component of the short-time energy of the first voice command;

and integrating the components to obtain the short-time energy of the first voice command.

7. The method of claim 6, wherein after the windowing and framing the first speech samples to obtain a plurality of speech frames, the method further comprises:

acquiring the short-time energy threshold;

and when the short-time energy of the voice frame is smaller than the short-time energy threshold value, determining the voice frame to be a mute frame, and removing the voice frame.

8. The method of claim 6, wherein after said sampling the first speech instruction for a plurality of first speech samples, the method further comprises:

pre-emphasis processing is carried out on the plurality of first voice sampling points to obtain pre-emphasized first voice sampling points;

and filtering the weighted first voice sampling point to obtain the first voice sampling point for windowing and framing.

9. The method of claim 6, wherein after said sampling the first speech instruction for a plurality of first speech samples, the method further comprises:

acquiring the maximum absolute value of the amplitude corresponding to the plurality of first voice sampling points;

determining the number of moving bits according to the maximum absolute value;

and moving decimal points of the amplitude corresponding to the first voice sampling points to the left by the moving digit according to each first voice sampling point to obtain the first voice sampling points for performing windowing and framing processing.

10. A speech recognition processing apparatus, comprising:

the processing module is used for executing the following processing aiming at any two of the first voice instruction and the second voice instruction which are received successively in the plurality of voice instructions: