CN114446283A

CN114446283A - Voice processing method and device, electronic equipment and storage medium

Info

Publication number: CN114446283A
Application number: CN202210147479.6A
Authority: CN
Inventors: 刘志康
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-06

Abstract

The invention relates to the technical field of artificial intelligence, and provides a voice processing method, a device, electronic equipment and a storage medium, wherein characteristic parameters of voice input by a user are obtained, a target recognition model is used for recognizing the characteristic parameters to obtain a recognition result, the recognition result is processed based on Fourier transform to obtain a first byte sequence, a preset decision tree is processed to obtain a second byte sequence, a Hamming distance algorithm is used for obtaining a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence, whether the target decision item is correct or not is verified, corresponding operation is executed according to the verified result, the decision item meeting the requirements of the user is intelligently searched according to voice data of the user and the decision tree by a voice recognition technology, and the voice recognition and processing efficiency is improved.

Description

Voice processing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice processing method and device, electronic equipment and a storage medium.

Background

The traditional speech recognition technology set by intelligent speech service is based on phonemes, and carries out speech recognition by modeling phoneme sequences. However, this method cannot effectively express the relationship between long-context phoneme sequences, and in order to describe the relationship between more phoneme sequences, high-order modeling is required, so that the computation cost is exponentially increased, and the vector space model is not robust enough. In addition, the problem that traditional intelligence pronunciation customer service can solve is concrete inadequately, and the problem that can solve is too single for feedback efficiency is low excessively, hardly satisfies the customer's demand, can't reach the effect of effectively solving the user's problem.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a voice processing method, device, electronic device and storage medium, which can improve the efficiency of voice processing.

A first aspect of the present invention provides a method of speech processing, the method comprising:

responding to the operation of inputting voice by a user, and acquiring the characteristic parameters of the voice;

recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result;

processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;

acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;

and checking whether the target decision item is correct or not, and executing corresponding operation according to a result obtained by checking.

According to an optional embodiment of the present invention, the recognizing the feature parameters by using a pre-trained target recognition model, and obtaining a recognition result includes:

inputting mel frequency cepstrum coefficients of the voice signal into the target recognition model;

and obtaining a character string sequence output by the target recognition model.

According to an optional embodiment of the present invention, the processing the recognition result based on fourier transform to obtain a first byte sequence, and processing a decision item in a preset decision tree based on fourier transform to obtain a second byte sequence includes:

acquiring a first sound wave shape corresponding to the character string sequence of the recognition result and a second sound wave shape corresponding to each decision item in the decision tree;

transform the first acoustic waveform into the first byte sequence and the second acoustic waveform into the second byte sequence based on a Fourier transform.

According to an optional embodiment of the present invention, the obtaining, by using a hamming distance algorithm, a target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence includes:

calculating a first SimHash value of the first byte sequence and a second SimHash value of the second byte sequence by the Hamming distance algorithm;

calculating a similarity between the first byte sequence and the second byte sequence based on the first SimHash value and the second SimHash value;

and acquiring a decision item corresponding to the minimum similarity from the decision tree as the target decision item.

According to an alternative embodiment of the present invention, said calculating a first SimHash value of said first sequence of bytes by said hamming distance algorithm comprises:

performing word segmentation on the first byte sequence to obtain a plurality of feature vectors of the first byte sequence;

setting a preset weight for each eigenvector;

calculating a Hash value of each feature vector through a Hash function;

weighting all the eigenvectors based on the Hash value to obtain a weighting result;

accumulating the weighted results of all the eigenvectors to obtain an accumulated result;

and reducing the dimension of the accumulation result to obtain the first SimHash value.

According to an optional embodiment of the present invention, performing corresponding operations according to the result obtained by the verification includes:

when the result obtained by verification is that the objective decision item is correct, executing the operation corresponding to the objective decision item;

when the result obtained by verification is that the objective decision item is incorrect, receiving the voice input by the user again, sending the voice input by the user twice to an artificial customer service, acquiring a first operation of the artificial customer service, and providing a processing method meeting the requirement of the user for the user based on the first operation, wherein the first operation comprises the following steps: and selecting the decision items in the decision tree which meet the requirements of the user.

According to an alternative embodiment of the invention, the method further comprises:

when the target decision item is determined to be incorrect and no decision item meeting the requirements of the user exists in the decision tree, acquiring a second operation of the artificial customer service, and updating the voice library and the decision tree according to the second operation, wherein the second operation comprises: and inputting the two times of voice.

A second aspect of the present invention provides a speech processing apparatus, comprising:

the acquisition module is used for responding to the operation of inputting voice by a user and acquiring the characteristic parameters of the voice;

the recognition module is used for recognizing the characteristic parameters by utilizing a pre-trained target recognition model to obtain a recognition result;

the processing module is used for processing the identification result based on Fourier transform to obtain a first byte sequence and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;

the matching module is used for acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;

and the checking module is used for checking whether the target decision item is correct or not and executing corresponding operation according to a result obtained by checking.

A third aspect of the invention provides an electronic device comprising a processor and a memory, the processor being configured to implement the speech processing method when executing a computer program stored in the memory.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech processing method.

To sum up, the voice processing method, the apparatus, the electronic device and the storage medium according to the present invention first obtain the feature parameters of the voice in response to the operation of the user to input the voice, then recognize the feature parameters by using the pre-trained target recognition model to obtain the recognition result, process the recognition result by using the hamming distance algorithm to obtain the first byte sequence, process the decision item in the preset decision tree to obtain the second byte sequence, then obtain the target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence, finally verify whether the target decision item is correct, execute the corresponding operation according to the result obtained by verification, achieve the intelligent search of the decision item according to the client voice data according to the pre-constructed decision tree by the voice recognition technology, and confirming the correctness of the searched decision item to the client in a mode of combining artificial intelligence and artificial customer service, and providing corresponding operation for the client according to the correctness. The background system of the artificial cooperation machine learning training adopts a machine to try to analyze historical data, trains and learns results, then is intervened and corrected manually, and the machine learning intelligently adjusts the proportion of related characteristic values to continue training, thereby improving the efficiency and accuracy of artificial intelligent voice customer service voice processing.

Drawings

Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a speech processing apparatus according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The voice processing method provided by the embodiment of the invention is executed by the electronic equipment, and correspondingly, the voice processing device runs in the electronic equipment.

Example one

Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present invention. The speech processing method specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.

And S11, responding to the operation of the user for recording the voice, and acquiring the characteristic parameters of the voice.

In response to the operation of a user for recording voice, firstly, preprocessing the voice to obtain a voice signal, then, acquiring a Mel Frequency Cepstral Coefficients (MFCC) of the voice signal, and taking the Mel Frequency Cepstral coefficients as characteristic parameters of the voice signal.

Wherein the pre-processing may include: pre-emphasis, windowing and frame division processing, end point detection and noise reduction processing.

The pre-emphasis processing comprises: the high frequency portion of the voice data is emphasized based on a difference in signal characteristics and noise characteristics of the voice data. The pre-emphasis process may increase the high frequency resolution of the speech data.

The windowing framing process comprises: and windowing and framing the voice data to obtain a plurality of short-time analysis windows of the voice data. And framing the voice data by adopting a movable window with limited length for weighting, and processing the voice data by utilizing a window function to form a windowed voice signal, wherein the window function comprises a Hamming window and a rectangular window.

The endpoint detection comprises: and acquiring a starting point and an end point of the voice data, and taking the starting point and the end point as two end points of the voice data. The correct and effective end point detection not only can reduce the calculated amount and shorten the processing time, eliminate the noise interference of the silent section and improve the accuracy of the voice recognition, but also can extract the initial point of the keyword to be recognized, separate the voice data from the background noise and silence, obtain the voice signal suitable for voice recognition and carry out subsequent operation.

The voice noise reduction processing comprises: and processing the voice data by using noise reduction algorithms such as an adaptive filter, a spectral subtraction method or a wiener filtering method and the like, so as to improve the signal-to-noise ratio of the voice data.

In an optional embodiment, the obtaining the mel-frequency cepstrum coefficients of the speech signal comprises:

acquiring a plurality of short-time analysis windows of the voice signal;

carrying out Fourier transform on each short-time analysis window to obtain a corresponding frequency spectrum;

obtaining a mel frequency spectrum of the frequency spectrum by using a mel filter bank;

and carrying out cepstrum analysis on the Mel frequency spectrum to obtain the Mel frequency cepstrum coefficient. Wherein the cepstrum analysis comprises taking a logarithm and performing an inverse transform, the inverse transform comprising a discrete cosine transform.

In an alternative embodiment, the mel filter bank includes, for example, 40 triangular filters. In order to balance the spectrum and improve the signal-to-noise ratio (SNR), the mel filter bank may be normalized to obtain a mean normalized mel filter bank, thereby obtaining a normalized MFCC.

In an alternative embodiment, the human ear is frequency selective, allowing only certain frequencies of signals to pass. The Mel filter group has a plurality of filters in a low-frequency area on a frequency coordinate axis and are distributed densely, and the number of the filters in a high-frequency area is small and the filters are distributed sparsely, so that the nonlinear perception of human ears on sound can be simulated, the identification capability is better under a lower frequency, and the accuracy of distinguishing low-frequency signals is improved.

And S12, recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result.

In an optional embodiment, the training process of the target recognition model includes:

constructing an initial recognition model based on a continuous hidden Markov model, and setting a parameter initial value of the initial recognition model, wherein the parameter initial value can be set through equal division states or according to experience estimation;

setting the maximum iteration times and the convergence threshold value of the target identification model;

performing segmentation operation on voice training samples in a preset voice library based on a Viterbi Algorithm (Viterbi Algorithm), wherein a voice training sample set O ═ is (O1, o2.., oA), where O1 to oA are respectively 1 st to a th voice training samples;

updating the parameters of the model obtained by the iteration by using an iterative algorithm (for example, Baum-welch algorithm), performing cyclic iterative training on a speech training sample until the maximum iteration number or a convergence threshold is met to obtain the optimal model parameters, and obtaining the target recognition model Y ═ pi, M, N according to the optimal model parameters, wherein pi is the probability distribution of the initial time, M is the state transition probability matrix, and N is the probability density vector of the observation process. In an optional embodiment, feature extraction is performed on a plurality of speech signals (including the speech training samples) in the speech library, a speech feature set (including the speech training sample set) of features of the plurality of speech signals is obtained, and speech feature parameters in each speech signal change in time sequence, so as to generate a feature vector of each speech signal. The voice feature vector extracted from the ith voice signal is oi (oi 1.., oin), and n is 1, 2, …, K is the MFCC coefficient order.

In an optional embodiment, each of the speech feature parameters in the speech feature set at a preset ratio is obtained, and an initial recognition model is constructed by using the obtained speech feature parameters at the preset ratio. The preset proportion is set according to specific conditions, for example, when the preset proportion is set to be 1%, the initial recognition model is constructed by using the extracted 1% speech feature parameters. Wherein the initial recognition model may be established based on a Continuous Hidden Markov Models (CHMM).

In an optional implementation manner, the initial recognition model is trained based on an iterative algorithm until an optimal model parameter is obtained, and a model corresponding to the optimal model parameter is used as the target recognition model.

And performing iterative algorithm training on the initial recognition model by using the voice characteristic parameters outside the preset proportion to obtain the optimal model parameters of the model, and obtaining the target recognition model according to the optimal model parameters. The iterative algorithm comprises a Baum-Welch algorithm or a Baum-Welch algorithm improved by a K-means algorithm so as to improve the accuracy of the model.

In an optional embodiment, the recognizing the feature of the speech signal by using a pre-trained target recognition model to obtain a recognition result includes:

And S13, processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a decision item in a preset decision tree to obtain a second byte sequence.

The monophonic-based speech recognition system has been able to perform the basic task of large vocabulary continuous speech recognition, but the monophonic-based speech recognition system suffers from the following drawbacks: the number of modeling units is small, fine modeling is difficult to achieve, and better recognition rate is difficult to achieve; without considering the influence of the context in which the phoneme pronunciation is located, the phonemes in a sentence or a word are not isolated pronunciations but are integrally co-pronounced, resulting in low recognition accuracy. When modeling is performed in consideration of the context in which the phonemes are located, namely, a speech recognition system based on triphones, the problems that the parameter quantity is too large, training data is too sparse, and untrained triphones and the probability thereof cannot be described occur. The introduction of the decision tree can effectively solve the above problems.

In an optional embodiment, the preset decision tree building process includes:

the method comprises the steps that a tone-based voice recognition system obtains a feature set corresponding to each state of a tone and a state alignment sequence of the tone;

obtaining feature sets corresponding to all states of triphones and state alignment sequences of the triphones based on the feature sets corresponding to all the states of the monophony, wherein the triphones comprise previous phonemes and next phonemes of the monophony;

determining a problem set of a decision tree according to the similarity and the position of the phonemes, wherein the problem set comprises a plurality of problems;

determining a root node of the decision tree according to the problem set;

calculating the likelihood gain of all problems in the problem set, selecting the current node to be split and the problem with the maximum likelihood gain to split the node, obtaining child nodes and distributing the corresponding problems to the child nodes;

classifying the similar triphones into the same nodes according to the state alignment sequences of the triphones;

and performing recursive splitting on the nodes of the decision tree by using a likelihood gain criterion until the splitting reaches a preset node number or the likelihood gain is lower than a preset gain threshold value.

In an optional embodiment, an item in each node is used as the decision item, and the item in each node is a question corresponding to each node.

In an optional embodiment, the processing the recognition result based on fourier transform to obtain a first byte sequence, and the processing the decision item in the preset decision tree based on fourier transform to obtain a second byte sequence includes:

For example, the upward sound wave shape of the recognition result is converted into 1, otherwise, the upward sound wave shape is converted into 0, the first byte sequence is obtained, and the second byte sequence is obtained in the same way; the first byte sequence and/or the second byte sequence may be 64-bit bytes.

And S14, acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance (SimHash) algorithm.

In an alternative embodiment, the obtaining, by using a hamming distance algorithm, a target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence includes:

calculating a first SimHash value of the first byte sequence and a second SimHash value of the second byte sequence by using a SimHash algorithm;

Calculating the similarity between the identified SimHash sequence and the decision SimHash sequence by calculating a Hamming distance between the first SimHash value and the second SimHash value, wherein the Hamming distance comprises the number of characters of two character strings with the same length which are different at the same position.

In an alternative embodiment, the SimHash algorithm comprises: word segmentation, Hash (Hash) algorithm, weighting, merging, and dimension reduction.

In an alternative embodiment, said calculating a first SimHash value of said first sequence of bytes by a SimHash algorithm comprises:

setting a preset weight for each eigenvector;

calculating a Hash value of each feature vector through a Hash function;

A preset weight number (e.g., 5) of weights (weights) including the number of occurrences of each feature vector is set for each feature vector, and feature vectors occurring a larger number of times are given a larger weight. The Hash value is an n-bit signature consisting of binary numbers 01. And weighting all the feature vectors based on the Hash values, calculating the weighting result W of each feature vector as Hash value weight, positively multiplying the Hash values by the weight values when 1 is met, and negatively multiplying the Hash values by the weight values when 0 is met. And reducing the dimension of the accumulated result to obtain the first SimHash value, wherein for the accumulated result of the n-bit signature, if the accumulated result is more than 0, the value is set to 1, otherwise, the value is set to 0.

Similarly, a second SimHash value of the second byte sequence can be calculated, a hamming distance between the two calculated SimHash values is calculated, and the similarity between the two calculated SimHash values is compared according to the hamming distance. The smaller the hamming distance, the more similar the recognition result is to the decision term.

And S15, checking whether the target decision item is correct, and executing corresponding operation according to the result obtained by checking.

In an optional embodiment, the user may be prompted to respond to the correctness of the objective decision item by selecting and pressing different keys through a voice query, and the correctness of the objective decision item is verified by responding to the selection and pressing operations of the user on different keys.

The result of the verification comprises: the objective decision item is correct, and the objective decision item is incorrect.

In an optional embodiment, the performing, according to the result obtained by the verification, a corresponding operation includes:

and when the result obtained by verification is that the target decision item is incorrect, receiving the voice input by the user again, sending the voice input by the user twice to an artificial customer service, acquiring a first operation of the artificial customer service, and providing a processing method which meets the requirement of the user for the user based on the first operation.

Wherein the first operation may include: and selecting the decision items in the decision tree which meet the requirements of the user.

In an optional embodiment, when the result obtained by the verification is that the objective decision item is correct, the operation corresponding to the objective decision item is a processing method that meets the requirement of the user. And when the result obtained by verification is that the target decision item is incorrect, reminding the user to record the voice again.

In an optional embodiment, the method further comprises:

and when the target decision item is determined to be incorrect and no decision item meeting the requirements of the user exists in the decision tree, acquiring a second operation of the artificial customer service, and updating the voice library and the decision tree according to the second operation.

Wherein the second operation may include: and inputting the two times of voice.

In an optional embodiment, the updating the speech library and the decision tree according to the second operation includes: updating the two times of voice to the voice library; and newly adding the two times of voice in the problem set of the decision tree, and updating decision items corresponding to the two times of voice data in the decision tree.

In an optional embodiment, the updating process performed on the speech library and the decision tree is also a process of continuously optimizing a machine learning algorithm, which helps to continuously improve the efficiency and accuracy of speech recognition.

Example two

In some embodiments, the speech processing apparatus 20 may include a plurality of functional modules composed of computer program segments. The computer programs of the various program segments in the speech processing apparatus 20 may be stored in a memory of an electronic device and executed by at least one processor to perform the functions of speech processing (described in detail in fig. 1).

In this embodiment, the speech processing apparatus 20 may be divided into a plurality of functional modules according to the functions performed by the speech processing apparatus. The functional module may include: the system comprises an acquisition module 201, a recognition module 202, a processing module 203, a matching module 204 and a verification module 205. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The obtaining module 201 is configured to obtain a characteristic parameter of a voice in response to an operation of a user to enter the voice.

Wherein the pre-processing may include: pre-emphasis, windowing and framing processing, end point detection and noise reduction processing.

acquiring a plurality of short-time analysis windows of the voice signal;

performing Fourier transform on each short-time analysis window to obtain a corresponding frequency spectrum;

The identification module 202 is configured to identify the feature parameters by using a pre-trained target identification model to obtain an identification result.

In an alternative embodiment, the training process of the object recognition model includes:

updating the parameters of the model obtained by the iteration by using an iterative algorithm (for example, Baum-welch algorithm), performing cyclic iterative training on a speech training sample until the maximum iteration number or a convergence threshold is met to obtain the optimal model parameters, and obtaining the target recognition model Y ═ pi, M, N according to the optimal model parameters, wherein pi is the probability distribution of the initial time, M is the state transition probability matrix, and N is the probability density vector of the observation process.

In an optional embodiment, feature extraction is performed on a plurality of speech signals (including the speech training samples) in the speech library, a speech feature set (including the speech training sample set) of features of the plurality of speech signals is obtained, and speech feature parameters in each speech signal change in time sequence, so as to generate a feature vector of each speech signal. The voice feature vector extracted from the ith voice signal is oi (oi 1.., oin), and n is 1, 2, …, K is the MFCC coefficient order.

The processing module 203 is configured to process the identification result based on fourier transform to obtain a first byte sequence, and process a preset decision item in a decision tree based on fourier transform to obtain a second byte sequence.

In an optional embodiment, the preset decision tree construction process includes:

determining a root node of the decision tree according to the problem set;

In an optional embodiment, the processing the recognition result based on fourier transform to obtain a first byte sequence, and processing a decision item in a preset decision tree to obtain a second byte sequence includes:

The matching module 204 is configured to obtain a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by using a hamming distance (SimHash) algorithm.

In an optional embodiment, the obtaining, by using a hamming distance algorithm, a target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence includes:

In an alternative embodiment, the hamming distance algorithm (SimHash algorithm) comprises: word segmentation, Hash (Hash) algorithm, weighting, merging, and dimension reduction.

setting a preset weight for each eigenvector;

calculating a Hash value of each feature vector through a Hash function;

The checking module 205 is configured to check whether the target decision item is correct, and execute a corresponding operation according to a result obtained by the checking.

The result of the verification comprises: the objective decision item is correct and the objective decision item is incorrect.

In an optional embodiment, the verification module 205 is further configured to:

Wherein the second operation may include: and inputting the two times of voice.

EXAMPLE III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the above-described speech processing method embodiments, such as S11-S15 shown in fig. 1:

s11, responding to the operation of the user for recording voice, and acquiring the characteristic parameters of the voice;

s12, recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result;

s13, processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;

s14, obtaining a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;

Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units in the above-mentioned device embodiments, for example, the

module

201 and 205 in fig. 2:

the obtaining module 201 is configured to obtain a characteristic parameter of a voice in response to an operation of a user to enter the voice;

the identification module 202 is configured to identify the characteristic parameters by using a pre-trained target identification model to obtain an identification result;

the processing module 203 is configured to process the recognition result by using a hamming distance algorithm to obtain a first byte sequence, and process a decision item in a preset decision tree to obtain a second byte sequence;

the matching module 204 is configured to obtain a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence;

Example four

Fig. 3 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 3 may include more or less other hardware or software than those shown, or a different arrangement of components.

In some embodiments, the electronic device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.

It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, implements all or part of the steps of the speech processing method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects various components of the electronic device 3 by various interfaces and lines, and executes various functions and processes data of the electronic device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or part of the steps of the speech processing method described in the embodiments of the present invention; or to implement all or part of the functionality of the speech processing apparatus. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method of speech processing, the method comprising:

2. The speech processing method of claim 1, wherein the recognizing the feature parameters by using a pre-trained target recognition model to obtain a recognition result comprises:

3. The method of claim 2, wherein the processing the recognition result based on the fourier transform to obtain a first byte sequence and the processing the decision term in the predetermined decision tree based on the fourier transform to obtain a second byte sequence comprises:

transforming the first acoustic waveform into the first byte sequence and the second acoustic waveform into the second byte sequence based on a Fourier transform.

4. The speech processing method of claim 1, wherein the obtaining a target decision item from the decision tree that matches the recognition result based on the first byte sequence and the second byte sequence using a hamming distance algorithm comprises:

5. The speech processing method of claim 4 wherein said calculating a first SimHash value for the first sequence of bytes by the Hamming distance algorithm comprises:

setting a preset weight for each eigenvector;

calculating a Hash value of each feature vector through a Hash function;

6. The speech processing method of claim 1, wherein performing the corresponding operation according to the verified result comprises:

7. The speech processing method of claim 6 wherein the method further comprises:

8. A speech processing apparatus, characterized in that the apparatus comprises:

9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to implement the speech processing method according to any of claims 1 to 7 when executing a computer program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech processing method according to any one of claims 1 to 7.