CN114446283A - Voice processing method and device, electronic equipment and storage medium - Google Patents
Voice processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114446283A CN114446283A CN202210147479.6A CN202210147479A CN114446283A CN 114446283 A CN114446283 A CN 114446283A CN 202210147479 A CN202210147479 A CN 202210147479A CN 114446283 A CN114446283 A CN 114446283A
- Authority
- CN
- China
- Prior art keywords
- byte sequence
- voice
- decision
- target
- decision tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 28
- 238000003066 decision tree Methods 0.000 claims abstract description 68
- 238000012545 processing Methods 0.000 claims abstract description 58
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 52
- 238000000034 method Methods 0.000 claims description 37
- 239000013598 vector Substances 0.000 claims description 25
- 238000012795 verification Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000009825 accumulation Methods 0.000 claims description 4
- 230000001131 transforming effect Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000012549 training Methods 0.000 description 22
- 230000008569 process Effects 0.000 description 21
- 238000004458 analytical method Methods 0.000 description 10
- 238000001228 spectrum Methods 0.000 description 10
- 230000009467 reduction Effects 0.000 description 8
- 238000001514 detection method Methods 0.000 description 7
- 238000009432 framing Methods 0.000 description 7
- 230000004044 response Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000003825 pressing Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 210000005069 ears Anatomy 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000011410 subtraction method Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/148—Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of artificial intelligence, and provides a voice processing method, a device, electronic equipment and a storage medium, wherein characteristic parameters of voice input by a user are obtained, a target recognition model is used for recognizing the characteristic parameters to obtain a recognition result, the recognition result is processed based on Fourier transform to obtain a first byte sequence, a preset decision tree is processed to obtain a second byte sequence, a Hamming distance algorithm is used for obtaining a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence, whether the target decision item is correct or not is verified, corresponding operation is executed according to the verified result, the decision item meeting the requirements of the user is intelligently searched according to voice data of the user and the decision tree by a voice recognition technology, and the voice recognition and processing efficiency is improved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a voice processing method and device, electronic equipment and a storage medium.
Background
The traditional speech recognition technology set by intelligent speech service is based on phonemes, and carries out speech recognition by modeling phoneme sequences. However, this method cannot effectively express the relationship between long-context phoneme sequences, and in order to describe the relationship between more phoneme sequences, high-order modeling is required, so that the computation cost is exponentially increased, and the vector space model is not robust enough. In addition, the problem that traditional intelligence pronunciation customer service can solve is concrete inadequately, and the problem that can solve is too single for feedback efficiency is low excessively, hardly satisfies the customer's demand, can't reach the effect of effectively solving the user's problem.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a voice processing method, device, electronic device and storage medium, which can improve the efficiency of voice processing.
A first aspect of the present invention provides a method of speech processing, the method comprising:
responding to the operation of inputting voice by a user, and acquiring the characteristic parameters of the voice;
recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result;
processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;
acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;
and checking whether the target decision item is correct or not, and executing corresponding operation according to a result obtained by checking.
According to an optional embodiment of the present invention, the recognizing the feature parameters by using a pre-trained target recognition model, and obtaining a recognition result includes:
inputting mel frequency cepstrum coefficients of the voice signal into the target recognition model;
and obtaining a character string sequence output by the target recognition model.
According to an optional embodiment of the present invention, the processing the recognition result based on fourier transform to obtain a first byte sequence, and processing a decision item in a preset decision tree based on fourier transform to obtain a second byte sequence includes:
acquiring a first sound wave shape corresponding to the character string sequence of the recognition result and a second sound wave shape corresponding to each decision item in the decision tree;
transform the first acoustic waveform into the first byte sequence and the second acoustic waveform into the second byte sequence based on a Fourier transform.
According to an optional embodiment of the present invention, the obtaining, by using a hamming distance algorithm, a target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence includes:
calculating a first SimHash value of the first byte sequence and a second SimHash value of the second byte sequence by the Hamming distance algorithm;
calculating a similarity between the first byte sequence and the second byte sequence based on the first SimHash value and the second SimHash value;
and acquiring a decision item corresponding to the minimum similarity from the decision tree as the target decision item.
According to an alternative embodiment of the present invention, said calculating a first SimHash value of said first sequence of bytes by said hamming distance algorithm comprises:
performing word segmentation on the first byte sequence to obtain a plurality of feature vectors of the first byte sequence;
setting a preset weight for each eigenvector;
calculating a Hash value of each feature vector through a Hash function;
weighting all the eigenvectors based on the Hash value to obtain a weighting result;
accumulating the weighted results of all the eigenvectors to obtain an accumulated result;
and reducing the dimension of the accumulation result to obtain the first SimHash value.
According to an optional embodiment of the present invention, performing corresponding operations according to the result obtained by the verification includes:
when the result obtained by verification is that the objective decision item is correct, executing the operation corresponding to the objective decision item;
when the result obtained by verification is that the objective decision item is incorrect, receiving the voice input by the user again, sending the voice input by the user twice to an artificial customer service, acquiring a first operation of the artificial customer service, and providing a processing method meeting the requirement of the user for the user based on the first operation, wherein the first operation comprises the following steps: and selecting the decision items in the decision tree which meet the requirements of the user.
According to an alternative embodiment of the invention, the method further comprises:
when the target decision item is determined to be incorrect and no decision item meeting the requirements of the user exists in the decision tree, acquiring a second operation of the artificial customer service, and updating the voice library and the decision tree according to the second operation, wherein the second operation comprises: and inputting the two times of voice.
A second aspect of the present invention provides a speech processing apparatus, comprising:
the acquisition module is used for responding to the operation of inputting voice by a user and acquiring the characteristic parameters of the voice;
the recognition module is used for recognizing the characteristic parameters by utilizing a pre-trained target recognition model to obtain a recognition result;
the processing module is used for processing the identification result based on Fourier transform to obtain a first byte sequence and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;
the matching module is used for acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;
and the checking module is used for checking whether the target decision item is correct or not and executing corresponding operation according to a result obtained by checking.
A third aspect of the invention provides an electronic device comprising a processor and a memory, the processor being configured to implement the speech processing method when executing a computer program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech processing method.
To sum up, the voice processing method, the apparatus, the electronic device and the storage medium according to the present invention first obtain the feature parameters of the voice in response to the operation of the user to input the voice, then recognize the feature parameters by using the pre-trained target recognition model to obtain the recognition result, process the recognition result by using the hamming distance algorithm to obtain the first byte sequence, process the decision item in the preset decision tree to obtain the second byte sequence, then obtain the target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence, finally verify whether the target decision item is correct, execute the corresponding operation according to the result obtained by verification, achieve the intelligent search of the decision item according to the client voice data according to the pre-constructed decision tree by the voice recognition technology, and confirming the correctness of the searched decision item to the client in a mode of combining artificial intelligence and artificial customer service, and providing corresponding operation for the client according to the correctness. The background system of the artificial cooperation machine learning training adopts a machine to try to analyze historical data, trains and learns results, then is intervened and corrected manually, and the machine learning intelligently adjusts the proportion of related characteristic values to continue training, thereby improving the efficiency and accuracy of artificial intelligent voice customer service voice processing.
Drawings
Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a speech processing apparatus according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The voice processing method provided by the embodiment of the invention is executed by the electronic equipment, and correspondingly, the voice processing device runs in the electronic equipment.
Example one
Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present invention. The speech processing method specifically comprises the following steps, and the sequence of the steps in the flowchart can be changed and some steps can be omitted according to different requirements.
And S11, responding to the operation of the user for recording the voice, and acquiring the characteristic parameters of the voice.
In response to the operation of a user for recording voice, firstly, preprocessing the voice to obtain a voice signal, then, acquiring a Mel Frequency Cepstral Coefficients (MFCC) of the voice signal, and taking the Mel Frequency Cepstral coefficients as characteristic parameters of the voice signal.
Wherein the pre-processing may include: pre-emphasis, windowing and frame division processing, end point detection and noise reduction processing.
The pre-emphasis processing comprises: the high frequency portion of the voice data is emphasized based on a difference in signal characteristics and noise characteristics of the voice data. The pre-emphasis process may increase the high frequency resolution of the speech data.
The windowing framing process comprises: and windowing and framing the voice data to obtain a plurality of short-time analysis windows of the voice data. And framing the voice data by adopting a movable window with limited length for weighting, and processing the voice data by utilizing a window function to form a windowed voice signal, wherein the window function comprises a Hamming window and a rectangular window.
The endpoint detection comprises: and acquiring a starting point and an end point of the voice data, and taking the starting point and the end point as two end points of the voice data. The correct and effective end point detection not only can reduce the calculated amount and shorten the processing time, eliminate the noise interference of the silent section and improve the accuracy of the voice recognition, but also can extract the initial point of the keyword to be recognized, separate the voice data from the background noise and silence, obtain the voice signal suitable for voice recognition and carry out subsequent operation.
The voice noise reduction processing comprises: and processing the voice data by using noise reduction algorithms such as an adaptive filter, a spectral subtraction method or a wiener filtering method and the like, so as to improve the signal-to-noise ratio of the voice data.
In an optional embodiment, the obtaining the mel-frequency cepstrum coefficients of the speech signal comprises:
acquiring a plurality of short-time analysis windows of the voice signal;
carrying out Fourier transform on each short-time analysis window to obtain a corresponding frequency spectrum;
obtaining a mel frequency spectrum of the frequency spectrum by using a mel filter bank;
and carrying out cepstrum analysis on the Mel frequency spectrum to obtain the Mel frequency cepstrum coefficient. Wherein the cepstrum analysis comprises taking a logarithm and performing an inverse transform, the inverse transform comprising a discrete cosine transform.
In an alternative embodiment, the mel filter bank includes, for example, 40 triangular filters. In order to balance the spectrum and improve the signal-to-noise ratio (SNR), the mel filter bank may be normalized to obtain a mean normalized mel filter bank, thereby obtaining a normalized MFCC.
In an alternative embodiment, the human ear is frequency selective, allowing only certain frequencies of signals to pass. The Mel filter group has a plurality of filters in a low-frequency area on a frequency coordinate axis and are distributed densely, and the number of the filters in a high-frequency area is small and the filters are distributed sparsely, so that the nonlinear perception of human ears on sound can be simulated, the identification capability is better under a lower frequency, and the accuracy of distinguishing low-frequency signals is improved.
And S12, recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result.
In an optional embodiment, the training process of the target recognition model includes:
constructing an initial recognition model based on a continuous hidden Markov model, and setting a parameter initial value of the initial recognition model, wherein the parameter initial value can be set through equal division states or according to experience estimation;
setting the maximum iteration times and the convergence threshold value of the target identification model;
performing segmentation operation on voice training samples in a preset voice library based on a Viterbi Algorithm (Viterbi Algorithm), wherein a voice training sample set O ═ is (O1, o2.., oA), where O1 to oA are respectively 1 st to a th voice training samples;
updating the parameters of the model obtained by the iteration by using an iterative algorithm (for example, Baum-welch algorithm), performing cyclic iterative training on a speech training sample until the maximum iteration number or a convergence threshold is met to obtain the optimal model parameters, and obtaining the target recognition model Y ═ pi, M, N according to the optimal model parameters, wherein pi is the probability distribution of the initial time, M is the state transition probability matrix, and N is the probability density vector of the observation process. In an optional embodiment, feature extraction is performed on a plurality of speech signals (including the speech training samples) in the speech library, a speech feature set (including the speech training sample set) of features of the plurality of speech signals is obtained, and speech feature parameters in each speech signal change in time sequence, so as to generate a feature vector of each speech signal. The voice feature vector extracted from the ith voice signal is oi (oi 1.., oin), and n is 1, 2, …, K is the MFCC coefficient order.
In an optional embodiment, each of the speech feature parameters in the speech feature set at a preset ratio is obtained, and an initial recognition model is constructed by using the obtained speech feature parameters at the preset ratio. The preset proportion is set according to specific conditions, for example, when the preset proportion is set to be 1%, the initial recognition model is constructed by using the extracted 1% speech feature parameters. Wherein the initial recognition model may be established based on a Continuous Hidden Markov Models (CHMM).
In an optional implementation manner, the initial recognition model is trained based on an iterative algorithm until an optimal model parameter is obtained, and a model corresponding to the optimal model parameter is used as the target recognition model.
And performing iterative algorithm training on the initial recognition model by using the voice characteristic parameters outside the preset proportion to obtain the optimal model parameters of the model, and obtaining the target recognition model according to the optimal model parameters. The iterative algorithm comprises a Baum-Welch algorithm or a Baum-Welch algorithm improved by a K-means algorithm so as to improve the accuracy of the model.
In an optional embodiment, the recognizing the feature of the speech signal by using a pre-trained target recognition model to obtain a recognition result includes:
inputting mel frequency cepstrum coefficients of the voice signal into the target recognition model;
and obtaining a character string sequence output by the target recognition model.
And S13, processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a decision item in a preset decision tree to obtain a second byte sequence.
The monophonic-based speech recognition system has been able to perform the basic task of large vocabulary continuous speech recognition, but the monophonic-based speech recognition system suffers from the following drawbacks: the number of modeling units is small, fine modeling is difficult to achieve, and better recognition rate is difficult to achieve; without considering the influence of the context in which the phoneme pronunciation is located, the phonemes in a sentence or a word are not isolated pronunciations but are integrally co-pronounced, resulting in low recognition accuracy. When modeling is performed in consideration of the context in which the phonemes are located, namely, a speech recognition system based on triphones, the problems that the parameter quantity is too large, training data is too sparse, and untrained triphones and the probability thereof cannot be described occur. The introduction of the decision tree can effectively solve the above problems.
In an optional embodiment, the preset decision tree building process includes:
the method comprises the steps that a tone-based voice recognition system obtains a feature set corresponding to each state of a tone and a state alignment sequence of the tone;
obtaining feature sets corresponding to all states of triphones and state alignment sequences of the triphones based on the feature sets corresponding to all the states of the monophony, wherein the triphones comprise previous phonemes and next phonemes of the monophony;
determining a problem set of a decision tree according to the similarity and the position of the phonemes, wherein the problem set comprises a plurality of problems;
determining a root node of the decision tree according to the problem set;
calculating the likelihood gain of all problems in the problem set, selecting the current node to be split and the problem with the maximum likelihood gain to split the node, obtaining child nodes and distributing the corresponding problems to the child nodes;
classifying the similar triphones into the same nodes according to the state alignment sequences of the triphones;
and performing recursive splitting on the nodes of the decision tree by using a likelihood gain criterion until the splitting reaches a preset node number or the likelihood gain is lower than a preset gain threshold value.
In an optional embodiment, an item in each node is used as the decision item, and the item in each node is a question corresponding to each node.
In an optional embodiment, the processing the recognition result based on fourier transform to obtain a first byte sequence, and the processing the decision item in the preset decision tree based on fourier transform to obtain a second byte sequence includes:
acquiring a first sound wave shape corresponding to the character string sequence of the recognition result and a second sound wave shape corresponding to each decision item in the decision tree;
transform the first acoustic waveform into the first byte sequence and the second acoustic waveform into the second byte sequence based on a Fourier transform.
For example, the upward sound wave shape of the recognition result is converted into 1, otherwise, the upward sound wave shape is converted into 0, the first byte sequence is obtained, and the second byte sequence is obtained in the same way; the first byte sequence and/or the second byte sequence may be 64-bit bytes.
And S14, acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance (SimHash) algorithm.
In an alternative embodiment, the obtaining, by using a hamming distance algorithm, a target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence includes:
calculating a first SimHash value of the first byte sequence and a second SimHash value of the second byte sequence by using a SimHash algorithm;
calculating a similarity between the first byte sequence and the second byte sequence based on the first SimHash value and the second SimHash value;
and acquiring a decision item corresponding to the minimum similarity from the decision tree as the target decision item.
Calculating the similarity between the identified SimHash sequence and the decision SimHash sequence by calculating a Hamming distance between the first SimHash value and the second SimHash value, wherein the Hamming distance comprises the number of characters of two character strings with the same length which are different at the same position.
In an alternative embodiment, the SimHash algorithm comprises: word segmentation, Hash (Hash) algorithm, weighting, merging, and dimension reduction.
In an alternative embodiment, said calculating a first SimHash value of said first sequence of bytes by a SimHash algorithm comprises:
performing word segmentation on the first byte sequence to obtain a plurality of feature vectors of the first byte sequence;
setting a preset weight for each eigenvector;
calculating a Hash value of each feature vector through a Hash function;
weighting all the eigenvectors based on the Hash value to obtain a weighting result;
accumulating the weighted results of all the eigenvectors to obtain an accumulated result;
and reducing the dimension of the accumulation result to obtain the first SimHash value.
A preset weight number (e.g., 5) of weights (weights) including the number of occurrences of each feature vector is set for each feature vector, and feature vectors occurring a larger number of times are given a larger weight. The Hash value is an n-bit signature consisting of binary numbers 01. And weighting all the feature vectors based on the Hash values, calculating the weighting result W of each feature vector as Hash value weight, positively multiplying the Hash values by the weight values when 1 is met, and negatively multiplying the Hash values by the weight values when 0 is met. And reducing the dimension of the accumulated result to obtain the first SimHash value, wherein for the accumulated result of the n-bit signature, if the accumulated result is more than 0, the value is set to 1, otherwise, the value is set to 0.
Similarly, a second SimHash value of the second byte sequence can be calculated, a hamming distance between the two calculated SimHash values is calculated, and the similarity between the two calculated SimHash values is compared according to the hamming distance. The smaller the hamming distance, the more similar the recognition result is to the decision term.
And S15, checking whether the target decision item is correct, and executing corresponding operation according to the result obtained by checking.
In an optional embodiment, the user may be prompted to respond to the correctness of the objective decision item by selecting and pressing different keys through a voice query, and the correctness of the objective decision item is verified by responding to the selection and pressing operations of the user on different keys.
The result of the verification comprises: the objective decision item is correct, and the objective decision item is incorrect.
In an optional embodiment, the performing, according to the result obtained by the verification, a corresponding operation includes:
when the result obtained by verification is that the objective decision item is correct, executing the operation corresponding to the objective decision item;
and when the result obtained by verification is that the target decision item is incorrect, receiving the voice input by the user again, sending the voice input by the user twice to an artificial customer service, acquiring a first operation of the artificial customer service, and providing a processing method which meets the requirement of the user for the user based on the first operation.
Wherein the first operation may include: and selecting the decision items in the decision tree which meet the requirements of the user.
In an optional embodiment, when the result obtained by the verification is that the objective decision item is correct, the operation corresponding to the objective decision item is a processing method that meets the requirement of the user. And when the result obtained by verification is that the target decision item is incorrect, reminding the user to record the voice again.
In an optional embodiment, the method further comprises:
and when the target decision item is determined to be incorrect and no decision item meeting the requirements of the user exists in the decision tree, acquiring a second operation of the artificial customer service, and updating the voice library and the decision tree according to the second operation.
Wherein the second operation may include: and inputting the two times of voice.
In an optional embodiment, the updating the speech library and the decision tree according to the second operation includes: updating the two times of voice to the voice library; and newly adding the two times of voice in the problem set of the decision tree, and updating decision items corresponding to the two times of voice data in the decision tree.
In an optional embodiment, the updating process performed on the speech library and the decision tree is also a process of continuously optimizing a machine learning algorithm, which helps to continuously improve the efficiency and accuracy of speech recognition.
Example two
Fig. 2 is a structural diagram of a speech processing apparatus according to a second embodiment of the present invention.
In some embodiments, the speech processing apparatus 20 may include a plurality of functional modules composed of computer program segments. The computer programs of the various program segments in the speech processing apparatus 20 may be stored in a memory of an electronic device and executed by at least one processor to perform the functions of speech processing (described in detail in fig. 1).
In this embodiment, the speech processing apparatus 20 may be divided into a plurality of functional modules according to the functions performed by the speech processing apparatus. The functional module may include: the system comprises an acquisition module 201, a recognition module 202, a processing module 203, a matching module 204 and a verification module 205. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The obtaining module 201 is configured to obtain a characteristic parameter of a voice in response to an operation of a user to enter the voice.
In response to the operation of a user for recording voice, firstly, preprocessing the voice to obtain a voice signal, then, acquiring a Mel Frequency Cepstral Coefficients (MFCC) of the voice signal, and taking the Mel Frequency Cepstral coefficients as characteristic parameters of the voice signal.
Wherein the pre-processing may include: pre-emphasis, windowing and framing processing, end point detection and noise reduction processing.
The pre-emphasis processing comprises: the high frequency portion of the voice data is emphasized based on a difference in signal characteristics and noise characteristics of the voice data. The pre-emphasis process may increase the high frequency resolution of the speech data.
The windowing framing process comprises: and windowing and framing the voice data to obtain a plurality of short-time analysis windows of the voice data. And framing the voice data by adopting a movable window with limited length for weighting, and processing the voice data by utilizing a window function to form a windowed voice signal, wherein the window function comprises a Hamming window and a rectangular window.
The endpoint detection comprises: and acquiring a starting point and an end point of the voice data, and taking the starting point and the end point as two end points of the voice data. The correct and effective end point detection not only can reduce the calculated amount and shorten the processing time, eliminate the noise interference of the silent section and improve the accuracy of the voice recognition, but also can extract the initial point of the keyword to be recognized, separate the voice data from the background noise and silence, obtain the voice signal suitable for voice recognition and carry out subsequent operation.
The voice noise reduction processing comprises: and processing the voice data by using noise reduction algorithms such as an adaptive filter, a spectral subtraction method or a wiener filtering method and the like, so as to improve the signal-to-noise ratio of the voice data.
In an optional embodiment, the obtaining the mel-frequency cepstrum coefficients of the speech signal comprises:
acquiring a plurality of short-time analysis windows of the voice signal;
performing Fourier transform on each short-time analysis window to obtain a corresponding frequency spectrum;
obtaining a mel frequency spectrum of the frequency spectrum by using a mel filter bank;
and carrying out cepstrum analysis on the Mel frequency spectrum to obtain the Mel frequency cepstrum coefficient. Wherein the cepstrum analysis comprises taking a logarithm and performing an inverse transform, the inverse transform comprising a discrete cosine transform.
In an alternative embodiment, the mel filter bank includes, for example, 40 triangular filters. In order to balance the spectrum and improve the signal-to-noise ratio (SNR), the mel filter bank may be normalized to obtain a mean normalized mel filter bank, thereby obtaining a normalized MFCC.
In an alternative embodiment, the human ear is frequency selective, allowing only certain frequencies of signals to pass. The Mel filter group has a plurality of filters in a low-frequency area on a frequency coordinate axis and are distributed densely, and the number of the filters in a high-frequency area is small and the filters are distributed sparsely, so that the nonlinear perception of human ears on sound can be simulated, the identification capability is better under a lower frequency, and the accuracy of distinguishing low-frequency signals is improved.
The identification module 202 is configured to identify the feature parameters by using a pre-trained target identification model to obtain an identification result.
In an alternative embodiment, the training process of the object recognition model includes:
constructing an initial recognition model based on a continuous hidden Markov model, and setting a parameter initial value of the initial recognition model, wherein the parameter initial value can be set through equal division states or according to experience estimation;
setting the maximum iteration times and the convergence threshold value of the target identification model;
performing segmentation operation on voice training samples in a preset voice library based on a Viterbi Algorithm (Viterbi Algorithm), wherein a voice training sample set O ═ is (O1, o2.., oA), where O1 to oA are respectively 1 st to a th voice training samples;
updating the parameters of the model obtained by the iteration by using an iterative algorithm (for example, Baum-welch algorithm), performing cyclic iterative training on a speech training sample until the maximum iteration number or a convergence threshold is met to obtain the optimal model parameters, and obtaining the target recognition model Y ═ pi, M, N according to the optimal model parameters, wherein pi is the probability distribution of the initial time, M is the state transition probability matrix, and N is the probability density vector of the observation process.
In an optional embodiment, feature extraction is performed on a plurality of speech signals (including the speech training samples) in the speech library, a speech feature set (including the speech training sample set) of features of the plurality of speech signals is obtained, and speech feature parameters in each speech signal change in time sequence, so as to generate a feature vector of each speech signal. The voice feature vector extracted from the ith voice signal is oi (oi 1.., oin), and n is 1, 2, …, K is the MFCC coefficient order.
In an optional embodiment, each of the speech feature parameters in the speech feature set at a preset ratio is obtained, and an initial recognition model is constructed by using the obtained speech feature parameters at the preset ratio. The preset proportion is set according to specific conditions, for example, when the preset proportion is set to be 1%, the initial recognition model is constructed by using the extracted 1% speech feature parameters. Wherein the initial recognition model may be established based on a Continuous Hidden Markov Models (CHMM).
In an optional implementation manner, the initial recognition model is trained based on an iterative algorithm until an optimal model parameter is obtained, and a model corresponding to the optimal model parameter is used as the target recognition model.
And performing iterative algorithm training on the initial recognition model by using the voice characteristic parameters outside the preset proportion to obtain the optimal model parameters of the model, and obtaining the target recognition model according to the optimal model parameters. The iterative algorithm comprises a Baum-Welch algorithm or a Baum-Welch algorithm improved by a K-means algorithm so as to improve the accuracy of the model.
In an optional embodiment, the recognizing the feature of the speech signal by using a pre-trained target recognition model to obtain a recognition result includes:
inputting mel frequency cepstrum coefficients of the voice signal into the target recognition model;
and obtaining a character string sequence output by the target recognition model.
The processing module 203 is configured to process the identification result based on fourier transform to obtain a first byte sequence, and process a preset decision item in a decision tree based on fourier transform to obtain a second byte sequence.
The monophonic-based speech recognition system has been able to perform the basic task of large vocabulary continuous speech recognition, but the monophonic-based speech recognition system suffers from the following drawbacks: the number of modeling units is small, fine modeling is difficult to achieve, and better recognition rate is difficult to achieve; without considering the influence of the context in which the phoneme pronunciation is located, the phonemes in a sentence or a word are not isolated pronunciations but are integrally co-pronounced, resulting in low recognition accuracy. When modeling is performed in consideration of the context in which the phonemes are located, namely, a speech recognition system based on triphones, the problems that the parameter quantity is too large, training data is too sparse, and untrained triphones and the probability thereof cannot be described occur. The introduction of the decision tree can effectively solve the above problems.
In an optional embodiment, the preset decision tree construction process includes:
the method comprises the steps that a tone-based voice recognition system obtains a feature set corresponding to each state of a tone and a state alignment sequence of the tone;
obtaining feature sets corresponding to all states of triphones and state alignment sequences of the triphones based on the feature sets corresponding to all the states of the monophony, wherein the triphones comprise previous phonemes and next phonemes of the monophony;
determining a problem set of a decision tree according to the similarity and the position of the phonemes, wherein the problem set comprises a plurality of problems;
determining a root node of the decision tree according to the problem set;
calculating the likelihood gain of all problems in the problem set, selecting the current node to be split and the problem with the maximum likelihood gain to split the node, obtaining child nodes and distributing the corresponding problems to the child nodes;
classifying the similar triphones into the same nodes according to the state alignment sequences of the triphones;
and performing recursive splitting on the nodes of the decision tree by using a likelihood gain criterion until the splitting reaches a preset node number or the likelihood gain is lower than a preset gain threshold value.
In an optional embodiment, an item in each node is used as the decision item, and the item in each node is a question corresponding to each node.
In an optional embodiment, the processing the recognition result based on fourier transform to obtain a first byte sequence, and processing a decision item in a preset decision tree to obtain a second byte sequence includes:
acquiring a first sound wave shape corresponding to the character string sequence of the recognition result and a second sound wave shape corresponding to each decision item in the decision tree;
transform the first acoustic waveform into the first byte sequence and the second acoustic waveform into the second byte sequence based on a Fourier transform.
For example, the upward sound wave shape of the recognition result is converted into 1, otherwise, the upward sound wave shape is converted into 0, the first byte sequence is obtained, and the second byte sequence is obtained in the same way; the first byte sequence and/or the second byte sequence may be 64-bit bytes.
The matching module 204 is configured to obtain a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by using a hamming distance (SimHash) algorithm.
In an optional embodiment, the obtaining, by using a hamming distance algorithm, a target decision item matching the recognition result from the decision tree based on the first byte sequence and the second byte sequence includes:
calculating a first SimHash value of the first byte sequence and a second SimHash value of the second byte sequence by using a SimHash algorithm;
calculating a similarity between the first byte sequence and the second byte sequence based on the first SimHash value and the second SimHash value;
and acquiring a decision item corresponding to the minimum similarity from the decision tree as the target decision item.
Calculating the similarity between the identified SimHash sequence and the decision SimHash sequence by calculating a Hamming distance between the first SimHash value and the second SimHash value, wherein the Hamming distance comprises the number of characters of two character strings with the same length which are different at the same position.
In an alternative embodiment, the hamming distance algorithm (SimHash algorithm) comprises: word segmentation, Hash (Hash) algorithm, weighting, merging, and dimension reduction.
In an alternative embodiment, said calculating a first SimHash value of said first sequence of bytes by a SimHash algorithm comprises:
performing word segmentation on the first byte sequence to obtain a plurality of feature vectors of the first byte sequence;
setting a preset weight for each eigenvector;
calculating a Hash value of each feature vector through a Hash function;
weighting all the eigenvectors based on the Hash value to obtain a weighting result;
accumulating the weighted results of all the eigenvectors to obtain an accumulated result;
and reducing the dimension of the accumulation result to obtain the first SimHash value.
A preset weight number (e.g., 5) of weights (weights) including the number of occurrences of each feature vector is set for each feature vector, and feature vectors occurring a larger number of times are given a larger weight. The Hash value is an n-bit signature consisting of binary numbers 01. And weighting all the feature vectors based on the Hash values, calculating the weighting result W of each feature vector as Hash value weight, positively multiplying the Hash values by the weight values when 1 is met, and negatively multiplying the Hash values by the weight values when 0 is met. And reducing the dimension of the accumulated result to obtain the first SimHash value, wherein for the accumulated result of the n-bit signature, if the accumulated result is more than 0, the value is set to 1, otherwise, the value is set to 0.
Similarly, a second SimHash value of the second byte sequence can be calculated, a hamming distance between the two calculated SimHash values is calculated, and the similarity between the two calculated SimHash values is compared according to the hamming distance. The smaller the hamming distance, the more similar the recognition result is to the decision term.
The checking module 205 is configured to check whether the target decision item is correct, and execute a corresponding operation according to a result obtained by the checking.
In an optional embodiment, the user may be prompted to respond to the correctness of the objective decision item by selecting and pressing different keys through a voice query, and the correctness of the objective decision item is verified by responding to the selection and pressing operations of the user on different keys.
The result of the verification comprises: the objective decision item is correct and the objective decision item is incorrect.
In an optional embodiment, the performing, according to the result obtained by the verification, a corresponding operation includes:
when the result obtained by verification is that the objective decision item is correct, executing the operation corresponding to the objective decision item;
and when the result obtained by verification is that the target decision item is incorrect, receiving the voice input by the user again, sending the voice input by the user twice to an artificial customer service, acquiring a first operation of the artificial customer service, and providing a processing method which meets the requirement of the user for the user based on the first operation.
Wherein the first operation may include: and selecting the decision items in the decision tree which meet the requirements of the user.
In an optional embodiment, when the result obtained by the verification is that the objective decision item is correct, the operation corresponding to the objective decision item is a processing method that meets the requirement of the user. And when the result obtained by verification is that the target decision item is incorrect, reminding the user to record the voice again.
In an optional embodiment, the verification module 205 is further configured to:
and when the target decision item is determined to be incorrect and no decision item meeting the requirements of the user exists in the decision tree, acquiring a second operation of the artificial customer service, and updating the voice library and the decision tree according to the second operation.
Wherein the second operation may include: and inputting the two times of voice.
In an optional embodiment, the updating the speech library and the decision tree according to the second operation includes: updating the two times of voice to the voice library; and newly adding the two times of voice in the problem set of the decision tree, and updating decision items corresponding to the two times of voice data in the decision tree.
In an optional embodiment, the updating process performed on the speech library and the decision tree is also a process of continuously optimizing a machine learning algorithm, which helps to continuously improve the efficiency and accuracy of speech recognition.
EXAMPLE III
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the above-described speech processing method embodiments, such as S11-S15 shown in fig. 1:
s11, responding to the operation of the user for recording voice, and acquiring the characteristic parameters of the voice;
s12, recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result;
s13, processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;
s14, obtaining a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;
and S15, checking whether the target decision item is correct, and executing corresponding operation according to the result obtained by checking.
Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units in the above-mentioned device embodiments, for example, the module 201 and 205 in fig. 2:
the obtaining module 201 is configured to obtain a characteristic parameter of a voice in response to an operation of a user to enter the voice;
the identification module 202 is configured to identify the characteristic parameters by using a pre-trained target identification model to obtain an identification result;
the processing module 203 is configured to process the recognition result by using a hamming distance algorithm to obtain a first byte sequence, and process a decision item in a preset decision tree to obtain a second byte sequence;
the matching module 204 is configured to obtain a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence;
the checking module 205 is configured to check whether the target decision item is correct, and execute a corresponding operation according to a result obtained by the checking.
Example four
Fig. 3 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 3 may include more or less other hardware or software than those shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is a device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.
It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 has stored therein a computer program which, when executed by the at least one processor 32, implements all or part of the steps of the speech processing method as described. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
In some embodiments, the at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects various components of the electronic device 3 by various interfaces and lines, and executes various functions and processes data of the electronic device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31. For example, the at least one processor 32, when executing the computer program stored in the memory, implements all or part of the steps of the speech processing method described in the embodiments of the present invention; or to implement all or part of the functionality of the speech processing apparatus. The at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the specification may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A method of speech processing, the method comprising:
responding to the operation of inputting voice by a user, and acquiring the characteristic parameters of the voice;
recognizing the characteristic parameters by using a pre-trained target recognition model to obtain a recognition result;
processing the recognition result based on Fourier transform to obtain a first byte sequence, and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;
acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;
and checking whether the target decision item is correct or not, and executing corresponding operation according to a result obtained by checking.
2. The speech processing method of claim 1, wherein the recognizing the feature parameters by using a pre-trained target recognition model to obtain a recognition result comprises:
inputting mel frequency cepstrum coefficients of the voice signal into the target recognition model;
and obtaining a character string sequence output by the target recognition model.
3. The method of claim 2, wherein the processing the recognition result based on the fourier transform to obtain a first byte sequence and the processing the decision term in the predetermined decision tree based on the fourier transform to obtain a second byte sequence comprises:
acquiring a first sound wave shape corresponding to the character string sequence of the recognition result and a second sound wave shape corresponding to each decision item in the decision tree;
transforming the first acoustic waveform into the first byte sequence and the second acoustic waveform into the second byte sequence based on a Fourier transform.
4. The speech processing method of claim 1, wherein the obtaining a target decision item from the decision tree that matches the recognition result based on the first byte sequence and the second byte sequence using a hamming distance algorithm comprises:
calculating a first SimHash value of the first byte sequence and a second SimHash value of the second byte sequence by the Hamming distance algorithm;
calculating a similarity between the first byte sequence and the second byte sequence based on the first SimHash value and the second SimHash value;
and acquiring a decision item corresponding to the minimum similarity from the decision tree as the target decision item.
5. The speech processing method of claim 4 wherein said calculating a first SimHash value for the first sequence of bytes by the Hamming distance algorithm comprises:
performing word segmentation on the first byte sequence to obtain a plurality of feature vectors of the first byte sequence;
setting a preset weight for each eigenvector;
calculating a Hash value of each feature vector through a Hash function;
weighting all the eigenvectors based on the Hash value to obtain a weighting result;
accumulating the weighted results of all the eigenvectors to obtain an accumulated result;
and reducing the dimension of the accumulation result to obtain the first SimHash value.
6. The speech processing method of claim 1, wherein performing the corresponding operation according to the verified result comprises:
when the result obtained by verification is that the objective decision item is correct, executing the operation corresponding to the objective decision item;
when the result obtained by verification is that the objective decision item is incorrect, receiving the voice input by the user again, sending the voice input by the user twice to an artificial customer service, acquiring a first operation of the artificial customer service, and providing a processing method meeting the requirement of the user for the user based on the first operation, wherein the first operation comprises the following steps: and selecting the decision items in the decision tree which meet the requirements of the user.
7. The speech processing method of claim 6 wherein the method further comprises:
when the target decision item is determined to be incorrect and no decision item meeting the requirements of the user exists in the decision tree, acquiring a second operation of the artificial customer service, and updating the voice library and the decision tree according to the second operation, wherein the second operation comprises: and inputting the two times of voice.
8. A speech processing apparatus, characterized in that the apparatus comprises:
the acquisition module is used for responding to the operation of inputting voice by a user and acquiring the characteristic parameters of the voice;
the recognition module is used for recognizing the characteristic parameters by utilizing a pre-trained target recognition model to obtain a recognition result;
the processing module is used for processing the identification result based on Fourier transform to obtain a first byte sequence and processing a preset decision item in a decision tree based on Fourier transform to obtain a second byte sequence;
the matching module is used for acquiring a target decision item matched with the recognition result from the decision tree based on the first byte sequence and the second byte sequence by utilizing a Hamming distance algorithm;
and the checking module is used for checking whether the target decision item is correct or not and executing corresponding operation according to a result obtained by checking.
9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to implement the speech processing method according to any of claims 1 to 7 when executing a computer program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech processing method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210147479.6A CN114446283A (en) | 2022-02-17 | 2022-02-17 | Voice processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210147479.6A CN114446283A (en) | 2022-02-17 | 2022-02-17 | Voice processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114446283A true CN114446283A (en) | 2022-05-06 |
Family
ID=81374462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210147479.6A Pending CN114446283A (en) | 2022-02-17 | 2022-02-17 | Voice processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114446283A (en) |
-
2022
- 2022-02-17 CN CN202210147479.6A patent/CN114446283A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109817246B (en) | Emotion recognition model training method, emotion recognition device, emotion recognition equipment and storage medium | |
CN107680582B (en) | Acoustic model training method, voice recognition method, device, equipment and medium | |
CN111276131B (en) | Multi-class acoustic feature integration method and system based on deep neural network | |
CN106683680B (en) | Speaker recognition method and device, computer equipment and computer readable medium | |
CN111712874B (en) | Method, system, device and storage medium for determining sound characteristics | |
CN107610707B (en) | A kind of method for recognizing sound-groove and device | |
CN112562691B (en) | Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium | |
CN109087648B (en) | Counter voice monitoring method and device, computer equipment and storage medium | |
JP5853029B2 (en) | Passphrase modeling device and method for speaker verification, and speaker verification system | |
WO2019037205A1 (en) | Voice fraud identifying method and apparatus, terminal device, and storage medium | |
KR100800367B1 (en) | Sensor based speech recognizer selection, adaptation and combination | |
CN107731233B (en) | Voiceprint recognition method based on RNN | |
CN102509547B (en) | Method and system for voiceprint recognition based on vector quantization based | |
US5638486A (en) | Method and system for continuous speech recognition using voting techniques | |
US5596679A (en) | Method and system for identifying spoken sounds in continuous speech by comparing classifier outputs | |
CN112885336B (en) | Training and recognition method and device of voice recognition system and electronic equipment | |
JPWO2019102884A1 (en) | Label generators, model learning devices, emotion recognition devices, their methods, programs, and recording media | |
CN112259089B (en) | Speech recognition method and device | |
CN109036471B (en) | Voice endpoint detection method and device | |
KR20200104019A (en) | Machine learning based voice data analysis method, device and program | |
CN111933148A (en) | Age identification method and device based on convolutional neural network and terminal | |
JP3014177B2 (en) | Speaker adaptive speech recognition device | |
CN113793615A (en) | Speaker recognition method, model training method, device, equipment and storage medium | |
JP3920749B2 (en) | Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model | |
CN115631748A (en) | Emotion recognition method and device based on voice conversation, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |