WO2020140609A1

WO2020140609A1 - Voice recognition method and device and computer readable storage medium

Info

Publication number: WO2020140609A1
Application number: PCT/CN2019/116979
Authority: WO
Inventors: 贾雪丽; 程宁; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-01-04
Filing date: 2019-11-11
Publication date: 2020-07-09
Also published as: CN109545226A; CN109545226B

Abstract

A voice recognition method and device and a computer readable storage medium. The method comprises: obtaining a first digital voice signal to be detected, wherein the first digital voice signal is composed of digital ciphers, and the digital ciphers are composed of multiple digits (S101); performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a digit (S102); processing the second digital voice signals according to a preset signal processing method, determining logarithmic Mel power spectra corresponding to the second digital voice signals, and extracting target feature information of the second digital voice signals from the logarithmic Mel power spectra (S103); recognizing the target feature information of the second digital voice signals on the basis of a neural network model so as to obtain target digits corresponding to the second digital voice signals (S104); and determining a target digital cipher corresponding to the first digital voice signal according to the target digits corresponding to the second digital voice signals (S105). The method improves the performance and validity of voice recognition.

Description

Voice recognition method, device and computer readable storage medium

This application requires the priority of the Chinese patent application filed on January 04, 2019, with the application number 201910014557.3, the application name is "a speech recognition method, equipment and computer-readable storage medium", all of which are approved by The reference is incorporated in this application.

Technical field

The present application relates to the field of voice recognition technology, and in particular, to a voice recognition method, device, and computer-readable storage medium.

Background technique

Vector-based (Identity-Vector, I-vector) speaker recognition system is a classic method for solving text-independent speaker recognition problems. However, in recent years, this field has received more and more attention from deep learning. Deep learning methods and techniques for solving acoustic problems can be divided into two categories: (1) Deep Neural Network (DNN) is connected to Hidden Markov Model (HMM) to train the statistics of Baum-Welch Parameters; (2) A training method that combines bottleneck features and Mel Frequency Cepstral Coefficent (MFCC) features. Since text-related problems are mainly based on text-independent problems, DNN can also be used to solve text-related speaker recognition problems. However, using DNN to extract features for direct speaker discrimination has poor performance. Therefore, how to improve the performance and effectiveness of the speaker recognition system has become the focus of research.

Summary of the invention

Embodiments of the present application provide a voice recognition method, device, and computer-readable storage medium, which can improve the performance and effectiveness of a voice recognition system.

In a first aspect, an embodiment of the present application provides a voice recognition method. The method includes:

Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;

Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;

Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;

Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;

According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.

In a second aspect, an embodiment of the present application provides a voice recognition device including a unit for performing the method of the first aspect.

In a third aspect, an embodiment of the present application provides another voice recognition device, including a processor, an input device, an output device, and a memory, where the processor, input device, output device, and memory are connected to each other, wherein the memory is used A computer program for supporting the voice recognition device to perform the above method is stored, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the method of the first aspect.

According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program, where the computer program includes program instructions, which when executed by a processor causes the processing Implements the method of the first aspect described above.

In the embodiment of the present application, the second digital voice signal is obtained by dividing the first digital voice signal, and the target digital password is determined according to the target number obtained by processing the second digital voice signal, so as to effectively recognize the text-independent voice signal.

BRIEF DESCRIPTION

1 is a schematic flowchart of a voice recognition method provided by an embodiment of the present application;

2 is a schematic flowchart of another voice recognition method provided by an embodiment of the present application;

3 is a schematic block diagram of a voice recognition device provided by an embodiment of the present application;

4 is a schematic block diagram of another voice recognition device provided by an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The voice recognition method provided by the embodiments of the present application may be executed by a voice recognition device, wherein, in some embodiments, the voice recognition device may be provided on smart terminals such as mobile phones, computers, tablets, and smart watches. In some embodiments, the voice recognition device may be installed on a smart terminal. In some embodiments, the voice recognition device may be spatially independent from the smart terminal. In some embodiments, all The voice recognition device may be a component of the smart terminal, that is, the smart terminal includes a voice recognition device.

In one embodiment, the voice recognition device may acquire the first digital voice signal to be detected. In some embodiments, the first digital voice signal is composed of a digital password, and the digital password is composed of multiple Number. After acquiring the first digital voice signal, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals; in some embodiments, the first Two digital voice signals are determined by one number. After obtaining the second digital voice signal, the voice recognition device may process each of the divided second digital voice signals by a preset signal processing method to obtain the second digital voice signal Corresponding log mel power spectrum, and extract target feature information of each second digital voice signal from the log mel power spectrum. The voice recognition device may recognize target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals, and according to each of the The target number corresponding to the two digital voice signals determines the target digital password corresponding to the first digital voice signal. The speech recognition method according to the embodiment of the present application will be schematically described below.

For details, please refer to FIG. 1, which is a schematic flowchart of a voice recognition method provided by an embodiment of the present application. As shown in FIG. 1, the method may be performed by a voice recognition device, and the specific explanation of the voice recognition device is as described above As mentioned above, they will not be described here. Specifically, the method in the embodiment of the present application includes the following steps.

S101: Acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.

In the embodiment of the present application, the voice recognition device may acquire the first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits. In some embodiments, the digital password is composed of any one or more digits from 0 to 9, for example, the digital password may be a genetic string of 0 to 9 spoken by the speaker as a voice signal.

S102: Perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, where each second digital voice signal is determined by a number.

In the embodiment of the present application, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals. In some embodiments, each of the second digital voice signals is Determined by a number.

In one embodiment, the voice recognition device may perform preset segmentation processing on the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent digital second digital voices signal. In some embodiments, the speech recognition device may use the HMM-based segmentation method to split the first digital voice signal composed of digital passwords into second digital voice signals composed of mutually independent digits, so that the neural network The model recognizes the second digital speech signal.

In one embodiment, when the voice recognition device performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it may record each second digital voice obtained by splitting The order in which the signals are arranged in the first digital voice signal, so as to subsequently identify the target number corresponding to each of the second digital voice signals, according to each recorded second digital voice signal in the first digital voice The order of arrangement in the signal determines the order of arrangement of the target numbers, and forms a target digital password corresponding to the first digital voice signal according to the order of arrangement of the target numbers.

S103: Process each second digital voice signal according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each second digital voice signal, and select from the log mel power spectrum Extract the target feature information of each second digital voice signal.

In the embodiment of the present application, the voice recognition device may process each of the divided second digital voice signals according to a preset signal processing method to obtain a logarithmic mel corresponding to each of the second digital voice signals Power spectrum, and extract target feature information of each second digital voice signal from the log-mel power spectrum.

In an embodiment, the voice recognition device processes each of the divided second digital voice signals according to a preset signal processing method to obtain a logarithmic plume corresponding to each of the second digital voice signals Power spectrum, each of the divided second digital voice signals can be framed and windowed to obtain a voice frame corresponding to each second digital voice signal; and for each second Performing fast Fourier transform on the speech frame corresponding to the digital speech signal to obtain the spectral signal of the speech frame corresponding to each of the second digital speech signals; and converting the spectral signal of the speech frame corresponding to each of the second digital speech signals Convert to log-mel frequency spectrum power to obtain the log-mel frequency spectrum corresponding to each second digital voice signal.

In some embodiments, the log-mel power spectrum refers to the power value in the mel scale. In some embodiments, the mel scale is a pitch based on the equidistance of the human ear The non-linear frequency scale depends on the sensory judgment of the change (that is, the Hertz can be converted into Mel by a formula).

In one embodiment, when extracting the target feature information of each second digital voice signal from the log-mel power spectrum, the voice recognition device may correspond to the second digital voice signal corresponding to each digit The log mel power spectrum is normalized to obtain the characteristic information of the second digital voice signal corresponding to each digit. Wherein, the normalization process refers to normalizing the log-mel power spectrum characteristics of the second digital speech signal corresponding to each digit, so as to facilitate subsequent processing of the neural network model and accelerate convergence. In some embodiments, the voice recognition device may use the log-mel power spectrum corresponding to the second digital voice signal as an input feature; in some embodiments, the frequency domain length of the input feature is 64 Bandwidth, the length of the time domain is 96 frames (equal to the longest digital pronunciation time).

In one embodiment, when the speech recognition device converts the spectral signal of the speech frame corresponding to each of the second digital speech signals into log-melt spectral power, it may The frequency spectrum signal of the corresponding speech frame takes an absolute value to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into a pair Count Mel power spectrum.

In one embodiment, when the speech recognition device converts the spectral signal of the speech frame corresponding to each of the second digital speech signals into log-melt spectral power, it may The spectrum signal of the corresponding speech frame takes a square value to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into a pair Count Mel power spectrum.

S104: Identify target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal.

In the embodiment of the present application, the voice recognition device may recognize target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal.

In some embodiments, the neural network model may be composed of a preset convolutional neural network, wherein the preset convolutional neural network structure is an MFM-CNN structure, and the activation function of the MFM is The feature map obtained by the convolution layer, the activation function of the MFM is as follows:

Where x is the input tensor of the MFM layer of size W x H x N and y is

The output tensor of size, i and j are in the time domain, and k represents the index of the channel.

Among them, the convolutional layer is used to extract features, the MFM layer is used as an activation function layer for nonlinear transformation, and the pooling layer functions to translate without deformation and reduce the number of parameters; the input to the network is the log-mel power spectrum The equivalent of the input is a matrix, which is input to the neural network for training.

In one embodiment, the MFM-CNN structure is composed of many layer groups. After receiving the log-mel power spectrum as the input feature from the beginning, each group is composed of a convolution layer followed by an MFM layer. And pooling layer. These layer groups are stacked together and then connected by a fully connected layer to generate an embedded layer.

S105: Determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.

In the embodiment of the present application, the voice recognition device may determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.

In one embodiment, the voice recognition device may determine the sequence of the target digits according to the sequence of each recorded second digital voice signal in the first digital voice signal, and according to the target digits The arranged order forms the target digital password corresponding to the first digital voice signal.

In the embodiment of the present application, the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.

Referring to FIG. 2, FIG. 2 is a schematic flowchart of another voice recognition method provided by an embodiment of the present application. As shown in FIG. 2, the method may be performed by a voice recognition device, and the specific explanation of the voice recognition device is as described above , I won’t go into details here. The difference between the embodiment of the present application and the embodiment described in FIG. 1 above is that the embodiment of the present application mainly describes the detailed implementation process of the embodiment of the present application. Specifically, the method in the embodiment of the present application includes the following steps.

S201: Obtain a training sample set, where the training sample set includes target feature information of a sample digital voice signal, and each sample digital voice signal is determined by a number.

In the embodiment of the present application, the voice recognition device may acquire a training sample set, where the training sample set includes target feature information of sample digital voice signals, and each sample digital voice signal is determined by a number.

S202: Train and optimize the initial neural network model based on target feature information of each sample digital voice signal in the training sample set to obtain the neural network model.

In the embodiment of the present application, the voice recognition device may generate an initial neural network model according to a preset neural network algorithm, use the target feature information of the sample digital voice signal as input data, and based on each sample digital voice signal in the training sample set The target feature information of is used to train and optimize the initial neural network model, and outputs a number corresponding to the target feature information, thereby obtaining the neural network model. The explanation of the neural network model is as described above, and will not be repeated here.

S203: Acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.

In the embodiment of the present application, the voice recognition device may acquire the first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits. The explanation of the digital password is as described above, and will not be repeated here.

S204: Perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, where each second digital voice signal is determined by a number.

In the embodiment of the present application, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, and each of the second digital voice signals is determined by one digit. The specific embodiments are as described above, and will not be repeated here.

S205: Determine, according to a preset signal processing method, for each second digital voice signal, a log mel power spectrum corresponding to each second digital voice signal, and extract from the log mel power spectrum Target characteristic information of each second digital voice signal.

In the embodiment of the present application, the voice recognition device may process each of the divided second digital voice signals in a preset signal processing method to obtain a logarithmic mel power corresponding to each of the second digital voice signals Frequency spectrum, and extract target feature information of each of the second digital voice signals from the log-mel power spectrum. Specific embodiments are as described above, and will not be repeated here.

S206: Calculate the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set.

In the embodiment of the present application, when the speech recognition device recognizes the target feature information of each second digital voice signal based on the neural network model, it may calculate the target feature information of the second digital voice signal and the The similarity of the target feature information of the digital speech signal of each sample in the training sample set, so that the target digital speech signal is subsequently determined according to the similarity.

S207: Obtain target feature information of at least one sample digital voice signal with a similarity greater than a preset similarity threshold, and determine the target sample with the largest similarity from the target feature information of the at least one sample digital voice signal Target feature information of digital voice signals.

In the embodiment of the present application, after calculating the similarity between the second digital voice signal and each sample digital voice signal in the training sample set, the voice recognition device may acquire at least that the similarity is greater than a preset similarity threshold Target feature information of a sample digital speech signal, and the target feature information of the target sample digital speech signal with the greatest similarity is determined from the target feature information of the at least one sample digital speech signal.

S208: Determine the target number corresponding to the target feature information of the target sample digital voice signal according to the preset correspondence between the target feature information of the sample digital voice signal and the number.

In the embodiment of the present application, after determining the target digital voice signal with the highest similarity, the voice recognition device may also determine the relationship between the target feature information of the sample digital voice signal and the number The target number corresponding to the target feature information of the target sample digital speech signal.

In one embodiment, the voice recognition device may also determine the first digital voice signal to be detected by calculating the cosine similarity between the first digital voice signal to be detected and the target digital voice signal The similarity of the target digital voice signal, and determining the number whose cosine similarity is greater than a preset threshold as the target number corresponding to the target digital voice signal.

In one embodiment, after determining the target number corresponding to the target digital voice signal, the voice recognition device may obtain the number of target numbers corresponding to the target digital voice signal, and obtain the first to be detected A number of digits corresponding to a digital voice signal, and calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal, so that the to-be-detected can be determined according to the quantity ratio The probability that the first digital voice signal is detected successfully. The voice recognition device may detect whether the probability is less than a preset threshold, and if it is detected that the probability is less than a preset threshold, it may select a sample training set similar to the first digital voice signal to perform on the neural network model Training adjustments to optimize the training of the neural network model in real time to further improve the performance and effectiveness of recognizing speech signals.

S209: Determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.

An embodiment of the present application further provides a voice recognition device, which is a unit for performing the method described in any one of the foregoing. Specifically, referring to FIG. 3, FIG. 3 is a schematic block diagram of a voice recognition device according to an embodiment of the present application. The voice recognition device of this embodiment includes: an acquisition unit 301, a segmentation processing unit 302, a preprocessing unit 303, a recognition unit 304, and a determination unit 305.

The obtaining unit 301 is configured to obtain a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;

A division processing unit 302, configured to perform preset division processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;

The pre-processing unit 303 is configured to process each of the second digital voice signals according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and select from Extracting target feature information of each of the second digital voice signals from the log-mel power spectrum;

The recognition unit 304 is configured to recognize target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;

The determining unit 305 is configured to determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals.

Further, before the acquiring unit 301 acquires the first digital voice signal to be detected, it is also used to:

Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;

Generate the initial neural network model according to the preset neural network algorithm;

The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.

Further, when the recognition unit 304 recognizes the target feature information of each second digital voice signal based on the neural network model, and obtains the target number corresponding to each second digital voice signal, it is specifically used to:

Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;

Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;

Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;

The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.

Further, after the determining unit 305 determines the target number corresponding to the target feature information of the target sample digital voice signal, it is also used to:

Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;

Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;

Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;

Determine whether the probability is less than a preset threshold;

If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.

Further, the pre-processing unit 303 processes each of the second digital voice signals according to a preset signal processing method to determine the log-mel power spectrum corresponding to each of the second digital voice signals , Specifically for:

Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;

Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.

Further, when the pre-processing unit 303 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power, it is specifically used for:

Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;

Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.

Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;

Further, when the segmentation processing unit 302 performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it is specifically used for:

Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.

Further, when the determining unit 305 determines the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals, it is specifically used to:

According to the recorded sequence of each second digital voice signal arranged in the first digital voice signal, determining the sequence of target digital arrangement corresponding to each second digital voice signal;

According to the order in which the target numbers are arranged, a target digital password corresponding to the first digital voice signal is determined.

Further, when the pre-processing unit 303 extracts target feature information of each second digital voice signal from the log-mel power spectrum, it is specifically used to:

Performing normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to The log-mel power spectrum characteristics of each of the second digital speech signals are normalized.

In the embodiment of the present application, the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize the text-independent type Voice signal.

Referring to FIG. 4, FIG. 4 is a schematic block diagram of another voice recognition device provided by an embodiment of the present application. As shown in the figure, the voice recognition device in this embodiment may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and a memory 404. The processor 401, the input device 402, the output device 403, and the memory 404 are connected via a bus 405. The memory 404 is used to store a computer program, and the computer program includes program instructions, and the processor 401 is used to execute the program instructions stored in the memory 404. The processor 401 is configured to call the program instructions to execute:

Further, before the processor 401 obtains the first digital voice signal to be detected, it is also used to:

Further, when the processor 401 recognizes the target feature information of each second digital voice signal based on the neural network model, and obtains the target number corresponding to each second digital voice signal, it is specifically used to:

Further, after the processor 401 determines the target number corresponding to the target feature information of the target sample digital voice signal, it is also used to:

Determine whether the probability is less than a preset threshold;

Further, when the processor 401 processes each of the second digital voice signals according to a preset signal processing method and determines a logarithmic mel power spectrum corresponding to each of the second digital voice signals, Specifically used for:

Further, when the processor 401 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power, it is specifically used to:

Further, when the processor 401 performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it is specifically used for:

Further, when the processor 401 determines the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals, it is specifically used to:

Further, when the processor 401 extracts target feature information of each second digital voice signal from the log mel power spectrum, it is specifically used to:

Performing normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to Normalize the characteristic of the log-mel power spectrum of each of the second digital speech signals.

It should be understood that in the embodiments of the present application, the so-called processor 401 may be a central processing unit (CenSral Processing UniS, CPU), and the processor may also be other general-purpose processors, digital signal processors (DigiSal Signal Processor, DSP) , Application-specific integrated circuits (ApplicaSion Specific InSegraSed Circuits, ASIC), ready-made programmable gate array (Field-Programmable GaSe Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 402 may include a touch panel, a microphone, and the like, and the output device 403 may include a display (LCD, etc.), a speaker, and the like.

The memory 404 may include a read-only memory and a random access memory, and provide instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.

In a specific implementation, the processor 401, the input device 402, and the output device 403 described in the embodiments of the present application may perform the method described in FIG. 1 or FIG. 2 of the voice recognition method provided in the embodiments of the present application. The implementation manner may also be the implementation manner of the voice recognition device described in FIG. 3 or FIG. 4 in the embodiment of the present application, and details are not described herein again.

An embodiment of the present application also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the computer program is implemented as described in the embodiment corresponding to FIG. 1 or FIG. 2 The voice recognition method can also implement the voice recognition device of the embodiment corresponding to FIG. 3 or FIG. 4 of the present application, and details are not described herein again. In some embodiments, the computer-readable storage medium may also be a non-volatile computer-readable storage medium, which is not specifically limited in this embodiment of the present invention.

The computer-readable storage medium may be an internal storage unit of the voice recognition device described in any of the foregoing embodiments, such as a hard disk or a memory of the voice recognition device. The computer-readable storage medium may also be an external storage device of the voice recognition device, such as a plug-in hard disk equipped on the voice recognition device, an intelligent memory card (SmarS Media, Card, SMC), and secure digital (Secure DigiSal) , SD) card, flash card (Flash Card), etc. Further, the computer-readable storage medium may also include both an internal storage unit of the voice recognition device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the voice recognition device. The computer-readable storage medium can also be used to temporarily store data that has been or will be output.

Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly explain the hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described according to function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

The above is only part of the implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application.

Claims

A voice recognition method, characterized in that it includes:

Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;

Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;

Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;

Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;

According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.
The method according to claim 1, wherein before acquiring the first digital voice signal to be detected, the method further comprises:

Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;

Generate the initial neural network model according to the preset neural network algorithm;

The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
The method according to claim 2, wherein the neural network model is used to identify target feature information of each of the second digital voice signals to obtain a target corresponding to each of the second digital voice signals Numbers, including:

Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;

Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;

Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;

The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
The method according to claim 3, wherein after determining the target number corresponding to the target characteristic information of the target sample digital voice signal, further comprising:

Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;

Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;

Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;

Determine whether the probability is less than a preset threshold;

If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
The method according to claim 1, wherein the second digital voice signal is processed according to a preset signal processing method to determine a pair corresponding to each second digital voice signal Meyer power spectrum, including:

Performing frame windowing on each of the second digital voice signals to obtain a voice frame corresponding to each of the second digital voice signals;

Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;

Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.
The method according to claim 5, wherein the converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power includes:

Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;

Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
The method according to claim 5, wherein the converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power includes:

Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;

Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
The method according to claim 1, wherein the preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals includes:

Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.
The method according to claim 8, wherein the determining the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals includes:

According to the recorded sequence of each second digital voice signal arranged in the first digital voice signal, determining the sequence of target digital arrangement corresponding to each second digital voice signal;

According to the order in which the target numbers are arranged, a target digital password corresponding to the first digital voice signal is determined.
The method according to claim 1, wherein the extracting target feature information of each of the second digital voice signals from the log-mel power spectrum includes:

Performing normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to Normalize the characteristic of the log-mel power spectrum of each of the second digital speech signals.
A voice recognition device, characterized in that it includes:

An obtaining unit, configured to obtain a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;

A division processing unit, configured to perform preset division processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;

A preprocessing unit, configured to process each of the second digital voice signals according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and select from Extracting target feature information of each of the second digital voice signals from the log-mel power spectrum;

A recognition unit, configured to recognize target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals;

The determining unit is configured to determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals.
The device according to claim 11, wherein before the acquiring unit acquires the first digital voice signal to be detected, it is further used to:

Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;

Generate the initial neural network model according to the preset neural network algorithm;

The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
The device according to claim 12, wherein the recognition unit recognizes target feature information of each of the second digital voice signals based on a neural network model, and obtains a correspondence with each of the second digital voice signals When the target number is used:

Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;

Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;

Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;

The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
The device according to claim 13, wherein after the recognition unit determines the target number corresponding to the target feature information of the target sample digital voice signal, it is further used to:

Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;

Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;

Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;

Determine whether the probability is less than a preset threshold;

If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
The device according to claim 11, wherein the pre-processing unit processes each of the second digital voice signals according to a preset signal processing method, and determines that each of the second digital voice signals The corresponding log mel power spectrum is specifically used for:

Performing frame windowing processing on each of the second digital voice signals to obtain a voice frame corresponding to each of the second digital voice signals;

Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;

Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.
The device according to claim 15, wherein the pre-processing unit is specifically used when converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power:

Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;

Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
The device according to claim 15, wherein the pre-processing unit is specifically used when converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power:

Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;

Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
The device according to claim 11, wherein the division processing unit performs preset division processing on the first digital voice signal to obtain multiple second digital voice signals, which is specifically used for:

Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.
A voice recognition device is characterized by comprising a processor, an input device, an output device and a memory, wherein the processor, input device, output device and memory are connected to each other, wherein the memory is used to store a computer program, the The computer program includes program instructions, and the processor is configured to call the program instructions and execute:

Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;

Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;

Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;

Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;

According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program includes program instructions, which when executed by a processor cause the processor to execute as rights The method according to any one of claims 1-10.