WO2020140609A1 - Voice recognition method and device and computer readable storage medium - Google Patents

Voice recognition method and device and computer readable storage medium Download PDF

Info

Publication number
WO2020140609A1
WO2020140609A1 PCT/CN2019/116979 CN2019116979W WO2020140609A1 WO 2020140609 A1 WO2020140609 A1 WO 2020140609A1 CN 2019116979 W CN2019116979 W CN 2019116979W WO 2020140609 A1 WO2020140609 A1 WO 2020140609A1
Authority
WO
WIPO (PCT)
Prior art keywords
digital voice
digital
voice signal
target
signals
Prior art date
Application number
PCT/CN2019/116979
Other languages
French (fr)
Chinese (zh)
Inventor
贾雪丽
程宁
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020140609A1 publication Critical patent/WO2020140609A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of voice recognition technology, and in particular, to a voice recognition method, device, and computer-readable storage medium.
  • Deep Neural Network is connected to Hidden Markov Model (HMM) to train the statistics of Baum-Welch Parameters; (2) A training method that combines bottleneck features and Mel Frequency Cepstral Coefficent (MFCC) features.
  • HMM Hidden Markov Model
  • MFCC Mel Frequency Cepstral Coefficent
  • Embodiments of the present application provide a voice recognition method, device, and computer-readable storage medium, which can improve the performance and effectiveness of a voice recognition system.
  • an embodiment of the present application provides a voice recognition method.
  • the method includes:
  • each second digital voice signal is determined by a number
  • a target digital password corresponding to the first digital voice signal is determined.
  • an embodiment of the present application provides a voice recognition device including a unit for performing the method of the first aspect.
  • an embodiment of the present application provides another voice recognition device, including a processor, an input device, an output device, and a memory, where the processor, input device, output device, and memory are connected to each other, wherein the memory is used
  • a computer program for supporting the voice recognition device to perform the above method is stored, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the method of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program, where the computer program includes program instructions, which when executed by a processor causes the processing Implements the method of the first aspect described above.
  • the second digital voice signal is obtained by dividing the first digital voice signal, and the target digital password is determined according to the target number obtained by processing the second digital voice signal, so as to effectively recognize the text-independent voice signal.
  • FIG. 1 is a schematic flowchart of a voice recognition method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of another voice recognition method provided by an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a voice recognition device provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of another voice recognition device provided by an embodiment of the present application.
  • the voice recognition method provided by the embodiments of the present application may be executed by a voice recognition device, wherein, in some embodiments, the voice recognition device may be provided on smart terminals such as mobile phones, computers, tablets, and smart watches. In some embodiments, the voice recognition device may be installed on a smart terminal. In some embodiments, the voice recognition device may be spatially independent from the smart terminal. In some embodiments, all The voice recognition device may be a component of the smart terminal, that is, the smart terminal includes a voice recognition device.
  • the voice recognition device may acquire the first digital voice signal to be detected.
  • the first digital voice signal is composed of a digital password, and the digital password is composed of multiple Number.
  • the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals; in some embodiments, the first Two digital voice signals are determined by one number.
  • the voice recognition device may process each of the divided second digital voice signals by a preset signal processing method to obtain the second digital voice signal Corresponding log mel power spectrum, and extract target feature information of each second digital voice signal from the log mel power spectrum.
  • the voice recognition device may recognize target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals, and according to each of the The target number corresponding to the two digital voice signals determines the target digital password corresponding to the first digital voice signal.
  • the speech recognition method according to the embodiment of the present application will be schematically described below.
  • FIG. 1 is a schematic flowchart of a voice recognition method provided by an embodiment of the present application. As shown in FIG. 1, the method may be performed by a voice recognition device, and the specific explanation of the voice recognition device is as described above As mentioned above, they will not be described here. Specifically, the method in the embodiment of the present application includes the following steps.
  • S101 Acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.
  • the voice recognition device may acquire the first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.
  • the digital password is composed of any one or more digits from 0 to 9, for example, the digital password may be a genetic string of 0 to 9 spoken by the speaker as a voice signal.
  • S102 Perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, where each second digital voice signal is determined by a number.
  • the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals.
  • each of the second digital voice signals is Determined by a number.
  • the voice recognition device may perform preset segmentation processing on the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent digital second digital voices signal.
  • the speech recognition device may use the HMM-based segmentation method to split the first digital voice signal composed of digital passwords into second digital voice signals composed of mutually independent digits, so that the neural network The model recognizes the second digital speech signal.
  • the voice recognition device when it performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it may record each second digital voice obtained by splitting The order in which the signals are arranged in the first digital voice signal, so as to subsequently identify the target number corresponding to each of the second digital voice signals, according to each recorded second digital voice signal in the first digital voice
  • the order of arrangement in the signal determines the order of arrangement of the target numbers, and forms a target digital password corresponding to the first digital voice signal according to the order of arrangement of the target numbers.
  • S103 Process each second digital voice signal according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each second digital voice signal, and select from the log mel power spectrum Extract the target feature information of each second digital voice signal.
  • the voice recognition device may process each of the divided second digital voice signals according to a preset signal processing method to obtain a logarithmic mel corresponding to each of the second digital voice signals Power spectrum, and extract target feature information of each second digital voice signal from the log-mel power spectrum.
  • the voice recognition device processes each of the divided second digital voice signals according to a preset signal processing method to obtain a logarithmic plume corresponding to each of the second digital voice signals Power spectrum, each of the divided second digital voice signals can be framed and windowed to obtain a voice frame corresponding to each second digital voice signal; and for each second Performing fast Fourier transform on the speech frame corresponding to the digital speech signal to obtain the spectral signal of the speech frame corresponding to each of the second digital speech signals; and converting the spectral signal of the speech frame corresponding to each of the second digital speech signals Convert to log-mel frequency spectrum power to obtain the log-mel frequency spectrum corresponding to each second digital voice signal.
  • the log-mel power spectrum refers to the power value in the mel scale.
  • the mel scale is a pitch based on the equidistance of the human ear
  • the non-linear frequency scale depends on the sensory judgment of the change (that is, the Hertz can be converted into Mel by a formula).
  • the voice recognition device when extracting the target feature information of each second digital voice signal from the log-mel power spectrum, may correspond to the second digital voice signal corresponding to each digit
  • the log mel power spectrum is normalized to obtain the characteristic information of the second digital voice signal corresponding to each digit.
  • the normalization process refers to normalizing the log-mel power spectrum characteristics of the second digital speech signal corresponding to each digit, so as to facilitate subsequent processing of the neural network model and accelerate convergence.
  • the voice recognition device may use the log-mel power spectrum corresponding to the second digital voice signal as an input feature; in some embodiments, the frequency domain length of the input feature is 64 Bandwidth, the length of the time domain is 96 frames (equal to the longest digital pronunciation time).
  • the speech recognition device when the speech recognition device converts the spectral signal of the speech frame corresponding to each of the second digital speech signals into log-melt spectral power, it may The frequency spectrum signal of the corresponding speech frame takes an absolute value to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into a pair Count Mel power spectrum.
  • the speech recognition device when the speech recognition device converts the spectral signal of the speech frame corresponding to each of the second digital speech signals into log-melt spectral power, it may The spectrum signal of the corresponding speech frame takes a square value to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into a pair Count Mel power spectrum.
  • S104 Identify target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal.
  • the voice recognition device may recognize target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal.
  • the neural network model may be composed of a preset convolutional neural network, wherein the preset convolutional neural network structure is an MFM-CNN structure, and the activation function of the MFM is The feature map obtained by the convolution layer, the activation function of the MFM is as follows:
  • x is the input tensor of the MFM layer of size W x H x N and y is The output tensor of size, i and j are in the time domain, and k represents the index of the channel.
  • the convolutional layer is used to extract features
  • the MFM layer is used as an activation function layer for nonlinear transformation
  • the pooling layer functions to translate without deformation and reduce the number of parameters
  • the input to the network is the log-mel power spectrum
  • the equivalent of the input is a matrix, which is input to the neural network for training.
  • the MFM-CNN structure is composed of many layer groups. After receiving the log-mel power spectrum as the input feature from the beginning, each group is composed of a convolution layer followed by an MFM layer. And pooling layer. These layer groups are stacked together and then connected by a fully connected layer to generate an embedded layer.
  • S105 Determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
  • the voice recognition device may determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
  • the voice recognition device may determine the sequence of the target digits according to the sequence of each recorded second digital voice signal in the first digital voice signal, and according to the target digits The arranged order forms the target digital password corresponding to the first digital voice signal.
  • the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.
  • FIG. 2 is a schematic flowchart of another voice recognition method provided by an embodiment of the present application. As shown in FIG. 2, the method may be performed by a voice recognition device, and the specific explanation of the voice recognition device is as described above , I won’t go into details here.
  • the difference between the embodiment of the present application and the embodiment described in FIG. 1 above is that the embodiment of the present application mainly describes the detailed implementation process of the embodiment of the present application. Specifically, the method in the embodiment of the present application includes the following steps.
  • S201 Obtain a training sample set, where the training sample set includes target feature information of a sample digital voice signal, and each sample digital voice signal is determined by a number.
  • the voice recognition device may acquire a training sample set, where the training sample set includes target feature information of sample digital voice signals, and each sample digital voice signal is determined by a number.
  • the voice recognition device may generate an initial neural network model according to a preset neural network algorithm, use the target feature information of the sample digital voice signal as input data, and based on each sample digital voice signal in the training sample set
  • the target feature information of is used to train and optimize the initial neural network model, and outputs a number corresponding to the target feature information, thereby obtaining the neural network model.
  • the explanation of the neural network model is as described above, and will not be repeated here.
  • S203 Acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.
  • the voice recognition device may acquire the first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.
  • the explanation of the digital password is as described above, and will not be repeated here.
  • S204 Perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, where each second digital voice signal is determined by a number.
  • the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, and each of the second digital voice signals is determined by one digit.
  • the specific embodiments are as described above, and will not be repeated here.
  • S205 Determine, according to a preset signal processing method, for each second digital voice signal, a log mel power spectrum corresponding to each second digital voice signal, and extract from the log mel power spectrum Target characteristic information of each second digital voice signal.
  • the voice recognition device may process each of the divided second digital voice signals in a preset signal processing method to obtain a logarithmic mel power corresponding to each of the second digital voice signals Frequency spectrum, and extract target feature information of each of the second digital voice signals from the log-mel power spectrum. Specific embodiments are as described above, and will not be repeated here.
  • S206 Calculate the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set.
  • the speech recognition device when the speech recognition device recognizes the target feature information of each second digital voice signal based on the neural network model, it may calculate the target feature information of the second digital voice signal and the The similarity of the target feature information of the digital speech signal of each sample in the training sample set, so that the target digital speech signal is subsequently determined according to the similarity.
  • S207 Obtain target feature information of at least one sample digital voice signal with a similarity greater than a preset similarity threshold, and determine the target sample with the largest similarity from the target feature information of the at least one sample digital voice signal Target feature information of digital voice signals.
  • the voice recognition device may acquire at least that the similarity is greater than a preset similarity threshold Target feature information of a sample digital speech signal, and the target feature information of the target sample digital speech signal with the greatest similarity is determined from the target feature information of the at least one sample digital speech signal.
  • S208 Determine the target number corresponding to the target feature information of the target sample digital voice signal according to the preset correspondence between the target feature information of the sample digital voice signal and the number.
  • the voice recognition device may also determine the relationship between the target feature information of the sample digital voice signal and the number The target number corresponding to the target feature information of the target sample digital speech signal.
  • the voice recognition device may also determine the first digital voice signal to be detected by calculating the cosine similarity between the first digital voice signal to be detected and the target digital voice signal The similarity of the target digital voice signal, and determining the number whose cosine similarity is greater than a preset threshold as the target number corresponding to the target digital voice signal.
  • the voice recognition device may obtain the number of target numbers corresponding to the target digital voice signal, and obtain the first to be detected A number of digits corresponding to a digital voice signal, and calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal, so that the to-be-detected can be determined according to the quantity ratio The probability that the first digital voice signal is detected successfully.
  • the voice recognition device may detect whether the probability is less than a preset threshold, and if it is detected that the probability is less than a preset threshold, it may select a sample training set similar to the first digital voice signal to perform on the neural network model Training adjustments to optimize the training of the neural network model in real time to further improve the performance and effectiveness of recognizing speech signals.
  • S209 Determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
  • the voice recognition device may determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
  • the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.
  • FIG. 3 is a schematic block diagram of a voice recognition device according to an embodiment of the present application.
  • the voice recognition device of this embodiment includes: an acquisition unit 301, a segmentation processing unit 302, a preprocessing unit 303, a recognition unit 304, and a determination unit 305.
  • the obtaining unit 301 is configured to obtain a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
  • a division processing unit 302 configured to perform preset division processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
  • the pre-processing unit 303 is configured to process each of the second digital voice signals according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and select from Extracting target feature information of each of the second digital voice signals from the log-mel power spectrum;
  • the recognition unit 304 is configured to recognize target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;
  • the determining unit 305 is configured to determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals.
  • the acquiring unit 301 acquires the first digital voice signal to be detected, it is also used to:
  • the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;
  • the initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
  • the recognition unit 304 recognizes the target feature information of each second digital voice signal based on the neural network model, and obtains the target number corresponding to each second digital voice signal, it is specifically used to:
  • the target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
  • the determining unit 305 determines the target number corresponding to the target feature information of the target sample digital voice signal, it is also used to:
  • a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
  • the pre-processing unit 303 processes each of the second digital voice signals according to a preset signal processing method to determine the log-mel power spectrum corresponding to each of the second digital voice signals , Specifically for:
  • the pre-processing unit 303 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power, it is specifically used for:
  • the pre-processing unit 303 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power, it is specifically used for:
  • segmentation processing unit 302 performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it is specifically used for:
  • the determining unit 305 determines the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals, it is specifically used to:
  • a target digital password corresponding to the first digital voice signal is determined.
  • the pre-processing unit 303 extracts target feature information of each second digital voice signal from the log-mel power spectrum, it is specifically used to:
  • the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize the text-independent type Voice signal.
  • FIG. 4 is a schematic block diagram of another voice recognition device provided by an embodiment of the present application.
  • the voice recognition device in this embodiment may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and a memory 404.
  • the processor 401, the input device 402, the output device 403, and the memory 404 are connected via a bus 405.
  • the memory 404 is used to store a computer program, and the computer program includes program instructions, and the processor 401 is used to execute the program instructions stored in the memory 404.
  • the processor 401 is configured to call the program instructions to execute:
  • each second digital voice signal is determined by a number
  • a target digital password corresponding to the first digital voice signal is determined.
  • the processor 401 obtains the first digital voice signal to be detected, it is also used to:
  • the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;
  • the initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
  • the processor 401 recognizes the target feature information of each second digital voice signal based on the neural network model, and obtains the target number corresponding to each second digital voice signal, it is specifically used to:
  • the target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
  • the processor 401 determines the target number corresponding to the target feature information of the target sample digital voice signal, it is also used to:
  • a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
  • processor 401 processes each of the second digital voice signals according to a preset signal processing method and determines a logarithmic mel power spectrum corresponding to each of the second digital voice signals, Specifically used for:
  • the processor 401 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power, it is specifically used to:
  • the processor 401 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power, it is specifically used to:
  • processor 401 performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it is specifically used for:
  • the processor 401 determines the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals, it is specifically used to:
  • a target digital password corresponding to the first digital voice signal is determined.
  • the processor 401 extracts target feature information of each second digital voice signal from the log mel power spectrum, it is specifically used to:
  • normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to Normalize the characteristic of the log-mel power spectrum of each of the second digital speech signals.
  • the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.
  • the so-called processor 401 may be a central processing unit (CenSral Processing UniS, CPU), and the processor may also be other general-purpose processors, digital signal processors (DigiSal Signal Processor, DSP) , Application-specific integrated circuits (ApplicaSion Specific InSegraSed Circuits, ASIC), ready-made programmable gate array (Field-Programmable GaSe Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the input device 402 may include a touch panel, a microphone, and the like
  • the output device 403 may include a display (LCD, etc.), a speaker, and the like.
  • the memory 404 may include a read-only memory and a random access memory, and provide instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.
  • the processor 401, the input device 402, and the output device 403 described in the embodiments of the present application may perform the method described in FIG. 1 or FIG. 2 of the voice recognition method provided in the embodiments of the present application.
  • the implementation manner may also be the implementation manner of the voice recognition device described in FIG. 3 or FIG. 4 in the embodiment of the present application, and details are not described herein again.
  • An embodiment of the present application also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the computer program is implemented as described in the embodiment corresponding to FIG. 1 or FIG. 2
  • the voice recognition method can also implement the voice recognition device of the embodiment corresponding to FIG. 3 or FIG. 4 of the present application, and details are not described herein again.
  • the computer-readable storage medium may also be a non-volatile computer-readable storage medium, which is not specifically limited in this embodiment of the present invention.
  • the computer-readable storage medium may be an internal storage unit of the voice recognition device described in any of the foregoing embodiments, such as a hard disk or a memory of the voice recognition device.
  • the computer-readable storage medium may also be an external storage device of the voice recognition device, such as a plug-in hard disk equipped on the voice recognition device, an intelligent memory card (SmarS Media, Card, SMC), and secure digital (Secure DigiSal) , SD) card, flash card (Flash Card), etc.
  • the computer-readable storage medium may also include both an internal storage unit of the voice recognition device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the voice recognition device.
  • the computer-readable storage medium can also be used to temporarily store data that has been or will be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice recognition method and device and a computer readable storage medium. The method comprises: obtaining a first digital voice signal to be detected, wherein the first digital voice signal is composed of digital ciphers, and the digital ciphers are composed of multiple digits (S101); performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a digit (S102); processing the second digital voice signals according to a preset signal processing method, determining logarithmic Mel power spectra corresponding to the second digital voice signals, and extracting target feature information of the second digital voice signals from the logarithmic Mel power spectra (S103); recognizing the target feature information of the second digital voice signals on the basis of a neural network model so as to obtain target digits corresponding to the second digital voice signals (S104); and determining a target digital cipher corresponding to the first digital voice signal according to the target digits corresponding to the second digital voice signals (S105). The method improves the performance and validity of voice recognition.

Description

一种语音识别方法、设备及计算机可读存储介质Voice recognition method, device and computer readable storage medium
本申请要求于2019年01月04日提交中国专利局、申请号为201910014557.3、申请名称为“一种语音识别方法、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application filed on January 04, 2019, with the application number 201910014557.3, the application name is "a speech recognition method, equipment and computer-readable storage medium", all of which are approved by The reference is incorporated in this application.
技术领域Technical field
本申请涉及语音识别技术领域,尤其涉及一种语音识别方法、设备及计算机可读存储介质。The present application relates to the field of voice recognition technology, and in particular, to a voice recognition method, device, and computer-readable storage medium.
背景技术Background technique
基于矢量(Identity-Vector,I-vector)的说话人识别系统是解决文本无关型说话人识别问题的经典方法,然而近年来这个领域越来越受到深度学习的关注。解决声学问题的深度学习方法技术可以分为两类:(1)用深度神经网络(Deep Neural Network,DNN)接在隐马尔可夫模型(Hidden Markov Model,HMM)后面去训练Baum-Welch的统计参数;(2)结合瓶颈特征和梅尔频率倒谱系数(Mel Frequency Cepstral Coefficent,MFCC)特征的训练方法。由于文本相关型问题主要建立在文本无关型问题的基础上,所以DNN也可以用来解决文本相关型说话人识别问题。然而采用DNN来提取特征去做直接说话人区分性能较差,因此,如何提高说话人识别系统的性能和有效性成为研究的重点。Vector-based (Identity-Vector, I-vector) speaker recognition system is a classic method for solving text-independent speaker recognition problems. However, in recent years, this field has received more and more attention from deep learning. Deep learning methods and techniques for solving acoustic problems can be divided into two categories: (1) Deep Neural Network (DNN) is connected to Hidden Markov Model (HMM) to train the statistics of Baum-Welch Parameters; (2) A training method that combines bottleneck features and Mel Frequency Cepstral Coefficent (MFCC) features. Since text-related problems are mainly based on text-independent problems, DNN can also be used to solve text-related speaker recognition problems. However, using DNN to extract features for direct speaker discrimination has poor performance. Therefore, how to improve the performance and effectiveness of the speaker recognition system has become the focus of research.
发明内容Summary of the invention
本申请实施例提供一种语音识别方法、设备及计算机可读存储介质,可提高语音识别系统的性能和有效性。Embodiments of the present application provide a voice recognition method, device, and computer-readable storage medium, which can improve the performance and effectiveness of a voice recognition system.
第一方面,本申请实施例提供了一种语音识别方法,该方法包括:In a first aspect, an embodiment of the present application provides a voice recognition method. The method includes:
获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成;Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定;Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息;Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;
基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字;Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;
根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.
第二方面,本申请实施例提供了一种语音识别设备,该语音识别设备包括用于执行上述第一方面的 方法的单元。In a second aspect, an embodiment of the present application provides a voice recognition device including a unit for performing the method of the first aspect.
第三方面,本申请实施例提供了另一种语音识别设备,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储支持语音识别设备执行上述方法的计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行上述第一方面的方法。In a third aspect, an embodiment of the present application provides another voice recognition device, including a processor, an input device, an output device, and a memory, where the processor, input device, output device, and memory are connected to each other, wherein the memory is used A computer program for supporting the voice recognition device to perform the above method is stored, the computer program includes program instructions, and the processor is configured to call the program instructions to perform the method of the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program, where the computer program includes program instructions, which when executed by a processor causes the processing Implements the method of the first aspect described above.
本申请实施例通过对第一数字语音信号进行分割得到第二数字语音信号,并根据对第二数字语音信号进行处理得到的目标数字确定目标数字密码,以有效识别文本无关性型的语音信号。In the embodiment of the present application, the second digital voice signal is obtained by dividing the first digital voice signal, and the target digital password is determined according to the target number obtained by processing the second digital voice signal, so as to effectively recognize the text-independent voice signal.
附图说明BRIEF DESCRIPTION
图1是本申请实施例提供的一种语音识别方法的示意流程图;1 is a schematic flowchart of a voice recognition method provided by an embodiment of the present application;
图2是本申请实施例提供的另一种语音识别方法的示意流程图;2 is a schematic flowchart of another voice recognition method provided by an embodiment of the present application;
图3是本申请实施例提供的一种语音识别设备的示意框图;3 is a schematic block diagram of a voice recognition device provided by an embodiment of the present application;
图4是本申请实施例提供的另一种语音识别设备示意框图。4 is a schematic block diagram of another voice recognition device provided by an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本申请实施例提供的语音识别方法可以由一种语音识别设备执行,其中,在某些实施例中,所述语音识别设备可以设置在手机、电脑、平板、智能手表等智能终端上。在某些实施例中,所述语音识别设备可以安装在智能终端上,在某些实施例中,所述语音识别设备可以在空间上独立于所述智能终端,在某些实施例中,所述语音识别设备可以是所述智能终端的部件,即所述智能终端包括语音识别设备。The voice recognition method provided by the embodiments of the present application may be executed by a voice recognition device, wherein, in some embodiments, the voice recognition device may be provided on smart terminals such as mobile phones, computers, tablets, and smart watches. In some embodiments, the voice recognition device may be installed on a smart terminal. In some embodiments, the voice recognition device may be spatially independent from the smart terminal. In some embodiments, all The voice recognition device may be a component of the smart terminal, that is, the smart terminal includes a voice recognition device.
在一个实施例中,所述语音识别设备可以获取待检测的第一数字语音信号,在某些实施例中,所述第一数字语音信号是由数字密码组成的,所述数字密码是由多个数字组成的。所述语音识别设备在获取到所述第一数字语音信号之后可以对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号;在某些实施例中,所述第二数字语音信号是由一个数字确定的。所述语音识别设备在得到所述第二数字语音信号之后,可以预设的信号处理方法对分割得到的每个所述第二数字语音信号进行处理,得到与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息。所述语音识别设备可以基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字,并根据每个所述第 二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。下面对本申请实施例的语音识别方法进行示意性说明。In one embodiment, the voice recognition device may acquire the first digital voice signal to be detected. In some embodiments, the first digital voice signal is composed of a digital password, and the digital password is composed of multiple Number. After acquiring the first digital voice signal, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals; in some embodiments, the first Two digital voice signals are determined by one number. After obtaining the second digital voice signal, the voice recognition device may process each of the divided second digital voice signals by a preset signal processing method to obtain the second digital voice signal Corresponding log mel power spectrum, and extract target feature information of each second digital voice signal from the log mel power spectrum. The voice recognition device may recognize target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals, and according to each of the The target number corresponding to the two digital voice signals determines the target digital password corresponding to the first digital voice signal. The speech recognition method according to the embodiment of the present application will be schematically described below.
具体请参见图1,图1是本申请实施例提供的一种语音识别方法的示意流程图,如图1所示,该方法可以由语音识别设备执行,所述语音识别设备的具体解释如前所述,此处不再赘述。具体地,本申请实施例的所述方法包括如下步骤。For details, please refer to FIG. 1, which is a schematic flowchart of a voice recognition method provided by an embodiment of the present application. As shown in FIG. 1, the method may be performed by a voice recognition device, and the specific explanation of the voice recognition device is as described above As mentioned above, they will not be described here. Specifically, the method in the embodiment of the present application includes the following steps.
S101:获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成。S101: Acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.
本申请实施例中,语音识别设备可以获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成。在某些实施例中,所述数字密码是由0到9的任意一个或多个数字组成的,例如,所述数字密码可以是说话人说出的遗传0到9的数字串,以作为语音信号。In the embodiment of the present application, the voice recognition device may acquire the first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits. In some embodiments, the digital password is composed of any one or more digits from 0 to 9, for example, the digital password may be a genetic string of 0 to 9 spoken by the speaker as a voice signal.
S102:对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定。S102: Perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, where each second digital voice signal is determined by a number.
本申请实施例中,语音识别设备可以对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,在某些实施例中,每个所述第二数字语音信号均由一个数字确定。In the embodiment of the present application, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals. In some embodiments, each of the second digital voice signals is Determined by a number.
在一个实施例中,所述语音识别设备可以通过HMM对所述第一数字语音信号进行预设分割处理,以将所述第一数字语音信号分割为多个相互独立的数字的第二数字语音信号。在某些实施例中,所述语音识别设备用基于所述HMM的分割方法,可以将由数字密码组成的第一数字语音信号分割为由相互独立的数字组成的第二数字语音信号,以便神经网络模型对所述第二数字语音信号进行识别。In one embodiment, the voice recognition device may perform preset segmentation processing on the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent digital second digital voices signal. In some embodiments, the speech recognition device may use the HMM-based segmentation method to split the first digital voice signal composed of digital passwords into second digital voice signals composed of mutually independent digits, so that the neural network The model recognizes the second digital speech signal.
在一个实施例中,所述语音识别设备在对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号时,可以记录拆分得到的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序,以便后续识别出每个所述第二数字语音信号对应的目标数字之后,可以根据记录的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序,确定所述目标数字排列的先后顺序,并根据所述目标数字排列的先后顺序组成与所述第一数字语音信号对应的目标数字密码。In one embodiment, when the voice recognition device performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it may record each second digital voice obtained by splitting The order in which the signals are arranged in the first digital voice signal, so as to subsequently identify the target number corresponding to each of the second digital voice signals, according to each recorded second digital voice signal in the first digital voice The order of arrangement in the signal determines the order of arrangement of the target numbers, and forms a target digital password corresponding to the first digital voice signal according to the order of arrangement of the target numbers.
S103:根据预设的信号处理方法对每个第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从该对数梅尔功率频谱中提取每个第二数字语音信号的目标特征信息。S103: Process each second digital voice signal according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each second digital voice signal, and select from the log mel power spectrum Extract the target feature information of each second digital voice signal.
本申请实施例中,语音识别设备可以根据预设的信号处理方法对分割得到的每个所述第二数字语音信号进行处理,得到与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息。In the embodiment of the present application, the voice recognition device may process each of the divided second digital voice signals according to a preset signal processing method to obtain a logarithmic mel corresponding to each of the second digital voice signals Power spectrum, and extract target feature information of each second digital voice signal from the log-mel power spectrum.
在一个实施例中,所述语音识别设备在根据预设信号处理方法对分割得到的每个所述第二数字语音信号进行处理,得到与每个所述第二数字语音信号对应的对数梅尔功率频谱时,可以对分割得到的每个所述第二数字语音信号进行分帧加窗处理,以得到每个所述第二数字语音信号对应的语音帧;并对每个所述第二数字语音信号对应的语音帧进行快速傅里叶变换,得到每个所述第二数字语音信号对应的语音帧的频谱信号;以及将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,以得到每个所述第二数字语音信号对应的对数梅尔功率频谱。In an embodiment, the voice recognition device processes each of the divided second digital voice signals according to a preset signal processing method to obtain a logarithmic plume corresponding to each of the second digital voice signals Power spectrum, each of the divided second digital voice signals can be framed and windowed to obtain a voice frame corresponding to each second digital voice signal; and for each second Performing fast Fourier transform on the speech frame corresponding to the digital speech signal to obtain the spectral signal of the speech frame corresponding to each of the second digital speech signals; and converting the spectral signal of the speech frame corresponding to each of the second digital speech signals Convert to log-mel frequency spectrum power to obtain the log-mel frequency spectrum corresponding to each second digital voice signal.
在一些实施例中,所述对数梅尔功率频谱指的就是在梅尔刻度下的功率值,在某些实施例中,所述梅尔刻度是一种基于人耳对等距的音高变化的感官判断而定的非线性频率刻度(即可以由一个公式将赫兹转换为梅尔)。In some embodiments, the log-mel power spectrum refers to the power value in the mel scale. In some embodiments, the mel scale is a pitch based on the equidistance of the human ear The non-linear frequency scale depends on the sensory judgment of the change (that is, the Hertz can be converted into Mel by a formula).
在一个实施例中,所述语音识别设备在从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息时,可以对各数字对应的第二数字语音信号对应的对数梅尔功率频谱进行归一化处理,得到各数字对应的第二数字语音信号的特征信息。其中,所述归一化处理指的是将每个数字对应的第二数字语音信号的对数梅尔功率谱特征做归一化,以有利于后续的神经网络模型的处理,可以加速收敛。在某些实施例中,所述语音识别设备可以将第二数字语音信号对应的对数梅尔功率谱作为输入的特征;在某些实施例中,所述输入的特征的频域长度为64带宽,时域长度为96帧(等于最长的数字发音时间)。In one embodiment, when extracting the target feature information of each second digital voice signal from the log-mel power spectrum, the voice recognition device may correspond to the second digital voice signal corresponding to each digit The log mel power spectrum is normalized to obtain the characteristic information of the second digital voice signal corresponding to each digit. Wherein, the normalization process refers to normalizing the log-mel power spectrum characteristics of the second digital speech signal corresponding to each digit, so as to facilitate subsequent processing of the neural network model and accelerate convergence. In some embodiments, the voice recognition device may use the log-mel power spectrum corresponding to the second digital voice signal as an input feature; in some embodiments, the frequency domain length of the input feature is 64 Bandwidth, the length of the time domain is 96 frames (equal to the longest digital pronunciation time).
在一个实施例中,所述语音识别设备在将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,可以对每个所述第二数字语音信号对应的语音帧的频谱信号取绝对值,得到每个所述第二数字语音信号对应的语音帧的功率频谱,以将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。In one embodiment, when the speech recognition device converts the spectral signal of the speech frame corresponding to each of the second digital speech signals into log-melt spectral power, it may The frequency spectrum signal of the corresponding speech frame takes an absolute value to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into a pair Count Mel power spectrum.
在一个实施例中,所述语音识别设备在将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,可以对每个所述第二数字语音信号对应的语音帧的频谱信号取平方值,得到每个所述第二数字语音信号对应的语音帧的功率频谱,以将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。In one embodiment, when the speech recognition device converts the spectral signal of the speech frame corresponding to each of the second digital speech signals into log-melt spectral power, it may The spectrum signal of the corresponding speech frame takes a square value to obtain the power spectrum of the speech frame corresponding to each second digital speech signal, so as to convert the power spectrum of the speech frame corresponding to each second digital speech signal into a pair Count Mel power spectrum.
S104:基于神经网络模型对每个第二数字语音信号的目标特征信息进行识别,得到与每个第二数字语音信号对应的目标数字。S104: Identify target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal.
本申请实施例中,所述语音识别设备可以基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字。In the embodiment of the present application, the voice recognition device may recognize target feature information of each second digital voice signal based on a neural network model to obtain a target number corresponding to each second digital voice signal.
在一些实施例中,所述神经网络模型可以由预设的卷积神经网络组成,其中,所述预设的卷积神经网络结构为MFM-CNN结构,所述MFM的激活函数是作用于从卷积层得到的特征图,所述MFM的激活函数如下所示:In some embodiments, the neural network model may be composed of a preset convolutional neural network, wherein the preset convolutional neural network structure is an MFM-CNN structure, and the activation function of the MFM is The feature map obtained by the convolution layer, the activation function of the MFM is as follows:
Figure PCTCN2019116979-appb-000001
Figure PCTCN2019116979-appb-000001
Figure PCTCN2019116979-appb-000002
Figure PCTCN2019116979-appb-000002
其中,x是W x H x N大小的MFM层输入张量,y是
Figure PCTCN2019116979-appb-000003
大小的输出张量,i和j分时域,k代表通道的索引。
Where x is the input tensor of the MFM layer of size W x H x N and y is
Figure PCTCN2019116979-appb-000003
The output tensor of size, i and j are in the time domain, and k represents the index of the channel.
其中,所述卷积层用于提取特征,MFM层作为激活函数层做非线性变换,池化层作用是平移不变形,减少参数数量;输入该网络中的是对数梅尔功率谱,实际上相当于输入的就是一个矩阵,输入到神经网络中进行训练。Among them, the convolutional layer is used to extract features, the MFM layer is used as an activation function layer for nonlinear transformation, and the pooling layer functions to translate without deformation and reduce the number of parameters; the input to the network is the log-mel power spectrum The equivalent of the input is a matrix, which is input to the neural network for training.
在一个实施例中,所述MFM-CNN结构由许多层组组成,从最开始接受对数梅尔功率谱作为输入的特征之后,每一组都是由一个卷积层紧跟着一个MFM层和池化层。这些层组堆叠在一起,然后由一个全连接层连接去生成嵌入层。In one embodiment, the MFM-CNN structure is composed of many layer groups. After receiving the log-mel power spectrum as the input feature from the beginning, each group is composed of a convolution layer followed by an MFM layer. And pooling layer. These layer groups are stacked together and then connected by a fully connected layer to generate an embedded layer.
S105:根据所述每个第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。S105: Determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
本申请实施例中,语音识别设备可以根据所述每个第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。In the embodiment of the present application, the voice recognition device may determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
在一个实施例中,所述语音识别设备可以根据记录的每个第二数字语音信号在第一数字语音信号中排列的先后顺序,确定所述目标数字排列的先后顺序,并根据所述目标数字排列的先后顺序组成与所述第一数字语音信号对应的目标数字密码。In one embodiment, the voice recognition device may determine the sequence of the target digits according to the sequence of each recorded second digital voice signal in the first digital voice signal, and according to the target digits The arranged order forms the target digital password corresponding to the first digital voice signal.
本申请实施例中,所述语音识别设备可以对第一数字语音信号进行分割处理得到第二数字语音信号,并对第二数字语音信号进行识别得到目标数字密码,以有效地识别出文本无关性型的语音信号。In the embodiment of the present application, the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.
参见图2,图2是本申请实施例提供的另一种语音识别方法的示意流程图,如图2所示,该方法可以由语音识别设备执行,该语音识别设备的具体解释如前所述,此处不再赘述。本申请实施例与上述图1所述实施例的区别在于,本申请实施例主要是对本申请实施例的详细实施过程进行示意性说明。具体地,本申请实施例的所述方法包括如下步骤。Referring to FIG. 2, FIG. 2 is a schematic flowchart of another voice recognition method provided by an embodiment of the present application. As shown in FIG. 2, the method may be performed by a voice recognition device, and the specific explanation of the voice recognition device is as described above , I won’t go into details here. The difference between the embodiment of the present application and the embodiment described in FIG. 1 above is that the embodiment of the present application mainly describes the detailed implementation process of the embodiment of the present application. Specifically, the method in the embodiment of the present application includes the following steps.
S201:获取训练样本集,所述训练样本集中包括样本数字语音信号的目标特征信息,每个样本数字语音信号均是由一个数字确定的。S201: Obtain a training sample set, where the training sample set includes target feature information of a sample digital voice signal, and each sample digital voice signal is determined by a number.
本申请实施例中,语音识别设备可以获取训练样本集,所述训练样本集中包括样本数字语音信号的目标特征信息,每个样本数字语音信号是由一个数字确定的。In the embodiment of the present application, the voice recognition device may acquire a training sample set, where the training sample set includes target feature information of sample digital voice signals, and each sample digital voice signal is determined by a number.
S202:基于所述训练样本集中的各样本数字语音信号的目标特征信息对初始神经网络模型进行训练 优化,得到所述神经网络模型。S202: Train and optimize the initial neural network model based on target feature information of each sample digital voice signal in the training sample set to obtain the neural network model.
本申请实施例中,语音识别设备可以根据预设的神经网络算法生成初始神经网络模型,将所述样本数字语音信号的目标特征信息作为输入数据,基于所述训练样本集中的各样本数字语音信号的目标特征信息对初始神经网络模型进行训练优化,输出与所述目标特征信息对应的数字,从而得到所述神经网络模型。其中,所述神经网络模型的解释如前所述,此处不再赘述。In the embodiment of the present application, the voice recognition device may generate an initial neural network model according to a preset neural network algorithm, use the target feature information of the sample digital voice signal as input data, and based on each sample digital voice signal in the training sample set The target feature information of is used to train and optimize the initial neural network model, and outputs a number corresponding to the target feature information, thereby obtaining the neural network model. The explanation of the neural network model is as described above, and will not be repeated here.
S203:获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成。S203: Acquire a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits.
本申请实施例中,语音识别设备可以获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成。其中,所述数字密码的解释如前所述,此处不再赘述。In the embodiment of the present application, the voice recognition device may acquire the first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple digits. The explanation of the digital password is as described above, and will not be repeated here.
S204:对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定。S204: Perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, where each second digital voice signal is determined by a number.
本申请实施例中,语音识别设备可以对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,每个所述第二数字语音信号由一个数字确定。其中,具体实施例如前所述,此处不再赘述。In the embodiment of the present application, the voice recognition device may perform preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, and each of the second digital voice signals is determined by one digit. The specific embodiments are as described above, and will not be repeated here.
S205:根据预设的信号处理方法对每个第二数字语音信号,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从该对数梅尔功率频谱中提取每个第二数字语音信号的目标特征信息。S205: Determine, according to a preset signal processing method, for each second digital voice signal, a log mel power spectrum corresponding to each second digital voice signal, and extract from the log mel power spectrum Target characteristic information of each second digital voice signal.
本申请实施例中,语音识别设备可以预设的信号处理方法对分割得到的每个所述第二数字语音信号进行处理,得到与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息。具体实施例如前所述,此处不再赘述。In the embodiment of the present application, the voice recognition device may process each of the divided second digital voice signals in a preset signal processing method to obtain a logarithmic mel power corresponding to each of the second digital voice signals Frequency spectrum, and extract target feature information of each of the second digital voice signals from the log-mel power spectrum. Specific embodiments are as described above, and will not be repeated here.
S206:计算所述第二数字语音信号的目标特征信息与所述训练样本集中各样本数字语音信号的目标特征信息的相似度。S206: Calculate the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set.
本申请实施例中,所述语音识别设备在基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别时,可以计算所述第二数字语音信号的目标特征信息与所述训练样本集中各样本数字语音信号的目标特征信息的相似度,以便后续根据所述相似度确定出目标数字语音信号。In the embodiment of the present application, when the speech recognition device recognizes the target feature information of each second digital voice signal based on the neural network model, it may calculate the target feature information of the second digital voice signal and the The similarity of the target feature information of the digital speech signal of each sample in the training sample set, so that the target digital speech signal is subsequently determined according to the similarity.
S207:获取所述相似度大于预设相似度阈值的至少一个样本数字语音信号的目标特征信息,并从所述至少一个样本数字语音信号的目标特征信息中确定出所述相似度最大的目标样本数字语音信号的目标特征信息。S207: Obtain target feature information of at least one sample digital voice signal with a similarity greater than a preset similarity threshold, and determine the target sample with the largest similarity from the target feature information of the at least one sample digital voice signal Target feature information of digital voice signals.
本申请实施例中,所述语音识别设备在计算所述第二数字语音信号与所述训练样本集中各样本数字语音信号的相似度之后,可以获取所述相似度大于预设相似度阈值的至少一个样本数字语音信号的目标特征信息,并从所述至少一个样本数字语音信号的目标特征信息中确定出所述相似度最大的目标样本数 字语音信号的目标特征信息。In the embodiment of the present application, after calculating the similarity between the second digital voice signal and each sample digital voice signal in the training sample set, the voice recognition device may acquire at least that the similarity is greater than a preset similarity threshold Target feature information of a sample digital speech signal, and the target feature information of the target sample digital speech signal with the greatest similarity is determined from the target feature information of the at least one sample digital speech signal.
S208:根据预设的样本数字语音信号的目标特征信息与数字之间的对应关系,确定与所述目标样本数字语音信号的目标特征信息对应的目标数字。S208: Determine the target number corresponding to the target feature information of the target sample digital voice signal according to the preset correspondence between the target feature information of the sample digital voice signal and the number.
本申请实施例中,所述语音识别设备在确定出所述相似度最大的目标数字语音信号之后,还可以根据预设的样本数字语音信号的目标特征信息与数字之间的对应关系,确定与所述目标样本数字语音信号的目标特征信息对应的目标数字。In the embodiment of the present application, after determining the target digital voice signal with the highest similarity, the voice recognition device may also determine the relationship between the target feature information of the sample digital voice signal and the number The target number corresponding to the target feature information of the target sample digital speech signal.
在一个实施例中,所述语音识别设备还可以通过计算所述待检测的第一数字语音信号与所述目标数字语音信号的余弦相似度,来确定所述待检测的第一数字语音信号与所述目标数字语音信号的相似度,并将所述余弦相似度大于预设阈值的数字确定为与所述目标数字语音信号对应的目标数字。In one embodiment, the voice recognition device may also determine the first digital voice signal to be detected by calculating the cosine similarity between the first digital voice signal to be detected and the target digital voice signal The similarity of the target digital voice signal, and determining the number whose cosine similarity is greater than a preset threshold as the target number corresponding to the target digital voice signal.
在一个实施例中,所述语音识别设备在确定与所述目标数字语音信号对应的目标数字之后,可以获取与所述目标数字语音信号对应的目标数字的数量,以及获取所述待检测的第一数字语音信号对应的数字的数量,并计算所述目标数字的数量与所述第一数字语音信号对应的数字的数量之间的数量比值,从而可以根据所述数量比值,确定所述待检测的第一数字语音信号被检测成功的概率。所述语音识别设备可以检测所述概率是否小于预设阈值,如果检测到所述概率小于预设阈值,则可以选取与所述第一数字语音信号相似的样本训练集对所述神经网络模型进行训练调整,以便实时地对所述神经网络模型进行训练优化,以进一步提高识别语音信号的性能和有效性。In one embodiment, after determining the target number corresponding to the target digital voice signal, the voice recognition device may obtain the number of target numbers corresponding to the target digital voice signal, and obtain the first to be detected A number of digits corresponding to a digital voice signal, and calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal, so that the to-be-detected can be determined according to the quantity ratio The probability that the first digital voice signal is detected successfully. The voice recognition device may detect whether the probability is less than a preset threshold, and if it is detected that the probability is less than a preset threshold, it may select a sample training set similar to the first digital voice signal to perform on the neural network model Training adjustments to optimize the training of the neural network model in real time to further improve the performance and effectiveness of recognizing speech signals.
S209:根据所述每个第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。S209: Determine a target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
本申请实施例中,所述语音识别设备可以根据所述每个第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。In the embodiment of the present application, the voice recognition device may determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each second digital voice signal.
本申请实施例中,所述语音识别设备可以对第一数字语音信号进行分割处理得到第二数字语音信号,并对第二数字语音信号进行识别得到目标数字密码,以有效地识别出文本无关性型的语音信号。In the embodiment of the present application, the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.
本申请实施例还提供了一种语音识别设备,该语音识别设备用于执行前述任一项所述的方法的单元。具体地,参见图3,图3是本申请实施例提供的一种语音识别设备的示意框图。本实施例的语音识别设备包括:获取单元301、分割处理单元302、预处理单元303、识别单元304以及确定单元305。An embodiment of the present application further provides a voice recognition device, which is a unit for performing the method described in any one of the foregoing. Specifically, referring to FIG. 3, FIG. 3 is a schematic block diagram of a voice recognition device according to an embodiment of the present application. The voice recognition device of this embodiment includes: an acquisition unit 301, a segmentation processing unit 302, a preprocessing unit 303, a recognition unit 304, and a determination unit 305.
获取单元301,用于获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成;The obtaining unit 301 is configured to obtain a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
分割处理单元302,用于对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定;A division processing unit 302, configured to perform preset division processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
预处理单元303,用于根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与 每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息;The pre-processing unit 303 is configured to process each of the second digital voice signals according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and select from Extracting target feature information of each of the second digital voice signals from the log-mel power spectrum;
识别单元304,用于基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字;The recognition unit 304 is configured to recognize target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;
确定单元305,用于根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。The determining unit 305 is configured to determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals.
进一步地,所述获取单元301获取待检测的第一数字语音信号之前,还用于:Further, before the acquiring unit 301 acquires the first digital voice signal to be detected, it is also used to:
获取训练样本集,所述训练样本集中包括样本数字语音信号的目标特征信息,其中,每个样本数字语音信号均是由一个数字确定的;Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;
根据预设的神经网络算法生成初始神经网络模型;Generate the initial neural network model according to the preset neural network algorithm;
基于所述训练样本集中的各样本数字语音信号的目标特征信息对所述初始神经网络模型进行训练优化,得到所述神经网络模型。The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
进一步地,所述识别单元304基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字时,具体用于:Further, when the recognition unit 304 recognizes the target feature information of each second digital voice signal based on the neural network model, and obtains the target number corresponding to each second digital voice signal, it is specifically used to:
计算所述第二数字语音信号的目标特征信息与所述训练样本集中各样本数字语音信号的目标特征信息的相似度;Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;
获取所述相似度大于预设相似度阈值的至少一个样本数字语音信号的目标特征信息;Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;
从所述至少一个样本数字语音信号的目标特征信息中确定出所述相似度最大的目标样本数字语音信号的目标特征信息;Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;
根据预设的样本数字语音信号的目标特征信息与数字之间的对应关系,确定与所述目标样本数字语音信号的目标特征信息对应的目标数字。The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
进一步地,所述确定单元305确定与所述目标样本数字语音信号的目标特征信息对应的目标数字之后,还用于:Further, after the determining unit 305 determines the target number corresponding to the target feature information of the target sample digital voice signal, it is also used to:
获取与所述目标样本数字语音信号的目标特征信息对应的目标数字的数量,以及获取所述待检测的第一数字语音信号对应的数字的数量;Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;
计算所述目标数字的数量与所述第一数字语音信号对应的数字的数量之间的数量比值;Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;
根据所述数量比值,确定所述待检测的第一数字语音信号被检测成功的概率;Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;
判断所述概率是否小于预设阈值;Determine whether the probability is less than a preset threshold;
如果判断结果为是,则选取与所述第一数字语音信号相似的样本训练集对所述神经网络模型进行训练调整。If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
进一步地,所述预处理单元303根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱时,具体用于:Further, the pre-processing unit 303 processes each of the second digital voice signals according to a preset signal processing method to determine the log-mel power spectrum corresponding to each of the second digital voice signals , Specifically for:
对每个所述第二数字语音信号对应的语音帧进行快速傅里叶变换,得到每个所述第二数字语音信号对应的语音帧的频谱信号;Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;
将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,以得到每个所述第二数字语音信号对应的对数梅尔功率频谱。Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.
进一步地,所述预处理单元303将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,具体用于:Further, when the pre-processing unit 303 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power, it is specifically used for:
对每个所述第二数字语音信号对应的语音帧的频谱信号取绝对值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
进一步地,所述预处理单元303将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,具体用于:Further, when the pre-processing unit 303 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power, it is specifically used for:
对每个所述第二数字语音信号对应的语音帧的频谱信号取平方值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
进一步地,所述分割处理单元302对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号时,具体用于:Further, when the segmentation processing unit 302 performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it is specifically used for:
通过HMM对所述第一数字语音信号进行预设分割处理,以将所述第一数字语音信号分割为多个相互独立的第二数字语音信号,并记录分割得到的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序。Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.
进一步地,所述确定单元305根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码时,具体用于:Further, when the determining unit 305 determines the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals, it is specifically used to:
根据记录的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序,确定与每个所述第二数字语音信号对应的目标数字排列的先后顺序;According to the recorded sequence of each second digital voice signal arranged in the first digital voice signal, determining the sequence of target digital arrangement corresponding to each second digital voice signal;
根据所述目标数字排列的先后顺序,确定与所述第一数字语音信号对应的目标数字密码。According to the order in which the target numbers are arranged, a target digital password corresponding to the first digital voice signal is determined.
进一步地,所述预处理单元303从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息时,具体用于:Further, when the pre-processing unit 303 extracts target feature information of each second digital voice signal from the log-mel power spectrum, it is specifically used to:
对每个所述第二数字语音信号对应的对数梅尔功率频谱进行归一化处理,得到每个所述第二数字语音信号的目标特征信息,其中,所述归一化处理指的是将每个所述第二数字语音信号的对数梅尔功率谱 特征进行归一化。Performing normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to The log-mel power spectrum characteristics of each of the second digital speech signals are normalized.
本申请实施例中,所述语音识别设备可以对第一数字语音信号进行分割处理得到第二数字语音信号,并对第二数字语音信号识别得到目标数字密码,以有效地识别出文本无关性型的语音信号。In the embodiment of the present application, the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize the text-independent type Voice signal.
参见图4,图4是本申请实施例提供的另一种语音识别设备示意框图。如图所示的本实施例中的语音识别设备可以包括:一个或多个处理器401;一个或多个输入设备402,一个或多个输出设备403和存储器404。上述处理器401、输入设备402、输出设备403和存储器404通过总线405连接。存储器404用于存储计算机程序,所述计算机程序包括程序指令,处理器401用于执行存储器404存储的程序指令。其中,处理器401被配置用于调用所述程序指令执行:Referring to FIG. 4, FIG. 4 is a schematic block diagram of another voice recognition device provided by an embodiment of the present application. As shown in the figure, the voice recognition device in this embodiment may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and a memory 404. The processor 401, the input device 402, the output device 403, and the memory 404 are connected via a bus 405. The memory 404 is used to store a computer program, and the computer program includes program instructions, and the processor 401 is used to execute the program instructions stored in the memory 404. The processor 401 is configured to call the program instructions to execute:
获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成;Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定;Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息;Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;
基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字;Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;
根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.
进一步地,所述处理器401获取待检测的第一数字语音信号之前,还用于:Further, before the processor 401 obtains the first digital voice signal to be detected, it is also used to:
获取训练样本集,所述训练样本集中包括样本数字语音信号的目标特征信息,其中,每个样本数字语音信号均是由一个数字确定的;Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;
根据预设的神经网络算法生成初始神经网络模型;Generate the initial neural network model according to the preset neural network algorithm;
基于所述训练样本集中的各样本数字语音信号的目标特征信息对所述初始神经网络模型进行训练优化,得到所述神经网络模型。The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
进一步地,所述处理器401基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字时,具体用于:Further, when the processor 401 recognizes the target feature information of each second digital voice signal based on the neural network model, and obtains the target number corresponding to each second digital voice signal, it is specifically used to:
计算所述第二数字语音信号的目标特征信息与所述训练样本集中各样本数字语音信号的目标特征信息的相似度;Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;
获取所述相似度大于预设相似度阈值的至少一个样本数字语音信号的目标特征信息;Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;
从所述至少一个样本数字语音信号的目标特征信息中确定出所述相似度最大的目标样本数字语音信号的目标特征信息;Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;
根据预设的样本数字语音信号的目标特征信息与数字之间的对应关系,确定与所述目标样本数字语音信号的目标特征信息对应的目标数字。The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
进一步地,所述处理器401确定与所述目标样本数字语音信号的目标特征信息对应的目标数字之后,还用于:Further, after the processor 401 determines the target number corresponding to the target feature information of the target sample digital voice signal, it is also used to:
获取与所述目标样本数字语音信号的目标特征信息对应的目标数字的数量,以及获取所述待检测的第一数字语音信号对应的数字的数量;Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;
计算所述目标数字的数量与所述第一数字语音信号对应的数字的数量之间的数量比值;Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;
根据所述数量比值,确定所述待检测的第一数字语音信号被检测成功的概率;Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;
判断所述概率是否小于预设阈值;Determine whether the probability is less than a preset threshold;
如果判断结果为是,则选取与所述第一数字语音信号相似的样本训练集对所述神经网络模型进行训练调整。If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
进一步地,所述处理器401根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱时,具体用于:Further, when the processor 401 processes each of the second digital voice signals according to a preset signal processing method and determines a logarithmic mel power spectrum corresponding to each of the second digital voice signals, Specifically used for:
对每个所述第二数字语音信号对应的语音帧进行快速傅里叶变换,得到每个所述第二数字语音信号对应的语音帧的频谱信号;Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;
将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,以得到每个所述第二数字语音信号对应的对数梅尔功率频谱。Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.
进一步地,所述处理器401将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,具体用于:Further, when the processor 401 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power, it is specifically used to:
对每个所述第二数字语音信号对应的语音帧的频谱信号取绝对值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
进一步地,所述处理器401将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,具体用于:Further, when the processor 401 converts the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power, it is specifically used to:
对每个所述第二数字语音信号对应的语音帧的频谱信号取平方值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
进一步地,所述处理器401对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信 号时,具体用于:Further, when the processor 401 performs preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, it is specifically used for:
通过HMM对所述第一数字语音信号进行预设分割处理,以将所述第一数字语音信号分割为多个相互独立的第二数字语音信号,并记录分割得到的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序。Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.
进一步地,所述处理器401根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码时,具体用于:Further, when the processor 401 determines the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals, it is specifically used to:
根据记录的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序,确定与每个所述第二数字语音信号对应的目标数字排列的先后顺序;According to the recorded sequence of each second digital voice signal arranged in the first digital voice signal, determining the sequence of target digital arrangement corresponding to each second digital voice signal;
根据所述目标数字排列的先后顺序,确定与所述第一数字语音信号对应的目标数字密码。According to the order in which the target numbers are arranged, a target digital password corresponding to the first digital voice signal is determined.
进一步地,所述处理器401从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息时,具体用于:Further, when the processor 401 extracts target feature information of each second digital voice signal from the log mel power spectrum, it is specifically used to:
对每个所述第二数字语音信号对应的对数梅尔功率频谱进行归一化处理,得到每个所述第二数字语音信号的目标特征信息,其中,所述归一化处理指的是将每个所述第二数字语音信号的对数梅尔功率谱特征进行归一化。Performing normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to Normalize the characteristic of the log-mel power spectrum of each of the second digital speech signals.
本申请实施例中,所述语音识别设备可以对第一数字语音信号进行分割处理得到第二数字语音信号,并对第二数字语音信号进行识别得到目标数字密码,以有效地识别出文本无关性型的语音信号。In the embodiment of the present application, the voice recognition device may divide the first digital voice signal to obtain a second digital voice signal, and recognize the second digital voice signal to obtain a target digital password to effectively recognize text independence Type voice signal.
应当理解,在本申请实施例中,所称处理器401可以是中央处理单元(CenSral Processing UniS,CPU),该处理器还可以是其他通用处理器、数字信号处理器(DigiSal Signal Processor,DSP)、专用集成电路(ApplicaSion Specific InSegraSed CircuiS,ASIC)、现成可编程门阵列(Field-Programmable GaSe Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in the embodiments of the present application, the so-called processor 401 may be a central processing unit (CenSral Processing UniS, CPU), and the processor may also be other general-purpose processors, digital signal processors (DigiSal Signal Processor, DSP) , Application-specific integrated circuits (ApplicaSion Specific InSegraSed Circuits, ASIC), ready-made programmable gate array (Field-Programmable GaSe Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
输入设备402可以包括触控板、麦克风等,输出设备403可以包括显示器(LCD等)、扬声器等。The input device 402 may include a touch panel, a microphone, and the like, and the output device 403 may include a display (LCD, etc.), a speaker, and the like.
该存储器404可以包括只读存储器和随机存取存储器,并向处理器401提供指令和数据。存储器404的一部分还可以包括非易失性随机存取存储器。例如,存储器404还可以存储设备类型的信息。The memory 404 may include a read-only memory and a random access memory, and provide instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.
具体实现中,本申请实施例中所描述的处理器401、输入设备402、输出设备403可执行本申请实施例提供的语音识别方法的图1或图2所述的方法实施例中所描述的实现方式,也可执行本申请实施例图3或图4所描述的语音识别设备的实现方式,在此不再赘述。In a specific implementation, the processor 401, the input device 402, and the output device 403 described in the embodiments of the present application may perform the method described in FIG. 1 or FIG. 2 of the voice recognition method provided in the embodiments of the present application. The implementation manner may also be the implementation manner of the voice recognition device described in FIG. 3 or FIG. 4 in the embodiment of the present application, and details are not described herein again.
本申请实施例中还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现图1或图2所对应实施例中描述的语音识别方法,也可实现本申请图3或图4所对应实施例的语音识别设备,在此不再赘述。在某些实施例中,所述计算机可读存储介质 还可以为计算机非易失性可读存储介质,本发明实施例在此处不做具体限定。An embodiment of the present application also provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the computer program is implemented as described in the embodiment corresponding to FIG. 1 or FIG. 2 The voice recognition method can also implement the voice recognition device of the embodiment corresponding to FIG. 3 or FIG. 4 of the present application, and details are not described herein again. In some embodiments, the computer-readable storage medium may also be a non-volatile computer-readable storage medium, which is not specifically limited in this embodiment of the present invention.
所述计算机可读存储介质可以是前述任一实施例所述的语音识别设备的内部存储单元,例如语音识别设备的硬盘或内存。所述计算机可读存储介质也可以是所述语音识别设备的外部存储设备,例如所述语音识别设备上配备的插接式硬盘,智能存储卡(SmarS Media Card,SMC),安全数字(Secure DigiSal,SD)卡,闪存卡(Flash Card)等。进一步地,所述计算机可读存储介质还可以既包括所述语音识别设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述语音识别设备所需的其他程序和数据。所述计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of the voice recognition device described in any of the foregoing embodiments, such as a hard disk or a memory of the voice recognition device. The computer-readable storage medium may also be an external storage device of the voice recognition device, such as a plug-in hard disk equipped on the voice recognition device, an intelligent memory card (SmarS Media, Card, SMC), and secure digital (Secure DigiSal) , SD) card, flash card (Flash Card), etc. Further, the computer-readable storage medium may also include both an internal storage unit of the voice recognition device and an external storage device. The computer-readable storage medium is used to store the computer program and other programs and data required by the voice recognition device. The computer-readable storage medium can also be used to temporarily store data that has been or will be output.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, in order to clearly explain the hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described according to function. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
以上所述,仅为本申请的部分实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。The above is only part of the implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application.

Claims (20)

  1. 一种语音识别方法,其特征在于,包括:A voice recognition method, characterized in that it includes:
    获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成;Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
    对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定;Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
    根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息;Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;
    基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字;Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;
    根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.
  2. 根据权利要求1所述的方法,其特征在于,所述获取待检测的第一数字语音信号之前,还包括:The method according to claim 1, wherein before acquiring the first digital voice signal to be detected, the method further comprises:
    获取训练样本集,所述训练样本集中包括样本数字语音信号的目标特征信息,其中,每个样本数字语音信号均是由一个数字确定的;Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;
    根据预设的神经网络算法生成初始神经网络模型;Generate the initial neural network model according to the preset neural network algorithm;
    基于所述训练样本集中的各样本数字语音信号的目标特征信息对所述初始神经网络模型进行训练优化,得到所述神经网络模型。The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
  3. 根据权利要求2所述的方法,其特征在于,所述基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字,包括:The method according to claim 2, wherein the neural network model is used to identify target feature information of each of the second digital voice signals to obtain a target corresponding to each of the second digital voice signals Numbers, including:
    计算所述第二数字语音信号的目标特征信息与所述训练样本集中各样本数字语音信号的目标特征信息的相似度;Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;
    获取所述相似度大于预设相似度阈值的至少一个样本数字语音信号的目标特征信息;Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;
    从所述至少一个样本数字语音信号的目标特征信息中确定出所述相似度最大的目标样本数字语音信号的目标特征信息;Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;
    根据预设的样本数字语音信号的目标特征信息与数字之间的对应关系,确定与所述目标样本数字语音信号的目标特征信息对应的目标数字。The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
  4. 根据权利要求3所述的方法,其特征在于,所述确定与所述目标样本数字语音信号的目标特征信 息对应的目标数字之后,还包括:The method according to claim 3, wherein after determining the target number corresponding to the target characteristic information of the target sample digital voice signal, further comprising:
    获取与所述目标样本数字语音信号的目标特征信息对应的目标数字的数量,以及获取所述待检测的第一数字语音信号对应的数字的数量;Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;
    计算所述目标数字的数量与所述第一数字语音信号对应的数字的数量之间的数量比值;Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;
    根据所述数量比值,确定所述待检测的第一数字语音信号被检测成功的概率;Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;
    判断所述概率是否小于预设阈值;Determine whether the probability is less than a preset threshold;
    如果判断结果为是,则选取与所述第一数字语音信号相似的样本训练集对所述神经网络模型进行训练调整。If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
  5. 根据权利要求1所述的方法,其特征在于,所述根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,包括:The method according to claim 1, wherein the second digital voice signal is processed according to a preset signal processing method to determine a pair corresponding to each second digital voice signal Meyer power spectrum, including:
    对每个所述第二数字语音信号进行分帧加窗处理,得到每个所述第二数字语音信号对应的语音帧;Performing frame windowing on each of the second digital voice signals to obtain a voice frame corresponding to each of the second digital voice signals;
    对每个所述第二数字语音信号对应的语音帧进行快速傅里叶变换,得到每个所述第二数字语音信号对应的语音帧的频谱信号;Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;
    将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,以得到每个所述第二数字语音信号对应的对数梅尔功率频谱。Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.
  6. 根据权利要求5所述的方法,其特征在于,所述将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,包括:The method according to claim 5, wherein the converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power includes:
    对每个所述第二数字语音信号对应的语音帧的频谱信号取绝对值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
    将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
  7. 根据权利要求5所述的方法,其特征在于,所述将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,包括:The method according to claim 5, wherein the converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power includes:
    对每个所述第二数字语音信号对应的语音帧的频谱信号取平方值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
    将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
  8. 根据权利要求1所述的方法,其特征在于,所述对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号时,包括:The method according to claim 1, wherein the preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals includes:
    通过HMM对所述第一数字语音信号进行预设分割处理,以将所述第一数字语音信号分割为多个相互独立的第二数字语音信号,并记录分割得到的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序。Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.
  9. 根据权利要求8所述的方法,其特征在于,所述根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码,包括:The method according to claim 8, wherein the determining the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals includes:
    根据记录的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序,确定与每个所述第二数字语音信号对应的目标数字排列的先后顺序;According to the recorded sequence of each second digital voice signal arranged in the first digital voice signal, determining the sequence of target digital arrangement corresponding to each second digital voice signal;
    根据所述目标数字排列的先后顺序,确定与所述第一数字语音信号对应的目标数字密码。According to the order in which the target numbers are arranged, a target digital password corresponding to the first digital voice signal is determined.
  10. 根据权利要求1所述的方法,其特征在于,所述从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息,包括:The method according to claim 1, wherein the extracting target feature information of each of the second digital voice signals from the log-mel power spectrum includes:
    对每个所述第二数字语音信号对应的对数梅尔功率频谱进行归一化处理,得到每个所述第二数字语音信号的目标特征信息,其中,所述归一化处理指的是将每个所述第二数字语音信号的对数梅尔功率谱特征进行归一化。Performing normalization processing on the log-mel power spectrum corresponding to each of the second digital voice signals to obtain target characteristic information of each of the second digital voice signals, where the normalization processing refers to Normalize the characteristic of the log-mel power spectrum of each of the second digital speech signals.
  11. 一种语音识别设备,其特征在于,包括:A voice recognition device, characterized in that it includes:
    获取单元,用于获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成;An obtaining unit, configured to obtain a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
    分割处理单元,用于对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定;A division processing unit, configured to perform preset division processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
    预处理单元,用于根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息;A preprocessing unit, configured to process each of the second digital voice signals according to a preset signal processing method, determine a logarithmic mel power spectrum corresponding to each of the second digital voice signals, and select from Extracting target feature information of each of the second digital voice signals from the log-mel power spectrum;
    识别单元,用于基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字;A recognition unit, configured to recognize target feature information of each of the second digital voice signals based on a neural network model to obtain a target number corresponding to each of the second digital voice signals;
    确定单元,用于根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。The determining unit is configured to determine the target digital password corresponding to the first digital voice signal according to the target number corresponding to each of the second digital voice signals.
  12. 根据权利要求11所述的设备,其特征在于,所述获取单元获取待检测的第一数字语音信号之前,还用于:The device according to claim 11, wherein before the acquiring unit acquires the first digital voice signal to be detected, it is further used to:
    获取训练样本集,所述训练样本集中包括样本数字语音信号的目标特征信息,其中,每个样本数字语音信号均是由一个数字确定的;Acquiring a training sample set, the training sample set includes target feature information of a sample digital voice signal, wherein each sample digital voice signal is determined by a number;
    根据预设的神经网络算法生成初始神经网络模型;Generate the initial neural network model according to the preset neural network algorithm;
    基于所述训练样本集中的各样本数字语音信号的目标特征信息对所述初始神经网络模型进行训练优化,得到所述神经网络模型。The initial neural network model is trained and optimized based on the target feature information of each sample digital speech signal in the training sample set to obtain the neural network model.
  13. 根据权利要求12所述的设备,其特征在于,所述识别单元基于神经网络模型对每个所述第二数 字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字时,具体用于:The device according to claim 12, wherein the recognition unit recognizes target feature information of each of the second digital voice signals based on a neural network model, and obtains a correspondence with each of the second digital voice signals When the target number is used:
    计算所述第二数字语音信号的目标特征信息与所述训练样本集中各样本数字语音信号的目标特征信息的相似度;Calculating the similarity between the target feature information of the second digital voice signal and the target feature information of each sample digital voice signal in the training sample set;
    获取所述相似度大于预设相似度阈值的至少一个样本数字语音信号的目标特征信息;Acquiring target feature information of at least one sample digital voice signal whose similarity is greater than a preset similarity threshold;
    从所述至少一个样本数字语音信号的目标特征信息中确定出所述相似度最大的目标样本数字语音信号的目标特征信息;Determining the target feature information of the target sample digital voice signal with the largest similarity from the target feature information of the at least one sample digital voice signal;
    根据预设的样本数字语音信号的目标特征信息与数字之间的对应关系,确定与所述目标样本数字语音信号的目标特征信息对应的目标数字。The target number corresponding to the target feature information of the target sample digital voice signal is determined according to the correspondence between the target feature information of the sample digital voice signal and the number.
  14. 根据权利要求13所述的设备,其特征在于,所述识别单元确定与所述目标样本数字语音信号的目标特征信息对应的目标数字之后,还用于:The device according to claim 13, wherein after the recognition unit determines the target number corresponding to the target feature information of the target sample digital voice signal, it is further used to:
    获取与所述目标样本数字语音信号的目标特征信息对应的目标数字的数量,以及获取所述待检测的第一数字语音信号对应的数字的数量;Acquiring the number of target digits corresponding to the target feature information of the target sample digital voice signal, and acquiring the number of digits corresponding to the first digital voice signal to be detected;
    计算所述目标数字的数量与所述第一数字语音信号对应的数字的数量之间的数量比值;Calculating a quantity ratio between the number of target digits and the number of digits corresponding to the first digital voice signal;
    根据所述数量比值,确定所述待检测的第一数字语音信号被检测成功的概率;Determine the probability that the first digital voice signal to be detected is successfully detected according to the quantity ratio;
    判断所述概率是否小于预设阈值;Determine whether the probability is less than a preset threshold;
    如果判断结果为是,则选取与所述第一数字语音信号相似的样本训练集对所述神经网络模型进行训练调整。If the judgment result is yes, a sample training set similar to the first digital speech signal is selected to perform training adjustment on the neural network model.
  15. 根据权利要求11所述的设备,其特征在于,所述预处理单元根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱时,具体用于:The device according to claim 11, wherein the pre-processing unit processes each of the second digital voice signals according to a preset signal processing method, and determines that each of the second digital voice signals The corresponding log mel power spectrum is specifically used for:
    对每个所述第二数字语音信号进行分帧加窗处理,得到每个所述第二数字语音信号对应的语音帧;Performing frame windowing processing on each of the second digital voice signals to obtain a voice frame corresponding to each of the second digital voice signals;
    对每个所述第二数字语音信号对应的语音帧进行快速傅里叶变换,得到每个所述第二数字语音信号对应的语音帧的频谱信号;Performing fast Fourier transform on the speech frame corresponding to each of the second digital speech signals to obtain the spectrum signal of the speech frame corresponding to each of the second digital speech signals;
    将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率,以得到每个所述第二数字语音信号对应的对数梅尔功率频谱。Converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log mel spectrum power to obtain the log mel power spectrum corresponding to each of the second digital speech signals.
  16. 根据权利要求15所述的设备,其特征在于,所述预处理单元将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,具体用于:The device according to claim 15, wherein the pre-processing unit is specifically used when converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power:
    对每个所述第二数字语音信号对应的语音帧的频谱信号取绝对值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the absolute value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
    将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
  17. 根据权利要求15所述的设备,其特征在于,所述预处理单元将每个所述第二数字语音信号对应的语音帧的频谱信号转换为对数梅尔频谱功率时,具体用于:The device according to claim 15, wherein the pre-processing unit is specifically used when converting the spectrum signal of the speech frame corresponding to each of the second digital speech signals into log-mel spectrum power:
    对每个所述第二数字语音信号对应的语音帧的频谱信号取平方值,得到每个所述第二数字语音信号对应的语音帧的功率频谱;Taking the square value of the spectrum signal of the speech frame corresponding to each of the second digital speech signals to obtain the power spectrum of the speech frame corresponding to each of the second digital speech signals;
    将每个所述第二数字语音信号对应的语音帧的功率频谱转换为对数梅尔功率频谱。Converting the power spectrum of the speech frame corresponding to each of the second digital speech signals into a log-mel power spectrum.
  18. 根据权利要求11所述的设备,其特征在于,所述分割处理单元对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号时,具体用于:The device according to claim 11, wherein the division processing unit performs preset division processing on the first digital voice signal to obtain multiple second digital voice signals, which is specifically used for:
    通过HMM对所述第一数字语音信号进行预设分割处理,以将所述第一数字语音信号分割为多个相互独立的第二数字语音信号,并记录分割得到的每个所述第二数字语音信号在第一数字语音信号中排列的先后顺序。Pre-segmenting the first digital voice signal through HMM to divide the first digital voice signal into a plurality of mutually independent second digital voice signals, and recording each second digit obtained by dividing The order in which the voice signals are arranged in the first digital voice signal.
  19. 一种语音识别设备,其特征在于,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行:A voice recognition device is characterized by comprising a processor, an input device, an output device and a memory, wherein the processor, input device, output device and memory are connected to each other, wherein the memory is used to store a computer program, the The computer program includes program instructions, and the processor is configured to call the program instructions and execute:
    获取待检测的第一数字语音信号,其中,所述第一数字语音信号是由数字密码组成的,所述数字密码由多个数字组成;Acquiring a first digital voice signal to be detected, wherein the first digital voice signal is composed of a digital password, and the digital password is composed of multiple numbers;
    对所述第一数字语音信号进行预设分割处理,得到多个第二数字语音信号,其中,每个第二数字语音信号均由一个数字确定;Performing preset segmentation processing on the first digital voice signal to obtain multiple second digital voice signals, wherein each second digital voice signal is determined by a number;
    根据预设的信号处理方法对每个所述第二数字语音信号进行处理,确定出与每个所述第二数字语音信号对应的对数梅尔功率频谱,并从所述对数梅尔功率频谱中提取每个所述第二数字语音信号的目标特征信息;Processing each of the second digital voice signals according to a preset signal processing method, determining a log mel power spectrum corresponding to each of the second digital voice signals, and calculating the power from the log mel power Extract target feature information of each of the second digital voice signals from the frequency spectrum;
    基于神经网络模型对每个所述第二数字语音信号的目标特征信息进行识别,得到与每个所述第二数字语音信号对应的目标数字;Identifying target feature information of each of the second digital voice signals based on a neural network model to obtain target numbers corresponding to each of the second digital voice signals;
    根据每个所述第二数字语音信号对应的目标数字,确定与所述第一数字语音信号对应的目标数字密码。According to the target number corresponding to each of the second digital voice signals, a target digital password corresponding to the first digital voice signal is determined.
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-10任一项所述的方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and the computer program includes program instructions, which when executed by a processor cause the processor to execute as rights The method according to any one of claims 1-10.
PCT/CN2019/116979 2019-01-04 2019-11-11 Voice recognition method and device and computer readable storage medium WO2020140609A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910014557.3 2019-01-04
CN201910014557.3A CN109545226B (en) 2019-01-04 2019-01-04 Voice recognition method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2020140609A1 true WO2020140609A1 (en) 2020-07-09

Family

ID=65834261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116979 WO2020140609A1 (en) 2019-01-04 2019-11-11 Voice recognition method and device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109545226B (en)
WO (1) WO2020140609A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545226B (en) * 2019-01-04 2022-11-22 平安科技(深圳)有限公司 Voice recognition method, device and computer readable storage medium
CN110415685A (en) * 2019-08-20 2019-11-05 河海大学 A kind of audio recognition method
CN112802484B (en) * 2021-04-12 2021-06-18 四川大学 Panda sound event detection method and system under mixed audio frequency

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358306A1 (en) * 2016-06-13 2017-12-14 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN108922561A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109074822A (en) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 Specific sound recognition methods, equipment and storage medium
CN109545226A (en) * 2019-01-04 2019-03-29 平安科技(深圳)有限公司 A kind of audio recognition method, equipment and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07261792A (en) * 1994-03-18 1995-10-13 Fujitsu Ltd Voice recognition device by numerical value with digit
JPH09311693A (en) * 1996-05-21 1997-12-02 Matsushita Electric Ind Co Ltd Speech recognition apparatus
CN103903612B (en) * 2014-03-26 2017-02-22 浙江工业大学 Method for performing real-time digital speech recognition
CN107221333B (en) * 2016-03-21 2019-11-08 中兴通讯股份有限公司 A kind of identity authentication method and device
CN107068154A (en) * 2017-03-13 2017-08-18 平安科技(深圳)有限公司 The method and system of authentication based on Application on Voiceprint Recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170358306A1 (en) * 2016-06-13 2017-12-14 Alibaba Group Holding Limited Neural network-based voiceprint information extraction method and apparatus
CN109074822A (en) * 2017-10-24 2018-12-21 深圳和而泰智能控制股份有限公司 Specific sound recognition methods, equipment and storage medium
CN108922561A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109545226A (en) * 2019-01-04 2019-03-29 平安科技(深圳)有限公司 A kind of audio recognition method, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LUO MEI ; JIANG LISHA ; LUO LIANLING: "Application of BP Neural Network in Chinese Digit Speech Recognition", GUANGXI PHYSICS, vol. 33, no. 3, 30 September 2012 (2012-09-30), pages 26 - 28, XP009521861, ISSN: 1003-7551 *

Also Published As

Publication number Publication date
CN109545226A (en) 2019-03-29
CN109545226B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
US10699699B2 (en) Constructing speech decoding network for numeric speech recognition
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
US8731936B2 (en) Energy-efficient unobtrusive identification of a speaker
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
WO2021082420A1 (en) Voiceprint authentication method and device, medium and electronic device
CN108986824B (en) Playback voice detection method
WO2019019256A1 (en) Electronic apparatus, identity verification method and system, and computer-readable storage medium
WO2019037205A1 (en) Voice fraud identifying method and apparatus, terminal device, and storage medium
WO2020140609A1 (en) Voice recognition method and device and computer readable storage medium
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
US9646613B2 (en) Methods and systems for splitting a digital signal
Baloul et al. Challenge-based speaker recognition for mobile authentication
WO2020034628A1 (en) Accent identification method and device, computer device, and storage medium
WO2021051608A1 (en) Voiceprint recognition method and device employing deep learning, and apparatus
CN103794207A (en) Dual-mode voice identity recognition method
CN112562691A (en) Voiceprint recognition method and device, computer equipment and storage medium
US20230401338A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN110111798B (en) Method, terminal and computer readable storage medium for identifying speaker
CN113053410B (en) Voice recognition method, voice recognition device, computer equipment and storage medium
CN113223536A (en) Voiceprint recognition method and device and terminal equipment
CN108847251B (en) Voice duplicate removal method, device, server and storage medium
JP6996627B2 (en) Information processing equipment, control methods, and programs
Kuznetsov et al. Methods of countering speech synthesis attacks on voice biometric systems in banking
CN115050350A (en) Label checking method and related device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19907842

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19907842

Country of ref document: EP

Kind code of ref document: A1