CN113409775B - Keyword recognition method and device, storage medium and computer equipment - Google Patents

Keyword recognition method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN113409775B
CN113409775B CN202110714828.3A CN202110714828A CN113409775B CN 113409775 B CN113409775 B CN 113409775B CN 202110714828 A CN202110714828 A CN 202110714828A CN 113409775 B CN113409775 B CN 113409775B
Authority
CN
China
Prior art keywords
convolutional
convolution
voice signal
network
network unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110714828.3A
Other languages
Chinese (zh)
Other versions
CN113409775A (en
Inventor
王嘉欣
胡伯承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Shanghai Co Ltd
Original Assignee
Spreadtrum Communications Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Shanghai Co Ltd filed Critical Spreadtrum Communications Shanghai Co Ltd
Priority to CN202110714828.3A priority Critical patent/CN113409775B/en
Publication of CN113409775A publication Critical patent/CN113409775A/en
Application granted granted Critical
Publication of CN113409775B publication Critical patent/CN113409775B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A keyword recognition method and device, a storage medium and computer equipment are provided, and the method comprises the following steps: acquiring an original voice signal, and performing framing processing on the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics; inputting the target voice signal into a keyword recognition network for keyword recognition to obtain a keyword contained in the target voice signal; the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target voice signal. The number of model parameters can be reduced, and the recognition accuracy can be considered at the same time.

Description

Keyword recognition method and device, storage medium and computer equipment
Technical Field
The invention relates to the field of artificial intelligence, in particular to a keyword identification method and device, a storage medium and computer equipment.
Background
With the development of artificial intelligence, more and more artificial intelligence algorithms are applied to keyword recognition at present. Keyword recognition (KWS) is an important human-computer interaction method at present, and has been applied to different electronic products (such as computers, mobile phones, etc.), and some electronic products hope to cover less model parameters in the Keyword recognition method due to smaller memory capacity, so as to reduce the memory used during the operation of the algorithm. However, reducing the model parameters used by the keyword recognition method may result in a reduction in recognition accuracy. That is, the current keyword recognition method cannot balance the recognition accuracy and the model parameters.
Compared with the conventional algorithm, the neural network can improve the accuracy of keyword recognition and the parameter quantity of the model, for example, a deep convolution (Depthwise convolution) network is used for keyword recognition to reduce the parameter quantity of the model, however, although the computation quantity of the deep convolution network can be reduced to a certain extent, the number of channels of the deep convolution network is large, so that the parameter quantity of the network is still large as a whole, the deep convolution network is still not suitable for electronic products with strict memory limitation, and the recognition accuracy of the deep convolution network is still to be further improved.
Disclosure of Invention
The invention solves the technical problem of how to provide a keyword identification method, which can reduce the number of model parameters and simultaneously consider the identification precision.
In order to solve the above problem, an embodiment of the present invention provides a keyword recognition method, where the method includes: acquiring an original voice signal, and processing the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics; inputting the target voice signal into a keyword recognition network for keyword recognition to obtain keywords contained in the target voice signal; the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target speech signal.
Optionally, part or all of the convolution network units adopt a residual error network structure.
Optionally, each convolutional network unit sequentially includes a first convolutional layer, a second convolutional layer, and a third convolutional layer according to a feature stream rotation sequence, where the number of channels of the first convolutional layer and the second convolutional layer is a first number, the number of channels of the third convolutional layer is a second number, and the first number is the same as or different from the second number.
Optionally, the first number of the plurality of convolutional network units connected in series gradually increases according to the feature flow order; and/or the second number of the plurality of series-connected convolutional network units is gradually increased according to the characteristic flow sequence.
Optionally, for each convolution network unit, the convolution kernel of the first convolution layer is 1 × 1, the convolution kernel of the second convolution layer is 3 × 1, and the convolution kernel of the third convolution layer is 1 × 1.
Optionally, the keyword recognition network is implemented based on MobileNetV 2.
Optionally, the number of the convolutional network units is 6.
Optionally, the processing the original speech signal to obtain a target speech signal includes: extracting mel cepstrum coefficients of the original voice signal; performing time convolution by taking each frame of MFCC as a time sequence to obtain the target speech signal; and the dimension of the MFCC is the number of channels of time convolution.
The embodiment of the invention also provides a keyword recognition device, which comprises: the target voice signal acquisition module is used for acquiring an original voice signal and processing the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics; the keyword identification module is used for inputting the target voice signal into a keyword identification network for keyword identification to obtain keywords contained in the target voice signal; the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the previous convolutional network unit is the input characteristic of the next convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target speech signal.
Embodiments of the present invention further provide a storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform any of the steps of the method.
The embodiment of the present invention further provides a computer device, which includes the keyword recognition apparatus, or includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor executes any one of the steps of the keyword recognition method when executing the computer program.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a keyword identification method, which comprises the following steps: acquiring an original voice signal, and processing the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics; inputting the target voice signal into a keyword recognition network for keyword recognition to obtain a keyword contained in the target voice signal; the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target speech signal.
Compared with the prior art, in the scheme of the embodiment of the invention, because all the low-layer features (namely the output features of the previous convolutional network unit) always participate in the formation of the next-layer high-layer features (namely the output features of the next convolutional network unit), the receptive field of the audio features in the input target speech signal can be expanded, and the accuracy of keyword recognition can be improved. In addition, the method reduces the size of the characteristic diagram through a multilayer convolution network, and the model parameter quantity is greatly reduced. In summary, the embodiment of the present invention provides a novel keyword recognition network that can meet the accuracy requirement of practical applications and greatly reduce the model parameter amount.
Furthermore, in the scheme of the embodiment of the invention, part or all of the convolution network units adopt a residual error network structure, namely, a residual error network is introduced into the keyword identification network with low calculation amount, so that the identification precision can be effectively improved.
Furthermore, each convolutional network unit can comprise 3 convolutional layers (namely a first convolutional layer, a second convolutional layer and a third convolutional layer), so that the depth of the keyword recognition network is increased, the model parameters are further reduced, and the occupation amount of the keyword recognition network on the device memory is effectively saved.
Further, the MFCCs of the original speech signal are subjected to time convolution in which each frame MFCC is treated as a time series rather than a grayscale image. At this time, the time convolution is converted into t × 1 × f, where t is a time dimension, that is, a convolution kernel is t × 1, and a feature dimension f (that is, a frequency domain feature) of the MFCC is used as a channel number. Therefore, the convolution kernel is reduced, and the parameter quantity can be effectively reduced; the number of channels is changed from 1 to f, so that the characteristic dimension f can be transmitted downwards, and the receptive field of the characteristic is enlarged.
Drawings
Fig. 1 is a schematic flowchart of a keyword recognition method according to an embodiment of the present invention;
FIG. 2 is a diagram of a keyword recognition network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a single convolutional network unit, according to an embodiment of the present invention;
FIG. 4 is a flow chart of a classical MFCC extraction;
fig. 5 is a schematic structural diagram of a keyword recognition apparatus according to an embodiment of the present invention.
Detailed Description
As background art, the existing keyword recognition method cannot balance recognition accuracy and model parameters.
In order to solve the above problem, an embodiment of the present invention provides a keyword recognition method, where the method includes: acquiring an original voice signal, and performing framing processing on the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics; inputting the target voice signal into a keyword recognition network for keyword recognition to obtain keywords contained in the target voice signal; the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target speech signal. Thus, the number of model parameters can be reduced while the recognition accuracy is maintained.
In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying figures are described in detail below.
Referring to fig. 1, fig. 1 is a schematic flow chart of a keyword recognition method according to an embodiment of the present invention, where the method is executed by a terminal side, and the terminal may be a mobile phone, a computer, a mobile watch, or other devices. The keyword recognition method may specifically include the following step S101 and step S102, which are detailed below.
Step S101, an original voice signal is obtained, framing processing is carried out on the original voice signal to obtain a target voice signal, the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented based on time domain characteristics and frequency domain characteristics.
The original voice signal is an audio signal, which may be an audio signal collected by a microphone, or an audio signal acquired from a memory or other terminals. In a specific application scenario, a terminal can automatically talk with a person, for example, the terminal is an intelligent robot or a mobile phone including an intelligent assistant, and the original voice signal is obtained by collecting sounds of an external environment through a microphone of the terminal.
The framing processing of the original speech signal to obtain a target speech signal includes: the original voice signal is segmented to analyze the characteristic parameters, each segmented voice signal is called a frame voice signal, namely a voice frame, and the frame length of the voice frame can be 10-30 milliseconds (ms) and can also be adjusted as required. And representing the original voice signal after framing based on two dimensions of time domain characteristics and frequency domain characteristics to obtain the target voice signal.
Step S102, inputting the target voice signal into a keyword recognition network for keyword recognition to obtain the keywords contained in the target voice signal. The keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target speech signal.
The keyword identification network is used for identifying keywords in the input target voice signal. Referring to fig. 2, fig. 2 is a schematic diagram of a keyword recognition network according to an embodiment of the present invention, the keyword recognition network may include a plurality of convolutional network units connected in series, and the convolutional network units connected in series are denoted by numeral 20 in fig. 2. In a specific embodiment, the number of the convolutional network units 20 connected in series is 6, and as proved by experiments, when the 6 convolutional network units are used in series, the identification accuracy of the keyword identification network is higher, and if the number of the convolutional network units is continuously increased, the influence on the identification accuracy is not great. The 6 convolutional network units include a first convolutional network unit 201, a second convolutional network unit 202, a third convolutional network unit 203, a fourth convolutional network unit 204, a fifth convolutional network unit 205, and a sixth convolutional network unit 206, the direction of an arrow in fig. 2 indicates the feature flow direction of data among the convolutional network units, the output feature of the previous convolutional network unit is the input feature of the next convolutional network unit, that is, the output feature of the first convolutional network unit 201 is the input feature of the second convolutional network unit 202, the output feature of the second convolutional network unit 202 is the input feature of the third convolutional network unit 203, …, and so on. Each convolution network unit comprises one or more convolution layers, and the convolution Kernel (Kernel) and the channel number of each convolution layer can be the same or different.
With continued reference to fig. 2, the keyword recognition network further includes a classifier 21, where the classifier 21 is configured to recognize the keywords in the target speech signal from the output features of the convolutional network unit 20 according to trained classification logic, and the classifier 21 may be implemented based on a logistic regression model (Softmax) or a Support Vector Machine (SVM) model.
Optionally, a convolutional layer 22 (represented by "Conv 3 × 1,C" in fig. 2) with a convolutional kernel of 3 × 1 and an output channel number C may be further included between the input end of the keyword recognition network (used for inputting the target speech signal) and the convolutional network unit 20 to enhance the input speech features, where the value of C is a positive integer.
Optionally, a Global Average Pooling layer (GAP for short) 23 may be further included between the convolutional network unit 20 and the classifier 21, which can directly implement dimension reduction, and greatly reduce model parameters of the keyword recognition network.
Optionally, the keyword recognition network is implemented based on MobileNetV 2. The characteristics of the MobileNet V2 are similar to those of the MobileNet V1, the quantity of parameters is reduced by mainly utilizing deep Convolution (Depthwise Convolution) and point-by-point Convolution (Pointwise Convolution), residual connection or Inverted Residual connection is added into the MobileNet V2, and when the network deepens to carry out reverse propagation continuously, the phenomenon of network degradation is well solved through the Residual connection/Inverted Residual connection. Mobilolenetv 2 achieves a balance between accuracy and parameters.
In the above keyword recognition method, since all the low-level features (i.e., the output features of the previous convolutional network unit) always participate in the formation of the next-level high-level features (i.e., the output features of the next convolutional network unit), the receptive field of the audio features in the input target speech signal can be expanded, and thus the accuracy of keyword recognition can be improved. In addition, the method reduces the size of the characteristic diagram through a multilayer convolution network, and the model parameter quantity is greatly reduced. In summary, the embodiment of the present invention provides a novel keyword recognition network that can meet the accuracy requirement of practical applications and greatly reduce the model parameter amount.
In one embodiment, part or all of the convolution Network units adopt a Residual Network (ResNet) structure. That is, a reference is made to the input features of each convolution network unit, and a residual function is formed through machine learning, and the residual function participates in generating the output features of the convolution network unit.
Therefore, the residual error network is introduced into the keyword identification network with low calculation amount, and the identification precision can be effectively improved.
In one embodiment, each convolutional network unit sequentially comprises a first convolutional layer, a second convolutional layer and a third convolutional layer according to a feature stream rotation sequence, the number of channels of the first convolutional layer and the second convolutional layer is a first number, the number of channels of the third convolutional layer is a second number, and the first number is the same as or different from the second number.
Specifically, the first convolutional layer, the second convolutional layer and the third convolutional layer are connected in series, and the feature flow sequence can be expressed in that the output feature of the first convolutional layer is used as the input feature of the second convolutional layer, and the output feature of the second convolutional layer is used as the input feature of the third convolutional layer.
Optionally, the first number of the plurality of convolutional network units connected in series gradually increases according to the feature flow order; and/or the second number of the plurality of series-connected convolutional network units is gradually increased according to the characteristic flow sequence.
In an embodiment, please refer to fig. 3, fig. 3 is a schematic diagram of a single convolutional network unit according to an embodiment of the present invention, and the structure of the single convolutional network unit is located within a dashed box. The first number and the second number of each convolution layer (first/second/third convolution layer) are the same. The first number (i.e. the second number) of the plurality of convolutional network units connected in series increases, and taking fig. 2 as an example, the first number (i.e. the second number) of the 6 convolutional network units 20 (in the order from the first convolutional network unit 201 to the sixth convolutional network unit 206) is: 8. 12, 14, 16, 24 and 48.
Optionally, for each convolution network unit, the convolution kernel of the first convolution layer is 1 × 1, the convolution kernel of the second convolution layer is 3 × 1, and the convolution kernel of the third convolution layer is 1 × 1.
Continuing with the example of fig. 3, the convolution kernel of the first convolution layer 301 is 1 × 1, and the number of channels is C _ in (denoted by "Conv 1 × 1, C _in" in fig. 3); the convolution kernel of the second convolution layer 302 is 3 × 1, and the number of channels is C _ in (indicated by "Conv 3 × 1, C _in" in fig. 3); the convolution kernel of the third convolution layer 303 is 1 × 1, and the number of channels is C _ out (indicated by "Conv 3 × 1, C_out" in fig. 3). Optionally, an activation function is included between two adjacent convolutional layers, and the activation function may use a Linear rectification function (reciu). Further, a Batch Normalization (BN) layer may be included between two adjacent convolutional layers. Wherein C _ in represents the number of input channels of each convolutional network unit, i.e., the first number; c out represents the number of output channels per convolutional network unit, i.e., the second number. The values of C _ in and C _ out are positive integers.
In this embodiment, each convolutional network unit may further include 3 convolutional layers (i.e., a first convolutional layer, a second convolutional layer, and a third convolutional layer), so that the depth of the keyword recognition network is increased, the number of model parameters is further reduced, and the occupation amount of the keyword recognition network on the device memory is effectively reduced. Experiments show that the equipment memory used by the keyword recognition network in the embodiment of the invention is only about half of that of the traditional network.
In one embodiment, the processing the original speech signal to obtain the target speech signal in step S102 in fig. 1 may include: extracting Mel-scale Frequency Cepstral Coefficients (MFCC) of the original voice signal; performing time Convolution (Temporal Convolution) on each frame of MFCC as a time sequence to obtain the target speech signal; wherein the dimension of the MFCC is used as the input channel number of the keyword recognition network.
Those skilled in the art will appreciate that Convolutional Neural Networks (CNNs) are currently commonly used to identify speech signals, and CNNs have the capability of extracting high-level features from low-level features, but currently mainstream small-kernel CNNs have difficulty in acquiring high-frequency and low-frequency information simultaneously. Embodiments of the present invention feed the MFCC of the original speech signal into a time convolution.
It should be noted that, a conventional time convolution is used to process a grayscale image, and if a conventional time convolution is used to process an MFCC of an original speech signal, the time convolution may be represented as t × f × 1, where t represents a time dimension and corresponds to a value of a time domain feature of each frame of the MFCC, f represents a feature dimension and corresponds to a value of a frequency domain feature of the MFCC, and a positive integer 1 is a channel number.
In the time convolution, each frame MFCC is used as a time sequence, rather than a grayscale image. At this time, the time convolution is converted into t × 1 × f, that is, the convolution kernel is t × 1, and the feature dimension f (that is, the frequency domain feature) of the MFCC is used as the number of channels. Therefore, the convolution kernel is reduced, and the parameter quantity can be effectively reduced; the number of channels is changed from 1 to f, so that the characteristic dimension f can be transmitted downwards, and the receptive field of the characteristic is enlarged.
In one embodiment, another upgraded keyword recognition network is provided, which may include three parts: MFCC extraction, time Convolution Networks (TCNs), and the keyword recognition network structures described in FIGS. 2 and 3, use time convolution to increase the feature receptive field of an upgraded keyword recognition network while reducing the computational effort.
Referring to fig. 4, fig. 4 is a flowchart illustrating a typical MFCC extraction process, wherein the MFCC is used for extracting time-domain and frequency-domain features of an original speech signal; sequentially executing steps S401 to S407 on the original speech signal, wherein:
step S401, pre-emphasis processing; i.e. passing the original speech signal through a high pass filter. The pre-emphasis is to boost the high frequency part to flatten the spectrum of the original speech signal, and to maintain the spectrum in the whole frequency band from low frequency to high frequency, so that the spectrum can be obtained with the same signal-to-noise ratio.
Step S402, framing; the N sampling points of the output signal of step S401 are grouped into one observation unit, which is called a frame. Typically, N has a value of 256 or 512, covering a time of about 20 to 30 ms. In order to avoid excessive signal variation between two adjacent frames, an overlap region is formed between two adjacent frames, the overlap region includes M sampling points, and M is usually about 1/2 or 1/3 of N. Generally, a sampling frequency of a voice signal employed in voice recognition is 8 kilohertz (KHz) or 16KHz. For example, if the frame length is 256 samples, the corresponding time length is 256/8000 × 1000=32 milliseconds (ms) in the case of 8 KHz.
Step S403, adding a Window (Hamming Window); multiplying each voice frame after framing in step S402 by Hamming window to increase continuity of left and right ends of the frame. The window size (denoted window size) may be 30ms and the window step size (window size) may be 10ms.
Step S404, performing Fast Fourier Transform (FFT); to obtain the spectrum of each speech frame.
Step S405, passing through a Mel filter bank; namely, the frequency spectrum passes through a group of Mel-scale triangular filter banks to smooth the frequency spectrum, and the effect of harmonic wave is eliminated, so that the formants of the original voice signals are highlighted. Therefore, the voice recognition system using MFCC as the characteristic is not affected by the tone difference of the input voice, and the calculation amount can be reduced.
Step S406, carrying out logarithmic operation; the logarithmic energy of each filter bank output is calculated.
Step S407, performing Discrete Cosine Transform (DCT); the above logarithmic energy is introduced into the DCT to obtain the MFCC. The DCT feature number (denoted DCT number features) may take on the value of 40.
Referring to fig. 5, an embodiment of the invention further provides a keyword recognition apparatus 50, including: a target speech signal obtaining module 501, configured to obtain an original speech signal, and perform framing processing on the original speech signal to obtain a target speech signal, where the target speech signal includes multiple speech frames arranged according to a time sequence, and the speech frames are represented based on a time domain feature and a frequency domain feature; a keyword recognition module 502, configured to input the target speech signal into a keyword recognition network for keyword recognition, so as to obtain a keyword included in the target speech signal; the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers; and the classifier is used for classifying the output characteristics of the last convolution network unit to obtain the keywords contained in the target speech signal.
Optionally, part or all of the convolution network units adopt a residual error network structure.
Optionally, each convolutional network unit sequentially includes a first convolutional layer, a second convolutional layer, and a third convolutional layer according to a feature stream rotation sequence, where the number of channels of the first convolutional layer and the second convolutional layer is a first number, the number of channels of the third convolutional layer is a second number, and the first number is the same as or different from the second number.
Optionally, the first number of the plurality of convolutional network units connected in series gradually increases according to the feature flow order; and/or the second number of the plurality of series-connected convolutional network units is gradually increased according to the characteristic flow sequence.
Optionally, for each convolution network unit, the convolution kernel of the first convolution layer is 1 × 1, the convolution kernel of the second convolution layer is 3 × 1, and the convolution kernel of the third convolution layer is 1 × 1.
Optionally, the keyword recognition network is implemented based on MobileNetV 2.
Optionally, the number of the convolutional network units is 6.
In one embodiment, the keyword recognition module 502 may include: an MFCC extraction unit for extracting Mel cepstrum coefficients of the original speech signal; the time convolution unit is used for performing time convolution by taking each frame of MFCC as a time sequence to obtain the target voice signal; and the dimension of the MFCC is the number of input channels of the keyword recognition network.
For more details of the working principle and the working mode of the keyword recognition apparatus 50, reference may be made to the description of the keyword recognition method in fig. 1 to 5, which is not repeated here.
In a specific implementation, the keyword recognition device 50 may correspond to a Chip having a keyword recognition function in a terminal, or correspond to a Chip having a data processing function, such as a System-On-a-Chip (SOC); or the terminal comprises a chip module with a keyword recognition function; or to a chip module having a chip with a data processing function, or to a terminal.
An embodiment of the present invention further provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the keyword recognition method in any one of fig. 1 or fig. 4. The storage medium may be a computer-readable storage medium, and may include, for example, a non-volatile (non-volatile) or non-transitory (non-transitory) memory, and may further include an optical disc, a mechanical hard disk, a solid state hard disk, and the like.
The embodiment of the invention also provides computer equipment. The computer device may include a memory having stored thereon a computer program operable on the processor, and a processor executing the computer program to perform the steps of the keyword recognition method of fig. 1-4.
Specifically, in the embodiment of the present invention, the processor may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will also be appreciated that the memory in the embodiments of the subject application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlronous DRAM (SLDRAM), and direct bus RAM (DR RAM).
It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.
The "plurality" appearing in the embodiments of the present application means two or more.
The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application.
The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method for keyword recognition, the method comprising:
acquiring an original voice signal, and processing the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics;
inputting the target voice signal into a keyword recognition network for keyword recognition to obtain keywords contained in the target voice signal;
the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the former convolutional network unit is the input characteristic of the latter convolutional network unit, and each convolutional network unit comprises one or more convolutional layers;
the classifier is used for classifying the output characteristics of the last convolution network unit to obtain key words contained in the target voice signal;
among the input end of the keyword recognition network and a plurality of convolution network units connected in series, the keyword recognition network further comprises a convolution layer with convolution kernel of 3 x 1;
each convolution network unit sequentially comprises a first convolution layer, a second convolution layer and a third convolution layer according to the characteristic flow conversion sequence;
for each convolutional network unit, the convolution kernel of the first convolutional layer is 1 × 1, the convolution kernel of the second convolutional layer is 3 × 1, and the convolution kernel of the third convolutional layer is 1 × 1.
2. The method of claim 1, wherein some or all of the convolutional network units use a residual network structure.
3. The method of claim 1 or 2, wherein the number of lanes of the first convolutional layer and the second convolutional layer is a first number, and the number of lanes of the third convolutional layer is a second number, the first number being the same as or different from the second number.
4. The method of claim 3, wherein the first number of the plurality of concatenated convolutional network elements is progressively larger in feature stream order; and/or the presence of a gas in the gas,
the second number of the plurality of series-connected convolutional network units increases gradually in the order of the feature flow.
5. The method of claim 1, wherein the keyword recognition network is implemented based on MobileNetV 2.
6. The method of claim 1 or 2, wherein the number of convolutional network elements is 6.
7. The method according to claim 1 or 2, wherein said processing the original speech signal to obtain the target speech signal comprises:
extracting Mel cepstrum coefficients of the original speech signal;
performing time convolution by taking each frame of MFCC as a time sequence to obtain the target voice signal;
and the dimension of the MFCC is the number of input channels of the keyword recognition network.
8. An apparatus for keyword recognition, the apparatus comprising:
the target voice signal acquisition module is used for acquiring an original voice signal and processing the original voice signal to obtain a target voice signal, wherein the target voice signal comprises a plurality of voice frames which are arranged according to a time sequence, and the voice frames are represented on the basis of time domain characteristics and frequency domain characteristics;
the keyword identification module is used for inputting the target voice signal into a keyword identification network for keyword identification to obtain keywords contained in the target voice signal;
the keyword identification network comprises a plurality of convolutional network units and classifiers which are connected in series, the output characteristic of the previous convolutional network unit is the input characteristic of the next convolutional network unit, and each convolutional network unit comprises one or more convolutional layers;
the classifier is used for classifying the output characteristics of the last convolution network unit to obtain key words contained in the target voice signal;
among the input end of the keyword recognition network and a plurality of convolution network units connected in series, the keyword recognition network further comprises a convolution layer with convolution kernel of 3 x 1;
each convolution network unit sequentially comprises a first convolution layer, a second convolution layer and a third convolution layer according to the characteristic flow conversion sequence;
for each convolution network element, the convolution kernel of the first convolution layer is 1 × 1, the convolution kernel of the second convolution layer is 3 × 1, and the convolution kernel of the third convolution layer is 1 × 1.
9. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, performing the steps of the method according to any one of claims 1 to 7.
10. A computer arrangement comprising an apparatus as claimed in claim 8, or comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor, when executing the computer program, performs the steps of the method of any one of claims 1 to 7.
CN202110714828.3A 2021-06-25 2021-06-25 Keyword recognition method and device, storage medium and computer equipment Active CN113409775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110714828.3A CN113409775B (en) 2021-06-25 2021-06-25 Keyword recognition method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110714828.3A CN113409775B (en) 2021-06-25 2021-06-25 Keyword recognition method and device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN113409775A CN113409775A (en) 2021-09-17
CN113409775B true CN113409775B (en) 2023-01-10

Family

ID=77679454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110714828.3A Active CN113409775B (en) 2021-06-25 2021-06-25 Keyword recognition method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN113409775B (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276125B (en) * 2020-02-11 2023-04-07 华南师范大学 Lightweight speech keyword recognition method facing edge calculation
CN111667835A (en) * 2020-06-01 2020-09-15 马上消费金融股份有限公司 Voice recognition method, living body detection method, model training method and device

Also Published As

Publication number Publication date
CN113409775A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN110120224B (en) Method and device for constructing bird sound recognition model, computer equipment and storage medium
KR102213013B1 (en) Frequency-based audio analysis using neural networks
WO2019232829A1 (en) Voiceprint recognition method and apparatus, computer device and storage medium
WO2019232845A1 (en) Voice data processing method and apparatus, and computer device, and storage medium
Birnbaum et al. Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations.
WO2021189642A1 (en) Method and device for signal processing, computer device, and storage medium
CN110931023B (en) Gender identification method, system, mobile terminal and storage medium
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN108922561A (en) Speech differentiation method, apparatus, computer equipment and storage medium
Luo et al. Group communication with context codec for lightweight source separation
WO2021127982A1 (en) Speech emotion recognition method, smart device, and computer-readable storage medium
CN111179910A (en) Speed of speech recognition method and apparatus, server, computer readable storage medium
Bavu et al. TimeScaleNet: A multiresolution approach for raw audio recognition using learnable biquadratic IIR filters and residual networks of depthwise-separable one-dimensional atrous convolutions
CN110648669B (en) Multi-frequency shunt voiceprint recognition method, device and system and computer readable storage medium
CN110648655B (en) Voice recognition method, device, system and storage medium
CN113470688B (en) Voice data separation method, device, equipment and storage medium
CN114627895A (en) Acoustic scene classification model training method and device, intelligent terminal and storage medium
Pan et al. An efficient hybrid learning algorithm for neural network–based speech recognition systems on FPGA chip
CN113409775B (en) Keyword recognition method and device, storage medium and computer equipment
CN114743561A (en) Voice separation device and method, storage medium and computer equipment
CN113869212A (en) Multi-modal in-vivo detection method and device, computer equipment and storage medium
Sunil Kumar et al. Phoneme recognition using zerocrossing interval distribution of speech patterns and ANN
CN115035897B (en) Keyword detection method and system
Alex et al. Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant