CN114842838A - Audio recognition method, device, electronic apparatus, medium, and program product - Google Patents

Audio recognition method, device, electronic apparatus, medium, and program product Download PDF

Info

Publication number
CN114842838A
CN114842838A CN202210415556.1A CN202210415556A CN114842838A CN 114842838 A CN114842838 A CN 114842838A CN 202210415556 A CN202210415556 A CN 202210415556A CN 114842838 A CN114842838 A CN 114842838A
Authority
CN
China
Prior art keywords
audio
data
audio data
audio recognition
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210415556.1A
Other languages
Chinese (zh)
Inventor
唐剑
夏立超
赵东宇
刘宁
张法朝
奉飞飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Midea Group Co Ltd
Midea Group Shanghai Co Ltd
Original Assignee
Midea Group Co Ltd
Midea Group Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Midea Group Co Ltd, Midea Group Shanghai Co Ltd filed Critical Midea Group Co Ltd
Priority to CN202210415556.1A priority Critical patent/CN114842838A/en
Publication of CN114842838A publication Critical patent/CN114842838A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of audio processing, and provides an audio identification method, an audio identification device, an electronic device, a medium and a program product. The method comprises the following steps: quantizing the audio data to be identified; inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model to obtain audio identification result data output by the audio identification model and updated fixed point historical audio data; carrying out inverse quantization on the audio identification result data to obtain an audio identification result; the updated fixed-point historical audio data is obtained through a memory module of the audio recognition model by utilizing deep separable convolution. The audio identification method provided by the embodiment of the application can eliminate redundant operation caused by quantization and inverse quantization operation in the audio identification model, and improve the reasoning speed of the audio identification model; and the memory module obtains fixed-point historical audio data by utilizing deep separable convolution, so that the number of operators can be effectively reduced, the time delay is reduced, and the model reasoning efficiency is further improved.

Description

Audio recognition method, device, electronic apparatus, medium, and program product
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to an audio recognition method, apparatus, electronic device, medium, and program product.
Background
The existing audio recognition method mainly utilizes a deep learning model to perform audio recognition, and the deep learning model can be a time sequence model, such as a Recurrent Neural Network (RNN). The deep learning model needs to perform full integer quantization operation in the inference process of audio recognition to complete the quantization of the weight and the activation value. Specifically, when the model quantizes the forward inference, the input data is float32 type (floating point data), the model needs to quantize the input data, and in order to ensure the consistency of the types of the input data and the output data, the model also needs to dequantize the data to be output, so that the data becomes float32 type. The quantization and dequantization operations described above result in model computation redundancy.
Disclosure of Invention
The present application is directed to solving at least one of the problems in the prior art.
Therefore, the audio identification method is provided, the operations of quantizing the audio data to be identified and carrying out inverse quantization on the audio identification result data output by the audio identification model are arranged outside the audio identification model, redundant operation caused by quantization and inverse quantization operations in the audio identification model can be eliminated, and the inference speed of the audio identification model is improved; and fixed point historical audio data is obtained by a memory module of the audio recognition model by utilizing depth separable convolution, so that the deformation degree can be effectively improved, the number of operators is reduced, the time delay is reduced, and the reasoning efficiency of the audio recognition model is improved.
The application provides an audio recognition device.
The application provides an electronic device.
The present application proposes a non-transitory computer-readable storage medium.
The present application proposes a computer program product.
The audio identification method according to the embodiment of the first aspect of the application comprises the following steps:
quantizing the audio data to be identified;
inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point memory data is used for next audio recognition of the audio recognition model.
According to the audio identification method, the operations of quantizing the audio data to be identified and carrying out inverse quantization on the audio identification result data output by the audio identification model are arranged outside the audio identification model, so that redundant operation caused by quantization and inverse quantization operations inside the audio identification model can be eliminated, and the reasoning speed of the audio identification model is improved; the audio recognition model directly utilizes the fixed-point historical audio data with the data type as the fixed-point data in the audio recognition process without quantization or inverse quantization operation, and the audio recognition model can be directly used for next audio recognition of the audio recognition model by utilizing the updated fixed-point historical audio data obtained by deep separable convolution through the memory module, so that the reasoning efficiency and precision of the audio recognition model can be effectively improved.
According to an embodiment of the present application, the updated fixed point historical audio data is obtained by convolving quantized audio data to be recognized and fixed point historical audio data by using a depth separable convolution through a memory module of the audio recognition model, specifically:
and utilizing a depth separable convolution operator to replace a for-loop operator through a memory module of the audio identification model, and convolving the quantized audio data to be identified and the fixed-point historical audio data to obtain the updated fixed-point historical audio data.
According to the audio recognition method, the for-loop operator of the memory module of the audio recognition model is equivalently replaced by the highly-parallel depth separable convolution operator, a large number of operators and for-loops can be reduced, matrix dimensionality is reduced, memory occupation is reduced, time delay is optimized, and inference efficiency of the audio recognition model is improved.
According to an embodiment of the application, the for loop operator includes any one or any combination of the following: multidimensional slicing operators, multiplications and additions.
According to the audio recognition method, the for-loop operator of the memory module of the audio recognition model is equivalently replaced by the depth separable convolution operator to carry out multiply-add operation, so that the use amount of the operator can be greatly reduced, the model operation process is simplified, and the accuracy of the audio recognition model is improved.
According to an embodiment of the present application, the obtaining, by the memory module of the audio recognition model, the updated fixed-point historical audio data by performing convolution on the quantized audio data to be recognized and the fixed-point historical audio data using a depth-separable convolution operator instead of a for-loop operator includes:
and carrying out unified data format conversion on the input data and the output data of the depth separable convolution of the memory module.
According to the audio recognition method, the unified data format conversion is carried out on the input data and the output data of the memory module, wherein the input data and the output data are subjected to the deep separable convolution, so that the data processing efficiency of the memory module is improved, and the data processing error rate caused by the inconsistent data formats is reduced.
According to an embodiment of the present application, the unified data format conversion is performed on the input data and the output data of the depth separable convolution of the memory module, specifically:
and uniformly converting the data formats of the input data and the output data of the depth separable convolution of the memory module into NHWC data formats.
According to the audio identification method, the NHWC data format is a data format which is commonly used in the field of deep learning, and the data format of the input data and the data format of the output data of the memory module, which are subjected to deep separable convolution, are uniformly converted into the NHWC data format, so that the universality of the memory module can be effectively improved.
According to an embodiment of the present application, the training of the audio recognition model based on the audio data samples and the audio recognition results corresponding to the audio data samples includes:
acquiring an audio data sample and an audio identification result corresponding to the audio data sample;
and training an initial audio recognition model according to the audio data sample and an audio recognition result corresponding to the audio data sample to obtain the audio recognition model.
According to the audio recognition method, the audio recognition model specially used for audio recognition is obtained according to the audio data samples and the audio recognition results corresponding to the audio data samples, so that the audio recognition model can perform deep learning according to the audio data samples and the audio recognition results corresponding to the audio data samples, and the accuracy of the audio recognition model is improved.
The audio recognition device according to the embodiment of the second aspect of the application comprises:
a quantization module to: quantizing the audio data to be identified;
an audio recognition module to: inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
an inverse quantization module to: carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point historical audio data is used for next audio recognition of the audio recognition model.
An electronic device according to an embodiment of the third aspect of the present application includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements any of the audio recognition methods when executing the program.
A non-transitory computer-readable storage medium according to an embodiment of the fourth aspect of the present application, having stored thereon a computer program that, when executed by a processor, implements the audio recognition method of any of the above.
A computer program product according to an embodiment of the fifth aspect of the present application comprises a computer program which, when executed by a processor, implements the audio recognition method of any of the above.
One or more technical solutions in the embodiments of the present application have at least one of the following technical effects: the operation of quantizing the audio data to be recognized and carrying out inverse quantization on the audio recognition result data output by the audio recognition model is carried out outside the audio recognition model, so that redundant operation caused by quantization and inverse quantization operation inside the audio recognition model can be eliminated, and the reasoning speed of the audio recognition model is improved; the audio recognition model directly utilizes the fixed-point historical audio data with the data type as the fixed-point data in the audio recognition process without quantization or inverse quantization operation, and the audio recognition model can be directly used for next audio recognition of the audio recognition model by utilizing the updated fixed-point historical audio data obtained by deep separable convolution through the memory module, so that the reasoning efficiency and precision of the audio recognition model can be effectively improved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of an audio recognition method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating an application of an audio recognition model of an audio recognition method provided by an embodiment of the present application;
FIG. 3 illustrates that the audio recognition method provided by the embodiment of the present application performs multiply-add operations by using a for loop operator of a memory module in a deep separable convolution equivalent replacement audio recognition model;
FIG. 4 is a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in further detail below with reference to the drawings and examples. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
Fig. 1 is a schematic flowchart of an audio recognition method according to an embodiment of the present application.
Referring to fig. 1, an audio identification method provided in an embodiment of the present application includes:
s110, quantizing the audio data to be identified;
s120, inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
s130, performing inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point historical audio data is used for next audio recognition of the audio recognition model.
It should be noted that the execution subject of the audio recognition method provided by the present application may be any terminal-side device, such as an audio recognition system, and the like, where the audio recognition system may preset an audio recognition model in advance.
It should be noted that the audio recognition model may be obtained by pre-training based on the audio data sample and the audio recognition result corresponding to the audio data sample, and the specific training process may include:
acquiring an audio data sample and an audio identification result corresponding to the audio data sample;
and training an initial audio recognition model according to the audio data sample and an audio recognition result corresponding to the audio data sample to obtain the audio recognition model.
During model training, the audio recognition model can perform deep learning according to the audio data samples and audio recognition results corresponding to the audio data samples, so that the precision of the audio recognition model and the accuracy of model reasoning are improved.
Alternatively, the audio recognition model may be any audio recognition model in the prior art that is used for implementing audio recognition and has a memory module, and is not limited herein.
In step S110, the terminal-side device quantizes the audio data to be recognized.
It should be noted that quantization (quantized), that is, floating point number operation in the forward process of the neural network is quantized into integer operation, so as to achieve the purpose of accelerating calculation.
Referring to fig. 2, fig. 2 shows an application diagram of an audio recognition model of an audio recognition method provided by an embodiment of the present application, where x1 represents audio data to be recognized, x2 represents fixed point history audio data, y1 represents audio recognition result data, y2 represents updated fixed point history audio data, and t represents the number of times of audio recognition. Specifically, the audio data to be recognized are floating point data of float32 type generally, quantization is performed on the audio data to be recognized, that is, the floating point data of float32 type is converted into fixed point data of int8 type, the floating point data is mapped from a high dimension to a low dimension, and the quantization operation can effectively improve the calculation efficiency of the terminal side device for realizing audio recognition. If quantization operation and inverse quantization operation are carried out in the audio identification model, which easily causes redundant calculation of the model, now the terminal side equipment firstly carries out quantization operation on audio data to be identified, and then the quantized audio data to be identified (the quantized audio data to be identified is fixed point data) is input into the audio identification model, so that not only can redundant operation of the audio identification model be avoided, but also the speed and the efficiency of realizing audio identification by the terminal side equipment can be ensured.
In step S120, the terminal-side device inputs the quantized audio data to be recognized and the fixed-point historical audio data into an audio recognition model, and obtains audio recognition result data and updated fixed-point historical audio data output by the audio recognition model.
It should be noted that, referring to fig. 2, before the first audio recognition of the audio recognition model, the terminal-side device initializes the initial historical audio data to make the initial historical audio data become the fixed-point historical audio data corresponding to the floating-point data "0", after the quantization of the audio data to be recognized is completed, the terminal-side device may first splice the quantized audio data to be recognized and the initialized fixed-point historical audio data, and use them as the input of the memory module of the audio recognition model, then use the memory module of the audio recognition model to convolve the quantized audio data to be recognized and the fixed-point historical audio data by using the depth separable convolution, extract the intermediate features of the audio data to be recognized and obtain the updated fixed-point historical audio data, and then perform audio recognition according to the intermediate features of the audio data to be recognized by using other layers (e.g. recognition layers) of the audio recognition model, the specific audio recognition algorithm may be any one of the prior art and is not limited herein.
Further, the updated fixed-point historical audio data may be provided to a memory module of the audio recognition model for next extraction of the intermediate features of the audio data to be recognized. It should be noted that, in the prior art, the memory module divides the input slice into (memory _ range) +1) data with a group size of (1 × input frame number (frames) × input channels) by using a for loop, multiplies the slice data by (memory _ range) +1) parameter with a group size of 1 × 128, and adds the multiplied slice data. In the embodiment of the application, the memory module can equivalently replace a for loop operator of the memory module with a highly-parallel depth separable convolution operator to perform multiply-add operation so as to extract the intermediate features of the audio data to be recognized for reasoning of a subsequent audio recognition model.
In step S130, the terminal-side device performs inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result.
It should be noted that the inverse quantization operation is the inverse process of the quantization operation. And the terminal side equipment inversely quantizes the audio recognition result data in the form of fixed point data output by the audio recognition model into floating point data, so that the floating point data is kept consistent with the data type of the audio data to be recognized.
It should be noted that the audio data to be recognized may be audio feature data to be recognized, and the audio recognition result data may be audio recognition probability data inferred by the audio recognition model through any existing audio recognition algorithm, for example, the audio recognition model may obtain an audio data sample matched with the audio feature data to be recognized according to the audio feature data to be recognized, and then obtain audio recognition probability data corresponding to the matched audio data sample, and the terminal side device may obtain an audio recognition result according to the audio recognition probability data after dequantization by using the existing audio recognition algorithm.
The general deep learning model completes conversion in the model through a quantization tool, but the method causes redundancy in calculation for an audio recognition model requiring time sequence information and influences the model reasoning speed; moreover, a memory module exists in the audio recognition model, and the module has a for loop and a large number of operation operators, and the slow model inference operation is further aggravated by the operation and the for loop of the large number of operators, which leads to resource waste. According to the audio identification method, the operations of quantizing the audio data to be identified and carrying out inverse quantization on the audio identification result data output by the audio identification model are arranged outside the audio identification model, so that redundant operation caused by quantization and inverse quantization operations inside the audio identification model can be eliminated, and the inference speed of the audio identification model is increased; the audio recognition model directly utilizes the fixed-point historical audio data with the data type as the fixed-point data in the audio recognition process without quantization or inverse quantization operation, and the audio recognition model can directly use the updated fixed-point historical audio data obtained by the deep separable convolution through the memory module for the next audio recognition of the audio recognition model, so that the reasoning efficiency and precision of the audio recognition model can be effectively improved.
Further, according to the audio recognition method provided by the embodiment of the present application, the updated fixed point historical audio data is obtained by convolving quantized audio data to be recognized and fixed point historical audio data by using a depth separable convolution through a memory module of the audio recognition model, and the method may specifically be:
and utilizing a depth separable convolution operator to replace a for-loop operator through a memory module of the audio identification model, and convolving the quantized audio data to be identified and the fixed-point historical audio data to obtain the updated fixed-point historical audio data.
It should be noted that, the for loop operator may include any one of the following items or any combination thereof: multidimensional slicing operators, multiplications and additions.
Referring to fig. 3, fig. 3 illustrates that the audio recognition method provided by the embodiment of the present application performs multiply-add operations by using a for loop operator of a memory module in a deep separable convolution equivalent replacement audio recognition model. Specifically, in the prior art, the memory module divides the input slice into (memory _ range) +1) sets of data with a size of (1 × input frame number (frames) × input channels number (channels)) by using a for loop, multiplies the slice data by (memory _ range) +1) sets of parameters with a size of 1 × 128, and adds the slices. According to the audio recognition method provided by the embodiment of the application, the for-loop operator of the memory module of the audio recognition model is equivalently replaced by the depth separable convolution operator with high parallelism to carry out multiply-add operation, so that a large number of operators and for-loops can be reduced, the matrix dimension is reduced, the memory occupation is reduced, the time delay is optimized, and the reasoning efficiency of the audio recognition model is improved.
Further, in an audio recognition method provided by an embodiment of the present application, an input _ shape of the depth separable convolution of the memory module is: 1 × 1 (memory range + input frame number) input channel number, and output shape (output _ shape) is: 1 × number of output frames × number of output channels, convolution kernel size (kernel size) is: 1 (memory range +1), and the number of channels of the convolution kernel is the same as the number of input channels.
Specifically, a large number of stridedslice operators (multidimensional slicing operators), mul operators (multiplication operators) and add operators (addition operators) in the for loop of the memory module may be replaced with a depth separable convolution (e.g., depthwiseConv 2D). Input _ shape of depthwiseConv2D is 1 × 1 (memory range + input frame number) × input channel number, a dimension needs to be expanded on the basis of the input of the original memory module, and reshape (data format adjustment) is an NHWC data format; the output _ shape is 1 × output frame number × output channel number, and needs to be reduced by one dimension to ensure the same output as the original module. The kernel size of depthwiseConv2D is set to 1 (memory range +1), and the number of channels of the convolution kernel is the same as the number of input channels.
When the memory module is applied to the audio recognition model, the terminal side equipment initializes the initial historical audio data before the audio recognition is executed for the first time by the audio recognition model, so that the initial historical audio data becomes the fixed point historical audio data corresponding to the floating point data of 0, and then the audio recognition model can directly utilize the updated fixed point historical audio data obtained in the last audio recognition without quantization or inverse quantization operation on the fixed point historical audio data in each audio recognition inference. More specifically, the terminal-side device or the audio recognition model may first splice the fixed-point historical audio data (1 × memory range × input channel number) and the quantized audio data to be recognized (1 × input frame number × input channel number), to obtain spliced data (1 × input frame number) × input channel number) as the input of the memory module, and after the operation of the memory module, the intermediate feature of the audio data to be recognized may be obtained for the subsequent audio recognition.
According to the audio recognition method provided by the embodiment of the application, the depth separable convolution parameters are set according to the memory module of the audio recognition model, the for-loop operator is equivalently replaced by the depth separable convolution operator, the time delay is optimized on the premise of not reducing the calculated amount, the memory occupation is reduced, the smooth operation of the audio recognition model can be effectively guaranteed, and the audio recognition efficiency is improved.
Further, according to an audio recognition method provided by the embodiment of the present application, the replacing, by a memory module of the audio recognition model, a for loop operator with a depth separable convolution operator, and performing convolution according to the quantized audio data to be recognized and the fixed-point historical audio data to obtain the updated fixed-point historical audio data may include:
and carrying out unified data format conversion on the input data and the output data of the depth separable convolution of the memory module.
The unified data format conversion of the input data and the output data of the deep separable convolution of the memory module is beneficial to improving the data processing efficiency of the memory module and reducing the data processing error rate caused by the inconsistent data formats.
It should be noted that, in particular, the data format of the input data and the output data of the depth separable convolution of the memory module may be uniformly converted into the NHWC data format.
The NHWC data format is a data format which is commonly used in the field of deep learning, and the universality of the memory module can be effectively improved by uniformly converting the data formats of the input data and the output data of the memory module, which are subjected to deep separable convolution, into the NHWC data format.
The following describes the audio recognition apparatus provided in the embodiments of the present application, and the audio recognition apparatus described below and the audio recognition method described above may be referred to correspondingly.
Fig. 4 is a schematic structural diagram of an audio recognition apparatus according to an embodiment of the present application.
Referring to fig. 4, an audio identification apparatus provided in an embodiment of the present application may include:
a quantization module 210 to: quantizing the audio data to be identified;
an audio recognition module 220 to: inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
an inverse quantization module 230 to: carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point memory data is used for next audio recognition of the audio recognition model.
It should be noted that the updated fixed point historical audio data is obtained by convolving quantized audio data to be recognized and fixed point historical audio data by using a depth separable convolution through a memory module of the audio recognition model, and specifically includes:
and utilizing a depth separable convolution operator to replace a for-loop operator through a memory module of the audio identification model, and convolving the quantized audio data to be identified and the fixed-point historical audio data to obtain the updated fixed-point historical audio data.
It should be noted that, the for loop operator includes any one of the following items or any combination thereof: multidimensional slicing operators, multiplications and additions.
It should be noted that, the obtaining, by the memory module of the audio recognition model, the updated fixed-point historical audio data by performing convolution according to the quantized audio data to be recognized and the fixed-point historical audio data by using a depth separable convolution operator instead of a for-loop operator includes:
and carrying out unified data format conversion on the input data and the output data of the depth separable convolution of the memory module.
It should be noted that, the unified data format conversion is performed on the input data and the output data of the depth separable convolution of the memory module, specifically:
and uniformly converting the data formats of the input data and the output data of the depth separable convolution of the memory module into NHWC data formats.
It should be noted that the training of the audio recognition model based on the audio data samples and the audio recognition results corresponding to the audio data samples includes:
acquiring an audio data sample and an audio identification result corresponding to the audio data sample;
training an initial audio recognition model according to the audio data sample and an audio recognition result corresponding to the audio data sample to obtain the audio recognition model.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may call logic instructions in the memory 830 to perform the following method:
quantizing the audio data to be identified;
inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point memory data is used for next audio recognition of the audio recognition model.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, the present application discloses a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the method provided by the above-mentioned method embodiments, for example, including:
quantizing the audio data to be identified;
inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point memory data is used for next audio recognition of the audio recognition model.
In another aspect, an embodiment of the present application further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, for example, the method includes:
quantizing the audio data to be identified;
inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point memory data is used for next audio recognition of the audio recognition model.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
The above embodiments are merely illustrative of the present application and are not intended to limit the present application. Although the present application has been described in detail with reference to the embodiments, it should be understood by those skilled in the art that various combinations, modifications or equivalents may be made to the technical solutions of the present application without departing from the spirit and scope of the technical solutions of the present application, and the technical solutions of the present application should be covered by the claims of the present application.

Claims (10)

1. An audio recognition method, comprising:
quantizing the audio data to be identified;
inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point memory data is used for next audio recognition of the audio recognition model.
2. The audio identification method according to claim 1, wherein the updated fixed-point historical audio data is obtained by convolving quantized audio data to be identified and fixed-point historical audio data with a deep separable convolution through a memory module of the audio identification model, specifically:
and utilizing a depth separable convolution operator to replace a circulation operator through a memory module of the audio identification model, and performing convolution on the quantized audio data to be identified and the fixed-point historical audio data to obtain the updated fixed-point historical audio data.
3. The audio recognition method of claim 2, wherein the loop operator comprises any one of or any combination of: multidimensional slicing operators, multiplications and additions.
4. The audio identification method according to claim 2, wherein the obtaining, by the memory module of the audio identification model, the updated fixed-point historical audio data by performing convolution on the quantized audio data to be identified and the fixed-point historical audio data using a depth-separable convolution operator instead of a for-loop operator comprises:
and carrying out unified data format conversion on the input data and the output data of the depth separable convolution of the memory module.
5. The audio recognition method of claim 4, wherein the unified data format conversion is performed on the input data and the output data of the deep separable convolution of the memory module, specifically:
and uniformly converting the data formats of the input data and the output data of the depth separable convolution of the memory module into NHWC data formats.
6. The audio recognition method of any one of claims 1 to 5, wherein the training of the audio recognition model based on the audio data samples and the audio recognition results corresponding to the audio data samples comprises:
acquiring an audio data sample and an audio identification result corresponding to the audio data sample;
and training an initial audio recognition model according to the audio data sample and an audio recognition result corresponding to the audio data sample to obtain the audio recognition model.
7. An audio recognition apparatus, comprising:
a quantization module to: quantizing the audio data to be identified;
an audio recognition module to: inputting quantized audio data to be identified and fixed point historical audio data into an audio identification model, and obtaining audio identification result data output by the audio identification model and updated fixed point historical audio data;
an inverse quantization module to: carrying out inverse quantization on the audio recognition result data output by the audio recognition model to obtain an audio recognition result;
the audio recognition model is obtained based on an audio data sample and an audio recognition result corresponding to the audio data sample through training, the updated fixed-point historical audio data is obtained by utilizing depth separable convolution to convolve quantized audio data to be recognized and the fixed-point historical audio data through a memory module of the audio recognition model, and the updated fixed-point historical audio data is used for next audio recognition of the audio recognition model.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the audio recognition method of any of claims 1 to 6 when executing the program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing the audio recognition method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the audio recognition method of any of claims 1 to 6 when executed by a processor.
CN202210415556.1A 2022-04-18 2022-04-18 Audio recognition method, device, electronic apparatus, medium, and program product Pending CN114842838A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210415556.1A CN114842838A (en) 2022-04-18 2022-04-18 Audio recognition method, device, electronic apparatus, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210415556.1A CN114842838A (en) 2022-04-18 2022-04-18 Audio recognition method, device, electronic apparatus, medium, and program product

Publications (1)

Publication Number Publication Date
CN114842838A true CN114842838A (en) 2022-08-02

Family

ID=82565559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210415556.1A Pending CN114842838A (en) 2022-04-18 2022-04-18 Audio recognition method, device, electronic apparatus, medium, and program product

Country Status (1)

Country Link
CN (1) CN114842838A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665656A (en) * 2023-07-24 2023-08-29 美智纵横科技有限责任公司 Speech recognition model generation method, speech recognition method, device and chip

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665656A (en) * 2023-07-24 2023-08-29 美智纵横科技有限责任公司 Speech recognition model generation method, speech recognition method, device and chip
CN116665656B (en) * 2023-07-24 2023-10-10 美智纵横科技有限责任公司 Speech recognition model generation method, speech recognition method, device and chip

Similar Documents

Publication Publication Date Title
US11562201B2 (en) Neural network layer processing with normalization and transformation of data
JP2019071080A (en) Batch normalization layer
CN110929865B (en) Network quantification method, service processing method and related product
CN114207625A (en) System-aware selective quantization for performance-optimized distributed deep learning
CN109165736B (en) Information processing method and device applied to convolutional neural network
EP3931763A1 (en) Deriving a concordant software neural network layer from a quantized firmware neural network layer
CN116976306A (en) Multi-model collaboration method based on large-scale language model
CN110781686A (en) Statement similarity calculation method and device and computer equipment
CN109616093A (en) End-to-end phoneme synthesizing method, device, equipment and storage medium
US11651198B2 (en) Data processing method and apparatus for neural network
CN110795235B (en) Method and system for deep learning and cooperation of mobile web
US11544521B2 (en) Neural network layer processing with scaled quantization
CN111326168A (en) Voice separation method and device, electronic equipment and storage medium
CN114842838A (en) Audio recognition method, device, electronic apparatus, medium, and program product
CN113348472A (en) Convolutional neural network with soft kernel selection
CN116629375A (en) Model processing method and system
CN116312502A (en) End-to-end stream type voice recognition method and device based on sequential sampling blocking mechanism
CN114358280A (en) Data processing method and device, electronic equipment and computer readable storage medium
TWI819005B (en) Computing unit, method, computer program, machine-readable storage element and product for performing multiplication operations
CN111614358A (en) Method, system, device and storage medium for feature extraction based on sub-channel quantization
CN112598020A (en) Target identification method and system
CN112541438A (en) Text recognition method and device
CN113158774B (en) Hand segmentation method, device, storage medium and equipment
CN115658307B (en) Intelligent load processing method and system based on compressed data direct calculation
CN117808083B (en) Distributed training communication method, device, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination