CN115148220A

CN115148220A - Audio detection system and audio detection method

Info

Publication number: CN115148220A
Application number: CN202110352178.2A
Authority: CN
Inventors: 刘锴; 宋宁; 徐庆嵩; 杜金凤; 詹宁斯·格兰特
Original assignee: Gowin Semiconductor Corp
Current assignee: Gowin Semiconductor Corp
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-10-04

Abstract

The embodiment of the application discloses an audio detection system and an audio detection method. The audio detection system comprises a Micro Control Unit (MCU), a programmable logic device and a shared memory; the shared memory is set to store original audio data; the micro control unit MCU is set to acquire the original audio data from the shared memory and perform format conversion on the original audio data according to a preset conversion rule; the programmable logic device is set to collect the original audio data and store the original audio data in the shared memory; and detecting the audio data after format conversion according to a pre-trained AI audio detection model to determine an audio detection result. The embodiment of the application has the advantages of low power consumption, low time delay, low cost, easy expansion and the like, and is suitable for being used in edge-end equipment.

Description

Audio detection system and audio detection method

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to an audio detection system and an audio detection method.

Background

With the development and wide application of AI (Artificial Intelligence) technology, AI computation under different scenes poses more and more challenges. The application of AI computing gradually expands from the cloud at the beginning to the edge devices.

At present, there are three general methods for detecting audio frequency:

the first is to analyze and process audio sample data using a complex audio processing algorithm to calculate the content of the audio data.

The second is to push the content of audio data by means of powerful hardware AI computation capability based on dedicated hardware such as an AI server or an AI processor.

And the third method is to reason and predict the content of the audio data based on the embedded AI algorithm of the high-end edge device chip.

The first two methods are not suitable for the edge device, the third method is often to use the expensive high-end chip, and the cost is not suitable for the edge device which is sought to be small and cheap.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein and is not intended to limit the scope of the appended claims.

An embodiment of the present disclosure provides an audio detection system, including,

the system comprises a Micro Control Unit (MCU), a programmable logic device and a shared memory;

the shared memory is set to store original audio data;

the micro control unit MCU is set to acquire the original audio data from the shared memory and perform format conversion on the original audio data according to a preset conversion rule;

the programmable logic device is set to collect the original audio data and store the original audio data in the shared memory; and detecting the audio data after format conversion according to a pre-trained AI audio detection model to determine an audio detection result.

The embodiment of the present disclosure further provides an audio detection method applied in the audio detection system, including,

the programmable logic device collects original audio data and stores the data in a shared memory;

the micro control unit MCU acquires the original audio data from the shared memory and performs format conversion on the original audio data according to a preset conversion rule;

and the programmable logic device detects the audio data after format conversion according to a pre-trained AI audio detection model, and determines an audio detection result.

The artificial intelligence system of the embodiment of the application can jointly complete the function of using the AI model to perform voice detection by mutually matching the MCU and the programmable logic device, so that the respective advantages of the MCU and the programmable logic device can be fully utilized, only less logic resources and limited data computing capacity are needed, the detection of the acquired audio data can be realized, and the artificial intelligence system has the advantages of low power consumption, low time delay, low cost, high performance, easiness in expansion and the like, and is suitable for being used in edge-end equipment.

Other aspects will be apparent upon reading and understanding the attached drawings and detailed description.

Drawings

The accompanying drawings are included to provide an understanding of the present disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the examples serve to explain the principles of the disclosure and not to limit the disclosure.

FIG. 1 is a schematic diagram of an audio detection system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another audio detection system in an embodiment of the present application;

FIG. 3 is a schematic diagram of another audio detection system in an embodiment of the present application;

FIG. 4 is a flowchart illustrating an audio detection method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an exemplary overall audio detection system;

FIG. 6 is a schematic diagram of the structure of an exemplary SoC for audio detection;

FIG. 7 is a schematic flow chart of audio format conversion in an example;

fig. 8 is a schematic flow chart of AI audio detection model inference in an example.

Detailed Description

The present application describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described herein. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with, or instead of, any other feature or element in any other embodiment, unless expressly limited otherwise.

The present application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements disclosed in the present application may also be combined with any conventional features or elements to form a unique inventive concept as defined by the appended claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the appended claims. Thus, it should be understood that any of the features shown and/or discussed in this application may be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be restricted except in light of the attached claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Accordingly, the particular order of the steps set forth in the specification should not be construed as limitations on the claims appended hereto. Further, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present application.

An embodiment of the present application provides an audio detection system, as shown in fig. 1, including: a micro control unit MCU11, a programmable logic device 12 and a shared memory 13;

the MCU11 is set to acquire original audio data from the shared memory 13 and perform format conversion on the original audio data according to a preset conversion rule;

a programmable logic device 12 configured to collect the original audio data and store the collected original audio data in the shared memory 13; the method also comprises the steps of detecting the audio data after format conversion according to a pre-trained AI audio detection model, and determining an audio detection result;

a shared memory 13 arranged to store raw audio data collected by the programmable logic device 12.

In some exemplary embodiments, as shown in fig. 2, the programmable logic device 12 includes: an audio acquisition module 1201 and an AI audio detection model inference module 1202;

the audio capture module 1201 is configured to capture the raw audio data input from a microphone device. The audio acquisition module 1201 receives a sound signal from the microphone device through an input port thereof, acquires original audio data, and stores the original audio data in the shared memory 13;

the AI audio detection model inference module 1202 is configured to sequentially execute the operational modes corresponding to the pre-trained AI audio detection models according to the format-converted audio data to determine the audio detection result.

In some exemplary embodiments, the MCU11 includes an audio format conversion module 1101, the audio format conversion module 1101 is disposed in a core of the MCU;

the audio format conversion module 1101 is configured to convert the original audio data into a spectrogram according to a preset conversion rule, and the spectrogram is used as the audio data after the format conversion.

In some exemplary embodiments, the raw audio data collected by the audio collection module 1201, which cannot be directly input to the AI audio detection model inference module 1202, needs to be converted into a Spectrogram (Spectrogram) format and then input to the AI audio detection model inference module 1202. The format conversion step is completed by an audio format conversion module 1101 in the MCU, which calculates and converts the audio format through the data processing capability of the MCU core.

In some exemplary embodiments, the audio format conversion module 1101 converts the raw audio data into a spectrogram according to a preset conversion rule, including:

sequentially acquiring corresponding original audio data fragments from the original audio data according to a first preset time length, and processing each original audio data fragment according to the following steps:

performing fast Fourier transform calculation according to the original audio data fragments, and determining a first preset number of transformed audio data;

and calculating the average value of the converted audio data, determining a second preset number of average audio data, and storing the average audio data into the spectrogram corresponding to the original audio data fragment.

In some exemplary embodiments, the first predetermined duration is 30 milliseconds, the first predetermined number is 256, and the second predetermined number is 43. Other values may be preset by those skilled in the art according to application needs, and are not limited to the examples of the embodiments of the present disclosure.

In some exemplary embodiments, the audio format conversion module 1101 converts the raw audio data into the spectrogram according to a preset conversion rule, including:

and taking the read original audio data with a preset determined time length as a segment, performing fast Fourier transform on each original audio data segment, then performing average value calculation, outputting audio data, storing the audio data into a spectrogram, and continuously processing all the original audio data segments according to the above mode until all the original audio data segments are processed. For example, if the segment is every 30 milliseconds, 256 audio data are output after the fast fourier transform calculation, and then the average value calculation is performed, 43 audio data are output and stored in the spectrogram.

The target format obtained by performing audio data conversion is determined by the data input requirement of the selected AI audio detection model, different AI audio detection models are used for inference prediction after being trained, and the conversion result of the audio format conversion module 1101 is correspondingly adjusted if the required input audio data formats are different. And is not limited to the spectrograms illustrated in the embodiments of the present disclosure.

In some exemplary embodiments, the programmable logic device 12 is a field programmable gate array FPGA; based on the programmable characteristic of the FPGA, the system has good expansibility. The audio acquisition module 1201 and the AI audio detection model inference module 1202 are both arranged in the core of the FPGA.

In some exemplary embodiments, the shared memory 13 is connected to the MCU core through a bus system, and the MCU core may read raw audio data from the shared memory 13 in real time, load the raw audio data into a data memory of the MCU core, input the audio format conversion module 1101, and perform audio mode conversion. The shared memory 13 is shared by the MCU kernel and the FPGA kernel, and the MCU kernel and the FPGA kernel can directly access and read and write data in real time.

In some exemplary embodiments, the operation mode corresponding to the pre-trained AI audio detection model includes: deep convolution operation, full connection operation and flexible maximum transmission operation. That is, the AI audio detection model inference module 1202 sequentially performs a deep convolution operation, a full-link operation, and a flexible maximum transmission operation on the basis of the format-converted audio data, performs a computational inference on the basis of a pre-trained AI audio detection model on the basis of the input format-converted audio data, and predicts the content of the audio data, thereby completing audio detection.

In some exemplary embodiments, the AI audio detection model is trained in a device or system other than the audio detection system. The pre-trained AI audio detection model is not limited to be from the cloud, and for example, the pre-trained AI audio detection model may be input to the audio detection system after being trained or downloaded by other devices, or may be stored at a designated location for the audio detection system to read by itself. The AI audio detection model is an AI model which is trained by learning a large amount of sample audio data in a cloud or other external equipment (or system) and can be accurately used for audio detection.

In some exemplary embodiments, the AI audio detection model may include: the system comprises a plurality of layers of operators such as Reshape, depthWiseConv2D, fullyConnected and SoftMax, an audio input data layer and a detection conclusion output data layer (reasoning and predicting result output). In some exemplary embodiments, a tensorflow is adopted at the cloud, and the AI audio detection model is trained by using a large amount of sample audio data, so as to obtain a trained AI audio detection model. Wherein, the sample audio data is the audio data marked with the audio characteristics.

In the audio detection scheme provided in the embodiment of the present disclosure, the AI audio detection model may also adopt other AI models in the related art, and is not limited to the models illustrated in the embodiment of the present disclosure. According to the description of the embodiment of the present disclosure, when different AI audio detection models are selected, the training mode, the sample audio data, and/or the operation mode to be executed in the AI audio detection model inference module of the programmable logic device 12 may be adjusted correspondingly.

In some exemplary embodiments, the programmable logic device 12 is further configured to update the pre-trained AI audio detection model. Along with improvement and upgrading of the detection function or performance, the AI audio detection model can be continuously learned/trained by using more or new sample audio data in external equipment or a cloud system, and is further optimized to improve the detection accuracy. The retrained AI audio detection model can be updated to the programmable logic device 12 to implement the function \ performance upgrade of the audio detection system.

In some exemplary embodiments, as shown in fig. 3, the MCU11 further includes a detection result obtaining module 1102 configured to obtain the audio detection result determined by the programmable logic device 12. The audio detection result may be stored; or, provided to an application in the system-on-chip; or, output to an external system.

In some exemplary embodiments, the detection result obtaining module 1102 obtains from the AI audio detection model inference module 1202; alternatively, the AI audio detection model inference module 1202 may obtain the AI audio detection model from the shared memory 13 after storing the AI audio detection model inference module.

In some exemplary embodiments, the detection result obtaining module 1102 is further configured to output the audio detection result to an external system, or provide an interface for the external system to obtain.

In some exemplary embodiments, the shared memory 13 is connected to the core of the MCU through a bus system.

In some exemplary embodiments, the MCU is a Cortex-M series of processors.

In some exemplary embodiments, the programmable logic device 1 is a low and medium-side FPGA.

It can be seen that the scheme of the audio detection System provided by the embodiment of the disclosure can be implemented On a lightweight System On Chip (soc) Chip of a middle and low-end FPGA and Cortex-M series processor with only a small amount of logic resources and limited data calculation capability, has the advantages of low power consumption, low time delay, low cost, easy expansion and the like, and is suitable for being used in edge-end mobile devices.

In some exemplary embodiments, the low-end and mid-end FPGAs are low-power, low-cost FPGA products that contain a small set of necessary logic resources. In some exemplary embodiments, the low-end FPGA may be selected as the high cloud

Semiconductor GWINSR-4C series FPGA productAnd (5) preparing the product.

An embodiment of the present application further provides an audio detection method, which is applied to the audio detection system according to any of the above embodiments, where the method is shown in fig. 4, and includes:

step 401, a programmable logic device collects original audio data and stores the data in a shared memory;

step 402, the MCU acquires the original audio data from the shared memory and performs format conversion on the original audio data according to a preset conversion rule;

and step 403, the programmable logic device detects the format-converted audio data according to the pre-trained AI audio detection model, and determines an audio detection result.

In some exemplary embodiments, step 403 comprises:

and sequentially executing operation modes corresponding to the pre-trained AI audio detection model according to the audio data after format conversion so as to determine the audio detection result.

In some exemplary embodiments, the operation mode corresponding to the pre-trained AI audio detection model includes: deep convolution operation, full connection operation and flexible maximum transmission operation.

In some exemplary embodiments, step 402 comprises: and converting the original audio data into a spectrogram according to a preset conversion rule to serve as the audio data after format conversion.

In some exemplary embodiments, the converting step correspondingly includes:

and carrying out average value calculation on the transformed audio data, determining a second preset number of average audio data, and storing the average audio data into a spectrogram corresponding to the original audio data fragment.

In some exemplary embodiments, the pre-trained AI audio detection model is trained in a device or system other than the audio detection system.

In some exemplary embodiments, the method further comprises: and updating the pre-trained AI audio detection model.

In some exemplary embodiments, other method implementation details can be found in the previous embodiments.

The above embodiments disclosed herein are illustrated below by way of an example.

The example is a voice detection system that is based on a lightweight MCU with low-end low-power FPGA SoC implementation and can reason and predict audio data.

In this example, the overall process of performing voice detection is as shown in fig. 5, and an AI audio detection model is trained at the cloud according to a large amount of audio sample audio data, so as to obtain a model for performing audio detection (i.e., a trained AI audio detection model). The trained AI audio detection model is downloaded to a system on chip and used when audio detection is needed.

When audio detection is performed, the data flow path is as shown in fig. 5:

inputting audio signals through microphone equipment, acquiring the audio signals input from the microphone equipment by an audio acquisition module to obtain original audio data, acquiring the original audio data by an audio format conversion module, converting the original audio data into a target format, and performing inference prediction on the converted audio data by an AI audio detection model inference module according to a pre-trained AI audio detection model to determine a detection result; and further outputting the detection result to other applications or external systems.

In this example, the structure of the audio detection system (SoC) is shown in fig. 6, where the SoC includes an MCU core, an FPGA core, and a shared memory, and acquires audio from a microphone device through an audio acquisition module. The MCU kernel is connected with the shared memory through a system bus, and the FPGA kernel is connected with the shared memory through a parallel bus. The MCU kernel comprises an audio format conversion module. The FPGA kernel comprises the audio acquisition module and an AI audio detection model reasoning module.

The shared memory in the chip is shared by the MCU kernel and the FPGA kernel, and the MCU kernel and the FPGA kernel can directly access and read and write data in real time.

Three modules in this example are described separately below:

(1) Audio acquisition module

The device is used for collecting audio data, is positioned in an FPGA kernel and is realized by using FPGA logic resources.

When the audio system is started, the audio acquisition module acquires original audio data, and the original audio data is input and stored into the on-chip shared memory through the FPGA port. Meanwhile, the shared memory is connected with the MCU kernel through a bus system, and the MCU kernel can read original audio data from the shared memory in real time, load the original audio data into a data memory of the MCU kernel, input an audio format conversion module and execute audio mode conversion.

(2) Audio format conversion module

The audio data read from the shared memory by the MCU kernel is original audio data collected by the audio collection module, cannot be directly input as an AI audio detection model inference module, needs to be converted into an audio format, is converted into a Spectrogram (Spectrogram), and is input into the AI audio detection model inference module.

The audio format conversion module is positioned in the MCU kernel and calculates and converts the audio format through the data processing capacity of the MCU kernel. And processing the output audio data by taking the original audio data read by the MCU kernel as a segment every fixed time, then performing average value calculation, storing the output audio data into a spectrogram, and continuously processing all the original audio data according to the above mode until all the original audio data are processed.

In some exemplary embodiments, as shown in fig. 7, 256 pieces of audio data are output as one piece every 30 milliseconds by fast fourier transform calculation; then, the average value calculation is carried out, 43 pieces of audio data are output and stored in a spectrogram. And all the data after audio conversion are sequentially stored in the spectrogram. And inputting the audio data in the spectrogram into an AI audio detection model reasoning module, and executing the reasoning and prediction of the AI audio detection. The duration of each audio segment is preset, the number of audio data obtained by fast fourier transform and the number of audio data obtained by calculating the average value may be adjusted according to the application requirement, and the present disclosure is not limited to the illustrated example.

(3) AI audio detection model reasoning module

And the spectrogram audio data output by the audio format conversion module is used as the input of the AI audio detection model reasoning module. The AI audio detection model reasoning module is positioned in an FPGA kernel, the convolution operation is realized by using FPGA logic resources, and the reasoning and prediction of the AI audio detection model are accelerated by the powerful hardware parallel processing capacity of the FPGA.

The AI audio detection model inference module includes deep convolution operation, full join operation, and flexible maximum transmission operation, which are in one-to-one correspondence with the AI audio detection model, and is used to calculate all operation modes in the AI audio detection model, and the calculation process is shown in fig. 8.

At the high in the clouds, through machine learning, this AI audio frequency detects a large amount of audio data of model learning, trains out the AI model that can accurately be used for audio frequency to detect. And the AI audio detection model reasoning module calculates and reasons the input Spectrogram audio data based on the trained model, and predicts the content of the audio data so as to finish audio detection.

It can be seen that the light-weight AI audio detection system provided by this example uses an MCU + FPGA SoC chip with very small logic resources and low cost as a carrier to implement an AI audio detection system. The system has the characteristics of low power consumption, low time delay, low cost and high performance, is suitable for the application field of the mobile equipment at the edge end, expands the application range of AI and reduces the complexity of AI and audio detection.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. An audio detection system, comprising,

the shared memory is set to store original audio data;

2. The audio detection system of claim 1,

the programmable logic device includes: the device comprises an audio acquisition module and an AI audio detection model reasoning module;

the audio acquisition module is arranged to acquire the original audio data input from the microphone device;

the AI audio detection model reasoning module is set to sequentially execute the operation modes corresponding to the pre-trained AI audio detection models according to the audio data after format conversion so as to determine the audio detection result.

3. The audio detection system according to claim 1 or 2,

the programmable logic device is a Field Programmable Gate Array (FPGA);

the audio acquisition module and the AI audio detection model reasoning module are both arranged in the kernel of the FPGA.

4. The audio detection system of claim 2,

the operation mode corresponding to the pre-trained AI audio detection model comprises: deep convolution operation, full connection operation and flexible maximum transmission operation.

5. The audio detection system according to claim 1 or 2,

the MCU comprises an audio format conversion module;

the audio format conversion module is arranged in the kernel of the MCU;

the audio format conversion module is configured to convert the original audio data into a spectrogram according to a preset conversion rule, and the spectrogram is used as the audio data after the format conversion.

6. The audio detection system of claim 5,

the audio format conversion module converts the original audio data into a spectrogram according to a preset conversion rule, and the method comprises the following steps:

7. The audio detection system according to claim 1 or 2,

and the pre-trained AI audio detection model is trained in equipment or a system except the audio detection system.

8. The audio detection system according to claim 1 or 2,

the programmable logic device is further configured to update the pre-trained AI audio detection model.

9. The audio detection system according to claim 1 or 2,

the shared memory is connected with the kernel of the MCU through a bus system.

10. An audio detection method applied to the audio detection system according to any one of claims 1 to 9, wherein the method comprises: