CN115862643A

CN115862643A - Voiceprint processor and voiceprint verification execution method

Info

Publication number: CN115862643A
Application number: CN202211506470.6A
Authority: CN
Inventors: 陆芳; 卜智勇; 杨大全; 赵峰
Original assignee: White Box Shanghai Microelectronics Technology Co ltd
Current assignee: White Box Shanghai Microelectronics Technology Co ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-28

Abstract

The invention relates to a voiceprint processor and a voiceprint verification execution method, wherein the voiceprint processor comprises a configuration manager, a reconfigurable audio processing engine and a main controller, the main controller is bidirectionally interconnected with the configuration manager through a configuration bus, the main controller is bidirectionally interconnected with the reconfigurable audio processing engine through an external interface, and the configuration manager is bidirectionally interconnected with the reconfigurable audio processing engine; the reconfigurable audio processing engine has the functions of vector dot product, tensor product, convolution and fast Fourier transform. The voiceprint processor has small area, low power consumption and complete functions, and can dynamically configure the layout and the wiring of different algorithms, so that the utilization rate of the PE processing unit is high.

Description

Voiceprint processor and voiceprint verification execution method

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voiceprint processor and a voiceprint verification execution method.

Background

Voiceprint recognition refers to a computer technology for recognizing who a speaker is from audio information of the speaker. It is mainly divided into two tasks, speaker identification (speaker identification) and speaker verification (speaker verification). The former is to confirm from the audio who the speaker is, and the latter is to confirm from the audio whether the speaker is a specific person.

The application field of voiceprint recognition is very wide, and the voiceprint recognition method is applied to many embedded scenes such as security, intelligent furniture, automatic driving and the like. At present, the voiceprint recognition application mainly faces two difficulties, one is huge calculation amount, and the other is the real-time requirement of the application. Therefore, how to select a hardware platform for embedded system design is an important issue in applications.

In order to meet the performance requirements required by applications, voiceprint recognition applications are mostly performed on some embedded platforms as follows:

1. general Purpose Processor (GPP)

2. Digital signal processor (digital signal processors, DSP)

3. Application Specific Integrated Circuits (ASIC)

4. Field Programmable Gate Array (FPGA)

GPP and DSP have greater flexibility due to their programmability, but have the disadvantage of being computationally inefficient. In order to adapt to a large number of application fields, the GPP is configured to process a large number of data, such as audio, image, and video, but lacks parallelism and is computationally inefficient.

Although DSP has better operation speed than GPP, it still cannot meet the requirement of voiceprint recognition real-time. In addition, the problems of power consumption and area are not completely solved by GPP and DSP, which are important targets in the application of embedded systems.

ASICs and FPGAs are computationally efficient in embedded applications for voiceprint recognition, but they also have limitations. An ASIC, while having the advantages of high arithmetic performance, strong electromagnetic impedance, low cost and high integration, is difficult to test once it is fabricated. Therefore, a detailed physical simulation process is required before the ASIC is produced. Meanwhile, the non-customizability of the ASIC is difficult to apply in the field of voiceprint recognition based on deep learning with high-speed algorithm alternation.

Due to its fine-grained parallel computing power, FPGA is another high-performance solution in the field of embedded voiceprint recognition. It has a smaller development cost than an ASIC. Although FPGA is an excellent choice for saving energy consumption and area, the embedded voiceprint recognition is more concerned about whether real-time performance can be achieved.

In addition, although voiceprint recognition runs efficiently on a GPU, the high power consumption of the GPU makes it difficult to apply to embedded systems, and therefore this patent does not discuss performance comparison with GPUs.

In summary, the current trend is that using reconfigurability to minimize production and development costs is an increasingly important requirement. CGRA (coarse gray reconfighter architecture) is a coarse-grained reconfigurable architecture that can compromise performance and flexibility. A standard CGRA architecture includes an array of processing elements that statically execute instructions and a customizable interconnect translator. The method has the characteristics of high data parallelism, reconfigurable processing unit, low power consumption and high memory bandwidth utilization rate, and is very suitable for being used as a processor of an embedded system when the algorithm is changed at high speed.

Disclosure of Invention

The invention aims to solve the technical problem of providing a voiceprint processor and a voiceprint verification execution method, wherein the voiceprint processor is small in area, low in power consumption, complete in function and capable of dynamically configuring layout and wiring of different algorithms, so that the utilization rate of a PE processing unit is high.

The technical scheme adopted by the invention for solving the technical problems is as follows: providing a voiceprint processor, which comprises a configuration manager, a reconfigurable audio processing engine and a main controller, wherein the main controller is bidirectionally interconnected with the configuration manager through a configuration bus, the main controller is bidirectionally interconnected with the reconfigurable audio processing engine through an external interface, and the configuration manager is bidirectionally interconnected with the reconfigurable audio processing engine;

the reconfigurable audio processing engine has the functions of vector dot product, tensor product, convolution and fast Fourier transform.

The reconfigurable audio processing engine comprises a second storage unit, a PE array, a third storage unit and a fourth storage unit which are sequentially connected, and the fourth storage unit is connected with the second storage unit.

The configuration manager comprises a configuration decoder, a configuration flow controller and a first storage unit which are sequentially connected, the configuration decoder is used for decoding a control instruction stream from the main controller, and the configuration flow controller is used for writing a reconstruction mode of the PE array into the first storage unit according to the decoded instruction stream and a time sequence.

The PE array comprises m multiplied by n PE processing units arranged in an array, and two adjacent PE processing units are interconnected to form a grid structure; and the PE processing unit executes processing according to the task configured by the configuration manager.

The PE array connects two adjacent columns of PE processing units except the PE processing unit parallel to the PE array through the bridging unit to realize the function of fast Fourier transform.

The first storage unit, the second storage unit, the third storage unit and the fourth storage unit are all static random access memories.

The technical scheme adopted by the invention for solving the technical problems is as follows: the voiceprint verification execution method of the voiceprint processor comprises the following steps:

(1) The configuration manager configures the PE processing units from the 1 st column to the n-1 st column in the PE array into multiply-add operation, and the PE processing units from the 1 st column to the n-1 st column only transmit data transversely; configuring the n-th column of PE processing units in the PE array into accumulation operation, wherein the n-th column of PE processing units only transmit data longitudinally; taking the n-th column and m-th row PE processing units in the PE array as output;

(2) The configuration manager configures each PE processing unit in the PE array into a multiply-add function, and constructs the PE processing units in two adjacent columns into a plurality of butterfly operation units; taking the n-th column of PE processing units in the PE array as output;

(31) The configuration manager configures the PE processing units from the 1 st column to the n-1 st column in the PE array into multiply-add operation, and the PE processing units from the 1 st column to the n-1 st column only transmit data transversely; configuring the n-th column of PE processing units in the PE array into accumulation operation, wherein the n-th column of PE processing units only transmit data longitudinally; taking the n-th column and m-th row PE processing units in the PE array as output; (32) The configuration manager configures the PE processing units in the 1 st column in the PE array to be a function of taking a larger value by comparison, configures the PE processing units in the 2 nd to nth columns to be a data transmission function, and only transversely transmits data by the PE processing units in the 1 st to nth columns; taking the n-th column of PE processing units in the PE array as output; (33) The configuration manager configures each PE processing unit in the PE array to have a data transmission function, and the PE processing units in the 1 st column to the nth column only transmit data transversely; taking the n-th column of PE processing units in the PE array as output;

(4) And (4) performing Euclidean distance measurement on the output result in the step (33) and the data to be verified in the database to realize voiceprint verification.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the reconfigurable audio processing engine is dynamically reconfigurable, the configuration manager is controlled by the main controller, the configuration manager configures the layout and wiring of the PE array of the reconfigurable audio processing engine, and the dynamic reconfiguration is realized during the operation, so that the efficient operation of the algorithm is realized; the reconfigurable audio processing engine has high agility, the PE array is very flexible to switch among different operation methods, and the processing unit array can be dynamically reconfigured at the speed of ten nanoseconds, so that the reconfigurable audio processing engine is suitable for different operations; the configuration manager can configure different layout and wiring for the PE array, and can realize mathematical operations such as vector dot product, tensor product, convolution, fast Fourier transform and the like; the configuration flow controller in the configuration manager can configure the layout and wiring of the PE array according to the time sequence, and the PE array can be dynamically configured into the layout and wiring of different mathematical operations in the operation process, so that the hardware realization of different algorithms in the voiceprint recognition process is realized; the voiceprint processor is small in area, low in power consumption, complete in function and capable of dynamically configuring layout wiring of different algorithms, so that the utilization rate of a processing unit is high, and the voiceprint processor is extremely low in power consumption and small in area; compared with the traditional general processor with ASIC, GPU, FPGA and DSP, the invention has obvious advantages.

Drawings

FIG. 1 is a general architecture diagram of a voiceprint processor according to an embodiment of the invention;

FIG. 2 is a flowchart illustrating the complete implementation of a conventional voiceprint authentication method;

FIG. 3 is a flowchart of the voiceprint authentication method of an embodiment of the invention;

FIG. 4 is a convolution data flow diagram of a reconfigurable audio processing engine of an embodiment of the present invention;

FIG. 5 is a schematic diagram of a parallel data flow configuration in the SRAM2 during convolution operation according to the embodiment of the present invention;

FIG. 6 is a data flow diagram of a reconfigurable audio processing engine implementing FFT in accordance with an embodiment of the present invention;

FIG. 7 is an FFT butterfly diagram of an embodiment of the invention;

FIG. 8 is a maximum pooled data flow diagram of a reconfigurable audio processing engine according to an embodiment of the present invention;

fig. 9 is a nonlinear activation operation data flow diagram of the reconfigurable audio processing engine according to the embodiment of the invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The present invention relates to a voiceprint processor, and in the embodiment, referring to fig. 1, the voiceprint processor includes a main controller, a configuration manager, a reconfigurable audio processing engine, four Static Random Access Memories (SRAMs), an external interface, and a configuration bus, and the voiceprint processor obtains input data through an upper system bus and the external interface.

The configuration manager is responsible for laying out and wiring the PE processing units of the reconfigurable audio processing engine, so that the reconfigurable audio processing engine has the dynamic reconfigurable capability during the operation of the algorithm.

The host controller is a general purpose processor (including but not limited to ARM7, RISC-V core) that is responsible for receiving input data from the external interface and controlling the configuration manager.

The configuration decoder in the configuration manager is responsible for decoding the control instruction stream from the host controller.

And a configuration flow controller in the configuration manager is responsible for writing the reconfiguration mode of the PE array into the SRAM1 according to the time sequence in operation.

The reconfigurable audio processing engine is composed of a PE array and three pieces of SRAM (SRAM 2, SRAM3, SRAM 4). The size of the PE array is m rows and n columns, where m and n are integers. The interconnection mode among the PE processing units in the PE array is four-way interconnection in the vertical direction and the horizontal direction, the PE array reads data from the SRAM2 and writes output data into the SRAM3, and the SRAM4 is an intermediate variable responsible for storage and calculation.

The reconfigurable audio processing engine can realize functions of vector dot product, tensor product, convolution, fast Fourier transform and the like, and the PE array connects PE processing units which are parallel to the PE processing units in two adjacent columns of PE processing units through the bridging unit.

The main controller and the configuration manager are connected through a configuration bus.

Referring to fig. 2, a generalized voiceprint recognition process includes ten steps shown in fig. 2, and this embodiment may configure ten sets of data flow diagrams, dynamically reconstruct the voiceprint recognition operation process, and implement hardware of all algorithms in the ten steps.

For convenience, in order to briefly describe the characteristics of dynamic reconstruction, parallel processing, high utilization rate of the computing unit, and the like in the embodiment, a voiceprint verification method shown in fig. 3 is selected for description, and mainly includes the steps of audio pre-emphasis, fast fourier transform, feature extraction, and similarity calculation, which are described in detail below.

In the first step, audio pre-reinforcement is firstly carried out on an audio segment with the length of 10-25ms, the audio pre-reinforcement is mainly carried out on the audio segment by using an FIR filter, and the main operation is one-dimensional convolution. The accelerated convolution calculation is the main characteristic of the PE array in the reconfigurable audio processing engine, and both the multi-channel two-dimensional convolution of the neural network and the one-dimensional convolution of the audio can be realized by the data flow diagram as shown in fig. 4.

Specifically, for a PE array of a reconfigurable audio processing engine of m × n size, when calculating convolution, the functions of all PE processing units of column 1 to column n-1 except the last column (i.e., the nth column) are configured by the configuration manager as multiply-add, and the functions of the PE processing units of the last column are configured as accumulate; when the layout and the wiring are performed, all the PE processing units except the last column can only transmit data in the transverse direction, the PE processing unit in the last column can only transmit data in the longitudinal direction, and simultaneously the lowermost PE processing unit in the last column transmits an output result to the SRAM3, as shown in fig. 4 and 5 in detail, the parallel data flow in the SRAM2 in fig. 5 moves to the right by one grid (the PE processing unit in the row is allowed to read in data) in each clock cycle, and the SRAM4 is used for storing an intermediate variable.

When configuring data, the configuration manager processes the data to configure the function of each PE processing unit in fig. 4 as a convolution kernel, that is: if the convolution kernel size is 3 × 3, then 9 numbers in this convolution kernel are sequentially allocated to 9 PE processing units in the same column, and at the same time, as shown in fig. 5, the convolved data sequence is properly configured in the SRAM2, so that the parallel data stream is shifted to the right by one every clock cycle (i.e., the data is read in once by the PE array of the reconfigurable audio processing engine). It is noted that not all PE processing units have to read in data every clock cycle, as shown in fig. 5.

As can be seen from fig. 4, for a reconfigurable audio processing engine with size of m × n, if m is 9, then n convolution operations with size of 3 × 3 can be performed simultaneously, because the PE processing units in each column represent the operation function of a convolution kernel. In the subsequent feature extraction process using the Resnet50, 64, 256, 512, 1024 convolution operations are often performed on the input tensor at some convolution layers, and at this time, the parallel operation of the embodiment can greatly improve the calculation efficiency.

The above is a description of the acceleration of convolution operations by a reconfigurable audio processing engine. After the convolution operation is accelerated, the voice print recognition method can accelerate by the first step of pre-enhancing the audio frequency.

In the second step, after the pre-emphasized audio signal is obtained, the next step is to perform a Fast Fourier Transform (FFT). In the present embodiment, FFT calculation may be accelerated, and the layout and layout are as shown in fig. 6, and the PE array size in fig. 6 is 8 × 4 (m =8, n = 4).

For an eight component signal, the hardware acceleration algorithm configuration of the FFT is shown in fig. 6, where the function of each PE processing unit is configured as a multiply-add, and the configuration manager configures the dataflow graph of the reconfigurable audio processing engine into the format of fig. 6 prior to the FFT operation. To briefly illustrate the calculation process, the present embodiment selects a smaller butterfly unit, which takes a and B as inputs, and the data is multiplied by a weight when passing through each line, so that its output is: c =0.5 × (a + B × Wn) and D =0.5 × (a-B × Wn).

A spectrogram can be obtained after FFT of the audio signal. The spectrogram has time on the abscissa and frequency on the ordinate, and the color of each pixel (intensity value on the RGB channel) represents the intensity of the energy. Since the spectrogram can comprehensively reflect the information of the sound, obtaining the spectrogram means that the voiceprint recognition can be performed by using an image processing method.

Thirdly, the spectrogram is input into Resnet50, and the characteristic information of the audio is extracted by using a convolutional neural network. The operation of Resnet50 includes three convolutional layers (1 × 1,3 × 3,7 × 7), max pooling, and nonlinear active layers. The layout wiring diagram for the convolutional layers has been given in fig. 4, and fig. 8 and 9 show dataflow diagrams for the max pooling operation and the nonlinear activation operation, which are all time-sequentially configured by the configuration manager.

The maximum pooling operation only needs to configure the PE processing unit function in the first row of this embodiment as "compare and take the larger value", and at the same time, the PE processing unit in the first row continuously receives 4 inputs, and all the subsequent PE processing units are configured as "data transfer" function (i.e., "pass" function).

For nonlinear active operation, the Resnet50 uses a linear rectifying unit (ReLU) as an active function, which operates according to the following formula:

f(x)＝max(0,x)

where x is the input to the function, representing the operands of the PE processing units in this embodiment; max represents the output that takes the larger of the two numbers. Since the neural network operates to normalize all data to the range of 0-1, there is no operand less than 0. Thus, as shown in fig. 9, for the data flow diagram of the non-linear activation of Resnet50, it is only necessary to change the wiring to landscape, and all PE processing units are configured for "data transfer" function (i.e., "pass-through" function).

And fourthly, calculating the similarity. After a feature vector is extracted with Resnet50, downstream tasks can continue to be completed using the vector. For the sake of simplicity, the downstream task is selected as voiceprint verification (speaker verification), and the feature vector dimension extracted in the previous step is assumed to be 256 dimensions.

Giving a 256-dimensional characteristic vector to perform a voiceprint verification task, only taking out the 256-dimensional characteristic vector and the characteristic vector which is marked as a speaker to be verified in the database, calculating the Euclidean distance of the two vectors, if the Euclidean distance is smaller than a threshold value, determining the Euclidean distance as 'yes', otherwise determining the Euclidean distance as 'no'. Depending on the array size of the reconfigurable audio processing engine (the PE array size is m × n in the present embodiment), the calculation of n vector components can be performed in parallel.

After the similarity result is calculated, only comparison with a given threshold value (scalar ratio) is needed, and then the result "yes" or "no" is output.

In summary, when the voice print verification method is used for completing the voice print verification task, the data flow graph needs to be dynamically switched when the algorithm runs, and the dynamic reconfigurable characteristic of the voice print verification execution method is fully shown. Meanwhile, each data flow diagram also fully shows the characteristics of the data parallel processing capability and the high utilization rate (low power consumption) of the processing unit.

The above is the whole process of using the invention to complete the voiceprint verification task.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A voiceprint processor is characterized by comprising a configuration manager, a reconfigurable audio processing engine and a main controller, wherein the main controller is bidirectionally interconnected with the configuration manager through a configuration bus, the main controller is bidirectionally interconnected with the reconfigurable audio processing engine through an external interface, and the configuration manager is bidirectionally interconnected with the reconfigurable audio processing engine;

2. The voiceprint processor of claim 1 wherein the reconfigurable audio processing engine comprises a second storage unit, a PE array, a third storage unit and a fourth storage unit connected in series, the fourth storage unit being connected to the second storage unit.

3. The voiceprint processor according to claim 2, wherein the configuration manager includes a configuration decoder, a configuration flow controller and a first storage unit, which are connected in sequence, the configuration decoder is configured to decode a control instruction stream from the main controller, and the configuration flow controller is configured to write a reconfiguration mode of the PE array to the first storage unit in time sequence according to the decoded instruction stream.

4. The voiceprint processor according to claim 3 wherein said PE array comprises m x n PE processing units arranged in an array, adjacent two PE processing units being interconnected to form a grid structure; and the PE processing unit executes processing according to the task configured by the configuration manager.

5. The voiceprint processor of claim 4 wherein the PE array connects PE processing units in two adjacent columns of PE processing units, except the PE processing unit in parallel with the PE processing unit, via a bridging unit to implement a fast fourier transform function.

6. The voiceprint processor of claim 3 wherein the first memory unit, the second memory unit, the third memory unit, and the fourth memory unit are all static random access memories.

7. A method of voiceprint authentication execution using a voiceprint processor as claimed in any one of claims 1 to 6, comprising the steps of:

(31) The configuration manager configures the PE processing units from the 1 st column to the n-1 st column in the PE array into multiply-add operation, and the PE processing units from the 1 st column to the n-1 st column only transmit data transversely; configuring the n-th column of PE processing units in the PE array into accumulation operation, wherein the n-th column of PE processing units only transmit data longitudinally; taking the n-th column and m-th row PE processing units in the PE array as output; (32) The configuration manager configures the 1 st column of PE processing units in the PE array to be a function of taking a larger value by comparison, configures the 2 nd to nth column of PE processing units to be a data transmission function, and only transmits data transversely from the 1 st to nth column of PE processing units; taking the n-th column of PE processing units in the PE array as output; (33) The configuration manager configures each PE processing unit in the PE array into a data transmission function, and the PE processing units in the 1 st column to the nth column only transmit data transversely; taking the n-th column of PE processing units in the PE array as output;