CN115328661A

CN115328661A - Computing power balance execution method and chip based on voice and image characteristics

Info

Publication number: CN115328661A
Application number: CN202211100689.6A
Authority: CN
Inventors: 王嘉诚; 张少仲; 张栩
Original assignee: Zhongcheng Hualong Computer Technology Co Ltd
Current assignee: Zhongcheng Hualong Computer Technology Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2022-11-11
Anticipated expiration: 2042-09-09
Also published as: CN115328661B

Abstract

The invention discloses a computing power balance execution method and a chip based on voice and image characteristics, which relate to the technical field of load balance and comprise the following steps: receiving data to be processed, wherein the data to be processed comprises voice data and image data; preprocessing data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform, and generating a first characteristic diagram; extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, and transmitting the frequency bandwidth data to the first task pool or the second task pool for next processing; and respectively processing the corresponding tasks and outputting processing results. By preprocessing the data signals and transmitting different signals to corresponding neural networks for further processing, the computation of each neural network can be designed differently, and the processing efficiency of the neural network module in the chip is improved; in the occasion with higher requirement on the real-time property of signal processing, the time delay of voice signal and image signal processing can be effectively reduced, and the running speed is improved.

Description

Computing power balance execution method and chip based on voice and image characteristics

Technical Field

The invention relates to the technical field of load balancing, in particular to a computing power balancing execution method and a chip based on voice and image characteristics.

Background

Speech recognition and image recognition have many similarities in the design of classifiers in pattern recognition, and typical classifiers such as neural networks, SVMs (Support Vector machines) and Deep Learning can be used in the recognition of the two, so the main difference lies in the difference of feature extraction algorithms.

The voice sampling frequency is very high relative to the sampling duration, and in consideration of the vibration characteristics and the persistence of sound waves, the voice signal is a dense signal in a sound area, so that the local information in a section of voice signal or the information amount of some adjacent sampling points is very small, so that a window mode is usually adopted in voice feature extraction to analyze the statistical characteristics of the signal in the window, and a feature extraction algorithm based on frequency domain transformation is very common. The spatial frequency of the image is not high relative to the size of the image, a large number of smooth areas exist in the image, the distribution of the features is relatively sparse, and the value of the local features is more important. SIFT (scale invariant feature transform), HOG (histogram of gradient directions), sparse coding, and the like, which have been widely used in recent years, are based on this idea. The invariance of the aspects of rotation, scaling, illumination and the like is emphasized more by the image characteristics, and compared with a voice signal, the mode of the image characteristics is more complex in general, and the separation difficulty of redundant information is larger.

For a chip which needs to process voice information and image information simultaneously, due to the fact that different feature extraction algorithms are adopted by voice recognition and image recognition, the computation amount of feature extraction is different, therefore, different recognition tasks are reasonably distributed to corresponding neural network processing units, and the operation efficiency of a neural network can be effectively improved.

In the prior art, computation scheduling based on an AI chip mainly calculates computation resources based on operation information corresponding to a plurality of operators and computation information of operator execution equipment. Or presetting the computation amount of the computation task, and if the new computation task is judged to meet the preset target computation power, distributing the task to the computation power target. However, since instructions and tasks processed by different neural network processing modules are different, computational resources cannot be calculated simply according to computational information of operators, and the load balancing scheme is not suitable for task scheduling among different neural network processing modules, and therefore, a task scheduling and load balancing execution method suitable for different neural network processing modules needs to be specially designed.

Disclosure of Invention

The invention provides a computing power balance execution method and a chip based on voice and image characteristics, and aims to solve the problem of task scheduling between neural network modules needing to process voice information and image information simultaneously in the prior art.

In order to solve the technical problems, the specific scheme of the invention is as follows:

a computing power balance execution method based on voice and image characteristics comprises the following steps:

receiving data to be processed, wherein the data to be processed comprises voice data and image data;

preprocessing the data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform to generate a first characteristic diagram;

extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, and transmitting the frequency bandwidth data to the first task pool or the second task pool for next processing;

and the first task pool and the second task pool respectively carry out the processing of the corresponding tasks and output processing results for subsequent tasks.

Preferably, the a/D conversion includes sampling, quantization and encoding, and converts an input analog signal into a digital signal.

Preferably, the fast fourier transform is implemented by a hardware circuit, and a pipeline-based fast fourier transform method is adopted.

Preferably, the first task pool performs processing on the voice data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.

Preferably, the second task pool performs image data processing, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.

Preferably, the first task pool is a speech neural network model, and the specific steps of feature extraction include:

end point detection, namely dividing the beginning and the end of a sentence by distinguishing signals of voiced segments, unvoiced segments and voiced segments to obtain an effective voice sequence;

framing and windowing, segmenting the emphasized voice sequence according to a set time interval, and filtering signals by using a band-pass filter to reduce the error of the signals and obtain a frame sequence depending on time;

pre-emphasis, namely increasing the high-frequency energy of the effective voice sequence, and improving the signal-to-noise ratio to obtain an emphasized voice sequence;

the fast Fourier transform is used for inputting the frame sequence into a fast Fourier transform hardware circuit and converting a time domain image into a frequency spectrum of each frame;

extracting a characteristic vector, namely extracting the characteristic vector of the frequency spectrum by using a perceptual linear prediction technology to generate a voice characteristic parameter;

and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.

Preferably, the second task pool is an image neural network model, and the extracted features include:

the histogram feature of the directional gradient, divide the picture into the small connected region at first, then gather the gradient and direction of the edge of every pixel in the connected region, form the histogram feature of the directional gradient;

the local binary pattern feature is used for describing texture information of a picture area, a detection window is divided into 16x16 cells, one pixel in each cell is compared with 8 surrounding pixels, a histogram of each cell is calculated, and finally the obtained statistical histograms of each cell are connected to form the local binary pattern feature;

harr characteristics are used for representing the human face in the image, and in the image with the human face, the Harr characteristics are extracted and used for detecting the human face.

Preferably, the speech neural network comprises a convolutional neural network and a cyclic neural network, and the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer which are connected in sequence:

the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1;

the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1;

the second convolutional layer is 256 filters with the size of 1 × 4, the transverse step size is set to 1, and the channel is set to 64;

preferably, the recurrent neural network employs a long-short term memory structure and a neural network-based time-series class classification for speech recognition.

A computing power balance execution chip based on voice and image features comprises a general-purpose processor and a neural network processor, wherein the neural network processor is used for executing the computing power balance execution method based on the voice and image features.

Compared with the prior art, the invention has the following technical effects:

1. through preprocessing the data signals, different signal characteristics are distinguished, different signals are transmitted to corresponding neural networks for further processing, the computation of each neural network can be designed differently, and the processing efficiency of the neural network modules in the chip is improved.

2. Different neural network models can be designed according to different occasions, and the time delay of voice signal and image signal processing can be effectively reduced, the waiting time of subsequent steps is reduced, and the running speed of the chip is improved in the occasions with higher requirements on the real-time performance of the voice signal and image signal processing.

Drawings

FIG. 1 is a flow chart of a method for performing computational power equalization based on speech and image features according to the present invention;

FIG. 2 is a flow chart of the preprocessing steps of a method for performing the computational power equalization based on speech and image features according to the present invention;

FIG. 3 is a flow chart of the radix-2 FFT method of the algorithm balance execution method based on the voice and image characteristics.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail and completely with reference to the accompanying drawings.

As shown in fig. 1, a computing power balance execution method based on voice and image features includes the following steps:

the data to be processed is preprocessed, as shown in fig. 2, the preprocessing includes a/D conversion and Fast Fourier Transform (FFT), and a first feature map is generated. The a/D conversion includes sampling, quantization and encoding, and converts an input analog signal into a digital signal. For a voice signal, discretizing a continuously changing voice in time on a line of a plane is called a voice signal sample, and for an image signal, discretizing a continuously changing image in time on a plane of a space is called an image signal sample; the quantization adopts optimal quantization and vector quantization.

Extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, wherein the frequency bandwidth of the image signal can reach 6.5MHz, and the frequency bandwidth of the voice signal can only be between 10Hz and 20kHz, so that the voice signal and the image signal can be obviously distinguished according to the frequency bandwidth data; transmitting the voice signal characteristic diagram to a first task pool, and carrying out next processing on the image signal characteristic diagram in a second task pool; and the first task pool and the second task pool respectively carry out the processing of the corresponding tasks and output processing results for subsequent tasks.

The fast Fourier transform is realized by a hardware circuit, and a fast Fourier transform method based on a production line is adopted. The FFT is a combination of a number of decimal points, and the FFT decomposes the transform once to reduce the amount of computation, and in general, the FFT has a radix 2 and a length N =2l, and when the length of the sequence to be transformed is not an integer power of 2, the FFT having the radix 2 is still taken, and zero padding is performed on the last bit to extend the length to an integer power of 2, that is, the radix-2 FFT method. In this embodiment, a radix-2 FFT method is specifically adopted, as shown in fig. 3, which is a flow chart of the radix-2 FFT method of the present invention, each stage of operation needs N/2 (N represents the number of input points of the stage) radix-2 butterflies to complete, and each butterfly factor operation includes 1 complex multiplication and 2 complex additions. 4096-Point FFT operation requires log altogether ₂ (4096) The calculation of the butterfly factors is a core part of FFT calculation, the radix-2 butterfly factors of each level adopt a method of addressing calculation, so-called addressing calculation, namely, data output by the previous level is firstly sent into a data storage RAM, when butterfly calculation is carried out, a butterfly factor unit reads out the data from the data storage RAM, an intermediate result obtained through calculation is still stored into the same data storage RAM, input data is covered until all the calculation of the butterfly factors of the level is finished, and a final result is output to the next level.

And the first task pool processes voice data, identifies the input first characteristic diagram through training a neural network model, and outputs a processing result. And the second task pool is used for processing the image data, recognizing the input first characteristic diagram through training a neural network model, and outputting a processing result.

The first task pool is a speech neural network model, and the specific steps of feature extraction comprise:

end point detection by distinguishing voiced segments, unvoiced segments, and voiced soundsThe signals of the segments are divided into the beginning and the end of the sentence, and an effective voice sequence is obtained. The time domain analysis is carried out on the speech signal, the original speech information can be obviously distinguished to include a vocal section, a silent section and a voiced section, and the endpoint detection finds the starting point and the ending point of the speech signal by distinguishing the signals of the different sections. The endpoint detection method in this embodiment adopts a double-threshold method, judges the endpoint of the voice by calculating the voice energy, and presets the threshold energy of the double gates

Then respectively calculating the voice energy of each time

If, if

If yes, generating a threshold sequence as 1; if it is

If yes, the generated threshold sequence is 0; thus obtaining a threshold sequence, and multiplying the threshold sequence by the original voice sequence to obtain an effective voice sequence.

The endpoint detection method is preferably a double-threshold method, wherein the calculation formula of the voice energy is as follows:

wherein the content of the first and second substances,

in order to detect the speech energy of the point,

is as follows

The phonetic generalized decibel value of a point,

is the number of all detection points.

Pre-emphasis, increasing the high-frequency energy of the effective voice sequence, improving the signal-to-noise ratio and obtaining an emphasized voice sequence. The voice information is often mingled with various other voice information in the environment, and due to the characteristic of human pronunciation, most of the voice information is often concentrated in a low frequency band after frequency conversion, so that the low frequency energy is too high, the high frequency energy is too low, and the high frequency voice information is difficult to effectively extract. The pre-emphasis adds high-frequency signals in advance, and after the high-frequency signals are overlapped with the original voice signals, the energy of the high-frequency band is equivalent to that of the low-frequency band, so that the subsequent recognition efficiency is obviously improved.

Framing and windowing, the emphasized speech sequence is segmented at set time intervals, and then the signal is filtered using a band-pass filter to reduce the error of the signal, resulting in a time-dependent frame sequence 2. A speech signal is unstable as a whole, but locally, the speech signal can be assumed to be stationary for a short time (the speech signal can be considered to be approximately unchanged to the pronunciation of a phoneme within 10-30ms, and is generally 25 ms), so that the whole speech signal needs to be framed. In this embodiment, a hamming window is used for windowing, only the middle data is shown due to the hamming window, and the data information on both sides is lost, so that there is an overlapping portion between adjacent windows, the window length in this embodiment is 25ms, and the step length is 10ms, that is, the last 15ms of each window and the first 15ms of the subsequent adjacent window are overlapping portions.

Fast Fourier transform, inputting the frame sequence into a fast Fourier transform hardware circuit, and converting the time domain image into the frequency spectrum of each frame;

extracting a feature vector, namely extracting the feature vector of a frequency spectrum by using a Perceptual Linear Prediction (PLP) technology to generate a voice feature parameter;

The second task pool is an image neural network model, and the extracted features comprise:

and (3) a Histogram of Oriented Gradients (HOG) feature, which divides the image into small connected regions, and collects the gradient and the direction of the edge of each pixel point in the connected regions to form the HOG feature. Firstly, graying an input image; then, carrying out color space standardization on the input image by adopting a Gamma correction method, aiming at adjusting the contrast of the image, reducing the influence caused by local shadow and illumination change of the image and simultaneously inhibiting the interference of noise; calculating the gradient of each pixel of the image, aiming at capturing contour information and further weakening the interference of illumination; dividing the image into small cells, counting a gradient histogram of each cell, namely the characteristic of each cell, forming each cell into a block, connecting the characteristics of all the cells in the block in series to obtain the HOG characteristic of the block, and connecting the HOG characteristics of all the cells in the image in series to obtain the HOG characteristic of the image.

A Local Binary Pattern (LBP) feature, which is used to describe texture information of a picture region, and divides a detection window into 16 × 16 cells; comparing one pixel in each cell with 8 surrounding pixels, if the surrounding pixel value is greater than the central pixel value, marking the position of the pixel point as 1, otherwise, marking the position as 0, so that 8 points in a 3x3 neighborhood can generate 8-bit binary numbers through comparison, and obtaining the LBP value of the central pixel point of the window; then calculate the histogram for each cell, i.e. the frequency of occurrence of each digit (assuming a decimal number LBP value); then, normalizing the histogram; and finally, connecting the obtained statistical histograms of all the cells into a feature vector, namely an LBP texture feature vector of the whole graph.

Harr characteristics are used for representing the human faces in the images, and in the images with the human faces, the Harr characteristics are extracted and used for detecting the human faces. In this embodiment, an integral graph is used to calculate the Haar feature, where the integral graph stores the sum of pixels in a rectangular region formed by an image from a starting point to each point as an element of an array in a memory, and when the sum of pixels in a certain region is to be calculated, the element of the array may be directly indexed without recalculating the sum of pixels in the region, thereby speeding up the calculation (this is called a dynamic programming algorithm). The integral map can use the same time (constant time) to compute different features at multiple scales, thus greatly improving the detection speed.

The speech neural network model comprises a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), wherein the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer which are connected in sequence: the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1; the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1; the second convolutional layer is 256 filters of size 1 × 4 with a lateral step size set to 1 and channel set to 64.

The recurrent neural network splits the input data according to the time parameters, encapsulates the split data into two matrixes in sequence, and records the time sequence characteristics of the voice sequence by using the LSTM. In the embodiment, a unidirectional LSTM (bidirectional Long short term memory) structure is used for learning the voice time sequence characteristics, and an LSTM unit is used for replacing a hidden layer in a bidirectional RNN, so that the characteristics of the current voice can be expanded into a whole sequence picture by using information of two directions of the past time and the future time, the effective learning of the whole voice time sequence characteristics is realized, and the final prediction result is more accurate. And respectively carrying out forward propagation on the former matrix and backward propagation on the latter matrix by using the BilSTM node, and outputting a voice recognition result. The number of nodes of the BilSTM is preferably 2048, wherein 1024 nodes are connected with only one matrix for forward propagation; another 1024 nodes connect to another matrix for back propagation.

The execution chip comprises a general processor and a neural network processor, and the neural network processor is used for executing the computational power balance execution method based on the voice and image characteristics.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept of the present invention, and these changes and modifications are all within the scope of the present invention.

Claims

1. A computing power balance execution method based on voice and image characteristics is characterized by comprising the following steps:

extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, transmitting the voice signal to a first task pool, and transmitting the image signal to a second task pool for next processing;

the first task pool and the second task pool respectively process corresponding tasks, and output processing results for subsequent tasks; the first task pool adopts a trained voice neural network model, the voice neural network model comprises a convolutional neural network and a cyclic neural network, and the cyclic neural network records the time sequence characteristics of a voice sequence through bidirectional long-term and short-term memory; and the second task pool adopts a trained image neural network model, and the image neural network model extracts HOG characteristics, LBP characteristics and Harr characteristics of the image.

2. The method of claim 1, wherein the a/D conversion comprises sampling, quantization and encoding, and the input analog signal is converted into a digital signal.

3. The method of claim 1, wherein the fast fourier transform is implemented in a hardware circuit and adopts a pipeline-based fast fourier transform method.

4. The method as claimed in claim 1, wherein the first task pool processes the speech data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.

5. The method as claimed in claim 1, wherein the second task pool processes image data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.

6. The method for performing computational power equalization based on speech and image features according to claim 4, wherein the first task pool is a speech neural network model, and the specific steps of feature extraction include:

7. The method of claim 5, wherein the second task pool is an image neural network model, and the extracted features comprise:

8. The method of claim 6, wherein the speech neural network comprises a convolutional neural network and a cyclic neural network, and the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer connected in sequence:

the second convolutional layer is 256 filters of size 1 × 4 with a lateral step size set to 1 and channel set to 64.

9. The method of claim 8, wherein the recurrent neural network employs long-short term memory structure and neural network-based time-series class classification for speech recognition.

10. A computational power equalization execution chip based on speech and image features, characterized in that the execution chip comprises a general purpose processor and a neural network processor for performing the method of any of claims 1-9.