CN115328661A - Computing power balance execution method and chip based on voice and image characteristics - Google Patents

Computing power balance execution method and chip based on voice and image characteristics Download PDF

Info

Publication number
CN115328661A
CN115328661A CN202211100689.6A CN202211100689A CN115328661A CN 115328661 A CN115328661 A CN 115328661A CN 202211100689 A CN202211100689 A CN 202211100689A CN 115328661 A CN115328661 A CN 115328661A
Authority
CN
China
Prior art keywords
neural network
voice
image
data
task pool
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211100689.6A
Other languages
Chinese (zh)
Other versions
CN115328661B (en
Inventor
王嘉诚
张少仲
张栩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongcheng Hualong Computer Technology Co Ltd
Original Assignee
Zhongcheng Hualong Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongcheng Hualong Computer Technology Co Ltd filed Critical Zhongcheng Hualong Computer Technology Co Ltd
Priority to CN202211100689.6A priority Critical patent/CN115328661B/en
Publication of CN115328661A publication Critical patent/CN115328661A/en
Application granted granted Critical
Publication of CN115328661B publication Critical patent/CN115328661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/467Encoded features or binary features, e.g. local binary patterns [LBP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/93Discriminating between voiced and unvoiced parts of speech signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5011Pool
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a computing power balance execution method and a chip based on voice and image characteristics, which relate to the technical field of load balance and comprise the following steps: receiving data to be processed, wherein the data to be processed comprises voice data and image data; preprocessing data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform, and generating a first characteristic diagram; extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, and transmitting the frequency bandwidth data to the first task pool or the second task pool for next processing; and respectively processing the corresponding tasks and outputting processing results. By preprocessing the data signals and transmitting different signals to corresponding neural networks for further processing, the computation of each neural network can be designed differently, and the processing efficiency of the neural network module in the chip is improved; in the occasion with higher requirement on the real-time property of signal processing, the time delay of voice signal and image signal processing can be effectively reduced, and the running speed is improved.

Description

Computing power balance execution method and chip based on voice and image characteristics
Technical Field
The invention relates to the technical field of load balancing, in particular to a computing power balancing execution method and a chip based on voice and image characteristics.
Background
Speech recognition and image recognition have many similarities in the design of classifiers in pattern recognition, and typical classifiers such as neural networks, SVMs (Support Vector machines) and Deep Learning can be used in the recognition of the two, so the main difference lies in the difference of feature extraction algorithms.
The voice sampling frequency is very high relative to the sampling duration, and in consideration of the vibration characteristics and the persistence of sound waves, the voice signal is a dense signal in a sound area, so that the local information in a section of voice signal or the information amount of some adjacent sampling points is very small, so that a window mode is usually adopted in voice feature extraction to analyze the statistical characteristics of the signal in the window, and a feature extraction algorithm based on frequency domain transformation is very common. The spatial frequency of the image is not high relative to the size of the image, a large number of smooth areas exist in the image, the distribution of the features is relatively sparse, and the value of the local features is more important. SIFT (scale invariant feature transform), HOG (histogram of gradient directions), sparse coding, and the like, which have been widely used in recent years, are based on this idea. The invariance of the aspects of rotation, scaling, illumination and the like is emphasized more by the image characteristics, and compared with a voice signal, the mode of the image characteristics is more complex in general, and the separation difficulty of redundant information is larger.
For a chip which needs to process voice information and image information simultaneously, due to the fact that different feature extraction algorithms are adopted by voice recognition and image recognition, the computation amount of feature extraction is different, therefore, different recognition tasks are reasonably distributed to corresponding neural network processing units, and the operation efficiency of a neural network can be effectively improved.
In the prior art, computation scheduling based on an AI chip mainly calculates computation resources based on operation information corresponding to a plurality of operators and computation information of operator execution equipment. Or presetting the computation amount of the computation task, and if the new computation task is judged to meet the preset target computation power, distributing the task to the computation power target. However, since instructions and tasks processed by different neural network processing modules are different, computational resources cannot be calculated simply according to computational information of operators, and the load balancing scheme is not suitable for task scheduling among different neural network processing modules, and therefore, a task scheduling and load balancing execution method suitable for different neural network processing modules needs to be specially designed.
Disclosure of Invention
The invention provides a computing power balance execution method and a chip based on voice and image characteristics, and aims to solve the problem of task scheduling between neural network modules needing to process voice information and image information simultaneously in the prior art.
In order to solve the technical problems, the specific scheme of the invention is as follows:
a computing power balance execution method based on voice and image characteristics comprises the following steps:
receiving data to be processed, wherein the data to be processed comprises voice data and image data;
preprocessing the data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform to generate a first characteristic diagram;
extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, and transmitting the frequency bandwidth data to the first task pool or the second task pool for next processing;
and the first task pool and the second task pool respectively carry out the processing of the corresponding tasks and output processing results for subsequent tasks.
Preferably, the a/D conversion includes sampling, quantization and encoding, and converts an input analog signal into a digital signal.
Preferably, the fast fourier transform is implemented by a hardware circuit, and a pipeline-based fast fourier transform method is adopted.
Preferably, the first task pool performs processing on the voice data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
Preferably, the second task pool performs image data processing, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
Preferably, the first task pool is a speech neural network model, and the specific steps of feature extraction include:
end point detection, namely dividing the beginning and the end of a sentence by distinguishing signals of voiced segments, unvoiced segments and voiced segments to obtain an effective voice sequence;
framing and windowing, segmenting the emphasized voice sequence according to a set time interval, and filtering signals by using a band-pass filter to reduce the error of the signals and obtain a frame sequence depending on time;
pre-emphasis, namely increasing the high-frequency energy of the effective voice sequence, and improving the signal-to-noise ratio to obtain an emphasized voice sequence;
the fast Fourier transform is used for inputting the frame sequence into a fast Fourier transform hardware circuit and converting a time domain image into a frequency spectrum of each frame;
extracting a characteristic vector, namely extracting the characteristic vector of the frequency spectrum by using a perceptual linear prediction technology to generate a voice characteristic parameter;
and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.
Preferably, the second task pool is an image neural network model, and the extracted features include:
the histogram feature of the directional gradient, divide the picture into the small connected region at first, then gather the gradient and direction of the edge of every pixel in the connected region, form the histogram feature of the directional gradient;
the local binary pattern feature is used for describing texture information of a picture area, a detection window is divided into 16x16 cells, one pixel in each cell is compared with 8 surrounding pixels, a histogram of each cell is calculated, and finally the obtained statistical histograms of each cell are connected to form the local binary pattern feature;
harr characteristics are used for representing the human face in the image, and in the image with the human face, the Harr characteristics are extracted and used for detecting the human face.
Preferably, the speech neural network comprises a convolutional neural network and a cyclic neural network, and the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer which are connected in sequence:
the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1;
the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1;
the second convolutional layer is 256 filters with the size of 1 × 4, the transverse step size is set to 1, and the channel is set to 64;
preferably, the recurrent neural network employs a long-short term memory structure and a neural network-based time-series class classification for speech recognition.
A computing power balance execution chip based on voice and image features comprises a general-purpose processor and a neural network processor, wherein the neural network processor is used for executing the computing power balance execution method based on the voice and image features.
Compared with the prior art, the invention has the following technical effects:
1. through preprocessing the data signals, different signal characteristics are distinguished, different signals are transmitted to corresponding neural networks for further processing, the computation of each neural network can be designed differently, and the processing efficiency of the neural network modules in the chip is improved.
2. Different neural network models can be designed according to different occasions, and the time delay of voice signal and image signal processing can be effectively reduced, the waiting time of subsequent steps is reduced, and the running speed of the chip is improved in the occasions with higher requirements on the real-time performance of the voice signal and image signal processing.
Drawings
FIG. 1 is a flow chart of a method for performing computational power equalization based on speech and image features according to the present invention;
FIG. 2 is a flow chart of the preprocessing steps of a method for performing the computational power equalization based on speech and image features according to the present invention;
FIG. 3 is a flow chart of the radix-2 FFT method of the algorithm balance execution method based on the voice and image characteristics.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail and completely with reference to the accompanying drawings.
As shown in fig. 1, a computing power balance execution method based on voice and image features includes the following steps:
receiving data to be processed, wherein the data to be processed comprises voice data and image data;
the data to be processed is preprocessed, as shown in fig. 2, the preprocessing includes a/D conversion and Fast Fourier Transform (FFT), and a first feature map is generated. The a/D conversion includes sampling, quantization and encoding, and converts an input analog signal into a digital signal. For a voice signal, discretizing a continuously changing voice in time on a line of a plane is called a voice signal sample, and for an image signal, discretizing a continuously changing image in time on a plane of a space is called an image signal sample; the quantization adopts optimal quantization and vector quantization.
Extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, wherein the frequency bandwidth of the image signal can reach 6.5MHz, and the frequency bandwidth of the voice signal can only be between 10Hz and 20kHz, so that the voice signal and the image signal can be obviously distinguished according to the frequency bandwidth data; transmitting the voice signal characteristic diagram to a first task pool, and carrying out next processing on the image signal characteristic diagram in a second task pool; and the first task pool and the second task pool respectively carry out the processing of the corresponding tasks and output processing results for subsequent tasks.
The fast Fourier transform is realized by a hardware circuit, and a fast Fourier transform method based on a production line is adopted. The FFT is a combination of a number of decimal points, and the FFT decomposes the transform once to reduce the amount of computation, and in general, the FFT has a radix 2 and a length N =2l, and when the length of the sequence to be transformed is not an integer power of 2, the FFT having the radix 2 is still taken, and zero padding is performed on the last bit to extend the length to an integer power of 2, that is, the radix-2 FFT method. In this embodiment, a radix-2 FFT method is specifically adopted, as shown in fig. 3, which is a flow chart of the radix-2 FFT method of the present invention, each stage of operation needs N/2 (N represents the number of input points of the stage) radix-2 butterflies to complete, and each butterfly factor operation includes 1 complex multiplication and 2 complex additions. 4096-Point FFT operation requires log altogether 2 (4096) The calculation of the butterfly factors is a core part of FFT calculation, the radix-2 butterfly factors of each level adopt a method of addressing calculation, so-called addressing calculation, namely, data output by the previous level is firstly sent into a data storage RAM, when butterfly calculation is carried out, a butterfly factor unit reads out the data from the data storage RAM, an intermediate result obtained through calculation is still stored into the same data storage RAM, input data is covered until all the calculation of the butterfly factors of the level is finished, and a final result is output to the next level.
And the first task pool processes voice data, identifies the input first characteristic diagram through training a neural network model, and outputs a processing result. And the second task pool is used for processing the image data, recognizing the input first characteristic diagram through training a neural network model, and outputting a processing result.
The first task pool is a speech neural network model, and the specific steps of feature extraction comprise:
end point detection by distinguishing voiced segments, unvoiced segments, and voiced soundsThe signals of the segments are divided into the beginning and the end of the sentence, and an effective voice sequence is obtained. The time domain analysis is carried out on the speech signal, the original speech information can be obviously distinguished to include a vocal section, a silent section and a voiced section, and the endpoint detection finds the starting point and the ending point of the speech signal by distinguishing the signals of the different sections. The endpoint detection method in this embodiment adopts a double-threshold method, judges the endpoint of the voice by calculating the voice energy, and presets the threshold energy of the double gates
Figure 158821DEST_PATH_IMAGE001
Then respectively calculating the voice energy of each time
Figure 975468DEST_PATH_IMAGE002
If, if
Figure 525529DEST_PATH_IMAGE003
If yes, generating a threshold sequence as 1; if it is
Figure 845652DEST_PATH_IMAGE004
If yes, the generated threshold sequence is 0; thus obtaining a threshold sequence, and multiplying the threshold sequence by the original voice sequence to obtain an effective voice sequence.
The endpoint detection method is preferably a double-threshold method, wherein the calculation formula of the voice energy is as follows:
Figure 852660DEST_PATH_IMAGE005
wherein the content of the first and second substances,
Figure 156602DEST_PATH_IMAGE002
in order to detect the speech energy of the point,
Figure 448037DEST_PATH_IMAGE006
is as follows
Figure 153825DEST_PATH_IMAGE007
The phonetic generalized decibel value of a point,
Figure 66155DEST_PATH_IMAGE008
is the number of all detection points.
Pre-emphasis, increasing the high-frequency energy of the effective voice sequence, improving the signal-to-noise ratio and obtaining an emphasized voice sequence. The voice information is often mingled with various other voice information in the environment, and due to the characteristic of human pronunciation, most of the voice information is often concentrated in a low frequency band after frequency conversion, so that the low frequency energy is too high, the high frequency energy is too low, and the high frequency voice information is difficult to effectively extract. The pre-emphasis adds high-frequency signals in advance, and after the high-frequency signals are overlapped with the original voice signals, the energy of the high-frequency band is equivalent to that of the low-frequency band, so that the subsequent recognition efficiency is obviously improved.
Framing and windowing, the emphasized speech sequence is segmented at set time intervals, and then the signal is filtered using a band-pass filter to reduce the error of the signal, resulting in a time-dependent frame sequence 2. A speech signal is unstable as a whole, but locally, the speech signal can be assumed to be stationary for a short time (the speech signal can be considered to be approximately unchanged to the pronunciation of a phoneme within 10-30ms, and is generally 25 ms), so that the whole speech signal needs to be framed. In this embodiment, a hamming window is used for windowing, only the middle data is shown due to the hamming window, and the data information on both sides is lost, so that there is an overlapping portion between adjacent windows, the window length in this embodiment is 25ms, and the step length is 10ms, that is, the last 15ms of each window and the first 15ms of the subsequent adjacent window are overlapping portions.
Fast Fourier transform, inputting the frame sequence into a fast Fourier transform hardware circuit, and converting the time domain image into the frequency spectrum of each frame;
extracting a feature vector, namely extracting the feature vector of a frequency spectrum by using a Perceptual Linear Prediction (PLP) technology to generate a voice feature parameter;
and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.
The second task pool is an image neural network model, and the extracted features comprise:
and (3) a Histogram of Oriented Gradients (HOG) feature, which divides the image into small connected regions, and collects the gradient and the direction of the edge of each pixel point in the connected regions to form the HOG feature. Firstly, graying an input image; then, carrying out color space standardization on the input image by adopting a Gamma correction method, aiming at adjusting the contrast of the image, reducing the influence caused by local shadow and illumination change of the image and simultaneously inhibiting the interference of noise; calculating the gradient of each pixel of the image, aiming at capturing contour information and further weakening the interference of illumination; dividing the image into small cells, counting a gradient histogram of each cell, namely the characteristic of each cell, forming each cell into a block, connecting the characteristics of all the cells in the block in series to obtain the HOG characteristic of the block, and connecting the HOG characteristics of all the cells in the image in series to obtain the HOG characteristic of the image.
A Local Binary Pattern (LBP) feature, which is used to describe texture information of a picture region, and divides a detection window into 16 × 16 cells; comparing one pixel in each cell with 8 surrounding pixels, if the surrounding pixel value is greater than the central pixel value, marking the position of the pixel point as 1, otherwise, marking the position as 0, so that 8 points in a 3x3 neighborhood can generate 8-bit binary numbers through comparison, and obtaining the LBP value of the central pixel point of the window; then calculate the histogram for each cell, i.e. the frequency of occurrence of each digit (assuming a decimal number LBP value); then, normalizing the histogram; and finally, connecting the obtained statistical histograms of all the cells into a feature vector, namely an LBP texture feature vector of the whole graph.
Harr characteristics are used for representing the human faces in the images, and in the images with the human faces, the Harr characteristics are extracted and used for detecting the human faces. In this embodiment, an integral graph is used to calculate the Haar feature, where the integral graph stores the sum of pixels in a rectangular region formed by an image from a starting point to each point as an element of an array in a memory, and when the sum of pixels in a certain region is to be calculated, the element of the array may be directly indexed without recalculating the sum of pixels in the region, thereby speeding up the calculation (this is called a dynamic programming algorithm). The integral map can use the same time (constant time) to compute different features at multiple scales, thus greatly improving the detection speed.
The speech neural network model comprises a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), wherein the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer which are connected in sequence: the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1; the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1; the second convolutional layer is 256 filters of size 1 × 4 with a lateral step size set to 1 and channel set to 64.
The recurrent neural network splits the input data according to the time parameters, encapsulates the split data into two matrixes in sequence, and records the time sequence characteristics of the voice sequence by using the LSTM. In the embodiment, a unidirectional LSTM (bidirectional Long short term memory) structure is used for learning the voice time sequence characteristics, and an LSTM unit is used for replacing a hidden layer in a bidirectional RNN, so that the characteristics of the current voice can be expanded into a whole sequence picture by using information of two directions of the past time and the future time, the effective learning of the whole voice time sequence characteristics is realized, and the final prediction result is more accurate. And respectively carrying out forward propagation on the former matrix and backward propagation on the latter matrix by using the BilSTM node, and outputting a voice recognition result. The number of nodes of the BilSTM is preferably 2048, wherein 1024 nodes are connected with only one matrix for forward propagation; another 1024 nodes connect to another matrix for back propagation.
The execution chip comprises a general processor and a neural network processor, and the neural network processor is used for executing the computational power balance execution method based on the voice and image characteristics.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept of the present invention, and these changes and modifications are all within the scope of the present invention.

Claims (10)

1. A computing power balance execution method based on voice and image characteristics is characterized by comprising the following steps:
receiving data to be processed, wherein the data to be processed comprises voice data and image data;
preprocessing the data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform to generate a first characteristic diagram;
extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, transmitting the voice signal to a first task pool, and transmitting the image signal to a second task pool for next processing;
the first task pool and the second task pool respectively process corresponding tasks, and output processing results for subsequent tasks; the first task pool adopts a trained voice neural network model, the voice neural network model comprises a convolutional neural network and a cyclic neural network, and the cyclic neural network records the time sequence characteristics of a voice sequence through bidirectional long-term and short-term memory; and the second task pool adopts a trained image neural network model, and the image neural network model extracts HOG characteristics, LBP characteristics and Harr characteristics of the image.
2. The method of claim 1, wherein the a/D conversion comprises sampling, quantization and encoding, and the input analog signal is converted into a digital signal.
3. The method of claim 1, wherein the fast fourier transform is implemented in a hardware circuit and adopts a pipeline-based fast fourier transform method.
4. The method as claimed in claim 1, wherein the first task pool processes the speech data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
5. The method as claimed in claim 1, wherein the second task pool processes image data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
6. The method for performing computational power equalization based on speech and image features according to claim 4, wherein the first task pool is a speech neural network model, and the specific steps of feature extraction include:
end point detection, namely dividing the beginning and the end of a sentence by distinguishing signals of voiced segments, unvoiced segments and voiced segments to obtain an effective voice sequence;
pre-emphasis, namely increasing the high-frequency energy of the effective voice sequence, and improving the signal-to-noise ratio to obtain an emphasized voice sequence;
framing and windowing, segmenting the emphasized voice sequence according to a set time interval, and filtering signals by using a band-pass filter to reduce the error of the signals and obtain a frame sequence depending on time;
the fast Fourier transform is used for inputting the frame sequence into a fast Fourier transform hardware circuit and converting a time domain image into a frequency spectrum of each frame;
extracting a characteristic vector, namely extracting the characteristic vector of the frequency spectrum by using a perceptual linear prediction technology to generate a voice characteristic parameter;
and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.
7. The method of claim 5, wherein the second task pool is an image neural network model, and the extracted features comprise:
the histogram feature of the directional gradient, divide the picture into the small connected region at first, then gather the gradient and direction of the edge of every pixel in the connected region, form the histogram feature of the directional gradient;
the local binary pattern feature is used for describing texture information of a picture area, a detection window is divided into 16x16 cells, one pixel in each cell is compared with 8 surrounding pixels, a histogram of each cell is calculated, and finally the obtained statistical histograms of each cell are connected to form the local binary pattern feature;
harr characteristics are used for representing the human face in the image, and in the image with the human face, the Harr characteristics are extracted and used for detecting the human face.
8. The method of claim 6, wherein the speech neural network comprises a convolutional neural network and a cyclic neural network, and the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer connected in sequence:
the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1;
the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1;
the second convolutional layer is 256 filters of size 1 × 4 with a lateral step size set to 1 and channel set to 64.
9. The method of claim 8, wherein the recurrent neural network employs long-short term memory structure and neural network-based time-series class classification for speech recognition.
10. A computational power equalization execution chip based on speech and image features, characterized in that the execution chip comprises a general purpose processor and a neural network processor for performing the method of any of claims 1-9.
CN202211100689.6A 2022-09-09 2022-09-09 Computing power balance execution method and chip based on voice and image characteristics Active CN115328661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211100689.6A CN115328661B (en) 2022-09-09 2022-09-09 Computing power balance execution method and chip based on voice and image characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211100689.6A CN115328661B (en) 2022-09-09 2022-09-09 Computing power balance execution method and chip based on voice and image characteristics

Publications (2)

Publication Number Publication Date
CN115328661A true CN115328661A (en) 2022-11-11
CN115328661B CN115328661B (en) 2023-07-18

Family

ID=83929117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211100689.6A Active CN115328661B (en) 2022-09-09 2022-09-09 Computing power balance execution method and chip based on voice and image characteristics

Country Status (1)

Country Link
CN (1) CN115328661B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863408A (en) * 2023-09-04 2023-10-10 成都智慧城市信息技术有限公司 Parallel acceleration and dynamic scheduling implementation method based on monitoring camera AI algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572308A (en) * 2015-02-10 2015-04-29 飞狐信息技术(天津)有限公司 Computing resource distributing method, distributed type computing method and distributed type computing device
CN106681840A (en) * 2016-12-30 2017-05-17 郑州云海信息技术有限公司 Tasking scheduling method and device for cloud operating system
CN109471727A (en) * 2018-10-29 2019-03-15 北京金山云网络技术有限公司 A kind of task processing method, apparatus and system
CN109840287A (en) * 2019-01-31 2019-06-04 中科人工智能创新技术研究院(青岛)有限公司 A kind of cross-module state information retrieval method neural network based and device
CN110263653A (en) * 2019-05-23 2019-09-20 广东鼎义互联科技股份有限公司 A kind of scene analysis system and method based on depth learning technology
WO2019234291A1 (en) * 2018-06-08 2019-12-12 Nokia Technologies Oy An apparatus, a method and a computer program for selecting a neural network
CN111984407A (en) * 2020-08-07 2020-11-24 苏州浪潮智能科技有限公司 Data block read-write performance optimization method, system, terminal and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572308A (en) * 2015-02-10 2015-04-29 飞狐信息技术(天津)有限公司 Computing resource distributing method, distributed type computing method and distributed type computing device
CN106681840A (en) * 2016-12-30 2017-05-17 郑州云海信息技术有限公司 Tasking scheduling method and device for cloud operating system
WO2019234291A1 (en) * 2018-06-08 2019-12-12 Nokia Technologies Oy An apparatus, a method and a computer program for selecting a neural network
CN109471727A (en) * 2018-10-29 2019-03-15 北京金山云网络技术有限公司 A kind of task processing method, apparatus and system
CN109840287A (en) * 2019-01-31 2019-06-04 中科人工智能创新技术研究院(青岛)有限公司 A kind of cross-module state information retrieval method neural network based and device
CN110263653A (en) * 2019-05-23 2019-09-20 广东鼎义互联科技股份有限公司 A kind of scene analysis system and method based on depth learning technology
CN111984407A (en) * 2020-08-07 2020-11-24 苏州浪潮智能科技有限公司 Data block read-write performance optimization method, system, terminal and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ZHENGWEI HUANG: "Speech Emotion Recognition Using CNN" *
茹黎涛,杨建刚,苏奕: "基于神经网络的多元特征融合身份识别系统", no. 02 *
郑婉蓉;谢凌云;: "声音-图像的跨模态处理方法综述", 中国传媒大学学报(自然科学版), no. 04 *
陈照云: "基于小规模GPU集群平台的深度学习任务调度研究", no. 01 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116863408A (en) * 2023-09-04 2023-10-10 成都智慧城市信息技术有限公司 Parallel acceleration and dynamic scheduling implementation method based on monitoring camera AI algorithm
CN116863408B (en) * 2023-09-04 2023-11-21 成都智慧城市信息技术有限公司 Parallel acceleration and dynamic scheduling implementation method based on monitoring camera AI algorithm

Also Published As

Publication number Publication date
CN115328661B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN111292764B (en) Identification system and identification method
CN107393526B (en) Voice silence detection method, device, computer equipment and storage medium
CN110136744B (en) Audio fingerprint generation method, equipment and storage medium
CN110718228B (en) Voice separation method and device, electronic equipment and computer readable storage medium
CN111627419B (en) Sound generation method based on underwater target and environmental information characteristics
CN109919295B (en) Embedded audio event detection method based on lightweight convolutional neural network
CN110796027A (en) Sound scene recognition method based on compact convolution neural network model
CN115328661B (en) Computing power balance execution method and chip based on voice and image characteristics
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN114332500A (en) Image processing model training method and device, computer equipment and storage medium
CN115132201A (en) Lip language identification method, computer device and storage medium
CN116705059B (en) Audio semi-supervised automatic clustering method, device, equipment and medium
CN113870896A (en) Motion sound false judgment method and device based on time-frequency graph and convolutional neural network
CN115331678A (en) Generalized regression neural network acoustic signal identification method using Mel frequency cepstrum coefficient
CN115050350A (en) Label checking method and related device, electronic equipment and storage medium
CN112735436A (en) Voiceprint recognition method and voiceprint recognition system
CN117292437B (en) Lip language identification method, device, chip and terminal
CN117238298B (en) Method and system for identifying and positioning animals based on sound event
CN117636908B (en) Digital mine production management and control system
CN117786101A (en) Data auditing method and device, computer equipment and storage medium
CN115472147A (en) Language identification method and device
CN115762472A (en) Voice rhythm identification method, system, equipment and storage medium
CN117316167A (en) Rolling mill state identification method, device and equipment combining deep learning and clustering
CN114648777A (en) Pedestrian re-identification method, pedestrian re-identification training method and device
Gao et al. Environmental Sound Classification Using CNN Based on Mel-spectogram

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant