CN115328661A - Computing power balance execution method and chip based on voice and image characteristics - Google Patents
Computing power balance execution method and chip based on voice and image characteristics Download PDFInfo
- Publication number
- CN115328661A CN115328661A CN202211100689.6A CN202211100689A CN115328661A CN 115328661 A CN115328661 A CN 115328661A CN 202211100689 A CN202211100689 A CN 202211100689A CN 115328661 A CN115328661 A CN 115328661A
- Authority
- CN
- China
- Prior art keywords
- neural network
- voice
- image
- data
- task pool
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/467—Encoded features or binary features, e.g. local binary patterns [LBP]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/50—Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5011—Pool
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a computing power balance execution method and a chip based on voice and image characteristics, which relate to the technical field of load balance and comprise the following steps: receiving data to be processed, wherein the data to be processed comprises voice data and image data; preprocessing data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform, and generating a first characteristic diagram; extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, and transmitting the frequency bandwidth data to the first task pool or the second task pool for next processing; and respectively processing the corresponding tasks and outputting processing results. By preprocessing the data signals and transmitting different signals to corresponding neural networks for further processing, the computation of each neural network can be designed differently, and the processing efficiency of the neural network module in the chip is improved; in the occasion with higher requirement on the real-time property of signal processing, the time delay of voice signal and image signal processing can be effectively reduced, and the running speed is improved.
Description
Technical Field
The invention relates to the technical field of load balancing, in particular to a computing power balancing execution method and a chip based on voice and image characteristics.
Background
Speech recognition and image recognition have many similarities in the design of classifiers in pattern recognition, and typical classifiers such as neural networks, SVMs (Support Vector machines) and Deep Learning can be used in the recognition of the two, so the main difference lies in the difference of feature extraction algorithms.
The voice sampling frequency is very high relative to the sampling duration, and in consideration of the vibration characteristics and the persistence of sound waves, the voice signal is a dense signal in a sound area, so that the local information in a section of voice signal or the information amount of some adjacent sampling points is very small, so that a window mode is usually adopted in voice feature extraction to analyze the statistical characteristics of the signal in the window, and a feature extraction algorithm based on frequency domain transformation is very common. The spatial frequency of the image is not high relative to the size of the image, a large number of smooth areas exist in the image, the distribution of the features is relatively sparse, and the value of the local features is more important. SIFT (scale invariant feature transform), HOG (histogram of gradient directions), sparse coding, and the like, which have been widely used in recent years, are based on this idea. The invariance of the aspects of rotation, scaling, illumination and the like is emphasized more by the image characteristics, and compared with a voice signal, the mode of the image characteristics is more complex in general, and the separation difficulty of redundant information is larger.
For a chip which needs to process voice information and image information simultaneously, due to the fact that different feature extraction algorithms are adopted by voice recognition and image recognition, the computation amount of feature extraction is different, therefore, different recognition tasks are reasonably distributed to corresponding neural network processing units, and the operation efficiency of a neural network can be effectively improved.
In the prior art, computation scheduling based on an AI chip mainly calculates computation resources based on operation information corresponding to a plurality of operators and computation information of operator execution equipment. Or presetting the computation amount of the computation task, and if the new computation task is judged to meet the preset target computation power, distributing the task to the computation power target. However, since instructions and tasks processed by different neural network processing modules are different, computational resources cannot be calculated simply according to computational information of operators, and the load balancing scheme is not suitable for task scheduling among different neural network processing modules, and therefore, a task scheduling and load balancing execution method suitable for different neural network processing modules needs to be specially designed.
Disclosure of Invention
The invention provides a computing power balance execution method and a chip based on voice and image characteristics, and aims to solve the problem of task scheduling between neural network modules needing to process voice information and image information simultaneously in the prior art.
In order to solve the technical problems, the specific scheme of the invention is as follows:
a computing power balance execution method based on voice and image characteristics comprises the following steps:
receiving data to be processed, wherein the data to be processed comprises voice data and image data;
preprocessing the data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform to generate a first characteristic diagram;
extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, and transmitting the frequency bandwidth data to the first task pool or the second task pool for next processing;
and the first task pool and the second task pool respectively carry out the processing of the corresponding tasks and output processing results for subsequent tasks.
Preferably, the a/D conversion includes sampling, quantization and encoding, and converts an input analog signal into a digital signal.
Preferably, the fast fourier transform is implemented by a hardware circuit, and a pipeline-based fast fourier transform method is adopted.
Preferably, the first task pool performs processing on the voice data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
Preferably, the second task pool performs image data processing, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
Preferably, the first task pool is a speech neural network model, and the specific steps of feature extraction include:
end point detection, namely dividing the beginning and the end of a sentence by distinguishing signals of voiced segments, unvoiced segments and voiced segments to obtain an effective voice sequence;
framing and windowing, segmenting the emphasized voice sequence according to a set time interval, and filtering signals by using a band-pass filter to reduce the error of the signals and obtain a frame sequence depending on time;
pre-emphasis, namely increasing the high-frequency energy of the effective voice sequence, and improving the signal-to-noise ratio to obtain an emphasized voice sequence;
the fast Fourier transform is used for inputting the frame sequence into a fast Fourier transform hardware circuit and converting a time domain image into a frequency spectrum of each frame;
extracting a characteristic vector, namely extracting the characteristic vector of the frequency spectrum by using a perceptual linear prediction technology to generate a voice characteristic parameter;
and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.
Preferably, the second task pool is an image neural network model, and the extracted features include:
the histogram feature of the directional gradient, divide the picture into the small connected region at first, then gather the gradient and direction of the edge of every pixel in the connected region, form the histogram feature of the directional gradient;
the local binary pattern feature is used for describing texture information of a picture area, a detection window is divided into 16x16 cells, one pixel in each cell is compared with 8 surrounding pixels, a histogram of each cell is calculated, and finally the obtained statistical histograms of each cell are connected to form the local binary pattern feature;
harr characteristics are used for representing the human face in the image, and in the image with the human face, the Harr characteristics are extracted and used for detecting the human face.
Preferably, the speech neural network comprises a convolutional neural network and a cyclic neural network, and the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer which are connected in sequence:
the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1;
the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1;
the second convolutional layer is 256 filters with the size of 1 × 4, the transverse step size is set to 1, and the channel is set to 64;
preferably, the recurrent neural network employs a long-short term memory structure and a neural network-based time-series class classification for speech recognition.
A computing power balance execution chip based on voice and image features comprises a general-purpose processor and a neural network processor, wherein the neural network processor is used for executing the computing power balance execution method based on the voice and image features.
Compared with the prior art, the invention has the following technical effects:
1. through preprocessing the data signals, different signal characteristics are distinguished, different signals are transmitted to corresponding neural networks for further processing, the computation of each neural network can be designed differently, and the processing efficiency of the neural network modules in the chip is improved.
2. Different neural network models can be designed according to different occasions, and the time delay of voice signal and image signal processing can be effectively reduced, the waiting time of subsequent steps is reduced, and the running speed of the chip is improved in the occasions with higher requirements on the real-time performance of the voice signal and image signal processing.
Drawings
FIG. 1 is a flow chart of a method for performing computational power equalization based on speech and image features according to the present invention;
FIG. 2 is a flow chart of the preprocessing steps of a method for performing the computational power equalization based on speech and image features according to the present invention;
FIG. 3 is a flow chart of the radix-2 FFT method of the algorithm balance execution method based on the voice and image characteristics.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail and completely with reference to the accompanying drawings.
As shown in fig. 1, a computing power balance execution method based on voice and image features includes the following steps:
receiving data to be processed, wherein the data to be processed comprises voice data and image data;
the data to be processed is preprocessed, as shown in fig. 2, the preprocessing includes a/D conversion and Fast Fourier Transform (FFT), and a first feature map is generated. The a/D conversion includes sampling, quantization and encoding, and converts an input analog signal into a digital signal. For a voice signal, discretizing a continuously changing voice in time on a line of a plane is called a voice signal sample, and for an image signal, discretizing a continuously changing image in time on a plane of a space is called an image signal sample; the quantization adopts optimal quantization and vector quantization.
Extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, wherein the frequency bandwidth of the image signal can reach 6.5MHz, and the frequency bandwidth of the voice signal can only be between 10Hz and 20kHz, so that the voice signal and the image signal can be obviously distinguished according to the frequency bandwidth data; transmitting the voice signal characteristic diagram to a first task pool, and carrying out next processing on the image signal characteristic diagram in a second task pool; and the first task pool and the second task pool respectively carry out the processing of the corresponding tasks and output processing results for subsequent tasks.
The fast Fourier transform is realized by a hardware circuit, and a fast Fourier transform method based on a production line is adopted. The FFT is a combination of a number of decimal points, and the FFT decomposes the transform once to reduce the amount of computation, and in general, the FFT has a radix 2 and a length N =2l, and when the length of the sequence to be transformed is not an integer power of 2, the FFT having the radix 2 is still taken, and zero padding is performed on the last bit to extend the length to an integer power of 2, that is, the radix-2 FFT method. In this embodiment, a radix-2 FFT method is specifically adopted, as shown in fig. 3, which is a flow chart of the radix-2 FFT method of the present invention, each stage of operation needs N/2 (N represents the number of input points of the stage) radix-2 butterflies to complete, and each butterfly factor operation includes 1 complex multiplication and 2 complex additions. 4096-Point FFT operation requires log altogether 2 (4096) The calculation of the butterfly factors is a core part of FFT calculation, the radix-2 butterfly factors of each level adopt a method of addressing calculation, so-called addressing calculation, namely, data output by the previous level is firstly sent into a data storage RAM, when butterfly calculation is carried out, a butterfly factor unit reads out the data from the data storage RAM, an intermediate result obtained through calculation is still stored into the same data storage RAM, input data is covered until all the calculation of the butterfly factors of the level is finished, and a final result is output to the next level.
And the first task pool processes voice data, identifies the input first characteristic diagram through training a neural network model, and outputs a processing result. And the second task pool is used for processing the image data, recognizing the input first characteristic diagram through training a neural network model, and outputting a processing result.
The first task pool is a speech neural network model, and the specific steps of feature extraction comprise:
end point detection by distinguishing voiced segments, unvoiced segments, and voiced soundsThe signals of the segments are divided into the beginning and the end of the sentence, and an effective voice sequence is obtained. The time domain analysis is carried out on the speech signal, the original speech information can be obviously distinguished to include a vocal section, a silent section and a voiced section, and the endpoint detection finds the starting point and the ending point of the speech signal by distinguishing the signals of the different sections. The endpoint detection method in this embodiment adopts a double-threshold method, judges the endpoint of the voice by calculating the voice energy, and presets the threshold energy of the double gatesThen respectively calculating the voice energy of each timeIf, ifIf yes, generating a threshold sequence as 1; if it isIf yes, the generated threshold sequence is 0; thus obtaining a threshold sequence, and multiplying the threshold sequence by the original voice sequence to obtain an effective voice sequence.
The endpoint detection method is preferably a double-threshold method, wherein the calculation formula of the voice energy is as follows:
wherein the content of the first and second substances,in order to detect the speech energy of the point,is as followsThe phonetic generalized decibel value of a point,is the number of all detection points.
Pre-emphasis, increasing the high-frequency energy of the effective voice sequence, improving the signal-to-noise ratio and obtaining an emphasized voice sequence. The voice information is often mingled with various other voice information in the environment, and due to the characteristic of human pronunciation, most of the voice information is often concentrated in a low frequency band after frequency conversion, so that the low frequency energy is too high, the high frequency energy is too low, and the high frequency voice information is difficult to effectively extract. The pre-emphasis adds high-frequency signals in advance, and after the high-frequency signals are overlapped with the original voice signals, the energy of the high-frequency band is equivalent to that of the low-frequency band, so that the subsequent recognition efficiency is obviously improved.
Framing and windowing, the emphasized speech sequence is segmented at set time intervals, and then the signal is filtered using a band-pass filter to reduce the error of the signal, resulting in a time-dependent frame sequence 2. A speech signal is unstable as a whole, but locally, the speech signal can be assumed to be stationary for a short time (the speech signal can be considered to be approximately unchanged to the pronunciation of a phoneme within 10-30ms, and is generally 25 ms), so that the whole speech signal needs to be framed. In this embodiment, a hamming window is used for windowing, only the middle data is shown due to the hamming window, and the data information on both sides is lost, so that there is an overlapping portion between adjacent windows, the window length in this embodiment is 25ms, and the step length is 10ms, that is, the last 15ms of each window and the first 15ms of the subsequent adjacent window are overlapping portions.
Fast Fourier transform, inputting the frame sequence into a fast Fourier transform hardware circuit, and converting the time domain image into the frequency spectrum of each frame;
extracting a feature vector, namely extracting the feature vector of a frequency spectrum by using a Perceptual Linear Prediction (PLP) technology to generate a voice feature parameter;
and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.
The second task pool is an image neural network model, and the extracted features comprise:
and (3) a Histogram of Oriented Gradients (HOG) feature, which divides the image into small connected regions, and collects the gradient and the direction of the edge of each pixel point in the connected regions to form the HOG feature. Firstly, graying an input image; then, carrying out color space standardization on the input image by adopting a Gamma correction method, aiming at adjusting the contrast of the image, reducing the influence caused by local shadow and illumination change of the image and simultaneously inhibiting the interference of noise; calculating the gradient of each pixel of the image, aiming at capturing contour information and further weakening the interference of illumination; dividing the image into small cells, counting a gradient histogram of each cell, namely the characteristic of each cell, forming each cell into a block, connecting the characteristics of all the cells in the block in series to obtain the HOG characteristic of the block, and connecting the HOG characteristics of all the cells in the image in series to obtain the HOG characteristic of the image.
A Local Binary Pattern (LBP) feature, which is used to describe texture information of a picture region, and divides a detection window into 16 × 16 cells; comparing one pixel in each cell with 8 surrounding pixels, if the surrounding pixel value is greater than the central pixel value, marking the position of the pixel point as 1, otherwise, marking the position as 0, so that 8 points in a 3x3 neighborhood can generate 8-bit binary numbers through comparison, and obtaining the LBP value of the central pixel point of the window; then calculate the histogram for each cell, i.e. the frequency of occurrence of each digit (assuming a decimal number LBP value); then, normalizing the histogram; and finally, connecting the obtained statistical histograms of all the cells into a feature vector, namely an LBP texture feature vector of the whole graph.
Harr characteristics are used for representing the human faces in the images, and in the images with the human faces, the Harr characteristics are extracted and used for detecting the human faces. In this embodiment, an integral graph is used to calculate the Haar feature, where the integral graph stores the sum of pixels in a rectangular region formed by an image from a starting point to each point as an element of an array in a memory, and when the sum of pixels in a certain region is to be calculated, the element of the array may be directly indexed without recalculating the sum of pixels in the region, thereby speeding up the calculation (this is called a dynamic programming algorithm). The integral map can use the same time (constant time) to compute different features at multiple scales, thus greatly improving the detection speed.
The speech neural network model comprises a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), wherein the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer which are connected in sequence: the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1; the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1; the second convolutional layer is 256 filters of size 1 × 4 with a lateral step size set to 1 and channel set to 64.
The recurrent neural network splits the input data according to the time parameters, encapsulates the split data into two matrixes in sequence, and records the time sequence characteristics of the voice sequence by using the LSTM. In the embodiment, a unidirectional LSTM (bidirectional Long short term memory) structure is used for learning the voice time sequence characteristics, and an LSTM unit is used for replacing a hidden layer in a bidirectional RNN, so that the characteristics of the current voice can be expanded into a whole sequence picture by using information of two directions of the past time and the future time, the effective learning of the whole voice time sequence characteristics is realized, and the final prediction result is more accurate. And respectively carrying out forward propagation on the former matrix and backward propagation on the latter matrix by using the BilSTM node, and outputting a voice recognition result. The number of nodes of the BilSTM is preferably 2048, wherein 1024 nodes are connected with only one matrix for forward propagation; another 1024 nodes connect to another matrix for back propagation.
The execution chip comprises a general processor and a neural network processor, and the neural network processor is used for executing the computational power balance execution method based on the voice and image characteristics.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept of the present invention, and these changes and modifications are all within the scope of the present invention.
Claims (10)
1. A computing power balance execution method based on voice and image characteristics is characterized by comprising the following steps:
receiving data to be processed, wherein the data to be processed comprises voice data and image data;
preprocessing the data to be processed, wherein the preprocessing comprises A/D conversion and fast Fourier transform to generate a first characteristic diagram;
extracting the frequency bandwidth data of the first characteristic diagram, judging the type of the corresponding input signal, transmitting the voice signal to a first task pool, and transmitting the image signal to a second task pool for next processing;
the first task pool and the second task pool respectively process corresponding tasks, and output processing results for subsequent tasks; the first task pool adopts a trained voice neural network model, the voice neural network model comprises a convolutional neural network and a cyclic neural network, and the cyclic neural network records the time sequence characteristics of a voice sequence through bidirectional long-term and short-term memory; and the second task pool adopts a trained image neural network model, and the image neural network model extracts HOG characteristics, LBP characteristics and Harr characteristics of the image.
2. The method of claim 1, wherein the a/D conversion comprises sampling, quantization and encoding, and the input analog signal is converted into a digital signal.
3. The method of claim 1, wherein the fast fourier transform is implemented in a hardware circuit and adopts a pipeline-based fast fourier transform method.
4. The method as claimed in claim 1, wherein the first task pool processes the speech data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
5. The method as claimed in claim 1, wherein the second task pool processes image data, performs recognition processing on the input first feature map by training a neural network model, and outputs a processing result.
6. The method for performing computational power equalization based on speech and image features according to claim 4, wherein the first task pool is a speech neural network model, and the specific steps of feature extraction include:
end point detection, namely dividing the beginning and the end of a sentence by distinguishing signals of voiced segments, unvoiced segments and voiced segments to obtain an effective voice sequence;
pre-emphasis, namely increasing the high-frequency energy of the effective voice sequence, and improving the signal-to-noise ratio to obtain an emphasized voice sequence;
framing and windowing, segmenting the emphasized voice sequence according to a set time interval, and filtering signals by using a band-pass filter to reduce the error of the signals and obtain a frame sequence depending on time;
the fast Fourier transform is used for inputting the frame sequence into a fast Fourier transform hardware circuit and converting a time domain image into a frequency spectrum of each frame;
extracting a characteristic vector, namely extracting the characteristic vector of the frequency spectrum by using a perceptual linear prediction technology to generate a voice characteristic parameter;
and (4) neural network recognition, namely inputting the voice characteristic parameters into a neural network model and outputting a voice recognition result.
7. The method of claim 5, wherein the second task pool is an image neural network model, and the extracted features comprise:
the histogram feature of the directional gradient, divide the picture into the small connected region at first, then gather the gradient and direction of the edge of every pixel in the connected region, form the histogram feature of the directional gradient;
the local binary pattern feature is used for describing texture information of a picture area, a detection window is divided into 16x16 cells, one pixel in each cell is compared with 8 surrounding pixels, a histogram of each cell is calculated, and finally the obtained statistical histograms of each cell are connected to form the local binary pattern feature;
harr characteristics are used for representing the human face in the image, and in the image with the human face, the Harr characteristics are extracted and used for detecting the human face.
8. The method of claim 6, wherein the speech neural network comprises a convolutional neural network and a cyclic neural network, and the convolutional neural network comprises a first convolutional layer, a pooling layer and a second convolutional layer connected in sequence:
the first convolution layer is 128 filters with the size of 1 × 9, the transverse step is set to 2, and the channel is set to 1;
the pooling layer is the largest pooling layer with the size of 1 × 3, and the step length is set to be 1;
the second convolutional layer is 256 filters of size 1 × 4 with a lateral step size set to 1 and channel set to 64.
9. The method of claim 8, wherein the recurrent neural network employs long-short term memory structure and neural network-based time-series class classification for speech recognition.
10. A computational power equalization execution chip based on speech and image features, characterized in that the execution chip comprises a general purpose processor and a neural network processor for performing the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211100689.6A CN115328661B (en) | 2022-09-09 | 2022-09-09 | Computing power balance execution method and chip based on voice and image characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211100689.6A CN115328661B (en) | 2022-09-09 | 2022-09-09 | Computing power balance execution method and chip based on voice and image characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115328661A true CN115328661A (en) | 2022-11-11 |
CN115328661B CN115328661B (en) | 2023-07-18 |
Family
ID=83929117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211100689.6A Active CN115328661B (en) | 2022-09-09 | 2022-09-09 | Computing power balance execution method and chip based on voice and image characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115328661B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116863408A (en) * | 2023-09-04 | 2023-10-10 | 成都智慧城市信息技术有限公司 | Parallel acceleration and dynamic scheduling implementation method based on monitoring camera AI algorithm |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572308A (en) * | 2015-02-10 | 2015-04-29 | 飞狐信息技术(天津)有限公司 | Computing resource distributing method, distributed type computing method and distributed type computing device |
CN106681840A (en) * | 2016-12-30 | 2017-05-17 | 郑州云海信息技术有限公司 | Tasking scheduling method and device for cloud operating system |
CN109471727A (en) * | 2018-10-29 | 2019-03-15 | 北京金山云网络技术有限公司 | A kind of task processing method, apparatus and system |
CN109840287A (en) * | 2019-01-31 | 2019-06-04 | 中科人工智能创新技术研究院(青岛)有限公司 | A kind of cross-module state information retrieval method neural network based and device |
CN110263653A (en) * | 2019-05-23 | 2019-09-20 | 广东鼎义互联科技股份有限公司 | A kind of scene analysis system and method based on depth learning technology |
WO2019234291A1 (en) * | 2018-06-08 | 2019-12-12 | Nokia Technologies Oy | An apparatus, a method and a computer program for selecting a neural network |
CN111984407A (en) * | 2020-08-07 | 2020-11-24 | 苏州浪潮智能科技有限公司 | Data block read-write performance optimization method, system, terminal and storage medium |
-
2022
- 2022-09-09 CN CN202211100689.6A patent/CN115328661B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572308A (en) * | 2015-02-10 | 2015-04-29 | 飞狐信息技术(天津)有限公司 | Computing resource distributing method, distributed type computing method and distributed type computing device |
CN106681840A (en) * | 2016-12-30 | 2017-05-17 | 郑州云海信息技术有限公司 | Tasking scheduling method and device for cloud operating system |
WO2019234291A1 (en) * | 2018-06-08 | 2019-12-12 | Nokia Technologies Oy | An apparatus, a method and a computer program for selecting a neural network |
CN109471727A (en) * | 2018-10-29 | 2019-03-15 | 北京金山云网络技术有限公司 | A kind of task processing method, apparatus and system |
CN109840287A (en) * | 2019-01-31 | 2019-06-04 | 中科人工智能创新技术研究院(青岛)有限公司 | A kind of cross-module state information retrieval method neural network based and device |
CN110263653A (en) * | 2019-05-23 | 2019-09-20 | 广东鼎义互联科技股份有限公司 | A kind of scene analysis system and method based on depth learning technology |
CN111984407A (en) * | 2020-08-07 | 2020-11-24 | 苏州浪潮智能科技有限公司 | Data block read-write performance optimization method, system, terminal and storage medium |
Non-Patent Citations (4)
Title |
---|
ZHENGWEI HUANG: "Speech Emotion Recognition Using CNN" * |
茹黎涛,杨建刚,苏奕: "基于神经网络的多元特征融合身份识别系统", no. 02 * |
郑婉蓉;谢凌云;: "声音-图像的跨模态处理方法综述", 中国传媒大学学报(自然科学版), no. 04 * |
陈照云: "基于小规模GPU集群平台的深度学习任务调度研究", no. 01 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116863408A (en) * | 2023-09-04 | 2023-10-10 | 成都智慧城市信息技术有限公司 | Parallel acceleration and dynamic scheduling implementation method based on monitoring camera AI algorithm |
CN116863408B (en) * | 2023-09-04 | 2023-11-21 | 成都智慧城市信息技术有限公司 | Parallel acceleration and dynamic scheduling implementation method based on monitoring camera AI algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN115328661B (en) | 2023-07-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111292764B (en) | Identification system and identification method | |
CN107393526B (en) | Voice silence detection method, device, computer equipment and storage medium | |
CN110136744B (en) | Audio fingerprint generation method, equipment and storage medium | |
CN110718228B (en) | Voice separation method and device, electronic equipment and computer readable storage medium | |
CN111627419B (en) | Sound generation method based on underwater target and environmental information characteristics | |
CN109919295B (en) | Embedded audio event detection method based on lightweight convolutional neural network | |
CN110796027A (en) | Sound scene recognition method based on compact convolution neural network model | |
CN115328661B (en) | Computing power balance execution method and chip based on voice and image characteristics | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN114332500A (en) | Image processing model training method and device, computer equipment and storage medium | |
CN115132201A (en) | Lip language identification method, computer device and storage medium | |
CN116705059B (en) | Audio semi-supervised automatic clustering method, device, equipment and medium | |
CN113870896A (en) | Motion sound false judgment method and device based on time-frequency graph and convolutional neural network | |
CN115331678A (en) | Generalized regression neural network acoustic signal identification method using Mel frequency cepstrum coefficient | |
CN115050350A (en) | Label checking method and related device, electronic equipment and storage medium | |
CN112735436A (en) | Voiceprint recognition method and voiceprint recognition system | |
CN117292437B (en) | Lip language identification method, device, chip and terminal | |
CN117238298B (en) | Method and system for identifying and positioning animals based on sound event | |
CN117636908B (en) | Digital mine production management and control system | |
CN117786101A (en) | Data auditing method and device, computer equipment and storage medium | |
CN115472147A (en) | Language identification method and device | |
CN115762472A (en) | Voice rhythm identification method, system, equipment and storage medium | |
CN117316167A (en) | Rolling mill state identification method, device and equipment combining deep learning and clustering | |
CN114648777A (en) | Pedestrian re-identification method, pedestrian re-identification training method and device | |
Gao et al. | Environmental Sound Classification Using CNN Based on Mel-spectogram |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |