CN110176248B - Road voice recognition method, system, computer device and readable storage medium - Google Patents
Road voice recognition method, system, computer device and readable storage medium Download PDFInfo
- Publication number
- CN110176248B CN110176248B CN201910436946.5A CN201910436946A CN110176248B CN 110176248 B CN110176248 B CN 110176248B CN 201910436946 A CN201910436946 A CN 201910436946A CN 110176248 B CN110176248 B CN 110176248B
- Authority
- CN
- China
- Prior art keywords
- sound
- sample
- data
- road
- data sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 35
- 238000009795 derivation Methods 0.000 claims abstract description 23
- 238000006243 chemical reaction Methods 0.000 claims abstract description 14
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 32
- 125000004122 cyclic group Chemical group 0.000 claims description 18
- 238000002156 mixing Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 238000005516 engineering process Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 9
- 238000012952 Resampling Methods 0.000 claims description 6
- 238000012958 reprocessing Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 230000005284 excitation Effects 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000004880 explosion Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims 6
- 230000000306 recurrent effect Effects 0.000 claims 1
- 238000001514 detection method Methods 0.000 abstract description 10
- 230000009286 beneficial effect Effects 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 7
- 230000003068 static effect Effects 0.000 description 6
- 230000005236 sound signal Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000001149 cognitive effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005315 distribution function Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses a road voice recognition method, a system, computer equipment and readable storage, wherein the method comprises the following steps: acquiring a data sample and a sample category of road sound; sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample; inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model. The method takes the logarithmic Mel characteristics of the road sound data samples as the input of the convolution cycle network model, trains out a model which can be used for identifying various sounds in a complex traffic scene, and is beneficial to improving the accuracy of road traffic incident detection.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a road voice recognition method, a road voice recognition system, a computer device and a readable storage medium.
Background
With the development of science and technology, intelligent traffic gradually becomes an important means for road monitoring, and meanwhile, in order to make road monitoring more intelligent, a traffic incident detection technology is developed. The traditional road traffic incident detection technology mainly depends on video detection, however, the video detection has directionality, correct identification is difficult to complete under the conditions of severe weather, poor lighting conditions, lens pollution and the like, and the detection accuracy rate is not guaranteed.
The inventors have found that road sounds carry a great deal of event information, and that a plurality of sounds such as car whistling sounds, car engine operating sounds, vehicle collision sounds, etc. may be simultaneously present on the road at the same time, and if the road sounds can be effectively recognized, the accuracy of the traffic event detection technology will be greatly improved. Meanwhile, the inventor finds that most of the existing research on voice recognition is limited to recognizing the most prominent event information, such as laughing voices, applause voices and the like at the same time, and other event information is lost, which obviously does not meet the requirement of voice recognition of a complex scene, such as road voice.
Disclosure of Invention
Based on the road sound identification method, the road sound identification system, the computer equipment and the readable storage medium, the road sound identification method, the road sound identification system, the computer equipment and the readable storage medium can identify models of various sounds on a road, are suitable for sound identification of a road traffic complex scene, and are beneficial to improving the accuracy of road traffic incident detection.
In a first aspect, the present invention provides a road voice recognition method, including:
acquiring a data sample and a sample category of road sound;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In a second aspect, the present invention provides a road voice recognition system, comprising:
the sample acquisition module is used for acquiring a data sample and a sample category of road sound;
the characteristic extraction module is used for sequentially carrying out time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
the training module is used for inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition;
and the classification module is used for identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In a third aspect, the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a data sample and a sample category of road sound;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring a data sample and a sample category of road sound;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
The road sound identification method, the system, the computer equipment and the readable storage medium are characterized in that the method comprises the steps of obtaining data samples and sample types of road sounds; sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample; inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model. The method takes the logarithmic Mel characteristics of the road sound data samples as the input of the convolution cycle network model, trains out a model which can be used for identifying various sounds in a complex traffic scene, and is beneficial to improving the accuracy of road traffic incident detection.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flowchart illustrating a road voice recognition method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a road voice recognition system according to a second embodiment of the present invention;
fig. 3 is an internal structural view of a computer device in a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a flowchart illustrating a road voice recognition method according to a first embodiment of the present invention. A first embodiment of the present invention provides a road voice recognition method, including the steps of:
s1, acquiring a data sample and a sample category of the road sound;
s2, sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
s3, inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition;
and S4, recognizing and classifying the sound data to be processed by using the trained convolution cycle network model.
In one embodiment, the obtaining the data sample of the road sound comprises:
acquiring an original data sample of road sound;
and performing data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of road sound.
The data enhancement is carried out on the original data samples to increase the number of samples, and more samples can enable the trained model to be more robust and stable.
In an embodiment, the performing data enhancement on the original data sample by using a random mixing technique to obtain a data sample of road sound specifically includes:
resampling the original data sample of the road sound to obtain a plurality of sound segments;
and selecting more than two sound segments, respectively matching the sound segments in any ratio, and mixing to obtain a data sample of the road sound.
In one particular embodiment, raw data samples of road sounds may be obtained in real time, but are not limited to, from an AudioSet of data and on the road using a microphone.
In a specific embodiment, the obtained raw data sample of the road sound is processed as follows to obtain a road sound data sample:
and resampling the traffic sound signal, wherein the sampling rate is 16kHz, and the sampling value is modulated by 16-bit pulse code to obtain a plurality of sound segments.
In order to be compatible with sound segments with different time lengths, 10-second time window data is selected as input, the traffic sound signals with the time length less than 10s are filled with zeros, and the traffic sound signals with the time length more than 10s are cut, so that the time length is guaranteed to be 10 s.
The sound segments are labeled with one-hot codes.
Randomly selecting two sound segments y from the processed sound segments1,y2The labels are respectively l1,l2. Selecting a random ratioWill y1,y2Mixing to obtain mixed sound mixr(y1,y12) The formula is as follows:
in the formulae (1) and (2), G1,G2Are each y1,y2The sound pressure level of (a). The sound pressure level G of the sound file is obtained by a weighting method A, which comprises the steps of firstly creating a 0.1s small window and respectively calculating the time sequence weighted sound level { G }1,g2,...,gt},G=max{g1,g2,...,gt}. Mixed sound mixr(y1,y12) Is marked with a label
In the embodiment, the random mixing technology is adopted to perform data enhancement on the original sample data, the training data is expanded, and the robustness and the generalization capability of the model are improved.
In one embodiment, the step S2 includes:
sequentially carrying out non-recursive filter filtering processing, slicing and Hamming window windowing processing on the data samples to obtain first data processing samples;
converting the frequency information in the data processing sample into Mel frequency information, sequentially inputting the Mel frequency information into a plurality of triangular filters, and outputting a second data processing sample;
carrying out logarithmic operation on the data reprocessing samples, and stacking the results subjected to logarithmic budget along a time axis to obtain a third data reprocessing sample;
and sequentially carrying out first-order derivation and second-order derivation on the third data processing sample, and taking the third data processing sample, the first-order derivation of the third data processing sample and the second-order derivation of the third data processing sample as the logarithmic Mel characteristic of the data sample.
In a specific embodiment, the step S2 includes:
1) a first order non-recursive filter is used to pre-emphasize the high frequency components of the sound signal. The expression for the first order non-recursive filter is:
H(z)=1-αz-1 (3)
in equation (3), α is a pre-emphasis coefficient, and h (z) is a filter response. Preferably, α is 0.97.
2) Extracting local information of a sound fragment according to a frame length of 25ms and a frame overlapping of 10ms, carrying out windowing processing on each frame of signal by using a Hamming window function in order to reduce frequency spectrum leakage, wherein the output of each frame of signal after windowing is as follows:
Ot=xt(n)*w(n) (4)
in the formulas (4) and (5), x is a convolution operatort(n) is the sound signal of the t-th frame, w (n) is a window function, OtThe windowed output signal for the t-th frame is window long winlen. Then, carrying out 512-point fast Fourier transform on each frame of signal to obtain a corresponding frequency spectrum Xn(k) Where n is the number of frames and k is the frequency.
3) Converting the frequency of the signal to a Mel (Mel) frequency, the conversion formula is as follows:
Mel(f)=2595lg(1+f/700) (6)
in the formula (6), f is the signal frequency, and Mel (f) is the Mel frequency corresponding to f.
4) Configuring L triangular filters on the Mel frequency, wherein the output of each triangular filter is as follows:
in the formulae (7) and (8), Wl(k) Is the coefficient of the first filter, | Xn(k) L is the amplitude spectrum of the nth frame signal, h (l), c (l), o (l) are the upper, center and lower frequencies of the lth filter, respectively. Preferably, L-64.
5) And performing logarithm operation on the output Y (L) of the filter, and stacking logY (L), L is 1,2, and L along a time axis to obtain a static logarithm Mel two-dimensional time-frequency characteristic. And then, solving a first derivative and a second derivative of the static logarithmic Mel two-dimensional time-frequency feature, and enabling the static logarithmic Mel two-dimensional time-frequency feature, the first derivative of the static logarithmic Mel two-dimensional time-frequency feature and the second derivative of the static logarithmic Mel two-dimensional time-frequency feature to jointly form a 3-channel logarithmic Mel feature which is used as an input sample of the convolution cycle network model.
For a sound sample, the original sound data has high dimensionality and high training complexity, and is easy to overfit, so that feature extraction is needed, and the extracted features are used as input samples of a convolution cycle network model, so that not only can the precision be improved, but also the complexity of early data processing can be reduced. In the embodiment, the logarithmic mel features of the data samples are extracted, and the logarithmic mel features can be used for calculating the sound frame by frame, capturing the instantaneous dynamic features of the sound source and mapping the frequency response similar to human auditory perception, so that the logarithmic mel features are closer to the original data samples, and when a convolution cycle network model is used for training, the difference between different sample classes can be better reflected, and the road sound can be more accurately identified and classified.
In one embodiment, the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are arranged in sequence; the number of layers of the gated convolutional network layer is 4, and the number of layers of the cyclic network layer is 2.
In the selection of the number of the gated convolutional network layers and the number of the cyclic network layers, the inventor finds that the convolutional cyclic network model with the number of the gated convolutional network layers of 4 and the number of the cyclic network layers of 2 has the best effect on road sound identification through repeated reciprocating tests.
In a specific embodiment, the process of building the convolution cyclic network model includes:
1) taking three-channel characteristics of logarithmic Mel characteristics as input samples of convolution cycle network model, dividing the input samples into 10 subsets with the same size, and recording as S1,S2,...S10By Si(i ═ 1,2,. 10) as a test set, the remaining 9 as training sets;
2) and (3) building a convolution cycle network model by using software, wherein the convolution cycle network model comprises a gate control convolution network layer, a cycle network layer, a time distribution type full connection layer and a classification output layer which are sequentially arranged.
3) Inputting the input samples into a convolution cycle network model, and performing supervised learning to obtain parameters of each layer of the trained convolution cycle network model; during training, random distribution function is used for carrying out random initialization on convolution kernel and weight, the learning rate is self-adaptively and dynamically adjusted, the initial value of the learning rate is set to be 0.01, and the minimum learning rate is 10-9The precision of the test set is unchanged in 20 training periods, the learning rate is reduced by 10 times, the convolution cycle network model is trained by using a binary cross entropy loss function and an adaptive moment estimation optimizer in a back propagation mode, and the training is stopped when no change exists in 50 training periods or the limit error of the cost function is less than 0.01.
4) The convolution cycle network model is tested, and the test method comprises the following steps: and inputting the samples of the test set into the trained convolution cycle network model, comparing the output of the convolution cycle network model with the sample categories corresponding to the samples of the test set, calculating the accuracy and evaluating the convolution cycle network model.
Further, the training principle of the convolution cycle network model is as follows: the convolutional layer in the gated convolutional network layer is used as a feature extractor, convolutional kernels in 4 gated convolutional network layers are 64,128 and 128 in sequence, the scale size is 5 ' 5, higher feature map numbers corresponding to the number of the convolutional kernels are obtained by calculation of convolutional response and a GLU excitation function, a batch processing normalization layer is introduced after the convolutional layers, internal ramp shift is reduced, the training process is accelerated, the dimension of feature data is reduced by adopting a maximum pooling mode, more frequency invariance is provided, pooling is only carried out on a frequency axis for ensuring the time integrity of a sound event, the sizes of the first three pooling layers are 1 ' 2, the size of the last pooling layer is 1 ' 4, the output vector of the fourth gated convolutional network layer block passes through a time distribution type full connection layer, the feature maps are stacked along the frequency direction and input to the bidirectional GRU circulating network layer, and pass through a updating gate and a resetting gate unit, learning the context information of the characteristics, then inputting the output vector into a sigmoid classifier through a time-distributed full-connection layer with the nodes of 500 and the excitation function of a rectified linear ReLu function, obtaining the cognitive results of 9 target events in each frame, after weighting averaging, carrying out binarization on the cognitive results of the target events one by one through a group of thresholds, and realizing the cognitive classification of road sounds.
In one embodiment, the sample categories include at least two of an alarm sound, a whistling sound, a vehicle running sound, a brake sound, an explosion sound, a person calling for help sound, a door closing sound, a collision sound, and a rain sound.
In an embodiment, the step S4 is specifically:
and identifying the logarithmic Mel characteristics of the sound data to be processed by using the trained convolution cycle network model so as to realize the classification of the sound data to be processed.
The road sound identification method comprises the steps of obtaining a data sample and a sample category of road sound; sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample; inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model. The method takes the logarithmic Mel characteristics of the road sound data samples as the input of the convolution cycle network model, trains out a model which can be used for identifying various sounds in a complex traffic scene, and is beneficial to improving the accuracy of road traffic incident detection.
Please refer to fig. 2, which is a schematic structural diagram of a road voice recognition system according to an embodiment. A second embodiment of the present invention provides a road voice recognition system, including:
the system comprises a sample acquisition module 1, a data analysis module and a data analysis module, wherein the sample acquisition module is used for acquiring a data sample and a sample category of road sound;
the characteristic extraction module 2 is used for sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data samples to obtain logarithmic Mel characteristics of the data samples;
the training module 3 is used for inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition; and the number of the first and second groups,
and the classification module 4 is used for identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In one embodiment, the sample acquiring module 1 comprises:
the system comprises an original sample acquisition unit, a data acquisition unit and a data processing unit, wherein the original sample acquisition unit is used for acquiring an original data sample of road sound;
and the data enhancement unit is used for carrying out data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of the road sound.
In one embodiment, the data enhancement unit specifically includes:
the resampling subunit is used for resampling the original data sample of the road sound to obtain a plurality of sound segments;
and the sound mixing subunit is used for selecting more than two sound segments, respectively matching the sound segments in any ratio and mixing the sound segments to obtain a data sample of the road sound.
In one embodiment, the feature extraction module 2 includes:
the data processing unit is used for sequentially carrying out non-recursive filter filtering processing, slicing and Hamming window windowing processing on the data samples to obtain first data processing samples;
the frequency conversion unit is used for converting the frequency information in the data processing samples into Mel frequency information, sequentially inputting the Mel frequency information into a plurality of triangular filters and outputting second data processing samples;
the logarithm arithmetic unit is used for carrying out logarithm arithmetic on the data reprocessing samples and stacking the results after the logarithm budgeting along a time axis to obtain a third data reprocessing sample;
and the derivation unit is used for sequentially carrying out first derivation and second derivation on the third data processing sample, and taking the third data processing sample, the first derivative of the third data processing sample and the second derivative of the third data processing sample as the logarithmic Mel characteristic of the data sample.
In one embodiment, the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are arranged in sequence; the number of layers of the gated convolutional network layer is 4, and the number of layers of the cyclic network layer is 2.
In one embodiment, the sample categories include at least two of an alarm sound, a whistling sound, a vehicle running sound, a brake sound, an explosion sound, a person calling for help sound, a door closing sound, a collision sound, and a rain sound.
In an embodiment, the classification module 4 is specifically configured to:
and identifying the logarithmic Mel characteristics of the sound data to be processed by using the trained convolution cycle network model so as to realize the classification of the sound data to be processed.
It should be noted that, the road voice recognition system provided in the embodiment of the present invention is used for executing all the method steps of the road voice recognition method in the first embodiment, and the working principles and beneficial effects of the two are in one-to-one correspondence, so that the details are not repeated.
Please refer to fig. 3, which is an internal structure diagram of a computer apparatus according to a third embodiment. A third embodiment of the present invention provides a computer apparatus including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the road sound recognition method of the first embodiment described above when executing the computer program.
The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium and an internal memory, the nonvolatile storage medium stores an operating system, a computer program, and a database, the internal memory provides an environment for the operating system and the computer program in the nonvolatile storage medium to run, and when the computer program is executed by the processor, the road sound recognition method according to the first embodiment is implemented.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
A fourth embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the road sound identification method of the first embodiment described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (7)
1. A road voice recognition method, comprising:
acquiring a data sample and a sample category of road sound, and specifically comprising the following steps:
acquiring an original data sample of road sound;
carrying out data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of road sound;
the data enhancement of the original data sample by using the random mixing technology to obtain the data sample of the road sound specifically comprises:
resampling the original data sample of the road sound to obtain a plurality of sound segments;
selecting more than two sound segments, respectively matching the sound segments in any ratio and mixing the sound segments to obtain a data sample of the road sound and a label of the data sample, wherein the concrete steps of obtaining the data sample of the mixed sound and the label of the data sample of the mixed sound are as follows:
randomly selecting two sound segments y from the processed sound segments1,y2The labels are respectively l1,l2;
Selecting a random ratioWill y1,y2Mixing to obtain mixed sound mixr(y1,y12) And mix the sounds mixr(y1,y2) Is marked with a labelThe formula is as follows:
wherein G is1,G2Respectively a sound clip y1,y2The sound pressure level of;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are sequentially arranged; the gated convolutional network layer uses a GLU function as an excitation function, the number of layers is 4, convolution kernels in the 4 gated convolutional network layers are 64,128 and 128 in sequence, the scale size is 5' 5, the cyclic network layer is a bidirectional GRU cyclic network, and the number of layers is 2; the classification output layer adopts a sigmoid classifier;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
2. The method of claim 1, wherein the sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data samples to obtain logarithmic mel features of the data samples comprises:
sequentially carrying out non-recursive filter filtering processing, slicing and Hamming window windowing processing on the data samples to obtain first data processing samples;
converting the frequency information in the data processing sample into Mel frequency information, sequentially inputting the Mel frequency information into a plurality of triangular filters, and outputting a second data processing sample;
carrying out logarithmic operation on the data reprocessing samples, and stacking the results subjected to logarithmic budget along a time axis to obtain a third data reprocessing sample;
and sequentially carrying out first-order derivation and second-order derivation on the third data processing sample, and taking the third data processing sample, the first-order derivation of the third data processing sample and the second-order derivation of the third data processing sample as the logarithmic Mel characteristic of the data sample.
3. The road sound identification method according to claim 1, wherein the sample categories include at least two of an alarm sound, a whistling sound, a vehicle running sound, a brake sound, an explosion sound, a person calling for help sound, a door closing sound, a collision sound, and a rain sound.
4. The road voice recognition method according to claim 1, wherein the voice data to be processed is recognized and classified by using the trained convolutional recurrent network model, specifically:
and identifying the logarithmic Mel characteristics of the sound data to be processed by using the trained convolution cycle network model so as to realize the classification of the sound data to be processed.
5. A road voice recognition system, comprising:
the sample acquisition module is used for acquiring data samples and sample categories of road sounds, and specifically comprises the following steps:
acquiring an original data sample of road sound;
carrying out data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of road sound;
the data enhancement of the original data sample by using the random mixing technology to obtain the data sample of the road sound specifically comprises:
resampling the original data sample of the road sound to obtain a plurality of sound segments;
selecting more than two sound segments, respectively matching the sound segments in any ratio and mixing the sound segments to obtain a data sample of the road sound and a label of the data sample, wherein the concrete steps of obtaining the data sample of the mixed sound and the label of the data sample of the mixed sound are as follows:
randomly selecting two sound segments y from the processed sound segments1,y2The labels are respectively l1,l2;
Selecting a random ratioWill y1,y2Mixing to obtain mixed sound mixr(y1,y12) And mix the sounds mixr(y1,y2) Is marked with a labelThe formula is as follows:
wherein G is1,G2Respectively a sound clip y1,y2The sound pressure level of;
the characteristic extraction module is used for sequentially carrying out time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
the training module is used for inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition; the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are sequentially arranged; the gated convolutional network layer uses a GLU function as an excitation function, the number of layers is 4, convolution kernels in the 4 gated convolutional network layers are 64,128 and 128 in sequence, the scale size is 5' 5, the cyclic network layer is a bidirectional GRU cyclic network, and the number of layers is 2; the classification output layer adopts a sigmoid classifier;
and the classification module is used for identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 4 are implemented when the computer program is executed by the processor.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910436946.5A CN110176248B (en) | 2019-05-23 | 2019-05-23 | Road voice recognition method, system, computer device and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910436946.5A CN110176248B (en) | 2019-05-23 | 2019-05-23 | Road voice recognition method, system, computer device and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110176248A CN110176248A (en) | 2019-08-27 |
CN110176248B true CN110176248B (en) | 2020-12-22 |
Family
ID=67691960
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910436946.5A Active CN110176248B (en) | 2019-05-23 | 2019-05-23 | Road voice recognition method, system, computer device and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110176248B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110718235B (en) * | 2019-09-20 | 2022-07-01 | 精锐视觉智能科技(深圳)有限公司 | Abnormal sound detection method, electronic device and storage medium |
CN113362851A (en) * | 2020-03-06 | 2021-09-07 | 上海其高电子科技有限公司 | Traffic scene sound classification method and system based on deep learning |
CN111445926B (en) * | 2020-04-01 | 2023-01-03 | 杭州叙简科技股份有限公司 | Rural road traffic accident warning condition identification method based on sound |
CN111785300B (en) * | 2020-06-12 | 2021-05-25 | 北京快鱼电子股份公司 | Crying detection method and system based on deep neural network |
CN112309405A (en) * | 2020-10-29 | 2021-02-02 | 平安科技(深圳)有限公司 | Method and device for detecting multiple sound events, computer equipment and storage medium |
CN112767961B (en) * | 2021-02-07 | 2022-06-03 | 哈尔滨琦音科技有限公司 | Accent correction method based on cloud computing |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109087655A (en) * | 2018-07-30 | 2018-12-25 | 桂林电子科技大学 | A kind of monitoring of traffic route sound and exceptional sound recognition system |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3895657B2 (en) * | 2002-10-01 | 2007-03-22 | 三菱電機エンジニアリング株式会社 | Accident sound detection circuit |
CN100507971C (en) * | 2007-10-31 | 2009-07-01 | 北京航空航天大学 | Independent component analysis based automobile sound identification method |
CN101980336B (en) * | 2010-10-18 | 2012-01-11 | 福州星网视易信息系统有限公司 | Hidden Markov model-based vehicle sound identification method |
KR101768145B1 (en) * | 2016-04-21 | 2017-08-14 | 현대자동차주식회사 | Method for providing sound detection information, apparatus detecting sound around vehicle, and vehicle including the same |
US10276187B2 (en) * | 2016-10-19 | 2019-04-30 | Ford Global Technologies, Llc | Vehicle ambient audio classification via neural network machine learning |
CN106846803B (en) * | 2017-02-08 | 2023-06-23 | 广西交通科学研究院有限公司 | Traffic event detection device and method based on audio frequency |
CN106910495A (en) * | 2017-04-26 | 2017-06-30 | 中国科学院微电子研究所 | A kind of audio classification system and method for being applied to abnormal sound detection |
CN107545890A (en) * | 2017-08-31 | 2018-01-05 | 桂林电子科技大学 | A kind of sound event recognition method |
US10629081B2 (en) * | 2017-11-02 | 2020-04-21 | Ford Global Technologies, Llc | Accelerometer-based external sound monitoring for backup assistance in a vehicle |
CN108231067A (en) * | 2018-01-13 | 2018-06-29 | 福州大学 | Sound scenery recognition methods based on convolutional neural networks and random forest classification |
CN109087635A (en) * | 2018-08-30 | 2018-12-25 | 湖北工业大学 | A kind of speech-sound intelligent classification method and system |
CN109346103B (en) * | 2018-10-30 | 2023-03-28 | 交通运输部公路科学研究所 | Audio detection method for road tunnel traffic incident |
-
2019
- 2019-05-23 CN CN201910436946.5A patent/CN110176248B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109087655A (en) * | 2018-07-30 | 2018-12-25 | 桂林电子科技大学 | A kind of monitoring of traffic route sound and exceptional sound recognition system |
Also Published As
Publication number | Publication date |
---|---|
CN110176248A (en) | 2019-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110176248B (en) | Road voice recognition method, system, computer device and readable storage medium | |
Knight et al. | Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs | |
CN108877775B (en) | Voice data processing method and device, computer equipment and storage medium | |
US10460747B2 (en) | Frequency based audio analysis using neural networks | |
CN109065027B (en) | Voice distinguishing model training method and device, computer equipment and storage medium | |
Zhang et al. | Automatic bird vocalization identification based on fusion of spectral pattern and texture features | |
CN111986699B (en) | Sound event detection method based on full convolution network | |
CN111341294B (en) | Method for converting text into voice with specified style | |
CN113205820B (en) | Method for generating voice coder for voice event detection | |
CN112767927A (en) | Method, device, terminal and storage medium for extracting voice features | |
US20200090040A1 (en) | Apparatus for processing a signal | |
Xie et al. | Application of image processing techniques for frog call classification | |
Naranjo-Alcazar et al. | On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification | |
CN114863938A (en) | Bird language identification method and system based on attention residual error and feature fusion | |
CN115034254A (en) | Nuclide identification method based on HHT (Hilbert-Huang transform) frequency band energy features and convolutional neural network | |
CN110458071A (en) | A kind of fiber-optic vibration signal characteristic abstraction and classification method based on DWT-DFPA-GBDT | |
Chaves et al. | Katydids acoustic classification on verification approach based on MFCC and HMM | |
Xie et al. | Acoustic feature extraction using perceptual wavelet packet decomposition for frog call classification | |
CN115497564A (en) | Antigen identification model establishing method and antigen identification method | |
CN115547347A (en) | Whale acoustic signal identification method and system based on multi-scale time-frequency feature extraction | |
CN114201993A (en) | Three-branch attention feature fusion method and system for detecting ultrasonic defects | |
Mamutova et al. | DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS | |
CN113160823A (en) | Voice awakening method and device based on pulse neural network and electronic equipment | |
CN110458219A (en) | A kind of Φ-OTDR vibration signal recognizer based on STFT-CNN-RVFL | |
CN110689875A (en) | Language identification method and device and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 530001 No.6, Gaoxin 2nd Road, xixiantang District, Nanning City, Guangxi Zhuang Autonomous Region Applicant after: Guangxi Jiaoke Group Co.,Ltd. Address before: 530000 the Guangxi Zhuang Autonomous Region XiXiangTang Nanning high tech two Road No. 6 Applicant before: GUANGXI TRANSPORTATION RESEARCH & CONSULTING Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |