CN110176248B - Road voice recognition method, system, computer device and readable storage medium - Google Patents

Road voice recognition method, system, computer device and readable storage medium Download PDF

Info

Publication number
CN110176248B
CN110176248B CN201910436946.5A CN201910436946A CN110176248B CN 110176248 B CN110176248 B CN 110176248B CN 201910436946 A CN201910436946 A CN 201910436946A CN 110176248 B CN110176248 B CN 110176248B
Authority
CN
China
Prior art keywords
sound
sample
data
road
data sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910436946.5A
Other languages
Chinese (zh)
Other versions
CN110176248A (en
Inventor
黎恒
徐韶华
唐文娟
韦泽贤
陈静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Jiaoke Group Co Ltd
Original Assignee
Guangxi Jiaoke Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Jiaoke Group Co Ltd filed Critical Guangxi Jiaoke Group Co Ltd
Priority to CN201910436946.5A priority Critical patent/CN110176248B/en
Publication of CN110176248A publication Critical patent/CN110176248A/en
Application granted granted Critical
Publication of CN110176248B publication Critical patent/CN110176248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a road voice recognition method, a system, computer equipment and readable storage, wherein the method comprises the following steps: acquiring a data sample and a sample category of road sound; sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample; inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model. The method takes the logarithmic Mel characteristics of the road sound data samples as the input of the convolution cycle network model, trains out a model which can be used for identifying various sounds in a complex traffic scene, and is beneficial to improving the accuracy of road traffic incident detection.

Description

Road voice recognition method, system, computer device and readable storage medium
Technical Field
The invention relates to the technical field of voice recognition, in particular to a road voice recognition method, a road voice recognition system, a computer device and a readable storage medium.
Background
With the development of science and technology, intelligent traffic gradually becomes an important means for road monitoring, and meanwhile, in order to make road monitoring more intelligent, a traffic incident detection technology is developed. The traditional road traffic incident detection technology mainly depends on video detection, however, the video detection has directionality, correct identification is difficult to complete under the conditions of severe weather, poor lighting conditions, lens pollution and the like, and the detection accuracy rate is not guaranteed.
The inventors have found that road sounds carry a great deal of event information, and that a plurality of sounds such as car whistling sounds, car engine operating sounds, vehicle collision sounds, etc. may be simultaneously present on the road at the same time, and if the road sounds can be effectively recognized, the accuracy of the traffic event detection technology will be greatly improved. Meanwhile, the inventor finds that most of the existing research on voice recognition is limited to recognizing the most prominent event information, such as laughing voices, applause voices and the like at the same time, and other event information is lost, which obviously does not meet the requirement of voice recognition of a complex scene, such as road voice.
Disclosure of Invention
Based on the road sound identification method, the road sound identification system, the computer equipment and the readable storage medium, the road sound identification method, the road sound identification system, the computer equipment and the readable storage medium can identify models of various sounds on a road, are suitable for sound identification of a road traffic complex scene, and are beneficial to improving the accuracy of road traffic incident detection.
In a first aspect, the present invention provides a road voice recognition method, including:
acquiring a data sample and a sample category of road sound;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In a second aspect, the present invention provides a road voice recognition system, comprising:
the sample acquisition module is used for acquiring a data sample and a sample category of road sound;
the characteristic extraction module is used for sequentially carrying out time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
the training module is used for inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition;
and the classification module is used for identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In a third aspect, the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a data sample and a sample category of road sound;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring a data sample and a sample category of road sound;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
The road sound identification method, the system, the computer equipment and the readable storage medium are characterized in that the method comprises the steps of obtaining data samples and sample types of road sounds; sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample; inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model. The method takes the logarithmic Mel characteristics of the road sound data samples as the input of the convolution cycle network model, trains out a model which can be used for identifying various sounds in a complex traffic scene, and is beneficial to improving the accuracy of road traffic incident detection.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a flowchart illustrating a road voice recognition method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram of a road voice recognition system according to a second embodiment of the present invention;
fig. 3 is an internal structural view of a computer device in a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a flowchart illustrating a road voice recognition method according to a first embodiment of the present invention. A first embodiment of the present invention provides a road voice recognition method, including the steps of:
s1, acquiring a data sample and a sample category of the road sound;
s2, sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
s3, inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition;
and S4, recognizing and classifying the sound data to be processed by using the trained convolution cycle network model.
In one embodiment, the obtaining the data sample of the road sound comprises:
acquiring an original data sample of road sound;
and performing data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of road sound.
The data enhancement is carried out on the original data samples to increase the number of samples, and more samples can enable the trained model to be more robust and stable.
In an embodiment, the performing data enhancement on the original data sample by using a random mixing technique to obtain a data sample of road sound specifically includes:
resampling the original data sample of the road sound to obtain a plurality of sound segments;
and selecting more than two sound segments, respectively matching the sound segments in any ratio, and mixing to obtain a data sample of the road sound.
In one particular embodiment, raw data samples of road sounds may be obtained in real time, but are not limited to, from an AudioSet of data and on the road using a microphone.
In a specific embodiment, the obtained raw data sample of the road sound is processed as follows to obtain a road sound data sample:
and resampling the traffic sound signal, wherein the sampling rate is 16kHz, and the sampling value is modulated by 16-bit pulse code to obtain a plurality of sound segments.
In order to be compatible with sound segments with different time lengths, 10-second time window data is selected as input, the traffic sound signals with the time length less than 10s are filled with zeros, and the traffic sound signals with the time length more than 10s are cut, so that the time length is guaranteed to be 10 s.
The sound segments are labeled with one-hot codes.
Randomly selecting two sound segments y from the processed sound segments1,y2The labels are respectively l1,l2. Selecting a random ratio
Figure GDA0002628938580000061
Will y1,y2Mixing to obtain mixed sound mixr(y1,y12) The formula is as follows:
Figure GDA0002628938580000062
Figure GDA0002628938580000063
in the formulae (1) and (2), G1,G2Are each y1,y2The sound pressure level of (a). The sound pressure level G of the sound file is obtained by a weighting method A, which comprises the steps of firstly creating a 0.1s small window and respectively calculating the time sequence weighted sound level { G }1,g2,...,gt},G=max{g1,g2,...,gt}. Mixed sound mixr(y1,y12) Is marked with a label
Figure GDA0002628938580000064
In the embodiment, the random mixing technology is adopted to perform data enhancement on the original sample data, the training data is expanded, and the robustness and the generalization capability of the model are improved.
In one embodiment, the step S2 includes:
sequentially carrying out non-recursive filter filtering processing, slicing and Hamming window windowing processing on the data samples to obtain first data processing samples;
converting the frequency information in the data processing sample into Mel frequency information, sequentially inputting the Mel frequency information into a plurality of triangular filters, and outputting a second data processing sample;
carrying out logarithmic operation on the data reprocessing samples, and stacking the results subjected to logarithmic budget along a time axis to obtain a third data reprocessing sample;
and sequentially carrying out first-order derivation and second-order derivation on the third data processing sample, and taking the third data processing sample, the first-order derivation of the third data processing sample and the second-order derivation of the third data processing sample as the logarithmic Mel characteristic of the data sample.
In a specific embodiment, the step S2 includes:
1) a first order non-recursive filter is used to pre-emphasize the high frequency components of the sound signal. The expression for the first order non-recursive filter is:
H(z)=1-αz-1 (3)
in equation (3), α is a pre-emphasis coefficient, and h (z) is a filter response. Preferably, α is 0.97.
2) Extracting local information of a sound fragment according to a frame length of 25ms and a frame overlapping of 10ms, carrying out windowing processing on each frame of signal by using a Hamming window function in order to reduce frequency spectrum leakage, wherein the output of each frame of signal after windowing is as follows:
Ot=xt(n)*w(n) (4)
Figure GDA0002628938580000071
in the formulas (4) and (5), x is a convolution operatort(n) is the sound signal of the t-th frame, w (n) is a window function, OtThe windowed output signal for the t-th frame is window long winlen. Then, carrying out 512-point fast Fourier transform on each frame of signal to obtain a corresponding frequency spectrum Xn(k) Where n is the number of frames and k is the frequency.
3) Converting the frequency of the signal to a Mel (Mel) frequency, the conversion formula is as follows:
Mel(f)=2595lg(1+f/700) (6)
in the formula (6), f is the signal frequency, and Mel (f) is the Mel frequency corresponding to f.
4) Configuring L triangular filters on the Mel frequency, wherein the output of each triangular filter is as follows:
Figure GDA0002628938580000072
Figure GDA0002628938580000081
in the formulae (7) and (8), Wl(k) Is the coefficient of the first filter, | Xn(k) L is the amplitude spectrum of the nth frame signal, h (l), c (l), o (l) are the upper, center and lower frequencies of the lth filter, respectively. Preferably, L-64.
5) And performing logarithm operation on the output Y (L) of the filter, and stacking logY (L), L is 1,2, and L along a time axis to obtain a static logarithm Mel two-dimensional time-frequency characteristic. And then, solving a first derivative and a second derivative of the static logarithmic Mel two-dimensional time-frequency feature, and enabling the static logarithmic Mel two-dimensional time-frequency feature, the first derivative of the static logarithmic Mel two-dimensional time-frequency feature and the second derivative of the static logarithmic Mel two-dimensional time-frequency feature to jointly form a 3-channel logarithmic Mel feature which is used as an input sample of the convolution cycle network model.
For a sound sample, the original sound data has high dimensionality and high training complexity, and is easy to overfit, so that feature extraction is needed, and the extracted features are used as input samples of a convolution cycle network model, so that not only can the precision be improved, but also the complexity of early data processing can be reduced. In the embodiment, the logarithmic mel features of the data samples are extracted, and the logarithmic mel features can be used for calculating the sound frame by frame, capturing the instantaneous dynamic features of the sound source and mapping the frequency response similar to human auditory perception, so that the logarithmic mel features are closer to the original data samples, and when a convolution cycle network model is used for training, the difference between different sample classes can be better reflected, and the road sound can be more accurately identified and classified.
In one embodiment, the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are arranged in sequence; the number of layers of the gated convolutional network layer is 4, and the number of layers of the cyclic network layer is 2.
In the selection of the number of the gated convolutional network layers and the number of the cyclic network layers, the inventor finds that the convolutional cyclic network model with the number of the gated convolutional network layers of 4 and the number of the cyclic network layers of 2 has the best effect on road sound identification through repeated reciprocating tests.
In a specific embodiment, the process of building the convolution cyclic network model includes:
1) taking three-channel characteristics of logarithmic Mel characteristics as input samples of convolution cycle network model, dividing the input samples into 10 subsets with the same size, and recording as S1,S2,...S10By Si(i ═ 1,2,. 10) as a test set, the remaining 9 as training sets;
2) and (3) building a convolution cycle network model by using software, wherein the convolution cycle network model comprises a gate control convolution network layer, a cycle network layer, a time distribution type full connection layer and a classification output layer which are sequentially arranged.
3) Inputting the input samples into a convolution cycle network model, and performing supervised learning to obtain parameters of each layer of the trained convolution cycle network model; during training, random distribution function is used for carrying out random initialization on convolution kernel and weight, the learning rate is self-adaptively and dynamically adjusted, the initial value of the learning rate is set to be 0.01, and the minimum learning rate is 10-9The precision of the test set is unchanged in 20 training periods, the learning rate is reduced by 10 times, the convolution cycle network model is trained by using a binary cross entropy loss function and an adaptive moment estimation optimizer in a back propagation mode, and the training is stopped when no change exists in 50 training periods or the limit error of the cost function is less than 0.01.
4) The convolution cycle network model is tested, and the test method comprises the following steps: and inputting the samples of the test set into the trained convolution cycle network model, comparing the output of the convolution cycle network model with the sample categories corresponding to the samples of the test set, calculating the accuracy and evaluating the convolution cycle network model.
Further, the training principle of the convolution cycle network model is as follows: the convolutional layer in the gated convolutional network layer is used as a feature extractor, convolutional kernels in 4 gated convolutional network layers are 64,128 and 128 in sequence, the scale size is 5 ' 5, higher feature map numbers corresponding to the number of the convolutional kernels are obtained by calculation of convolutional response and a GLU excitation function, a batch processing normalization layer is introduced after the convolutional layers, internal ramp shift is reduced, the training process is accelerated, the dimension of feature data is reduced by adopting a maximum pooling mode, more frequency invariance is provided, pooling is only carried out on a frequency axis for ensuring the time integrity of a sound event, the sizes of the first three pooling layers are 1 ' 2, the size of the last pooling layer is 1 ' 4, the output vector of the fourth gated convolutional network layer block passes through a time distribution type full connection layer, the feature maps are stacked along the frequency direction and input to the bidirectional GRU circulating network layer, and pass through a updating gate and a resetting gate unit, learning the context information of the characteristics, then inputting the output vector into a sigmoid classifier through a time-distributed full-connection layer with the nodes of 500 and the excitation function of a rectified linear ReLu function, obtaining the cognitive results of 9 target events in each frame, after weighting averaging, carrying out binarization on the cognitive results of the target events one by one through a group of thresholds, and realizing the cognitive classification of road sounds.
In one embodiment, the sample categories include at least two of an alarm sound, a whistling sound, a vehicle running sound, a brake sound, an explosion sound, a person calling for help sound, a door closing sound, a collision sound, and a rain sound.
In an embodiment, the step S4 is specifically:
and identifying the logarithmic Mel characteristics of the sound data to be processed by using the trained convolution cycle network model so as to realize the classification of the sound data to be processed.
The road sound identification method comprises the steps of obtaining a data sample and a sample category of road sound; sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample; inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model. The method takes the logarithmic Mel characteristics of the road sound data samples as the input of the convolution cycle network model, trains out a model which can be used for identifying various sounds in a complex traffic scene, and is beneficial to improving the accuracy of road traffic incident detection.
Please refer to fig. 2, which is a schematic structural diagram of a road voice recognition system according to an embodiment. A second embodiment of the present invention provides a road voice recognition system, including:
the system comprises a sample acquisition module 1, a data analysis module and a data analysis module, wherein the sample acquisition module is used for acquiring a data sample and a sample category of road sound;
the characteristic extraction module 2 is used for sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data samples to obtain logarithmic Mel characteristics of the data samples;
the training module 3 is used for inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition; and the number of the first and second groups,
and the classification module 4 is used for identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
In one embodiment, the sample acquiring module 1 comprises:
the system comprises an original sample acquisition unit, a data acquisition unit and a data processing unit, wherein the original sample acquisition unit is used for acquiring an original data sample of road sound;
and the data enhancement unit is used for carrying out data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of the road sound.
In one embodiment, the data enhancement unit specifically includes:
the resampling subunit is used for resampling the original data sample of the road sound to obtain a plurality of sound segments;
and the sound mixing subunit is used for selecting more than two sound segments, respectively matching the sound segments in any ratio and mixing the sound segments to obtain a data sample of the road sound.
In one embodiment, the feature extraction module 2 includes:
the data processing unit is used for sequentially carrying out non-recursive filter filtering processing, slicing and Hamming window windowing processing on the data samples to obtain first data processing samples;
the frequency conversion unit is used for converting the frequency information in the data processing samples into Mel frequency information, sequentially inputting the Mel frequency information into a plurality of triangular filters and outputting second data processing samples;
the logarithm arithmetic unit is used for carrying out logarithm arithmetic on the data reprocessing samples and stacking the results after the logarithm budgeting along a time axis to obtain a third data reprocessing sample;
and the derivation unit is used for sequentially carrying out first derivation and second derivation on the third data processing sample, and taking the third data processing sample, the first derivative of the third data processing sample and the second derivative of the third data processing sample as the logarithmic Mel characteristic of the data sample.
In one embodiment, the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are arranged in sequence; the number of layers of the gated convolutional network layer is 4, and the number of layers of the cyclic network layer is 2.
In one embodiment, the sample categories include at least two of an alarm sound, a whistling sound, a vehicle running sound, a brake sound, an explosion sound, a person calling for help sound, a door closing sound, a collision sound, and a rain sound.
In an embodiment, the classification module 4 is specifically configured to:
and identifying the logarithmic Mel characteristics of the sound data to be processed by using the trained convolution cycle network model so as to realize the classification of the sound data to be processed.
It should be noted that, the road voice recognition system provided in the embodiment of the present invention is used for executing all the method steps of the road voice recognition method in the first embodiment, and the working principles and beneficial effects of the two are in one-to-one correspondence, so that the details are not repeated.
Please refer to fig. 3, which is an internal structure diagram of a computer apparatus according to a third embodiment. A third embodiment of the present invention provides a computer apparatus including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the road sound recognition method of the first embodiment described above when executing the computer program.
The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium and an internal memory, the nonvolatile storage medium stores an operating system, a computer program, and a database, the internal memory provides an environment for the operating system and the computer program in the nonvolatile storage medium to run, and when the computer program is executed by the processor, the road sound recognition method according to the first embodiment is implemented.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
A fourth embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the road sound identification method of the first embodiment described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A road voice recognition method, comprising:
acquiring a data sample and a sample category of road sound, and specifically comprising the following steps:
acquiring an original data sample of road sound;
carrying out data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of road sound;
the data enhancement of the original data sample by using the random mixing technology to obtain the data sample of the road sound specifically comprises:
resampling the original data sample of the road sound to obtain a plurality of sound segments;
selecting more than two sound segments, respectively matching the sound segments in any ratio and mixing the sound segments to obtain a data sample of the road sound and a label of the data sample, wherein the concrete steps of obtaining the data sample of the mixed sound and the label of the data sample of the mixed sound are as follows:
randomly selecting two sound segments y from the processed sound segments1,y2The labels are respectively l1,l2
Selecting a random ratio
Figure FDA0002758614290000011
Will y1,y2Mixing to obtain mixed sound mixr(y1,y12) And mix the sounds mixr(y1,y2) Is marked with a label
Figure FDA0002758614290000012
The formula is as follows:
Figure FDA0002758614290000013
Figure FDA0002758614290000014
Figure FDA0002758614290000015
wherein G is1,G2Respectively a sound clip y1,y2The sound pressure level of;
sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
inputting the logarithmic Mel features into a convolution cycle network model according to the sample class for training until the convolution cycle network model meets a preset training end condition; the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are sequentially arranged; the gated convolutional network layer uses a GLU function as an excitation function, the number of layers is 4, convolution kernels in the 4 gated convolutional network layers are 64,128 and 128 in sequence, the scale size is 5' 5, the cyclic network layer is a bidirectional GRU cyclic network, and the number of layers is 2; the classification output layer adopts a sigmoid classifier;
and identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
2. The method of claim 1, wherein the sequentially performing time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data samples to obtain logarithmic mel features of the data samples comprises:
sequentially carrying out non-recursive filter filtering processing, slicing and Hamming window windowing processing on the data samples to obtain first data processing samples;
converting the frequency information in the data processing sample into Mel frequency information, sequentially inputting the Mel frequency information into a plurality of triangular filters, and outputting a second data processing sample;
carrying out logarithmic operation on the data reprocessing samples, and stacking the results subjected to logarithmic budget along a time axis to obtain a third data reprocessing sample;
and sequentially carrying out first-order derivation and second-order derivation on the third data processing sample, and taking the third data processing sample, the first-order derivation of the third data processing sample and the second-order derivation of the third data processing sample as the logarithmic Mel characteristic of the data sample.
3. The road sound identification method according to claim 1, wherein the sample categories include at least two of an alarm sound, a whistling sound, a vehicle running sound, a brake sound, an explosion sound, a person calling for help sound, a door closing sound, a collision sound, and a rain sound.
4. The road voice recognition method according to claim 1, wherein the voice data to be processed is recognized and classified by using the trained convolutional recurrent network model, specifically:
and identifying the logarithmic Mel characteristics of the sound data to be processed by using the trained convolution cycle network model so as to realize the classification of the sound data to be processed.
5. A road voice recognition system, comprising:
the sample acquisition module is used for acquiring data samples and sample categories of road sounds, and specifically comprises the following steps:
acquiring an original data sample of road sound;
carrying out data enhancement on the original data sample by utilizing a random mixing technology to obtain a data sample of road sound;
the data enhancement of the original data sample by using the random mixing technology to obtain the data sample of the road sound specifically comprises:
resampling the original data sample of the road sound to obtain a plurality of sound segments;
selecting more than two sound segments, respectively matching the sound segments in any ratio and mixing the sound segments to obtain a data sample of the road sound and a label of the data sample, wherein the concrete steps of obtaining the data sample of the mixed sound and the label of the data sample of the mixed sound are as follows:
randomly selecting two sound segments y from the processed sound segments1,y2The labels are respectively l1,l2
Selecting a random ratio
Figure FDA0002758614290000031
Will y1,y2Mixing to obtain mixed sound mixr(y1,y12) And mix the sounds mixr(y1,y2) Is marked with a label
Figure FDA0002758614290000032
The formula is as follows:
Figure FDA0002758614290000033
Figure FDA0002758614290000041
Figure FDA0002758614290000042
wherein G is1,G2Respectively a sound clip y1,y2The sound pressure level of;
the characteristic extraction module is used for sequentially carrying out time-frequency decomposition, frequency conversion, logarithmic operation and derivation on the data sample to obtain logarithmic Mel characteristics of the data sample;
the training module is used for inputting the logarithmic Mel features into a convolution cycle network model according to the sample types for training until the convolution cycle network model meets a preset training end condition; the convolution cyclic network model comprises a gate control convolution network layer, a cyclic network layer, a time distribution type full connection layer and a classification output layer which are sequentially arranged; the gated convolutional network layer uses a GLU function as an excitation function, the number of layers is 4, convolution kernels in the 4 gated convolutional network layers are 64,128 and 128 in sequence, the scale size is 5' 5, the cyclic network layer is a bidirectional GRU cyclic network, and the number of layers is 2; the classification output layer adopts a sigmoid classifier;
and the classification module is used for identifying and classifying the sound data to be processed by utilizing the trained convolution cycle network model.
6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 4 are implemented when the computer program is executed by the processor.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.
CN201910436946.5A 2019-05-23 2019-05-23 Road voice recognition method, system, computer device and readable storage medium Active CN110176248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910436946.5A CN110176248B (en) 2019-05-23 2019-05-23 Road voice recognition method, system, computer device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910436946.5A CN110176248B (en) 2019-05-23 2019-05-23 Road voice recognition method, system, computer device and readable storage medium

Publications (2)

Publication Number Publication Date
CN110176248A CN110176248A (en) 2019-08-27
CN110176248B true CN110176248B (en) 2020-12-22

Family

ID=67691960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910436946.5A Active CN110176248B (en) 2019-05-23 2019-05-23 Road voice recognition method, system, computer device and readable storage medium

Country Status (1)

Country Link
CN (1) CN110176248B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110718235B (en) * 2019-09-20 2022-07-01 精锐视觉智能科技(深圳)有限公司 Abnormal sound detection method, electronic device and storage medium
CN113362851A (en) * 2020-03-06 2021-09-07 上海其高电子科技有限公司 Traffic scene sound classification method and system based on deep learning
CN111445926B (en) * 2020-04-01 2023-01-03 杭州叙简科技股份有限公司 Rural road traffic accident warning condition identification method based on sound
CN111785300B (en) * 2020-06-12 2021-05-25 北京快鱼电子股份公司 Crying detection method and system based on deep neural network
CN112309405A (en) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 Method and device for detecting multiple sound events, computer equipment and storage medium
CN112767961B (en) * 2021-02-07 2022-06-03 哈尔滨琦音科技有限公司 Accent correction method based on cloud computing

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087655A (en) * 2018-07-30 2018-12-25 桂林电子科技大学 A kind of monitoring of traffic route sound and exceptional sound recognition system

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3895657B2 (en) * 2002-10-01 2007-03-22 三菱電機エンジニアリング株式会社 Accident sound detection circuit
CN100507971C (en) * 2007-10-31 2009-07-01 北京航空航天大学 Independent component analysis based automobile sound identification method
CN101980336B (en) * 2010-10-18 2012-01-11 福州星网视易信息系统有限公司 Hidden Markov model-based vehicle sound identification method
KR101768145B1 (en) * 2016-04-21 2017-08-14 현대자동차주식회사 Method for providing sound detection information, apparatus detecting sound around vehicle, and vehicle including the same
US10276187B2 (en) * 2016-10-19 2019-04-30 Ford Global Technologies, Llc Vehicle ambient audio classification via neural network machine learning
CN106846803B (en) * 2017-02-08 2023-06-23 广西交通科学研究院有限公司 Traffic event detection device and method based on audio frequency
CN106910495A (en) * 2017-04-26 2017-06-30 中国科学院微电子研究所 A kind of audio classification system and method for being applied to abnormal sound detection
CN107545890A (en) * 2017-08-31 2018-01-05 桂林电子科技大学 A kind of sound event recognition method
US10629081B2 (en) * 2017-11-02 2020-04-21 Ford Global Technologies, Llc Accelerometer-based external sound monitoring for backup assistance in a vehicle
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN109087635A (en) * 2018-08-30 2018-12-25 湖北工业大学 A kind of speech-sound intelligent classification method and system
CN109346103B (en) * 2018-10-30 2023-03-28 交通运输部公路科学研究所 Audio detection method for road tunnel traffic incident

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109087655A (en) * 2018-07-30 2018-12-25 桂林电子科技大学 A kind of monitoring of traffic route sound and exceptional sound recognition system

Also Published As

Publication number Publication date
CN110176248A (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN110176248B (en) Road voice recognition method, system, computer device and readable storage medium
Knight et al. Recommendations for acoustic recognizer performance assessment with application to five common automated signal recognition programs
CN108877775B (en) Voice data processing method and device, computer equipment and storage medium
US10460747B2 (en) Frequency based audio analysis using neural networks
CN109065027B (en) Voice distinguishing model training method and device, computer equipment and storage medium
Zhang et al. Automatic bird vocalization identification based on fusion of spectral pattern and texture features
CN111986699B (en) Sound event detection method based on full convolution network
CN111341294B (en) Method for converting text into voice with specified style
CN113205820B (en) Method for generating voice coder for voice event detection
CN112767927A (en) Method, device, terminal and storage medium for extracting voice features
US20200090040A1 (en) Apparatus for processing a signal
Xie et al. Application of image processing techniques for frog call classification
Naranjo-Alcazar et al. On the performance of residual block design alternatives in convolutional neural networks for end-to-end audio classification
CN114863938A (en) Bird language identification method and system based on attention residual error and feature fusion
CN115034254A (en) Nuclide identification method based on HHT (Hilbert-Huang transform) frequency band energy features and convolutional neural network
CN110458071A (en) A kind of fiber-optic vibration signal characteristic abstraction and classification method based on DWT-DFPA-GBDT
Chaves et al. Katydids acoustic classification on verification approach based on MFCC and HMM
Xie et al. Acoustic feature extraction using perceptual wavelet packet decomposition for frog call classification
CN115497564A (en) Antigen identification model establishing method and antigen identification method
CN115547347A (en) Whale acoustic signal identification method and system based on multi-scale time-frequency feature extraction
CN114201993A (en) Three-branch attention feature fusion method and system for detecting ultrasonic defects
Mamutova et al. DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS
CN113160823A (en) Voice awakening method and device based on pulse neural network and electronic equipment
CN110458219A (en) A kind of Φ-OTDR vibration signal recognizer based on STFT-CNN-RVFL
CN110689875A (en) Language identification method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 530001 No.6, Gaoxin 2nd Road, xixiantang District, Nanning City, Guangxi Zhuang Autonomous Region

Applicant after: Guangxi Jiaoke Group Co.,Ltd.

Address before: 530000 the Guangxi Zhuang Autonomous Region XiXiangTang Nanning high tech two Road No. 6

Applicant before: GUANGXI TRANSPORTATION RESEARCH & CONSULTING Co.,Ltd.

GR01 Patent grant
GR01 Patent grant