CN114220458A - Sound identification method and device based on array hydrophone - Google Patents

Sound identification method and device based on array hydrophone Download PDF

Info

Publication number
CN114220458A
CN114220458A CN202111356702.XA CN202111356702A CN114220458A CN 114220458 A CN114220458 A CN 114220458A CN 202111356702 A CN202111356702 A CN 202111356702A CN 114220458 A CN114220458 A CN 114220458A
Authority
CN
China
Prior art keywords
target
classification model
audio signals
sound
target audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111356702.XA
Other languages
Chinese (zh)
Other versions
CN114220458B (en
Inventor
申和平
张海兵
冯权胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Puhui Ocean Photoelectric Technology Co ltd
Original Assignee
Wuhan Puhui Ocean Photoelectric Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Puhui Ocean Photoelectric Technology Co ltd filed Critical Wuhan Puhui Ocean Photoelectric Technology Co ltd
Priority to CN202111356702.XA priority Critical patent/CN114220458B/en
Publication of CN114220458A publication Critical patent/CN114220458A/en
Application granted granted Critical
Publication of CN114220458B publication Critical patent/CN114220458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/44Special adaptations for subaqueous use, e.g. for hydrophone

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention provides a sound identification method and a device based on an array hydrophone, and the method comprises the following steps: collecting target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements; extracting various features of the target audio signal of each element, and performing feature fusion on the various features; inputting the fusion characteristics of the target audio signals of each element into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element; and fusing the prediction categories corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound. The method realizes classification based on target audio signal fusion characteristics and multi-array element signals, obtains the final category of the target sound, and effectively improves the accuracy and robustness of the classification result; in the classification process, the various target audio signals are synchronously learned, the judgment capability of the classification model is effectively enhanced, and the accuracy of the classification result is further improved.

Description

Sound identification method and device based on array hydrophone
Technical Field
The invention relates to the technical field of audio processing, in particular to a sound identification method and device based on an array hydrophone.
Background
The technology for carrying out sound target intelligent identification on environment sound signals under the real environment background is a hot research topic in the field of sound signal processing, and is widely applied to the military and civil fields, such as underwater monitoring, security monitoring, medical monitoring, intelligent home furnishing, urban intelligent management and the like.
Due to the complex and variable real environment, the characteristics of non-periodicity, non-stationarity and low signal-to-noise ratio of the signal, and most audio data contain noise interference, which brings great difficulty and challenge to sound signal classification. The traditional spectral feature analysis method is not ideal in recognition effect of the sound signals, and recognition accuracy is low.
With the wide application of deep learning in the field of computer vision, a sound target identification technology based on an artificial intelligence method also comes. At present, the artificial neural network can replace the traditional spectral feature analysis method by virtue of strong learning ability, adaptivity and nonlinear mapping ability, and becomes the mainstream method for recognizing the voice target.
However, the existing artificial neural network generally uses a single channel of sound data and a single feature to identify the category of the sound target more simply; generally, real environments are complex and changeable, multiple sound targets are mixed in the same environment, different sound targets are difficult to represent and distinguish according to sound data and single characteristics of a single channel, and therefore accuracy of recognition results is difficult to guarantee.
Disclosure of Invention
The invention provides a sound identification method and a sound identification device based on an array hydrophone, which are used for solving the defect that in the prior art, the sound identification result is inaccurate because the sound data of a single channel and a single characteristic are used for identifying the category of a sound target, and the accuracy of sound identification is improved.
The invention provides a sound identification method based on an array hydrophone, which comprises the following steps:
collecting target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements;
extracting various features of the target audio signal of each element, and performing feature fusion on the various features to obtain fusion features of the target audio signal of each element;
inputting the fusion characteristics of the target audio signals of each element into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element;
fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound;
the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
According to the sound identification method based on the array hydrophone, provided by the invention, the feature extraction comprises the steps of chromatographic energy normalization feature extraction, spectral contrast feature extraction and mel-frequency cepstrum coefficient feature extraction.
According to the sound identification method based on the array hydrophone provided by the invention, the step of fusing the prediction categories of the target sound corresponding to the target audio signals of a plurality of primitives to obtain the final category of the target sound comprises the following steps:
and on the basis of a voting selection method, counting the prediction categories with the highest frequency in the prediction categories corresponding to the plurality of primitives, and taking the prediction categories with the highest frequency as final categories of the target sound.
According to the sound identification method based on the array hydrophone provided by the invention, before the fusion characteristics of the target audio signals of each element are input into the classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element, the method further comprises the following steps:
and performing iterative training on the classification model according to the fusion characteristics of the sample audio signal and the reference category of the sample sound to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model iterated each time until the loss value of the classification model converges and/or the classification accuracy meets a preset termination condition.
According to the sound identification method based on the array hydrophone provided by the invention, the method further comprises the following steps:
determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in a verification set during each iteration;
wherein the result indicators comprise a combination of one or more of accuracy, average accuracy, recall, and F-score scores.
According to the sound identification method based on the array hydrophone, provided by the invention, the iterative training of the classification model is carried out according to the fusion characteristics of the sample audio signals and the reference category of the sample sound, and the iterative training comprises the following steps:
preprocessing the sample audio signal;
wherein the pre-processing comprises a combination of one or more of normalization, resizing, and data enhancement;
and performing iterative training on the parameters of the classification model according to the fusion characteristics of the preprocessed sample audio signals and the reference classes.
According to the array hydrophone-based sound recognition method provided by the invention, the data enhancement comprises one or more of mixed noise data enhancement, time offset data enhancement and shear fill data enhancement of sounds of the same kind.
The invention also provides a sound recognition device based on the array hydrophone, which comprises:
the acquisition module is used for acquiring target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements;
the feature extraction module is used for extracting various features of the target audio signals of the primitives and performing feature fusion on the various features to obtain fusion features of the target audio signals of the primitives;
the first testing module is used for inputting the fusion characteristics of the target audio signals of all the elements into the classification model to obtain the prediction categories of the target sounds corresponding to the target audio signals of all the elements;
the second testing module is used for fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound;
the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
The sound recognition device based on the array hydrophone further comprises a classification model training module, and is used for:
and performing iterative training on the classification model according to the fusion characteristics of the sample audio signal and the reference category of the sample sound to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model iterated each time until the loss value of the classification model converges and/or the classification accuracy meets a preset termination condition.
The sound recognition device based on the array hydrophone further comprises a classification model evaluation module, and is used for:
determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in a verification set during each iteration;
wherein the result indicators comprise a combination of one or more of accuracy, average accuracy, recall, and F-score scores.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the array hydrophone based sound identification method as described in any one of the above.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the array hydrophone based sound identification method as described in any one of the above.
The present invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the array hydrophone based sound identification method as described in any one of the above.
According to the sound identification method and device based on the array hydrophone, the target audio signals are collected through the plurality of primitives of the array hydrophone, and the multiple characteristics of the target audio signals of the primitives are fused to be used as input information of the classification model, so that the input information contains rich characteristics used for representing the target sounds, the fusion characteristics of the target audio signals of the primitives are input into the classification model, the prediction categories of the target sounds corresponding to the target audio signals of the primitives are obtained, then, the prediction categories of the target sounds corresponding to the target audio signals of the primitives are fused, the final category of the target sounds is obtained, and the accuracy and robustness of classification results are effectively improved; in the classification process, target audio signals of a plurality of primitives are synchronously learned, so that the judgment capability of a classification model can be effectively enhanced, and the accuracy of a classification result is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a method for acoustic recognition based on array hydrophones according to the present invention;
FIG. 2 is a schematic structural diagram of a classification model in the array hydrophone based sound recognition method provided by the invention;
FIG. 3 is a second schematic flow chart of the array hydrophone based sound recognition method provided by the present invention;
FIG. 4 is a schematic diagram of a sound recognition apparatus based on array hydrophones according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The array hydrophone based sound identification method of the present invention is described below with reference to fig. 1, and includes:
step 101, collecting target sound based on a plurality of elements of an array hydrophone to obtain target audio signals of the plurality of elements;
the voice recognition method in the embodiment can be applied to the fields of underwater monitoring, security monitoring, medical monitoring, intelligent home furnishing, urban intelligent management and the like;
the voice recognition method in the present embodiment can be applied to various systems or apparatuses, such as actuators; the executor may be an intelligent terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a vehicle-mounted terminal, and the like, and may also be a server or a cloud, and the like, which is not specifically limited in this embodiment.
The type and the operating environment of the actuator can be set according to actual requirements, for example, the actuator can be a computer with a GPU (Graphics Processing Unit) type NVIDIA GeForce RTX 3080. The software implementation takes Windows as a platform, adopts Python programming language and is developed based on a deep learning framework TensorFlow.
The array hydrophone comprises a plurality of primitives, wherein the primitives on the array hydrophone are used for synchronously acquiring target sound. The attributes of the array hydrophones, such as the number of elements, the shape of the array, the spatial positions among the elements, and the like, can be configured according to actual requirements. For example, the number of elements on an array hydrophone is 8 or 10, etc.
The target sound is a sound to be classified, and may be other sounds such as underwater sound, car sound, truck sound, and footstep sound, which is not specifically limited in this embodiment.
Alternatively, the target sound may be acquired simultaneously based on a plurality of primitives of the array hydrophone, and the target audio signals of the plurality of primitives, that is, one target audio signal of the target sound acquired by each primitive, are acquired.
It should be noted that the target audio signal may be an audio signal acquired by acquiring the target sound in real time based on a plurality of elements of the array hydrophone, or may be a target audio signal of the target sound acquired from a local storage of the plurality of elements of the array hydrophone. The source of the target sound is not particularly limited in this embodiment.
102, extracting various features of the target audio signal of each element, and performing feature fusion on the various features to obtain fusion features of the target audio signal of each element;
optionally, before inputting the target audio signal of each primitive into the classification model, multiple feature extractions may be performed on the target audio signal of each primitive to select a feature matrix with a certain degree of discrimination in the scene.
Each feature matrix is a two-dimensional feature matrix, and finally, the feature matrices are spliced to form a new two-dimensional feature matrix so as to obtain the fusion features of the target audio signals of all the elements.
Compared with the prior art in which single characteristics are adopted to classify the sound, the embodiment extracts various characteristics of the target audio signal, can extract more abundant characteristics from the target audio signal, can represent the characteristics of the target audio signal, and further enables the classification result of the target sound to be more accurate.
103, inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive;
the classification model includes, but is not limited to, an input layer, a hidden layer, an active layer, a pooling layer, a softmax classification layer, and an output layer, and the specific structure of the classification model is not specifically limited in this embodiment.
The classification model can be various deep learning models, such as a convolutional neural network; the classification model includes, but is not limited to, VGG (Visual Geometry Group Network), ResNet (Residual Network), classical Convolutional neural Network inclusion v3 Network and Xception Network, and densnet Connected Convolutional Network, and the like, and the type of the classification model is not specifically limited in this embodiment.
Fig. 2 illustrates an exemplary classification model. The classification model comprises one Entry Flow, eight Middle flows and one Exit Flow; wherein, the Entry flow is mainly used for continuously sampling and reducing the space dimension; the Middle Flow is continuously learning the association relationship and optimizing the characteristics; the Exit Flow is a gathering and sorting characteristic and is used for delivering the full-connection layer to express; the classification model has 14 modules, and except the first module and the last module, other modules are connected by a residual error structure; the classification model comprises 36 convolutional layers, wherein the introduced deep classifiable convolutional layers have greatly reduced parameter quantity and correspondingly reduced calculation quantity compared with the standard convolutional layers.
Optionally, after obtaining the fusion feature of the target audio signal of each primitive, the fusion feature of the target audio signal of each primitive may be directly input into the classification model, or may be input into the classification model after performing one or more kinds of preprocessing on the fusion feature, which is not specifically limited in this embodiment.
Then, after learning the fusion characteristics of the target audio signals of each primitive by the classification model, outputting the prediction category of the target sound corresponding to the target audio signals of each primitive; wherein the target audio signal of each primitive corresponds to a category of the target sound.
And 104, fusing the prediction types of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final type of the target sound.
Optionally, after the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives are obtained, the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives may be fused to obtain a final category of the target sounds.
Optionally, the fusion manner may be that the probabilities of the target sounds corresponding to the target audio signals of the multiple primitives belonging to each preset category are added, the final probability of the target sound belonging to each preset category is determined, and the final category of the target sound is determined according to the final probability of the target sound belonging to each preset category.
Optionally, a preset category of which the final probability is greater than a preset value may be taken as a target category; when the number of the target categories is one, directly taking the target categories as final categories of the target sounds; when the number of target classes is plural, the target class having the highest probability is set as the final class of the target sound.
Or selecting one category from the prediction categories of the target sound corresponding to the target audio signals of the plurality of primitives as the final category of the target sound based on a voting selection method;
or fusing the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives based on a D-S evidence theory to obtain the final category of the target sounds; the embodiment does not specifically limit the fusion method.
The existing traditional sound identification has poor adaptability to the environment, the background environment is slightly changed or other interferences can cause very low identification rate, and the robustness is low, but the embodiment combines a plurality of primitives of the array hydrophone to synchronously acquire target sound, synchronously identifies a plurality of primitive target audio signals, fuses the identification results of various target audio signals, obtains the final classification result, and can not influence the overall classification effect even though the interference of factors occasionally, so that the identification accuracy is higher.
In addition, compared with a single-channel sound collection system, the target sound received by the array hydrophones is farther away, and can be identified remotely.
In the embodiment, a plurality of elements of an array hydrophone are used for collecting target audio signals, and a plurality of characteristics of the target audio signals of each element are fused to be used as input information of a classification model, so that the input information contains rich characteristics for representing target sounds, the fusion characteristics of the target audio signals of each element are input into the classification model, prediction categories of the target sounds corresponding to the target audio signals of each element are obtained, then, the prediction categories of the target sounds corresponding to the target audio signals of the plurality of elements are fused to obtain final categories of the target sounds, and the accuracy and robustness of classification results are effectively improved; in the classification process, target audio signals of a plurality of primitives are synchronously learned, so that the judgment capability of a classification model can be effectively enhanced, and the accuracy of a classification result is further improved.
On the basis of the above embodiments, the feature extraction in this embodiment includes a chromatographic energy normalization feature extraction, a spectral contrast feature extraction, and a mel-frequency cepstrum coefficient feature extraction.
Alternatively, the feature extraction may be to perform, for example, chromatographic energy normalization feature extraction, spectral contrast feature extraction, mel-frequency cepstrum coefficient feature extraction, and the like on the time domain data of each audio signal, which is not specifically limited in this embodiment.
It should be noted that the feature extraction in the present embodiment is not limited to the above feature extraction methods, and other feature extraction methods that can represent the effective features of the target sound may be used.
The following describes the sound recognition method in this embodiment in a way of performing the chromatographic energy normalization feature extraction, the spectral contrast feature extraction, and the mel-frequency cepstrum coefficient feature extraction on each audio signal at the same time.
Optionally, in order to make the input information of the classification model contain rich features of the target sound, two-dimensional chromatographic energy normalization feature extraction, two-dimensional spectral contrast feature extraction and two-dimensional mel-frequency cepstrum coefficient feature extraction may be performed on each audio signal at the same time;
then, splicing the three feature matrixes contained in each audio signal to form a new two-dimensional feature matrix; and finally, inputting the new two-dimensional characteristic matrix of the multiple target audio signals into the classification model to obtain the category of the target sound corresponding to each target audio signal output by the classification model.
For example, for any audio signal, after two-dimensional chromatographic energy normalization feature extraction is performed on time domain data of the audio signal, the size of a feature matrix is 98 × 12; extracting two-dimensional spectral contrast characteristics from the time domain data of the audio signal, wherein the size of a characteristic matrix is 98 × 7; extracting two-dimensional Mel frequency cepstrum coefficient features of the time domain data of the audio signal, wherein the size of a feature matrix is 98 x 40; splicing the various feature matrixes of the audio signal into a new two-dimensional feature matrix with the size of 98 x 59; and then, inputting the new characteristic matrix of the audio signal of each element into a classification model for learning so as to obtain the category of the target sound corresponding to the target audio signal of each element.
In the embodiment, before the target audio signals of each element are input into the classification model, the features of the target audio signals are extracted, so that the input information of the classification model contains rich features of the target sound, the discrimination and the characterization capability of the features of the target sound can be effectively improved, and a more accurate classification result can be obtained.
On the basis of the foregoing embodiment, in this embodiment, the fusing the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound includes: and on the basis of a voting selection method, counting the prediction categories with the highest frequency in the prediction categories corresponding to the plurality of primitives, and taking the prediction categories with the highest frequency as final categories of the target sound.
Optionally, after the target audio signal of each primitive is input into the classification model, the prediction category of the target sound corresponding to the target audio signal of each primitive can be obtained;
then, based on a voting selection method, firstly counting the prediction categories of target sounds corresponding to target audio signals of a plurality of primitives to obtain the occurrence frequency of each prediction category; then, the prediction class with the highest frequency of occurrence is used as the final class of the target sound, and the classification probabilities of the target sounds corresponding to the target audio signals of the plurality of primitives are averaged to be used as the final classification probability of the target sound.
According to the method, the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives are fused in a voting selection mode to obtain the final category of the target sounds, so that the problems of single feature characterization method, single signal channel, low recognition accuracy, poor robustness and high false alarm rate in the prior art can be effectively solved, and the target sounds in a complex sound environment can be classified with high precision.
On the basis of the foregoing embodiments, before inputting the fusion feature of the target audio signal of each primitive into the classification model to obtain the prediction category of the target sound corresponding to the target audio signal of each primitive, the present embodiment further includes: and performing iterative training on the classification model according to the fusion characteristics of the sample audio signal and the reference category of the sample sound to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model iterated each time until the loss value of the classification model converges and/or the classification accuracy meets a preset termination condition.
Optionally, before inputting the plurality of target audio signals into the classification model, training the target audio signals; when the classification model is trained, a data set needs to be constructed.
Alternatively, a set number of audio signal construction data sets may be selected from the local storage of the various sound collection devices; wherein the number of selections may be 8000 groups; the data set contains a plurality of sound categories, such as ambient background noise, car sounds, truck sounds, and footfall sounds. Each category of sound is stored in a multi-channel wav format audio file.
The sampling frequency and the duration of each sampling of the sound collection device can be set according to actual requirements, for example, the sampling frequency is 10000Hz, and the duration is 5 s.
In the actual simulation process, the audio signals of the corresponding group number in the data set can be used as training samples, namely sample audio signals, and the audio signals of the corresponding group number can be used as a verification set, and the audio signals of the corresponding group number can be used as a test set, namely target audio signals, according to a certain proportion.
For example, the footstep phonon subset contains 2000 groups of 5s audio signals, in 80%: the sub data set is randomly divided into 1600 sets of audio signals in the input set and 400 sets of audio signals in the test set at a ratio of 20%; and then inputting 1600 groups of audio signals of the input set according to 80%: randomly dividing the 20% ratio into a 1280 group of audio signals in a training set and 320 groups of audio signals in a verification set; other classes of sounds may also be divided in the manner described above.
For a data set containing four categories of environmental background noise, car sound, truck sound and step sound, after division, a training set 5120 groups of audio signals, a verification set 1280 groups of audio signals and a test set 1600 groups of audio signals can be obtained.
As shown in fig. 3, in the process of training the classification model, the sample audio signals in the training set may be input into the classification model in batches for training; calculating the value of the loss function under the batch parameters through the loss function so as to reduce the value of the model loss function to the maximum extent and minimize the value of the loss function; the number of sets of sample data contained in each batch may be set according to actual requirements, for example, each batch contains 16 sets of sample sounds.
In the training process, the weight parameters of each iteration training classification model also need to be stored; the storage mode can be that the weight parameters of the classification model of the iterative training are stored after the iterative training is finished each time; or after the preset period is iterated, uniformly storing the weight parameters of the classification model in the preset period, for example, storing the weight parameters of the classification model once after 10 periods of iteration, and storing the weight parameters once after the last step of training. The embodiment does not specifically limit the storage manner of the weight parameter of the classification model.
And continuously updating parameters of the classification model according to the value of the loss function after each iteration training and the accuracy of the verification set, and continuously improving the precision of the classification model until one or more combinations of loss value convergence of the classification model, the classification accuracy meeting a preset termination condition or reaching the maximum iteration times are met, and stopping training the classification model to obtain the classification model with higher classification accuracy. Compared with the traditional voice recognition model, the method has the advantages that the deeper and wider convolutional neural networks are used for feature extraction, separation and training, the background interference can be self-adapted, the robustness is stronger, and the recognition accuracy is high.
In the embodiment, the classification model is trained through multiple sample audio signals, so that the generalization of the classification model to environmental changes is stronger, the accurate identification of the category of the target sound in a real complex environment is facilitated, and the evaluation index of the classification model adaptively selects the optimal parameter of the classification model, so that the performance of the classification model is better.
On the basis of the above embodiment, the present embodiment further includes: determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in a verification set during each iteration; wherein the result indicators comprise a combination of one or more of accuracy, average accuracy, recall, and F-score scores.
Alternatively, the performance of the classification model is mainly determined by the weight parameters of the classification model. Therefore, in order to obtain a classification model with higher accuracy and better suitability, a most suitable weight parameter needs to be selected.
In order to obtain the optimal parameters of the classification model, a set of weight parameters can be selected from all weight parameters stored in the training process to serve as the optimal weight parameters of the classification model according to the result indexes of the verification of the weight parameters of the classification model in the verification set each time.
The larger the accuracy, average accuracy, recall rate and score are, the better the performance of the characterization classification model is.
Score FScoreFor the harmonic average of the precision rate and the recall rate, the calculation formula is as follows:
Figure BDA0003357464960000141
where β is used to balance the importance of precision and recall, and P and R are precision and recall, respectively.
Optionally, when determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in the verification set at each iteration, calculating the performance index of the classification model at each iteration according to one or more combinations of the verification accuracy, the average accuracy, the recall rate and the score of the classification model at each iteration; this embodiment does not specifically limit this;
it should be noted that the result index in this embodiment is not limited to the above indexes, and other indexes that can represent the performance of the classification model may also be used.
Optionally, the optimal weight parameter may be the weight parameter with the highest accuracy; or the weight parameter with the highest average accuracy rate; or the weight parameter with highest comprehensive accuracy, average accuracy, recall rate and score; or the weight parameter with the highest accuracy and average accuracy; or the weight parameters with highest accuracy and average accuracy and highest comprehensive accuracy, average accuracy, recall rate and score; or the accuracy and the average accuracy are highest, the accuracy, the average accuracy, the recall rate and the score are highest comprehensively, and the iteration cycle is largest, which is not specifically limited in this embodiment.
It should be noted that the optimal parameter may be an optimal parameter stored after the last iterative training, or may be an optimal parameter stored after a certain intermediate iterative training.
The following describes the voice recognition method in this embodiment in detail by calculating the performance index of the classification model for each iteration according to the accuracy, the average accuracy, the recall rate, and the score at the same time.
Optionally, in order to optimize the comprehensive performance index of the optimized classification model, the performance index of the classification model for each iteration may be calculated by combining the accuracy, average accuracy, recall rate and score of the classification model for each iteration;
the accuracy, the average accuracy, the recall rate and the score can be directly added to obtain the performance index of the classification model;
or inputting the accuracy, the average accuracy, the recall rate and the score into a fusion model, and fusing to obtain the performance index of the classification model;
or, carrying out weighted addition on the accuracy, the average accuracy, the recall rate and the score of each iteration classification model to obtain the performance index of each iteration classification model.
The weights of the accuracy, the average accuracy, the recall rate and the score can be set according to actual requirements, and a weight analysis method can also be adopted to determine the weights.
The weight analysis method may be a hierarchical analysis method, a cyclic ratio analysis method, a principal component analysis method, an entropy analysis method, or the like, and the present embodiment is not limited to the type of the weight analysis method.
Multiplying the accuracy, the average accuracy, the recall rate and the score of the classification model of each iteration by corresponding weights and then adding the products to obtain the performance index of the classification model of each iteration, wherein the specific calculation formula is as follows:
Pi=λi,1Si,1i,2Si,2i,3Si,3i,4Si,4
wherein, PiIs a performance index, lambda, of the ith iterative classification modeli,1、λi,2、λi,3And λi,4Is a weight coefficient, Si,1、Si,2、Si,3And Si,4The accuracy, average accuracy, recall and score of the ith iterative classification model are respectively.
And then, sequencing the performance indexes of all iterations in the whole training process, and if only one maximum performance index exists in the sequencing result, taking the weight parameter corresponding to the maximum performance index as the optimal weight parameter of the classification model.
And if a plurality of maximum performance indexes exist in the sequencing result, taking the weight parameter corresponding to the maximum performance index with the largest iteration cycle as the optimal weight parameter.
Then, the target audio signals of the primitives are input into the classification model with the optimal weight parameters, and the prediction categories of the target sounds corresponding to the target audio signals of the primitives output by the classification model are obtained.
Compared with the prior art in which the optimized parameters of the last iteration are directly used as the optimal parameters of the model, the optimal weight parameters of the classification model are adaptively selected according to one or more combinations of the accuracy, the average accuracy, the recall rate and the score of the classification model, so that the comprehensive performance of the classification model is better, and the classification result is more accurate.
On the basis of the foregoing embodiment, in this embodiment, the iteratively training the classification model according to the fusion features of the sample audio signal and the reference class of the sample sound includes: preprocessing the sample audio signal; wherein the pre-processing comprises a combination of one or more of normalization, resizing, and data enhancement; and performing iterative training on the parameters of the classification model according to the fusion characteristics of the preprocessed sample audio signals and the reference classes.
Wherein the normalization process may normalize the sample audio signal to between [ -1,1 ]; the normalization process includes, but is not limited to, maximum-minimum normalization and standard normalization, which is not specifically limited by the present embodiment.
The size adjustment is used for adjusting the time lengths of all the sample audio signals to a uniform length t; if the duration is less than t and the complement is close to the random value filling of 0, the duration is greater than t and the truncation is carried out.
For example, the duration of each sample audio signal is rounded to 5s, cut to 5s duration longer than 5s, and filled to 5s duration with a random value close to 0 below 5 s.
According to the embodiment, the sample audio signals are preprocessed and data are enhanced, so that the trained classification model can be self-adaptive to background interference, robustness, generalization and compatibility are stronger, and the classification accuracy is improved.
The above-described preprocessing may be performed on the sample audio signal before iterative optimization of the parameters of the classification model using the sample audio signal and the reference classes.
It should be noted that normalization and/or size adjustment may also be performed on the target audio information, so as to avoid the classification result being affected by the dimension and size, so as to improve the accuracy of identifying the category of the target sound.
On the basis of the above embodiments, the data enhancement in the present embodiment includes one or more of a combination of mixed noise data enhancement, time offset data enhancement, and cut-fill data enhancement for sounds of the same kind.
The data enhancement includes, but is not limited to, one or more of mixed noise data enhancement, time offset data enhancement, and homogeneous sound cut filler data enhancement, which is not specifically limited in this embodiment.
The mixed noise data enhancement can randomly select a section of background noise, then mix the background noise with the sample audio signal, and randomly adjust the volume; some random noise, such as gaussian noise, may also be added randomly, and the implementation manner of the enhancement of the mixed noise data is not specifically limited in this embodiment.
The time offset data enhancement may take a segment of the audio signal at the beginning or end of the sample audio signal, e.g., 1 to 3 seconds, covering some random numbers in the sample audio signal that are very close to 0;
the same kind of sound cut filling data enhancement can randomly select two sample audio signals of the same kind, respectively cut off a part, and combine the part into a new sample audio signal.
According to the embodiment, the training set can be effectively expanded by performing data enhancement on the sample audio signal, the robustness of the sample in different environments is improved, the trained classification model can be adaptive to background interference, the robustness, the generalization performance and the compatibility are stronger, and the classification accuracy is improved.
The array hydrophone based sound recognition apparatus provided by the present invention is described below, and the array hydrophone based sound recognition apparatus described below and the array hydrophone based sound recognition method described above may be referred to in correspondence with each other.
As shown in fig. 4, the present embodiment provides a sound recognition apparatus based on an array hydrophone, which includes an acquisition module 401, a feature extraction module 402, a first test module 403, and a second test module 404, where:
the acquisition module 401 is configured to acquire a target sound based on a plurality of elements of the array hydrophone to obtain a target audio signal of the plurality of elements;
the voice recognition method in the embodiment can be applied to the fields of underwater monitoring, security monitoring, medical monitoring, intelligent home furnishing, urban intelligent management and the like;
the voice recognition method in the present embodiment can be applied to various systems or apparatuses, such as actuators; the executor may be an intelligent terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a vehicle-mounted terminal, and the like, and may also be a server or a cloud, and the like, which is not specifically limited in this embodiment.
The array hydrophone comprises a plurality of primitives, wherein the primitives on the array hydrophone are used for synchronously acquiring target sound. The attributes of the array hydrophones, such as the number of elements, the shape of the array, the space positions among the elements and the like, can be configured according to actual requirements. The target sound is a sound to be classified, and may be other sounds such as underwater sound, car sound, truck sound, and footstep sound, which is not specifically limited in this embodiment.
Alternatively, the target sound may be acquired simultaneously based on a plurality of primitives of the array hydrophone, and the target audio signals of the plurality of primitives, that is, one target audio signal of the target sound acquired by each primitive, are acquired.
It should be noted that the target audio signal may be an audio signal acquired by acquiring the target sound in real time based on a plurality of elements of the array hydrophone, or may be a target audio signal of the target sound acquired from a local storage of the plurality of elements of the array hydrophone. The source of the target sound is not particularly limited in this embodiment.
The feature extraction module 402 is configured to perform multiple feature extractions on the target audio signal of each primitive, and perform feature fusion on the multiple features to obtain a fusion feature of the target audio signal of each primitive;
optionally, before inputting the target audio signal of each primitive into the classification model, multiple feature extractions may be performed on the target audio signal of each primitive to select a feature matrix with a certain degree of discrimination in the scene.
Each feature matrix is a two-dimensional feature matrix, and finally, the feature matrices are spliced to form a new two-dimensional feature matrix so as to obtain the fusion features of the target audio signals of all the elements.
Compared with the prior art in which single characteristics are adopted to classify the sound, the embodiment extracts various characteristics of the target audio signal, can extract more abundant characteristics from the target audio signal, can represent the characteristics of the target audio signal, and further enables the classification result of the target sound to be more accurate.
A first testing module 403, configured to input the fusion characteristics of the target audio signals of each primitive into a classification model, so as to obtain a prediction category of the target sound corresponding to the target audio signal of each primitive;
the classification model includes, but is not limited to, an input layer, a hidden layer, an active layer, a pooling layer, a softmax classification layer, and an output layer, and the specific structure of the classification model is not specifically limited in this embodiment.
Optionally, after obtaining the fusion feature of the target audio signal of each primitive, the fusion feature of the target audio signal of each primitive may be directly input into the classification model, or may be input into the classification model after performing one or more kinds of preprocessing on the fusion feature, which is not specifically limited in this embodiment.
Then, after learning the fusion characteristics of the target audio signals of each primitive by the classification model, outputting the prediction category of the target sound corresponding to the target audio signals of each primitive; wherein the target audio signal of each primitive corresponds to a category of the target sound.
The second testing module 403 is configured to fuse the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives to obtain a final category of the target sound.
Optionally, after the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives are obtained, the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives may be fused to obtain a final category of the target sounds.
Optionally, the fusion manner may be that the probabilities of the target sounds corresponding to the target audio signals of the multiple primitives belonging to each preset category are added, the final probability of the target sound belonging to each preset category is determined, and the final category of the target sound is determined according to the final probability of the target sound belonging to each preset category.
Optionally, a preset category of which the final probability is greater than a preset value may be taken as a target category; when the number of the target categories is one, directly taking the target categories as final categories of the target sounds; when the number of target classes is plural, the target class having the highest probability is set as the final class of the target sound.
Or selecting one category from the prediction categories of the target sound corresponding to the target audio signals of the plurality of primitives as the final category of the target sound based on a voting selection method;
or fusing the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives based on a D-S evidence theory to obtain the final category of the target sounds; the embodiment does not specifically limit the fusion method.
The existing traditional sound identification has poor adaptability to the environment, the background environment is slightly changed or other interferences can cause very low identification rate, and the robustness is low, but the embodiment combines a plurality of primitives of the array hydrophone to synchronously acquire target sound, synchronously identifies a plurality of primitive target audio signals, fuses the identification results of various target audio signals, obtains the final classification result, and can not influence the overall classification effect even though the interference of factors occasionally, so that the identification accuracy is higher.
In addition, compared with a single-channel sound collection system, the target sound received by the array hydrophones is farther away, and can be identified remotely.
In the embodiment, a plurality of elements of an array hydrophone are used for collecting target audio signals, and a plurality of characteristics of the target audio signals of each element are fused to be used as input information of a classification model, so that the input information contains rich characteristics for representing target sounds, the fusion characteristics of the target audio signals of each element are input into the classification model, prediction categories of the target sounds corresponding to the target audio signals of each element are obtained, then, the prediction categories of the target sounds corresponding to the target audio signals of the plurality of elements are fused to obtain final categories of the target sounds, and the accuracy and robustness of classification results are effectively improved; in the classification process, target audio signals of a plurality of primitives are synchronously learned, so that the judgment capability of a classification model can be effectively enhanced, and the accuracy of a classification result is further improved.
On the basis of the above embodiments, the feature extraction in this embodiment includes a chromatographic energy normalization feature extraction, a spectral contrast feature extraction, and a mel-frequency cepstrum coefficient feature extraction.
On the basis of the foregoing embodiment, the second testing module in this embodiment is specifically configured to: and on the basis of a voting selection method, counting the prediction categories with the highest frequency in the prediction categories corresponding to the plurality of primitives, and taking the prediction categories with the highest frequency as final categories of the target sound.
On the basis of the foregoing embodiments, the present embodiment further includes a classification model training module, configured to: and performing iterative training on the classification model according to the fusion characteristics of the sample audio signal and the reference category of the sample sound to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model iterated each time until the loss value of the classification model converges and/or the classification accuracy meets a preset termination condition.
On the basis of the foregoing embodiment, the present embodiment further includes a classification model evaluation module, configured to: determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in a verification set during each iteration; wherein the result indicators comprise a combination of one or more of accuracy, average accuracy, recall, and F-score scores.
On the basis of the foregoing embodiment, the classification model training module in this embodiment is specifically configured to: preprocessing the sample audio signal; wherein the pre-processing comprises a combination of one or more of normalization, resizing, and data enhancement; and performing iterative training on the parameters of the classification model according to the fusion characteristics of the preprocessed sample audio signals and the reference classes.
On the basis of the above embodiments, the data enhancement in the present embodiment includes one or more of a combination of mixed noise data enhancement, time offset data enhancement, and cut-fill data enhancement for sounds of the same kind.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)501, a communication Interface (Communications Interface)502, a memory (memory)503, and a communication bus 504, wherein the processor 501, the communication Interface 502, and the memory 503 are configured to communicate with each other via the communication bus 504. The processor 501 may invoke logic instructions in the memory 503 to perform a method of array hydrophone based sound recognition, the method comprising: collecting target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements; extracting various features of the target audio signal of each element, and performing feature fusion on the various features to obtain fusion features of the target audio signal of each element; inputting the fusion characteristics of the target audio signals of each element into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element; fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound; the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
In addition, the logic instructions in the memory 503 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the array hydrophone based sound identification method provided by the above methods, the method comprising: collecting target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements; extracting various features of the target audio signal of each element, and performing feature fusion on the various features to obtain fusion features of the target audio signal of each element; inputting the fusion characteristics of the target audio signals of each element into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element; fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound; the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for array hydrophone based sound recognition provided by the above-mentioned methods, the method comprising: collecting target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements; extracting various features of the target audio signal of each element, and performing feature fusion on the various features to obtain fusion features of the target audio signal of each element; inputting the fusion characteristics of the target audio signals of each element into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element; fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound; the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (12)

1. A sound identification method based on array hydrophones is characterized by comprising the following steps:
collecting target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements;
extracting various features of the target audio signal of each element, and performing feature fusion on the various features to obtain fusion features of the target audio signal of each element;
inputting the fusion characteristics of the target audio signals of each element into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each element;
fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound;
the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
2. The array hydrophone based sound identification method of claim 1, wherein the feature extraction comprises chromatographic energy normalization feature extraction, spectral contrast feature extraction and mel-frequency cepstral coefficient feature extraction.
3. The array hydrophone-based sound recognition method of claim 1, wherein the fusing the predicted classes of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final class of the target sound comprises:
and on the basis of a voting selection method, counting the prediction categories with the highest frequency in the prediction categories corresponding to the plurality of primitives, and taking the prediction categories with the highest frequency as final categories of the target sound.
4. The array hydrophone-based sound recognition method of any of claims 1-3, wherein before inputting the fusion features of the target audio signals of each primitive into the classification model to obtain the prediction class of the target sound corresponding to the target audio signal of each primitive, the method further comprises:
and performing iterative training on the classification model according to the fusion characteristics of the sample audio signal and the reference category of the sample sound to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model iterated each time until the loss value of the classification model converges and/or the classification accuracy meets a preset termination condition.
5. The array hydrophone-based sound recognition method of claim 4, further comprising:
determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in a verification set during each iteration;
wherein the result indicators comprise a combination of one or more of accuracy, average accuracy, recall, and F-score scores.
6. The array hydrophone-based sound recognition method of claim 4, wherein iteratively training the classification model based on the fused features of the sample audio signals and the reference classes of the sample sounds comprises:
preprocessing the sample audio signal;
wherein the pre-processing comprises a combination of one or more of normalization, resizing, and data enhancement;
and performing iterative training on the parameters of the classification model according to the fusion characteristics of the preprocessed sample audio signals and the reference classes.
7. The array hydrophone based sound recognition method of claim 6, wherein the data enhancement comprises a combination of one or more of mixed noise data enhancement, time-shifted data enhancement, and shear-fill data enhancement for sounds of the same class.
8. An array hydrophone based sound recognition apparatus, comprising:
the acquisition module is used for acquiring target sound based on a plurality of elements of the array hydrophone to obtain target audio signals of the plurality of elements;
the feature extraction module is used for extracting various features of the target audio signals of the primitives and performing feature fusion on the various features to obtain fusion features of the target audio signals of the primitives;
the first testing module is used for inputting the fusion characteristics of the target audio signals of all the elements into the classification model to obtain the prediction categories of the target sounds corresponding to the target audio signals of all the elements;
the second testing module is used for fusing the prediction categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain the final category of the target sound;
the classification model is obtained by training based on the fusion characteristics of the sample audio signals and the reference classes of the sample sounds corresponding to the sample audio signals.
9. The array hydrophone based sound recognition device of claim 8, further comprising a classification model training module to:
and performing iterative training on the classification model according to the fusion characteristics of the sample audio signal and the reference category of the sample sound to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model iterated each time until the loss value of the classification model converges and/or the classification accuracy meets a preset termination condition.
10. The array hydrophone based sound recognition device of claim 8, further comprising a classification model evaluation module to:
determining the optimal weight parameter of the classification model according to the result index of the verification of the weight parameter of the classification model in a verification set during each iteration;
wherein the result indicators comprise a combination of one or more of accuracy, average accuracy, recall, and F-score scores.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the array hydrophone based sound identification method of any of claims 1-7.
12. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, performs the steps of the array hydrophone based sound identification method of any of claims 1-7.
CN202111356702.XA 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone Active CN114220458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111356702.XA CN114220458B (en) 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111356702.XA CN114220458B (en) 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone

Publications (2)

Publication Number Publication Date
CN114220458A true CN114220458A (en) 2022-03-22
CN114220458B CN114220458B (en) 2024-04-05

Family

ID=80697292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111356702.XA Active CN114220458B (en) 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone

Country Status (1)

Country Link
CN (1) CN114220458B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722884A (en) * 2022-06-08 2022-07-08 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium
CN116366082A (en) * 2023-01-30 2023-06-30 深圳大学 Method for improving signal-to-noise ratio of microsecond ultra-short pulse underwater sound signal
CN117198330A (en) * 2023-11-07 2023-12-08 国家海洋技术中心 Sound source identification method and system and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120224715A1 (en) * 2011-03-03 2012-09-06 Microsoft Corporation Noise Adaptive Beamforming for Microphone Arrays
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN110473568A (en) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 Scene recognition method, device, storage medium and electronic equipment
CN112183582A (en) * 2020-09-07 2021-01-05 中国海洋大学 Multi-feature fusion underwater target identification method
CN112364779A (en) * 2020-11-12 2021-02-12 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN112541547A (en) * 2020-12-14 2021-03-23 哈尔滨工程大学 Weighted fusion weight determination method supporting underwater rapid target recognition

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120224715A1 (en) * 2011-03-03 2012-09-06 Microsoft Corporation Noise Adaptive Beamforming for Microphone Arrays
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN110473568A (en) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 Scene recognition method, device, storage medium and electronic equipment
CN112183582A (en) * 2020-09-07 2021-01-05 中国海洋大学 Multi-feature fusion underwater target identification method
CN112364779A (en) * 2020-11-12 2021-02-12 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN112541547A (en) * 2020-12-14 2021-03-23 哈尔滨工程大学 Weighted fusion weight determination method supporting underwater rapid target recognition

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722884A (en) * 2022-06-08 2022-07-08 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium
CN114722884B (en) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium
CN116366082A (en) * 2023-01-30 2023-06-30 深圳大学 Method for improving signal-to-noise ratio of microsecond ultra-short pulse underwater sound signal
CN116366082B (en) * 2023-01-30 2023-12-01 深圳大学 Method for improving signal-to-noise ratio of microsecond ultra-short pulse underwater sound signal
CN117198330A (en) * 2023-11-07 2023-12-08 国家海洋技术中心 Sound source identification method and system and electronic equipment
CN117198330B (en) * 2023-11-07 2024-01-30 国家海洋技术中心 Sound source identification method and system and electronic equipment

Also Published As

Publication number Publication date
CN114220458B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN114220458A (en) Sound identification method and device based on array hydrophone
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN110600017A (en) Training method of voice processing model, voice recognition method, system and device
CN109784366A (en) The fine grit classification method, apparatus and electronic equipment of target object
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN111526119B (en) Abnormal flow detection method and device, electronic equipment and computer readable medium
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN110298235A (en) Hyperspectral abnormity detection method and system based on manifold constraint autoencoder network
CN110147745A (en) A kind of key frame of video detection method and device
CN111881972B (en) Black-out user identification method and device, server and storage medium
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN110046279A (en) Prediction technique, medium, device and the calculating equipment of video file feature
CN115587337A (en) Method, device and storage medium for recognizing abnormal sound of vehicle door
CN116958637A (en) Training method, device, equipment and storage medium of image detection model
Xiao et al. AMResNet: An automatic recognition model of bird sounds in real environment
CN112906883A (en) Hybrid precision quantization strategy determination method and system for deep neural network
CN116884435A (en) Voice event detection method and device based on audio prompt learning
CN112035700A (en) Voice deep hash learning method and system based on CNN
CN109117352B (en) Server performance prediction method and device
CN115795353A (en) Underground metal target classification method and system based on unbalanced data set
CN112951270B (en) Voice fluency detection method and device and electronic equipment
CN112529637B (en) Service demand dynamic prediction method and system based on context awareness
CN115267672A (en) Method for detecting and positioning sound source
CN113327616A (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN112163494A (en) Video false face detection method and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant