CN114220458B - Voice recognition method and device based on array hydrophone - Google Patents

Voice recognition method and device based on array hydrophone Download PDF

Info

Publication number
CN114220458B
CN114220458B CN202111356702.XA CN202111356702A CN114220458B CN 114220458 B CN114220458 B CN 114220458B CN 202111356702 A CN202111356702 A CN 202111356702A CN 114220458 B CN114220458 B CN 114220458B
Authority
CN
China
Prior art keywords
classification model
target
audio signals
primitives
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111356702.XA
Other languages
Chinese (zh)
Other versions
CN114220458A (en
Inventor
申和平
张海兵
冯权胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Puhui Ocean Photoelectric Technology Co ltd
Original Assignee
Wuhan Puhui Ocean Photoelectric Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Puhui Ocean Photoelectric Technology Co ltd filed Critical Wuhan Puhui Ocean Photoelectric Technology Co ltd
Priority to CN202111356702.XA priority Critical patent/CN114220458B/en
Publication of CN114220458A publication Critical patent/CN114220458A/en
Application granted granted Critical
Publication of CN114220458B publication Critical patent/CN114220458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/44Special adaptations for subaqueous use, e.g. for hydrophone

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention provides a voice recognition method and device based on an array hydrophone, wherein the method comprises the following steps: based on a plurality of primitives of the array hydrophone, collecting target sound to obtain target audio signals of the primitives; extracting multiple characteristics of the target audio signals of each primitive, and carrying out characteristic fusion on the multiple characteristics; inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive; and fusing the prediction categories corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound. The method and the device realize classification based on the fusion characteristics of the target audio signals and the multi-array element signals, acquire the final category of the target sound, and effectively improve the accuracy and the robustness of the classification result; in the classifying process, synchronous learning is carried out on various target audio signals, so that the judging capability of a classifying model is effectively enhanced, and the accuracy of a classifying result is further improved.

Description

Voice recognition method and device based on array hydrophone
Technical Field
The invention relates to the technical field of audio processing, in particular to a voice recognition method and device based on an array hydrophone.
Background
The technology for intelligently identifying the sound target of the environment sound signal under the background of the real environment is a hot research subject in the field of sound signal processing, and is widely applied to the fields of military and civil use, such as the fields of underwater monitoring, security monitoring, medical monitoring, intelligent home, intelligent city management and the like.
Because of the complex and changeable real environment, the signal itself has the characteristics of non-periodicity, non-stability and low signal-to-noise ratio, and most of audio data contains noise interference, so that great difficulty and challenges are brought to the classification of sound signals. The effect of identifying the sound signal by using the traditional spectral feature analysis method is not ideal, and the identification accuracy is low.
With the wide application of deep learning in the field of computer vision, voice target recognition technology based on artificial intelligence methods has also been developed. At present, the artificial neural network can replace the traditional spectral feature analysis method by virtue of strong learning ability, self-adaptability and nonlinear mapping ability, and becomes a mainstream method for voice target recognition.
However, existing artificial neural networks generally use single-channel sound data and single features to identify and compare categories of sound targets; the real environment is complex and changeable, and various sound targets are mixed in the same environment, so that it is difficult to characterize and distinguish different sound targets according to the sound data and the single characteristics of a single channel, and it is difficult to ensure the accuracy of the recognition result.
Disclosure of Invention
The invention provides a voice recognition method and device based on an array hydrophone, which are used for solving the defect that in the prior art, voice recognition results are inaccurate due to the fact that voice data and single characteristics of a single channel are used for recognizing the category of a voice target, and improving the accuracy of voice recognition.
The invention provides a voice recognition method based on an array hydrophone, which comprises the following steps:
based on a plurality of primitives of the array hydrophone, collecting target sound to obtain target audio signals of the primitives;
extracting multiple characteristics from the target audio signals of each primitive, and carrying out characteristic fusion on the multiple characteristics to obtain fusion characteristics of the target audio signals of each primitive;
inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive;
fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound;
the classification model is obtained through training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal.
According to the voice recognition method based on the array hydrophone, the feature extraction comprises chromatographic energy normalization feature extraction, spectral contrast feature extraction and Mel frequency cepstrum coefficient feature extraction.
According to the voice recognition method based on the array hydrophone provided by the invention, the predicted categories of the target voice corresponding to the target audio signals of the plurality of primitives are fused to obtain the final category of the target voice, and the voice recognition method comprises the following steps:
based on a voting selection method, the prediction category with the highest frequency is counted among the prediction categories corresponding to the plurality of primitives, and the prediction category with the highest frequency is used as the final category of the target sound.
According to the voice recognition method based on the array hydrophone, before the fusion characteristics of the target audio signals of all the primitives are input into the classification model to obtain the prediction category of the target voice corresponding to the target audio signals of all the primitives, the voice recognition method based on the array hydrophone further comprises the following steps:
and carrying out iterative training on the classification model according to the fusion characteristics of the sample audio signals and the reference category of the sample sound so as to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model in each iteration until the loss value of the classification model converges and/or the classification accuracy meets the preset termination condition.
The invention provides a voice recognition method based on an array hydrophone, which further comprises the following steps:
determining the optimal weight parameters of the classification model according to the result indexes of the weight parameters of the classification model verified in the verification set in each iteration;
wherein the outcome index comprises a combination of one or more of accuracy, average accuracy, recall, and F-score.
According to the voice recognition method based on the array hydrophone provided by the invention, the iterative training of the classification model is carried out according to the fusion characteristics of the sample audio signals and the reference category of the sample voice, and the voice recognition method comprises the following steps:
preprocessing the sample audio signal;
wherein the preprocessing includes a combination of one or more of normalization, resizing, and data enhancement;
and carrying out iterative training on parameters of the classification model according to fusion characteristics of the preprocessed sample audio signals and the reference category.
According to the voice recognition method based on the array hydrophone, the data enhancement comprises one or more of mixed noise data enhancement, time offset data enhancement and shearing filling data enhancement on the same type of voice.
The invention also provides a voice recognition device based on the array hydrophone, which comprises:
the acquisition module is used for acquiring target sound based on a plurality of primitives of the array hydrophone to obtain target audio signals of the primitives;
the feature extraction module is used for extracting various features of the target audio signals of the primitives and carrying out feature fusion on the various features to obtain fusion features of the target audio signals of the primitives;
the first test module is used for inputting the fusion characteristics of the target audio signals of the primitives into the classification model to obtain the prediction category of the target sound corresponding to the target audio signals of the primitives;
the second test module is used for fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound;
the classification model is obtained through training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal.
The invention provides a voice recognition device based on an array hydrophone, which further comprises a classification model training module, wherein the classification model training module is used for:
and carrying out iterative training on the classification model according to the fusion characteristics of the sample audio signals and the reference category of the sample sound so as to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model in each iteration until the loss value of the classification model converges and/or the classification accuracy meets the preset termination condition.
According to the invention, the voice recognition device based on the array hydrophone further comprises a classification model evaluation module, which is used for:
determining the optimal weight parameters of the classification model according to the result indexes of the weight parameters of the classification model verified in the verification set in each iteration;
wherein the outcome index comprises a combination of one or more of accuracy, average accuracy, recall, and F-score.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed implements the steps of the array hydrophone based voice recognition method as described in any of the above.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the array hydrophone based voice recognition method as described in any of the above.
The invention also provides a computer program product comprising a computer program which when executed by a processor carries out the steps of a method of array hydrophone based voice recognition as described in any of the above.
According to the voice recognition method and device based on the array hydrophone, the target audio signals are collected through the plurality of primitives of the array hydrophone, various characteristics of the target audio signals of the primitives are fused to serve as input information of the classification model, so that the input information contains rich characteristics used for representing target voice, the fused characteristics of the target audio signals of the primitives are input into the classification model, the prediction category of the target voice corresponding to the target audio signals of the primitives is obtained, then the prediction category of the target voice corresponding to the target audio signals of the primitives is fused, the final category of the target voice is obtained, and accuracy and robustness of classification results are effectively improved; in the classifying process, synchronous learning is carried out on the target audio signals of the plurality of primitives, so that the judging capability of the classifying model can be effectively enhanced, and the accuracy of the classifying result is further improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a method for voice recognition based on an array hydrophone provided by the invention;
FIG. 2 is a schematic diagram of a classification model in the array hydrophone-based voice recognition method provided by the invention;
FIG. 3 is a second flow chart of the method for voice recognition based on the array hydrophone provided by the invention;
FIG. 4 is a schematic diagram of the structure of the array hydrophone-based voice recognition device provided by the invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The array hydrophone-based voice recognition method of the present invention is described below in conjunction with fig. 1, and includes:
step 101, collecting target sound based on a plurality of primitives of an array hydrophone to obtain target audio signals of the primitives;
The voice recognition method in the embodiment can be applied to the fields of underwater monitoring, security monitoring, medical monitoring, intelligent home, urban intelligent management and the like;
the voice recognition method in the present embodiment can be applied to different systems or devices, such as an actuator; the actuator may be an intelligent terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a vehicle-mounted terminal, etc., or may be a server or a cloud terminal, etc., which is not particularly limited in this embodiment.
The type and the running environment of the actuator can be set according to actual requirements, for example, the actuator can be a computer with a GPU (Graphics Processing Unit, graphics processor) type NVIDIA GeForce RTX 3080,3080. The software implementation takes Windows as a platform, adopts Python programming language, and is developed based on a deep learning framework TensorFlow.
The multiple primitives on the array hydrophone are used for synchronously collecting target sound. The attribute of the array hydrophone can be configured according to actual requirements, such as the number of the primitives, the array shape, the space position among the primitives and the like. For example, the number of elements on an array hydrophone is 8 or 10, etc.
The target sound is a sound to be classified, and may be a water sound, a car sound, a truck sound, a footstep sound, or the like, which is not particularly limited in this embodiment.
Alternatively, the target sound may be collected based on a plurality of primitives of the array hydrophone at the same time, and a target audio signal of the plurality of primitives, that is, one target audio signal of the target sound collected by each primitive, may be obtained.
It should be noted that the target audio signal may be an audio signal obtained by acquiring the target sound in real time based on a plurality of primitives of the array hydrophone, or may be a target audio signal of the target sound acquired from a local storage of a plurality of primitives of the array hydrophone. The present embodiment does not specifically limit the source of the target sound.
102, extracting various features from the target audio signals of each primitive, and carrying out feature fusion on the various features to obtain fusion features of the target audio signals of each primitive;
optionally, before the target audio signal of each primitive is input into the classification model, multiple feature extraction may be performed on the target audio signal of each primitive to select a feature matrix with a certain degree of distinction in the scene.
Each feature matrix is a two-dimensional feature matrix, and finally the feature matrices are spliced and combined into a new two-dimensional feature matrix so as to obtain the fusion feature of the target audio signals of each primitive.
Compared with the prior art that single characteristics are adopted to classify the sound, the method and the device have the advantages that multiple characteristics of the target audio signal are extracted, the target audio signal can be extracted more abundantly, the characteristics of the target audio signal can be represented, and then the classification result of the target sound is more accurate.
Step 103, inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive;
the classification model includes, but is not limited to, an input layer, a hidden layer, an activation layer, a pooling layer, a softmax classification layer, and an output layer, and the specific structure of the classification model is not specifically limited in this embodiment.
The classification model may be various deep learning models, such as convolutional neural networks; classification models include, but are not limited to, VGG model (Visual Geometry Group Network, computer vision group Network), res net (Residual Network) series, classical convolutional neural Network acceptance v3 Network and Xception Network, and DenseNet (Densely Connected Convolutional Networks, densely connected convolutional Network) series, etc., and the present embodiment is not particularly limited to the type of classification model.
An exemplary classification model is shown in fig. 2. The classification model comprises one Entry Flow, eight Middle flows and one Exit Flow; the Entry flow is mainly used for continuously downsampling, so that the space dimension is reduced; middle Flow is to learn association relation continuously, optimize the characteristic; exit Flow is a summary and sort feature for delivering full connection layer to express; the classification model is provided with 14 modules, and the first module and the last module are connected by residual structures; the classification model comprises 36 convolution layers, wherein the parameter quantity of the introduced depth-sortable convolution layer is greatly reduced compared with that of a standard convolution layer, and the calculation quantity is correspondingly reduced.
Optionally, after the fusion feature of the target audio signal of each primitive is obtained, the fusion feature of the target audio signal of each primitive may be directly input into the classification model, or may be input into the classification model after one or more preprocessing is performed on the fusion feature, which is not specifically limited in this embodiment.
Then, after the fusion characteristics of the target audio signals of the primitives are learned by the classification model, outputting the prediction category of the target sound corresponding to the target audio signals of the primitives; wherein the target audio signal of each primitive corresponds to a class of target sound.
And 104, fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound.
Optionally, after obtaining the prediction categories of the target sounds corresponding to the target audio signals of the plurality of primitives, the prediction categories of the target sounds corresponding to the target audio signals of the plurality of primitives may be fused to obtain a final category of the target sounds.
Optionally, the fusion manner may be that probabilities that target sounds corresponding to target audio signals of the plurality of primitives belong to each preset category are added, final probabilities that the target sounds belong to each preset category are determined, and final categories of the target sounds are determined according to the final probabilities that the target sounds belong to each preset category.
Alternatively, a preset category having a final probability greater than a preset value may be taken as the target category; when the number of the target categories is one, the target category is directly used as the final category of the target sound; when the number of target categories is plural, the target category with the highest probability is taken as the final category of the target sound.
Or selecting one category from predicted categories of target sounds corresponding to target audio signals of a plurality of primitives as a final category of the target sounds based on a voting selection method;
Or fusing the predicted categories of the target sounds corresponding to the target audio signals of the plurality of primitives based on the D-S evidence theory to obtain the final category of the target sounds; the present embodiment is not particularly limited to the fusion method.
The existing traditional voice recognition has poor adaptability to the environment, the recognition rate is very low due to slight changes of the background environment or other interference, and the robustness is low.
In addition, compared with a single-channel sound collection system, the target sound received by the array hydrophone is farther away and can be identified remotely.
According to the embodiment, the target audio signals are collected through the multiple primitives of the array hydrophone, multiple characteristics of the target audio signals of the primitives are fused to serve as input information of the classification model, so that the input information contains rich characteristics used for representing target sounds, the fusion characteristics of the target audio signals of the primitives are input into the classification model, the prediction category of the target sounds corresponding to the target audio signals of the primitives is obtained, then the prediction category of the target sounds corresponding to the target audio signals of the multiple primitives is fused, the final category of the target sounds is obtained, and accuracy and robustness of classification results are effectively improved; in the classifying process, synchronous learning is carried out on the target audio signals of the plurality of primitives, so that the judging capability of the classifying model can be effectively enhanced, and the accuracy of the classifying result is further improved.
On the basis of the above embodiment, the feature extraction in this embodiment includes chromatographic energy normalization feature extraction, spectral contrast feature extraction, and mel-frequency cepstrum coefficient feature extraction.
Alternatively, the feature extraction may be chromatographic energy normalization feature extraction, spectral contrast feature extraction, mel-frequency cepstrum coefficient feature extraction, or the like, which is not particularly limited in this embodiment, on the time domain data of each audio signal.
It should be noted that, the feature extraction in the present embodiment is not limited to the above-mentioned feature extraction methods, and other feature extraction methods that can characterize the effective features of the target sound may be used.
The voice recognition method in the present embodiment is specifically described below in terms of simultaneous chromatographic energy normalization feature extraction, spectral contrast feature extraction, and mel-frequency cepstrum coefficient feature extraction for each audio signal.
Optionally, in order to make the input information of the classification model contain rich features of the target sound, two-dimensional chromatographic energy normalization feature extraction, two-dimensional spectral contrast feature extraction and two-dimensional mel-frequency cepstrum coefficient feature extraction can be performed on each audio signal at the same time;
Then, the three feature matrixes contained in each audio signal are spliced into a feature matrix to form a new two-dimensional feature matrix; and finally, inputting the new two-dimensional feature matrix of the multiple target audio signals into a classification model to obtain the class of the target sound corresponding to each target audio signal output by the classification model.
For example, for any audio signal, after performing two-dimensional chromatographic energy normalization feature extraction on time domain data of the audio signal, the size of the feature matrix is 98×12; extracting two-dimensional spectral contrast characteristic of the time domain data of the audio signal, wherein the size of a characteristic matrix is 98 x 7; extracting the features of the two-dimensional mel frequency cepstrum coefficient of the time domain data of the audio signal, wherein the size of the feature matrix is 98 x 40; splicing various feature matrices of the audio signal into a new two-dimensional feature matrix with the size of 98 x 59; then, a new feature matrix of the audio signal of each element is input into the classification model for learning to acquire the class of the target sound corresponding to the target audio signal of each element.
According to the embodiment, before the target audio signals of the primitives are input into the classification model, the characteristics of the target audio signals are extracted, so that the input information of the classification model contains rich characteristics of target sounds, the distinguishing degree and the characterization capability of the characteristics of the target sounds can be effectively improved, and further more accurate classification results can be obtained.
Based on the above embodiment, in this embodiment, fusing the predicted categories of the target sound corresponding to the target audio signals of the multiple primitives to obtain a final category of the target sound includes: based on a voting selection method, the prediction category with the highest frequency is counted among the prediction categories corresponding to the plurality of primitives, and the prediction category with the highest frequency is used as the final category of the target sound.
Optionally, after the target audio signal of each primitive is input into the classification model, a prediction class of the target sound corresponding to the target audio signal of each primitive may be obtained;
then, based on a voting selection method, firstly, counting the prediction categories of target sounds corresponding to target audio signals of a plurality of primitives to obtain the occurrence frequency of each prediction category; then, the predicted class with the highest frequency of occurrence is taken as the final class of the target sound, and the classification probabilities of the target sound corresponding to the target audio signals of the plurality of primitives are averaged to be taken as the final classification probability of the target sound.
According to the method, the device and the system, the prediction categories of the target sounds corresponding to the target audio signals of the multiple primitives are fused in a voting selection mode, so that the final category of the target sounds is obtained, and the problems of single characteristic characterization method, single signal channel, low recognition accuracy, poor robustness and high false alarm rate in the prior art can be effectively solved, and the target sounds in the complex sound environment can be classified with high precision.
On the basis of the above embodiments, in this embodiment, before inputting the fusion feature of the target audio signal of each primitive into the classification model to obtain the prediction class of the target sound corresponding to the target audio signal of each primitive, the method further includes: and carrying out iterative training on the classification model according to the fusion characteristics of the sample audio signals and the reference category of the sample sound so as to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model in each iteration until the loss value of the classification model converges and/or the classification accuracy meets the preset termination condition.
Optionally, training the target audio signals before inputting the plurality of target audio signals into the classification model; while training the classification model requires the construction of a dataset.
Optionally, a set of audio signal construction datasets may be selected from a local store of a plurality of sound collection devices; wherein the number of choices may be 8000 sets; the dataset contains a plurality of categories of sounds, such as ambient background noise, car sounds, truck sounds, and footstep sounds. Each class of sound is stored in a multichannel wav format audio file.
The sampling frequency and the duration of each sampling of the sound collection device can be set according to actual requirements, for example, the sampling frequency is 10000Hz, and the duration is 5s.
In the actual simulation process, the audio signals of the corresponding groups in the data set can be used as training samples, namely sample audio signals, the audio signals of the corresponding groups in the data set are used as verification sets, and the audio signals of the corresponding groups in the data set are used as test sets, namely target audio signals.
For example, the step phonon dataset contains 2000 groups of 5s of audio signals, at 80%: a 20% ratio randomly divides the sub-data set into 1600 sets of audio signals for the input set and 400 sets of audio signals for the test set; the 1600 groups of audio signals of the input set are mixed again according to 80 percent: the 20% proportion is randomly divided into a 1280 group of audio signals of a training set and 320 groups of audio signals of a verification set; other classes of sounds may also be divided in the manner described above.
For the data set containing four categories of environmental background noise, car sound, truck sound and footstep sound, after division, training set 5120 groups of audio signals, verification set 1280 groups of audio signals and test set 1600 groups of audio signals can be obtained.
As shown in fig. 3, in the process of training the classification model, the sample audio signals in the training set can be input into the classification model for training in batches; calculating the value of the loss function under the batch parameters through the loss function so as to minimize the value of the model loss function and minimize the value of the loss function; the number of sets of sample data contained in each batch can be set according to practical requirements, for example, each batch contains 16 sets of sample sounds.
In the training process, the weight parameters of the training classification model of each iteration are also required to be saved; the saving mode can be that the weight parameters of the iterative training classification model are saved after each iterative training is completed; or after iterating the preset period, uniformly storing the weight parameters of the classification model in the preset period, for example, storing the weight parameters of the classification model once after iterating for 10 periods, and training to the last step to store once. The present embodiment does not specifically limit the manner of saving the weight parameters of the classification model.
And continuously updating parameters of the classification model according to the value of the loss function and the accuracy of the verification set after each iteration training is finished, and continuously improving the accuracy of the classification model until one or more combinations of convergence of the loss value of the classification model, satisfaction of the classification accuracy and preset termination conditions or maximum iteration times are met, and stopping training the classification model to obtain the classification model with higher classification accuracy. Compared with the traditional voice recognition model, the embodiment utilizes the deeper and wider convolutional neural network to extract, separate and train the characteristics, can adapt to background interference, has stronger robustness and high recognition accuracy.
According to the embodiment, the classification model is trained through various sample audio signals, so that the generalization of the classification model to environmental changes is stronger, the classification of the target sound can be accurately identified in a real complex environment, and the evaluation index of the classification model is adaptively selected from the optimal parameters of the classification model, so that the performance of the classification model is better.
On the basis of the above embodiment, the present embodiment further includes: determining the optimal weight parameters of the classification model according to the result indexes of the weight parameters of the classification model verified in the verification set in each iteration; wherein the outcome index comprises a combination of one or more of accuracy, average accuracy, recall, and F-score.
Alternatively, since the performance of the classification model is mainly determined by the weight parameters of the classification model. Therefore, in order to obtain a classification model with higher accuracy and more suitability, a most suitable weight parameter needs to be selected.
In order to obtain the optimal parameters of the classification model, a group of weight parameters can be selected from all weight parameters stored in the training process to serve as the optimal weight parameters of the classification model according to the result indexes of verification of the weight parameters of each iteration classification model in a verification set.
The larger the accuracy, average accuracy, recall and score are, the better the performance of the characterization classification model is.
Score F Score For the harmonic averaging of the accuracy and recall, the calculation formula is:
wherein, beta is used to balance the importance of precision and recall, and P and R are precision and recall, respectively.
Optionally, when determining the optimal weight parameter of the classification model according to the result index of verification of the weight parameter of each iteration classification model in the verification set, calculating the performance index of each iteration classification model according to one or more combinations of the accuracy, average accuracy, recall and score of verification of each iteration classification model; the present embodiment is not particularly limited thereto;
it should be noted that, the result index in the present embodiment is not limited to the above-mentioned several indexes, and other indexes that can characterize the performance of the classification model may be used.
Optionally, the optimal weight parameter may be a weight parameter with highest accuracy; or the weight parameter with highest average accuracy; or the weight parameter with highest combination of accuracy, average accuracy, recall and score; or the weight parameter with highest accuracy and average accuracy; or the weight parameters with the highest accuracy and average accuracy and the highest combination of the accuracy, average accuracy, recall and score; or the weight parameter with the highest accuracy and average accuracy, and the highest combination of accuracy, average accuracy, recall and score, and the largest iteration period, which is not specifically limited in this embodiment.
It should be noted that, the optimal parameter may be an optimal parameter stored after the last iterative training, or may be an optimal parameter stored after a certain intermediate iterative training.
The voice recognition method in this embodiment is specifically described below in the following description in which performance indexes of the classification model are calculated for each iteration simultaneously according to the accuracy, the average accuracy, the recall, and the score.
Optionally, in order to optimize the comprehensive performance index of the optimized classification model, the accuracy, average accuracy, recall and score of each iteration classification model can be combined, and the performance index of each iteration classification model can be calculated;
the accuracy, the average accuracy, the recall and the score can be directly added to obtain the performance index of the classification model;
or inputting the accuracy, the average accuracy, the recall and the score into a fusion model, and fusing to obtain the performance index of the classification model;
or, the accuracy, average accuracy, recall rate and score of each iteration classification model are weighted and added to obtain the performance index of each iteration classification model.
The accuracy, average accuracy, recall and scoring weights can be set according to actual requirements, and weights can be determined by a weight analysis method.
The weight analysis method may be a hierarchical analysis method, a loop ratio analysis method, a principal component analysis method, an entropy value analysis method, or the like, and the embodiment is not limited to the type of the weight analysis method.
Multiplying the accuracy, average accuracy, recall and score of the classification model in each iteration with corresponding weights and adding to obtain the performance index of the classification model in each iteration, wherein the specific calculation formula is as follows:
P i =λ i,1 S i,1i,2 S i,2i,3 S i,3i,4 S i,4
wherein P is i For the performance index lambda of the ith iteration classification model i,1 、λ i,2 、λ i,3 And lambda (lambda) i,4 As the weight coefficient, S i,1 、S i,2 、S i,3 And S is i,4 The accuracy, average accuracy, recall and score of the ith iterative classification model are respectively determined.
And then sequencing performance indexes of all iterations of the whole training process, and taking a weight parameter corresponding to the maximum performance index as an optimal weight parameter of the classification model if only one maximum performance index exists in the sequencing result.
And if a plurality of maximum performance indexes exist in the sequencing result, taking the weight parameter corresponding to the maximum performance index with the maximum iteration period as the optimal weight parameter.
And then, inputting the target audio signals of the primitives into a classification model of the optimal weight parameters to obtain the prediction type of the target sound corresponding to the target audio signals of the primitives output by the classification model.
Compared with the prior art, the method has the advantages that the optimal parameters of the last iteration are directly used as the optimal parameters of the model, and the optimal weight parameters of the classification model are adaptively selected according to one or more of the accuracy, the average accuracy, the recall and the score of the classification model, so that the comprehensive performance of the classification model is better, and the classification result is more accurate.
On the basis of the foregoing embodiment, in this embodiment, performing iterative training on the classification model according to the fusion feature of the sample audio signal and the reference class of the sample sound includes: preprocessing the sample audio signal; wherein the preprocessing includes a combination of one or more of normalization, resizing, and data enhancement; and carrying out iterative training on parameters of the classification model according to fusion characteristics of the preprocessed sample audio signals and the reference category.
Wherein the normalization process may normalize the sample audio signal to between [ -1,1 ]; normalization processing includes, but is not limited to, maximum-minimum normalization and standard normalization, which is not specifically limited in this embodiment.
The size adjustment is used for adjusting the duration of all the sample audio signals to a uniform length t; and if the time length is less than the random value filling of the complement of t and is close to 0, the time length is greater than the truncation of t.
For example, the duration of each sample audio signal is rounded to 5s, cutting is performed for longer than 5s to 5s duration, and less than 5s is filled with a random value close to 0 to 5s duration.
According to the embodiment, the sample audio signals are preprocessed and data are enhanced, so that the trained classification model can adapt to background interference, robustness, generalization and compatibility are stronger, and the classification accuracy is improved.
The sample audio signal may be preprocessed as described above before iteratively optimizing parameters of the classification model using the sample audio signal and the reference class.
It should be noted that, normalization and/or size adjustment may be performed on the target audio information, so as to avoid influence of dimensions and sizes on the classification result, so as to improve accuracy of identifying the category of the target sound.
On the basis of the above embodiment, the data enhancement in this embodiment includes a combination of one or more of mixed noise data enhancement, time offset data enhancement, and clipping-filling data enhancement for the same kind of sound.
Wherein the data enhancement includes, but is not limited to, one or more of a mixed noise data enhancement, a time offset data enhancement, and a same kind of sound cut filler data enhancement, which is not specifically limited in this embodiment.
The mixed noise data enhancement can randomly select a section of background noise, then mix the background noise with a sample audio signal, and randomly adjust the volume; some random noise, such as gaussian noise, may also be added randomly, and the implementation of the hybrid noise data enhancement is not specifically limited in this embodiment.
The time offset data enhancement may take a segment of the audio signal, e.g., 1 to 3 seconds, at the beginning or end of the sample audio signal, covering some random numbers in the sample audio signal very close to 0;
the same kind of sound cutting filling data enhancement can randomly select two same kind of sample audio signals, each cut off a part, and a new sample audio signal is formed by combining.
According to the embodiment, the data enhancement is carried out on the sample audio signal, so that the training set can be effectively expanded, the robustness of the sample in different environments is improved, the trained classification model can adapt to background interference, the robustness, generalization and compatibility are stronger, and the classification accuracy is improved.
The following describes the voice recognition device based on the array hydrophone provided by the invention, and the voice recognition device based on the array hydrophone and the voice recognition method based on the array hydrophone described above can be correspondingly referred to each other.
As shown in fig. 4, the present embodiment provides a voice recognition apparatus based on an array hydrophone, which includes an acquisition module 401, a feature extraction module 402, a first test module 403, and a second test module 404, wherein:
the acquisition module 401 is configured to acquire a target sound based on a plurality of primitives of the array hydrophone, so as to obtain target audio signals of the plurality of primitives;
the voice recognition method in the embodiment can be applied to the fields of underwater monitoring, security monitoring, medical monitoring, intelligent home, urban intelligent management and the like;
the voice recognition method in the present embodiment can be applied to different systems or devices, such as an actuator; the actuator may be an intelligent terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a vehicle-mounted terminal, etc., or may be a server or a cloud terminal, etc., which is not particularly limited in this embodiment.
The multiple primitives on the array hydrophone are used for synchronously collecting target sound. The attribute of the array hydrophone can be configured according to actual requirements, such as the number of the primitives, the array shape, the space position among the primitives and the like. The target sound is a sound to be classified, and may be a water sound, a car sound, a truck sound, a footstep sound, or the like, which is not particularly limited in this embodiment.
Alternatively, the target sound may be collected based on a plurality of primitives of the array hydrophone at the same time, and a target audio signal of the plurality of primitives, that is, one target audio signal of the target sound collected by each primitive, may be obtained.
It should be noted that the target audio signal may be an audio signal obtained by acquiring the target sound in real time based on a plurality of primitives of the array hydrophone, or may be a target audio signal of the target sound acquired from a local storage of a plurality of primitives of the array hydrophone. The present embodiment does not specifically limit the source of the target sound.
The feature extraction module 402 is configured to perform multiple feature extraction on the target audio signal of each primitive, and perform feature fusion on the multiple features to obtain a fused feature of the target audio signal of each primitive;
optionally, before the target audio signal of each primitive is input into the classification model, multiple feature extraction may be performed on the target audio signal of each primitive to select a feature matrix with a certain degree of distinction in the scene.
Each feature matrix is a two-dimensional feature matrix, and finally the feature matrices are spliced and combined into a new two-dimensional feature matrix so as to obtain the fusion feature of the target audio signals of each primitive.
Compared with the prior art that single characteristics are adopted to classify the sound, the method and the device have the advantages that multiple characteristics of the target audio signal are extracted, the target audio signal can be extracted more abundantly, the characteristics of the target audio signal can be represented, and then the classification result of the target sound is more accurate.
The first test module 403 is configured to input the fusion feature of the target audio signal of each primitive into a classification model, to obtain a prediction class of the target sound corresponding to the target audio signal of each primitive;
the classification model includes, but is not limited to, an input layer, a hidden layer, an activation layer, a pooling layer, a softmax classification layer, and an output layer, and the specific structure of the classification model is not specifically limited in this embodiment.
Optionally, after the fusion feature of the target audio signal of each primitive is obtained, the fusion feature of the target audio signal of each primitive may be directly input into the classification model, or may be input into the classification model after one or more preprocessing is performed on the fusion feature, which is not specifically limited in this embodiment.
Then, after the fusion characteristics of the target audio signals of the primitives are learned by the classification model, outputting the prediction category of the target sound corresponding to the target audio signals of the primitives; wherein the target audio signal of each primitive corresponds to a class of target sound.
The second test module 403 is configured to fuse the predicted categories of the target sound corresponding to the target audio signals of the multiple primitives, so as to obtain a final category of the target sound.
Optionally, after obtaining the prediction categories of the target sounds corresponding to the target audio signals of the plurality of primitives, the prediction categories of the target sounds corresponding to the target audio signals of the plurality of primitives may be fused to obtain a final category of the target sounds.
Optionally, the fusion manner may be that probabilities that target sounds corresponding to target audio signals of the plurality of primitives belong to each preset category are added, final probabilities that the target sounds belong to each preset category are determined, and final categories of the target sounds are determined according to the final probabilities that the target sounds belong to each preset category.
Alternatively, a preset category having a final probability greater than a preset value may be taken as the target category; when the number of the target categories is one, the target category is directly used as the final category of the target sound; when the number of target categories is plural, the target category with the highest probability is taken as the final category of the target sound.
Or selecting one category from predicted categories of target sounds corresponding to target audio signals of a plurality of primitives as a final category of the target sounds based on a voting selection method;
Or fusing the predicted categories of the target sounds corresponding to the target audio signals of the plurality of primitives based on the D-S evidence theory to obtain the final category of the target sounds; the present embodiment is not particularly limited to the fusion method.
The existing traditional voice recognition has poor adaptability to the environment, the recognition rate is very low due to slight changes of the background environment or other interference, and the robustness is low.
In addition, compared with a single-channel sound collection system, the target sound received by the array hydrophone is farther away and can be identified remotely.
According to the embodiment, the target audio signals are collected through the multiple primitives of the array hydrophone, multiple characteristics of the target audio signals of the primitives are fused to serve as input information of the classification model, so that the input information contains rich characteristics used for representing target sounds, the fusion characteristics of the target audio signals of the primitives are input into the classification model, the prediction category of the target sounds corresponding to the target audio signals of the primitives is obtained, then the prediction category of the target sounds corresponding to the target audio signals of the multiple primitives is fused, the final category of the target sounds is obtained, and accuracy and robustness of classification results are effectively improved; in the classifying process, synchronous learning is carried out on the target audio signals of the plurality of primitives, so that the judging capability of the classifying model can be effectively enhanced, and the accuracy of the classifying result is further improved.
On the basis of the above embodiment, the feature extraction in this embodiment includes chromatographic energy normalization feature extraction, spectral contrast feature extraction, and mel-frequency cepstrum coefficient feature extraction.
Based on the foregoing embodiments, the second test module in this embodiment is specifically configured to: based on a voting selection method, the prediction category with the highest frequency is counted among the prediction categories corresponding to the plurality of primitives, and the prediction category with the highest frequency is used as the final category of the target sound.
Based on the above embodiments, the present embodiment further includes a classification model training module, configured to: and carrying out iterative training on the classification model according to the fusion characteristics of the sample audio signals and the reference category of the sample sound so as to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model in each iteration until the loss value of the classification model converges and/or the classification accuracy meets the preset termination condition.
On the basis of the above embodiment, the present embodiment further includes a classification model evaluation module, configured to: determining the optimal weight parameters of the classification model according to the result indexes of the weight parameters of the classification model verified in the verification set in each iteration; wherein the outcome index comprises a combination of one or more of accuracy, average accuracy, recall, and F-score.
Based on the above embodiment, the classification model training module in this embodiment is specifically configured to: preprocessing the sample audio signal; wherein the preprocessing includes a combination of one or more of normalization, resizing, and data enhancement; and carrying out iterative training on parameters of the classification model according to fusion characteristics of the preprocessed sample audio signals and the reference category.
On the basis of the above embodiment, the data enhancement in this embodiment includes a combination of one or more of mixed noise data enhancement, time offset data enhancement, and clipping-filling data enhancement for the same kind of sound.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: a processor (processor) 501, a communication interface (Communications Interface) 502, a memory (memory) 503 and a communication bus 504, wherein the processor 501, the communication interface 502, and the memory 503 communicate with each other via the communication bus 504. The processor 501 may invoke logic instructions in the memory 503 to perform an array hydrophone based voice recognition method comprising: based on a plurality of primitives of the array hydrophone, collecting target sound to obtain target audio signals of the primitives; extracting multiple characteristics from the target audio signals of each primitive, and carrying out characteristic fusion on the multiple characteristics to obtain fusion characteristics of the target audio signals of each primitive; inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive; fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound; the classification model is obtained through training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal.
Further, the logic instructions in the memory 503 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the array hydrophone based voice recognition method provided by the above methods, the method comprising: based on a plurality of primitives of the array hydrophone, collecting target sound to obtain target audio signals of the primitives; extracting multiple characteristics from the target audio signals of each primitive, and carrying out characteristic fusion on the multiple characteristics to obtain fusion characteristics of the target audio signals of each primitive; inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive; fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound; the classification model is obtained through training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method of array hydrophone based voice recognition provided by the methods above, the method comprising: based on a plurality of primitives of the array hydrophone, collecting target sound to obtain target audio signals of the primitives; extracting multiple characteristics from the target audio signals of each primitive, and carrying out characteristic fusion on the multiple characteristics to obtain fusion characteristics of the target audio signals of each primitive; inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive; fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound; the classification model is obtained through training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of voice recognition based on an array hydrophone, comprising:
based on a plurality of primitives of the array hydrophone, collecting target sound to obtain target audio signals of the primitives;
extracting multiple characteristics from the target audio signals of each primitive, and carrying out characteristic fusion on the multiple characteristics to obtain fusion characteristics of the target audio signals of each primitive;
inputting the fusion characteristics of the target audio signals of each primitive into a classification model to obtain the prediction category of the target sound corresponding to the target audio signals of each primitive;
fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound;
the classification model is obtained by training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal;
the optimal weight parameters of the classification model are obtained based on the following steps:
multiplying the accuracy, average accuracy, recall and score of the classification model in each iteration with corresponding weights and adding to obtain the performance index of the classification model in each iteration, wherein the specific calculation formula is as follows:
P i =λ i,1 S i,1i,2 S i,2i,3 S i,3i,4 S i,4
Wherein P is i For the performance index lambda of the ith iteration classification model i,1 、λ i,2 、λ i,3 And lambda (lambda) i,4 As the weight coefficient, S i,1 、S i,2 、S i,3 And S is i,4 The accuracy, average accuracy, recall and score of the ith iterative classification model are respectively determined;
score F Score For the harmonic average of the accuracy rate and the recall rate, a specific calculation formula is as follows:
Wherein, beta is used for balancing the importance of the precision and recall, and P and R are the precision and recall respectively;
sequencing the performance indexes of all iterations in the whole training process, and taking a weight parameter corresponding to the maximum performance index as an optimal weight parameter of the classification model if the sequencing result has the maximum performance index;
and if a plurality of maximum performance indexes exist in the sequencing result, taking a weight parameter corresponding to the maximum performance index with the maximum iteration period as the optimal weight parameter.
2. The array hydrophone-based sound identification method of claim 1, wherein the feature extraction comprises chromatographic energy normalization feature extraction, spectral contrast feature extraction, and mel-frequency cepstral coefficient feature extraction.
3. The method for identifying a sound based on an array hydrophone according to claim 1, wherein the fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound includes:
Based on a voting selection method, the prediction category with the highest frequency is counted among the prediction categories corresponding to the plurality of primitives, and the prediction category with the highest frequency is used as the final category of the target sound.
4. A method of array hydrophone based voice recognition according to any one of claims 1-3, further comprising, before inputting the fusion characteristics of the target audio signals of each element into the classification model to obtain the predicted class of the target sound corresponding to the target audio signal of each element:
and carrying out iterative training on the classification model according to the fusion characteristics of the sample audio signals and the reference category of the sample sound so as to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model in each iteration until the loss value of the classification model converges and/or the classification accuracy meets the preset termination condition.
5. The array hydrophone-based sound identification method of claim 4, wherein the iterative training of the classification model based on the fusion characteristics of the sample audio signals and the reference class of the sample sounds comprises:
preprocessing the sample audio signal;
Wherein the preprocessing includes a combination of one or more of normalization, resizing, and data enhancement;
and carrying out iterative training on parameters of the classification model according to fusion characteristics of the preprocessed sample audio signals and the reference category.
6. The array-hydrophone-based sound identification method of claim 5, wherein the data enhancement includes a combination of one or more of mixed noise data enhancement, time offset data enhancement, and shear-fill data enhancement of the same type of sound.
7. A voice recognition apparatus based on an array hydrophone, comprising:
the acquisition module is used for acquiring target sound based on a plurality of primitives of the array hydrophone to obtain target audio signals of the primitives;
the feature extraction module is used for extracting various features of the target audio signals of the primitives and carrying out feature fusion on the various features to obtain fusion features of the target audio signals of the primitives;
the first test module is used for inputting the fusion characteristics of the target audio signals of the primitives into the classification model to obtain the prediction category of the target sound corresponding to the target audio signals of the primitives;
The second test module is used for fusing the predicted categories of the target sound corresponding to the target audio signals of the plurality of primitives to obtain the final category of the target sound;
the classification model is obtained by training based on fusion characteristics of a sample audio signal and reference categories of sample sounds corresponding to the sample audio signal;
the optimal weight parameters of the classification model are obtained based on the following steps:
multiplying the accuracy, average accuracy, recall and score of the classification model in each iteration with corresponding weights and adding to obtain the performance index of the classification model in each iteration, wherein the specific calculation formula is as follows:
P i =λ i,1 S i,1i,2 S i,2i,3 S i,3i,4 S i,4
wherein P is i For the performance index lambda of the ith iteration classification model i,1 、λ i,2 、λ i,3 And lambda (lambda) i,4 As the weight coefficient, S i,1 、S i,2 、S i,3 And S is i,4 The accuracy, average accuracy, recall and score of the ith iterative classification model are respectively determined;
score F Score For the harmonic average of the accuracy rate and the recall rate, a specific calculation formula is as follows:
wherein, beta is used for balancing the importance of the precision and recall, and P and R are the precision and recall respectively;
sequencing the performance indexes of all iterations in the whole training process, and taking a weight parameter corresponding to the maximum performance index as an optimal weight parameter of the classification model if the sequencing result has the maximum performance index;
And if a plurality of maximum performance indexes exist in the sequencing result, taking a weight parameter corresponding to the maximum performance index with the maximum iteration period as the optimal weight parameter.
8. The array hydrophone-based voice recognition apparatus of claim 7, further comprising a classification model training module for:
and carrying out iterative training on the classification model according to the fusion characteristics of the sample audio signals and the reference category of the sample sound so as to iteratively optimize and adjust the weight parameters of the classification model, and storing the weight parameters of the classification model in each iteration until the loss value of the classification model converges and/or the classification accuracy meets the preset termination condition.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the array hydrophone-based voice recognition method of any of claims 1 to 6 when the program is executed.
10. A non-transitory computer readable storage medium, on which a computer program is stored which, when executed by a processor, implements the steps of the array hydrophone based voice recognition method of any of claims 1 to 6.
CN202111356702.XA 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone Active CN114220458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111356702.XA CN114220458B (en) 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111356702.XA CN114220458B (en) 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone

Publications (2)

Publication Number Publication Date
CN114220458A CN114220458A (en) 2022-03-22
CN114220458B true CN114220458B (en) 2024-04-05

Family

ID=80697292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111356702.XA Active CN114220458B (en) 2021-11-16 2021-11-16 Voice recognition method and device based on array hydrophone

Country Status (1)

Country Link
CN (1) CN114220458B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722884B (en) * 2022-06-08 2022-09-30 深圳市润东来科技有限公司 Audio control method, device and equipment based on environmental sound and storage medium
CN116366082B (en) * 2023-01-30 2023-12-01 深圳大学 Method for improving signal-to-noise ratio of microsecond ultra-short pulse underwater sound signal
CN117198330B (en) * 2023-11-07 2024-01-30 国家海洋技术中心 Sound source identification method and system and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN110473568A (en) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 Scene recognition method, device, storage medium and electronic equipment
CN112183582A (en) * 2020-09-07 2021-01-05 中国海洋大学 Multi-feature fusion underwater target identification method
CN112364779A (en) * 2020-11-12 2021-02-12 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN112541547A (en) * 2020-12-14 2021-03-23 哈尔滨工程大学 Weighted fusion weight determination method supporting underwater rapid target recognition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8929564B2 (en) * 2011-03-03 2015-01-06 Microsoft Corporation Noise adaptive beamforming for microphone arrays

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN110473568A (en) * 2019-08-08 2019-11-19 Oppo广东移动通信有限公司 Scene recognition method, device, storage medium and electronic equipment
CN112183582A (en) * 2020-09-07 2021-01-05 中国海洋大学 Multi-feature fusion underwater target identification method
CN112364779A (en) * 2020-11-12 2021-02-12 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN112541547A (en) * 2020-12-14 2021-03-23 哈尔滨工程大学 Weighted fusion weight determination method supporting underwater rapid target recognition

Also Published As

Publication number Publication date
CN114220458A (en) 2022-03-22

Similar Documents

Publication Publication Date Title
CN114220458B (en) Voice recognition method and device based on array hydrophone
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN109784366A (en) The fine grit classification method, apparatus and electronic equipment of target object
CN111860573A (en) Model training method, image class detection method and device and electronic equipment
CN107610707A (en) A kind of method for recognizing sound-groove and device
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN111401105B (en) Video expression recognition method, device and equipment
CN109378014A (en) A kind of mobile device source discrimination and system based on convolutional neural networks
CN111881972B (en) Black-out user identification method and device, server and storage medium
CN117332090B (en) Sensitive information identification method, device, equipment and storage medium
CN111402922B (en) Audio signal classification method, device, equipment and storage medium based on small samples
CN117765981A (en) Emotion recognition method and system based on cross-modal fusion of voice text
CN117036843A (en) Target detection model training method, target detection method and device
WO2021179198A1 (en) Image feature visualization method, image feature visualization apparatus, and electronic device
CN115393968A (en) Audio-visual event positioning method fusing self-supervision multi-mode features
CN115587337A (en) Method, device and storage medium for recognizing abnormal sound of vehicle door
CN117036834A (en) Data classification method and device based on artificial intelligence and electronic equipment
CN115170874A (en) Self-distillation implementation method based on decoupling distillation loss
CN114299995A (en) Language emotion recognition method for emotion assessment
CN113889077A (en) Voice recognition method, voice recognition device, electronic equipment and storage medium
CN116153337B (en) Synthetic voice tracing evidence obtaining method and device, electronic equipment and storage medium
CN112035700A (en) Voice deep hash learning method and system based on CNN
CN116884435A (en) Voice event detection method and device based on audio prompt learning
CN114898775B (en) Voice emotion recognition method and system based on cross-layer cross fusion
CN116402995A (en) Lightweight neural network-based ancient architecture point cloud semantic segmentation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant