CN112581942A - Method, system, device and medium for recognizing target object based on voice - Google Patents
Method, system, device and medium for recognizing target object based on voice Download PDFInfo
- Publication number
- CN112581942A CN112581942A CN202011596083.7A CN202011596083A CN112581942A CN 112581942 A CN112581942 A CN 112581942A CN 202011596083 A CN202011596083 A CN 202011596083A CN 112581942 A CN112581942 A CN 112581942A
- Authority
- CN
- China
- Prior art keywords
- training
- voice
- age
- audios
- gender
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 119
- 238000013145 classification model Methods 0.000 claims abstract description 64
- 238000013528 artificial neural network Methods 0.000 claims abstract description 48
- 239000013598 vector Substances 0.000 claims abstract description 46
- 238000007635 classification algorithm Methods 0.000 claims abstract description 29
- 238000013135 deep learning Methods 0.000 claims abstract description 24
- 238000005070 sampling Methods 0.000 claims description 44
- 238000000605 extraction Methods 0.000 claims description 27
- 238000009432 framing Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 abstract description 25
- 230000000694 effects Effects 0.000 description 15
- 238000004891 communication Methods 0.000 description 12
- 238000012546 transfer Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 9
- 238000005286 illumination Methods 0.000 description 8
- 230000005236 sound signal Effects 0.000 description 8
- 238000007781 pre-processing Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000003672 processing method Methods 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000000926 separation method Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method, a system, equipment and a medium for recognizing a target object based on voice, which are used for acquiring one or more voice audios used for training; converting a waveform of one or more speech audios as training into a sequence of feature vectors; training a classification algorithm by using the characteristic vector sequence to generate a classification model; identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed. The invention carries out processing based on the audio information, and avoids various occlusion problems possibly brought by image information when identifying the age and/or the sex of the target object. Meanwhile, the classification model is generated by training the deep learning neural network, and the robustness of identifying age and/or gender on the basis of audio information is stronger; the age and/or gender of the target subject can be identified with high efficiency and high accuracy in noisy and complex environments.
Description
Technical Field
The present invention relates to the field of audio processing technologies, and in particular, to a method, a system, a device, and a medium for recognizing a target object based on speech.
Background
In human-computer interaction, the interactive answer content is often judged according to the age and/or gender of the user. The age and/or gender of the user can generally be determined by visual or audio information. Then, due to the influence of factors such as shooting equipment, shooting environment, shooting parameters and shooting technology, the picture definition, the picture background, the face illumination, the face size and the like of the face picture are different. In addition, various shades are often formed on the face of a speaking person, including a mask or sunglasses. In more scenes, a camera cannot be installed, and thus the age and/or gender of the user cannot be determined by visual information. Audio-based and/or gender identification often fails to achieve better results due to the influence of the capture device or the capture environment.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a method, system, device and medium for recognizing a target object based on speech, which solve the technical problems in the prior art.
To achieve the above and other related objects, the present invention provides a method for recognizing age and/or gender based on voice, comprising the steps of:
acquiring one or more voice audios as training;
performing feature signal processing on the one or more voice audios as training, and converting waveforms of the one or more voice audios as training into a feature vector sequence;
training a classification algorithm by using the characteristic vector sequence to generate a classification model;
identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed.
Optionally, the classification algorithm comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network, a factorization time delay neural network and a forward sequence memory network.
Optionally, the forward sequence memory network includes multiple layers of deep forward sequence memory networks, and a jump connection is arranged between each two layers of deep forward sequence memory networks; the memory modules of the deep forward sequence memory network in different layers are different in size, and the corresponding memory modules are also different from small to large according to the hierarchy.
Optionally, after the context and the stride in the deep forward sequence memory network are changed, a jump connection is performed, and the gradient of the current jump connection is respectively transmitted to the next jump connection and the deep forward sequence memory network separated by two layers.
Optionally, when the feature vector sequence is used to train the classification algorithm, if the training loss of 2 epochs in the classification algorithm is not reduced any more, the classification model generated after the training is taken as the final classification model.
Optionally, converting the waveform of the one or more voice audios as training into a feature vector sequence by a feature extraction method; the feature extraction method includes at least one of: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform.
Optionally, the method further comprises changing the number of filters in the mel filter, and converting the waveform of the one or more voice audios as training into a feature vector sequence through the mel filter after changing the number.
Optionally, before converting the waveform of the one or more voice audios as training into the feature vector sequence, further comprising uniformly converting the format of the one or more voice audios as training by using down-sampling and/or up-sampling; the converted audio format includes at least: wav format and/or pcm format.
The invention also provides a system for identifying age and/or gender based on voice, which comprises:
a collection module for obtaining one or more voice audios for training;
the feature extraction module is used for converting the waveforms of the one or more voice audios used as training into a feature vector sequence;
the training module is used for training a classification algorithm according to the characteristic vector sequence to generate a classification model;
and the identification module is used for identifying one or more voice audios to be processed through the classification model and determining the age and/or the gender of one or more target objects in the one or more voice audios to be processed.
Optionally, the classification algorithm comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network, a factorization time delay neural network and a forward sequence memory network.
Optionally, the forward sequence memory network includes multiple layers of deep forward sequence memory networks, a jump connection is provided between each two layers of deep forward sequence memory networks, and the jump connection is performed after a context relationship and a stride in the deep forward sequence memory networks are changed; the memory modules of the deep forward sequence memory network in different layers are different in size, and the corresponding memory modules are also different from small to large according to the hierarchy.
Optionally, when the feature vector sequence is used to train the classification algorithm, if the training loss of 2 epochs in the classification algorithm is not reduced any more, the classification model generated after the training is taken as the final classification model.
Optionally, the feature extraction module converts the waveform of the one or more voice audios as training into a feature vector sequence by a feature extraction method; the feature extraction method includes at least one of: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform.
Optionally, the method further comprises changing the number of filters in the mel filter, and converting the waveform of the one or more voice audios as training into a feature vector sequence through the mel filter after changing the number.
Optionally, a preprocessing module is further included, configured to uniformly convert the format of the one or more training voice audios by using down-sampling and/or up-sampling before converting the waveform of the one or more training voice audios into the feature vector sequence; the converted audio format includes at least: wav format and/or pcm format.
The present invention also provides a device for identifying age and/or gender based on speech, comprising:
acquiring one or more voice audios as training;
converting the waveform of the one or more speech audios as training into a sequence of feature vectors;
training a classification algorithm by using the characteristic vector sequence to generate a classification model;
identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed.
The present invention also provides an apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as in any one of the above.
The present invention also provides one or more machine-readable media having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the method as recited in any of the above.
As described above, the present invention provides a method, system, device and medium for recognizing a target object based on speech, which has the following advantages: acquiring one or more voice audios as training; converting a waveform of one or more speech audios as training into a sequence of feature vectors; training a classification algorithm by using the characteristic vector sequence to generate a classification model; identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed. The method and the device perform processing based on the audio information, and avoid the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding possibly brought by image information when the age and/or the sex of the target object are identified. Meanwhile, the classification model is generated by training the deep learning neural network, and the robustness of identifying age and/or gender on the basis of audio information is stronger; the age and/or gender of the target subject can be identified with high efficiency and high accuracy in noisy and complex environments.
Drawings
FIG. 1 is a flowchart illustrating a method for recognizing a target object based on speech according to an embodiment;
FIG. 2 is a flowchart illustrating the process of identifying the age of a target object based on speech according to one embodiment;
FIG. 3 is a flowchart illustrating the process of recognizing gender of a target object based on speech according to an embodiment;
FIG. 4 is a block diagram of a forward sequence memory network according to an embodiment;
FIG. 5 is a block diagram of a system for recognizing a target object based on speech according to an embodiment;
fig. 6 is a schematic hardware structure diagram of a terminal device according to an embodiment;
fig. 7 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.
Description of the element reference numerals
M10 acquisition module
M20 feature extraction module
M30 training module
M40 identification module
M50 preprocessing module
1100 input device
1101 first processor
1102 output device
1103 first memory
1104 communication bus
1200 processing assembly
1201 second processor
1202 second memory
1203 communication assembly
1204 Power supply Assembly
1205 multimedia assembly
1206 Audio component
1207 input/output interface
1208 sensor assembly
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1 to 4, the present invention provides a method for identifying the age and/or gender of a target object based on voice, comprising the following steps:
s100, acquiring one or more voice audios used for training;
s200, converting the waveforms of one or more voice audios used as training into a characteristic vector sequence;
s300, training a classification algorithm by using the characteristic vector sequence to generate a classification model;
s400, identifying one or more voice audios to be processed through the classification model, and determining the age and/or the gender of one or more target objects in the one or more voice audios to be processed.
The method is based on the audio information for processing, and avoids the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding problems possibly brought by image information when the age and/or the sex of the target object are identified. Meanwhile, the method generates a classification model by training a deep learning neural network, and has stronger robustness for identifying age and/or gender on the basis of audio information; the age and/or gender of the target subject can be identified with high efficiency and high accuracy in noisy and complex environments.
As shown in fig. 2 and 4, an embodiment of the present application provides a method for recognizing an age of a target object based on speech, including the following steps:
and an audio input stage. The speech audio of one or more speakers is collected by a microphone or an array of microphones. The time for acquiring the voice audio in the embodiment of the application can be more than 1 second or more than 2 seconds; the collected voice audio includes at least voice audio of speakers of different ages. As an example, the division of the age group in the embodiment of the present application may be: (1) infants, 0-3 years old; (2) children, 3-12 years old; (3) teenagers, 12-18 years old; (4) middle-aged people, 18-45 years old; (5) the middle-aged and the elderly: over 45 years old.
And audio preprocessing stage. Embodiments of the present application generally require the input speech audio to be in the format of 16KHz sampling rate wav and/or pcm. Therefore, it is necessary to uniformly convert audio in other formats by an audio processing method. As an example, the format of one or more voice audios as training are uniformly converted, e.g., using downsampling and/or upsampling. Wherein the sampling rate, also called sampling speed or sampling frequency, defines the number of samples per second extracted from a continuous signal and constituting a discrete signal, which is expressed in hertz (Hz). The inverse of the sampling frequency is the sampling period or sampling time, which is the time interval between samples. Colloquially speaking, the sampling frequency refers to how many signal samples per second a computer takes. Common audio formats include, but are not limited to: audio formats such as wav, pcm, mp3, ape, wma, etc.
And (5) a characteristic extraction stage. In order to enable the classification model to normally read the input audio signal, the embodiment of the application may first perform feature extraction processing on the audio signal. For example, the one or more voice audios as training may be subjected to feature signal processing by a feature extraction method, and waveforms of the one or more voice audios as training may be converted into a feature vector sequence. The feature extraction method comprises at least one of the following steps: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform. The embodiment of the application can also change the number of the filters in the Mel filter, and convert the waveform of one or more voice audios used as training into a feature vector sequence through the Mel filter with the changed number. As an example, if the voice audio is audio in a format of 16KHz sampling rate wav and/or pcm format, the number of filters in the mel filter may be changed from 40 to 80 in the embodiment of the present application; through the waveform conversion of one or more pronunciation audio frequencies of 80 wave filters of constituteing into the eigenvector sequence, not only extract the effect better to the detail of 16kHz frequency spectrum, but also can let the information that contains in the classification model can learn the audio frequency characteristic better, realize the promotion of rate of recognition.
And (4) an age identification stage. The embodiment of the application trains a classification algorithm by using the obtained feature vector sequence, and can generate an age classification model. And identifying one or more voice audios to be processed through an age classification model, and determining the ages of one or more target objects in the one or more voice audios to be processed. In the embodiment of the application, each voice audio used for training at least comprises a sentence of voice, and age labeling is performed on each voice audio, so that the training audio has a corresponding age label; in the process of training the classification model, audio samples of different age groups can be screened out, and the proportion of each age group is balanced. The classification algorithm in the embodiment of the application comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network TDNN, a factorization time delay neural network TDNNF and a forward sequence memory network FSMN. As shown in fig. 4, the structure of the forward sequence memory network FSMN is schematically illustrated; the structure comprises 10 layers of depth forward sequence memory networks DFSMNs, and a short skip connection exists between every two layers of depth forward sequence memory networks DFSMNs. By setting the skip connection shortcut, the gradient transfer of the depth forward sequence memory network DFSMN can be optimized, and the gradient of the depth forward sequence memory network DFSMN can be better transferred, so that the model training effect is more excellent. In the embodiment of the application, the shortcut is jumped after the context relationship and the stride in the depth forward sequence memory network are changed, that is, the shortcut is jumped after a and b in each layer of the depth forward sequence memory network are changed. And when the shortcut is jumped, the gradient of the current shortcut is sent to the next shortcut and the DSFSMN after two layers of separation. Wherein, a x b in each layer of depth forward sequence memory network means that the context relationship is a, and the stride is b; for example, 4 × 1DFSMN indicates a context of 4 and a stride of 1; 8 × 1DFSMN indicates that the context is 8 and the stride is 1; 6 × 2DFSMN indicates that the context is 6 and the stride is 2; 10 x 2DFSMN indicates a context of 10 and a stride of 2. When the 4 × 1DFSMN is changed to 8 × 1DFSMN, the deep forward sequence memory network DFSMN corresponding to the 8 × 1DFSMN starts to perform a jump connection shortcut, transfer the gradient to the deep forward sequence memory network DFSMN corresponding to the 6 × 2DFSMN, and transfer the gradient to the next jump connection shortcut. The Memory modules Memory blocks of the deep forward sequence Memory networks in different layers are different in size, and the corresponding Memory modules Memory blocks are from small to large according to the hierarchy. Corresponding to fig. 4, the level of the layer where the 4 × 1DFSMN is located is less than the level of the layer where the 8 × 1DFSMN is located, the level of the layer where the 8 × 1DFSMN is located is less than the level of the layer where the 6 × 2DFSMN is located, and so on. When training the forward sequence memory network FSMN, the adopted loss function is cross entropy cross, KL divergence between the classification result and the label is judged, and when the cross entropy is gradually reduced to the condition that the cross entropy is not changed basically along with the continuous training, namely when the training loss of 2 epochs is not reduced any more, the neural network training is considered to achieve the available effect; and taking the classification model generated after the neural network training as a final classification model. Meanwhile, after the two layers of time-restricted self-attention networks are added behind the FSMN structure of the forward sequence memory network, compared with the common self-attention network, the embodiment of the application can extract the audio context information of a part of time periods in a voice audio without extracting the context information of the real voice audio every time, thereby being beneficial to the improvement of the model inference speed.
And a result output stage. After determining the age of one or more target objects in the one or more voice audios to be processed, the age of the one or more target objects may also be output in the form of words.
The embodiment processes based on the audio information, and avoids the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding problems possibly brought by the image information when the image is subjected to age identification. Meanwhile, the embodiment generates a classification model by training the deep learning neural network, and the robustness of identifying the age on the basis of the audio information is stronger; the age of the target object can be identified efficiently and accurately in a noisy and complex environment. The target object in the embodiment of the present application may be a person to be age-identified according to actual conditions.
As shown in fig. 3 and 4, an embodiment of the present application provides a method for recognizing gender of a target object based on voice, including the following steps:
and an audio input stage. The speech audio of one or more speakers is collected by a microphone or an array of microphones. The time for acquiring the voice audio in the embodiment of the application can be more than 1 second or more than 2 seconds; the collected voice audio includes at least voice audio of speakers of different genders. As an example, the gender classification in the embodiment of the present application may be: male, female and third sex.
And audio preprocessing stage. Embodiments of the present application generally require the input speech audio to be in the format of 16KHz sampling rate wav and/or pcm. Therefore, it is necessary to uniformly convert audio in other formats by an audio processing method. As an example, the format of one or more voice audios as training are uniformly converted, e.g., using downsampling and/or upsampling. Wherein the sampling rate, also called sampling speed or sampling frequency, defines the number of samples per second extracted from a continuous signal and constituting a discrete signal, which is expressed in hertz (Hz). The inverse of the sampling frequency is the sampling period or sampling time, which is the time interval between samples. Colloquially speaking, the sampling frequency refers to how many signal samples per second a computer takes. Common audio formats include, but are not limited to: audio formats such as wav, pcm, mp3, ape, wma, etc.
And (5) a characteristic extraction stage. In order to enable the classification model to normally read the input audio signal, the embodiment of the application may first perform feature extraction processing on the audio signal. For example, the one or more voice audios as training may be subjected to feature signal processing by a feature extraction method, and waveforms of the one or more voice audios as training may be converted into a feature vector sequence. The feature extraction method comprises at least one of the following steps: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform. The embodiment of the application can also change the number of the filters in the Mel filter, and convert the waveform of one or more voice audios used as training into a feature vector sequence through the Mel filter with the changed number. As an example, if the voice audio is audio in a format of 16KHz sampling rate wav and/or pcm format, the number of filters in the mel filter may be changed from 40 to 80 in the embodiment of the present application; through the waveform conversion of one or more pronunciation audio frequencies of 80 wave filters of constituteing into the eigenvector sequence, not only extract the effect better to the detail of 16kHz frequency spectrum, but also can let the information that contains in the classification model can learn the audio frequency characteristic better, realize the promotion of rate of recognition.
And a gender identification stage. The gender classification model can be generated by training a classification algorithm by using the obtained feature vector sequence. And identifying one or more voice audios to be processed through a gender classification model, and determining the gender of one or more target objects in the one or more voice audios to be processed. In the embodiment of the application, each voice audio used for training at least comprises a sentence of voice, and each voice audio is subjected to gender labeling, so that the training audio has a corresponding gender label; in the training process of the classification model, the audio samples of different genders can be screened out, and the proportion of each gender is balanced. The classification algorithm in the embodiment of the application comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network TDNN, a factorization time delay neural network TDNNF and a forward sequence memory network FSMN. As shown in fig. 4, the structure of the forward sequence memory network FSMN is schematically illustrated; the structure comprises 10 layers of depth forward sequence memory networks DFSMNs, and a short skip connection exists between every two layers of depth forward sequence memory networks DFSMNs. By setting the skip connection shortcut, the gradient transfer of the depth forward sequence memory network DFSMN can be optimized, and the gradient of the depth forward sequence memory network DFSMN can be better transferred, so that the model training effect is more excellent. In the embodiment of the application, the shortcut is jumped after the context relationship and the stride in the depth forward sequence memory network are changed, that is, the shortcut is jumped after a and b in each layer of the depth forward sequence memory network are changed. And when the shortcut is jumped, the gradient of the current shortcut is sent to the next shortcut and the DSFSMN after two layers of separation. Wherein, a x b in each layer of depth forward sequence memory network means that the context relationship is a, and the stride is b; for example, 4 × 1DFSMN indicates a context of 4 and a stride of 1; 8 × 1DFSMN indicates that the context is 8 and the stride is 1; 6 × 2DFSMN indicates that the context is 6 and the stride is 2; 10 x 2DFSMN indicates a context of 10 and a stride of 2. When the 8 × 1DFSMN is changed to 6 × 2DFSMN, the deep forward sequence memory network DFSMN corresponding to the 6 × 2DFSMN starts to perform a jump connection shortcut, transfer the gradient to the deep forward sequence memory network DFSMN corresponding to the 10 × 2DFSMN, and transfer the gradient to the next jump connection shortcut. The Memory modules Memory blocks of the deep forward sequence Memory networks in different layers are different in size, and the corresponding Memory modules Memory blocks are from small to large according to the hierarchy. Corresponding to fig. 4, the level of the layer where the 4 × 1DFSMN is located is less than the level of the layer where the 8 × 1DFSMN is located, the level of the layer where the 8 × 1DFSMN is located is less than the level of the layer where the 6 × 2DFSMN is located, and so on. When training the forward sequence memory network FSMN, the adopted loss function is cross entropy cross, KL divergence between the classification result and the label is judged, and when the cross entropy is gradually reduced to the condition that the cross entropy is not changed basically along with the continuous training, namely when the training loss of 2 epochs is not reduced any more, the neural network training is considered to achieve the available effect; and taking the classification model generated after the neural network training as a final classification model. Meanwhile, after the two layers of time-restricted self-attention networks are added behind the FSMN structure of the forward sequence memory network, compared with the common self-attention network, the embodiment of the application can extract the audio context information of a part of time periods in a voice audio without extracting the context information of the real voice audio every time, thereby being beneficial to the improvement of the model inference speed.
And a result output stage. After determining the gender of one or more target objects in the one or more voice audios to be processed, the gender of the one or more target objects can also be output in the form of words.
The embodiment processes based on the audio information, and avoids the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding problems possibly brought by the image information when the image is subjected to gender identification. Meanwhile, the classification model is generated by training the deep learning neural network, so that the robustness of gender identification based on audio information is stronger; the gender of the target object can be identified efficiently and accurately in a noisy and complex environment. The target object in the embodiment of the present application may be a person to be subjected to gender identification according to actual situations.
As shown in fig. 5, the present invention further provides a system for recognizing the age and/or gender of a target object based on voice, comprising:
a collecting module M10 for acquiring one or more voice audios as training;
a feature extraction module M20, configured to convert waveforms of one or more speech audios as training into a feature vector sequence;
the training module M30 is used for training a classification algorithm according to the characteristic vector sequence to generate a classification model;
the identification module M40 is configured to identify one or more voice audios to be processed through the classification model, and determine the age and/or gender of one or more target objects in the one or more voice audios to be processed.
The system processes based on the audio information, and avoids the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding problems possibly brought by image information when the age and/or the sex of the target object are identified. Meanwhile, the system generates a classification model by training a deep learning neural network, and the robustness of identifying age and/or gender on the basis of audio information is stronger; the age and/or gender of the target subject can be identified with high efficiency and high accuracy in noisy and complex environments.
In an exemplary embodiment, a preprocessing module M50 is further included for uniformly converting the format of the one or more voice audios as training by down-sampling and/or up-sampling before converting the waveforms of the one or more voice audios as training into a feature vector sequence; the converted audio format includes at least: wav format and/or pcm format.
As shown in fig. 4 and 5, an embodiment of the present application provides a system for recognizing an age of a target object based on speech, including:
and an audio input stage. The speech audio of one or more speakers is collected by a microphone or an array of microphones. The time for acquiring the voice audio in the embodiment of the application can be more than 1 second or more than 2 seconds; the collected voice audio includes at least voice audio of speakers of different ages. As an example, the division of the age group in the embodiment of the present application may be: (1) infants, 0-3 years old; (2) children, 3-12 years old; (3) teenagers, 12-18 years old; (4) middle-aged people, 18-45 years old; (5) the middle-aged and the elderly: over 45 years old.
And audio preprocessing stage. Embodiments of the present application generally require the input speech audio to be in the format of 16KHz sampling rate wav and/or pcm. Therefore, it is necessary to uniformly convert audio in other formats by an audio processing method. As an example, the format of one or more voice audios as training are uniformly converted, e.g., using downsampling and/or upsampling. Wherein the sampling rate, also called sampling speed or sampling frequency, defines the number of samples per second extracted from a continuous signal and constituting a discrete signal, which is expressed in hertz (Hz). The inverse of the sampling frequency is the sampling period or sampling time, which is the time interval between samples. Colloquially speaking, the sampling frequency refers to how many signal samples per second a computer takes. Common audio formats include, but are not limited to: audio formats such as wav, pcm, mp3, ape, wma, etc.
And (5) a characteristic extraction stage. In order to enable the classification model to normally read the input audio signal, the embodiment of the application may first perform feature extraction processing on the audio signal. For example, the one or more voice audios as training may be subjected to feature signal processing by a feature extraction method, and waveforms of the one or more voice audios as training may be converted into a feature vector sequence. The feature extraction method comprises at least one of the following steps: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform. The embodiment of the application can also change the number of the filters in the Mel filter, and convert the waveform of one or more voice audios used as training into a feature vector sequence through the Mel filter with the changed number. As an example, if the voice audio is audio in a format of 16KHz sampling rate wav and/or pcm format, the number of filters in the mel filter may be changed from 40 to 80 in the embodiment of the present application; through the waveform conversion of one or more pronunciation audio frequencies of 80 wave filters of constituteing into the eigenvector sequence, not only extract the effect better to the detail of 16kHz frequency spectrum, but also can let the information that contains in the classification model can learn the audio frequency characteristic better, realize the promotion of rate of recognition.
And (4) an age identification stage. The embodiment of the application trains a classification algorithm by using the obtained feature vector sequence, and can generate an age classification model. And identifying one or more voice audios to be processed through an age classification model, and determining the ages of one or more target objects in the one or more voice audios to be processed. In the embodiment of the application, each voice audio used for training at least comprises a sentence of voice, and age labeling is performed on each voice audio, so that the training audio has a corresponding age label; in the process of training the classification model, audio samples of different age groups can be screened out, and the proportion of each age group is balanced. The classification algorithm in the embodiment of the application comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network TDNN, a factorization time delay neural network TDNNF and a forward sequence memory network FSMN. As shown in fig. 4, the structure of the forward sequence memory network FSMN is schematically illustrated; the structure comprises 10 layers of depth forward sequence memory networks DFSMNs, and a short skip connection exists between every two layers of depth forward sequence memory networks DFSMNs. By setting the skip connection shortcut, the gradient transfer of the depth forward sequence memory network DFSMN can be optimized, and the gradient of the depth forward sequence memory network DFSMN can be better transferred, so that the model training effect is more excellent. In the embodiment of the application, the shortcut is jumped after the context relationship and the stride in the depth forward sequence memory network are changed, that is, the shortcut is jumped after a and b in each layer of the depth forward sequence memory network are changed. And when the shortcut is jumped, the gradient of the current shortcut is sent to the next shortcut and the DSFSMN after two layers of separation. Wherein, a x b in each layer of depth forward sequence memory network means that the context relationship is a, and the stride is b; for example, 4 × 1DFSMN indicates a context of 4 and a stride of 1; 8 × 1DFSMN indicates that the context is 8 and the stride is 1; 6 × 2DFSMN indicates that the context is 6 and the stride is 2; 10 x 2DFSMN indicates a context of 10 and a stride of 2. When the 4 × 1DFSMN is changed to 8 × 1DFSMN, the deep forward sequence memory network DFSMN corresponding to the 8 × 1DFSMN starts to perform a jump connection shortcut, transfer the gradient to the deep forward sequence memory network DFSMN corresponding to the 6 × 2DFSMN, and transfer the gradient to the next jump connection shortcut. The Memory modules Memory blocks of the deep forward sequence Memory networks in different layers are different in size, and the corresponding Memory modules Memory blocks are from small to large according to the hierarchy. Corresponding to fig. 4, the level of the layer where the 4 × 1DFSMN is located is less than the level of the layer where the 8 × 1DFSMN is located, the level of the layer where the 8 × 1DFSMN is located is less than the level of the layer where the 6 × 2DFSMN is located, and so on. When training the forward sequence memory network FSMN, the adopted loss function is cross entropy cross, KL divergence between the classification result and the label is judged, and when the cross entropy is gradually reduced to the condition that the cross entropy is not changed basically along with the continuous training, namely when the training loss of 2 epochs is not reduced any more, the neural network training is considered to achieve the available effect; and taking the classification model generated after the neural network training as a final classification model. Meanwhile, after the two layers of time-restricted self-attention networks are added behind the FSMN structure of the forward sequence memory network, compared with the common self-attention network, the embodiment of the application can extract the audio context information of a part of time periods in a voice audio without extracting the context information of the real voice audio every time, thereby being beneficial to the improvement of the model inference speed.
And a result output stage. After determining the age of one or more target objects in the one or more voice audios to be processed, the age of the one or more target objects may also be output in the form of words.
The embodiment processes based on the audio information, and avoids the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding problems possibly brought by the image information when the image is subjected to age identification. Meanwhile, the embodiment generates a classification model by training the deep learning neural network, and the robustness of identifying the age on the basis of the audio information is stronger; the age of the target object can be identified efficiently and accurately in a noisy and complex environment. The target object in the embodiment of the present application may be a person to be age-identified according to actual conditions.
As shown in fig. 4 and 5, an embodiment of the present application provides a system for recognizing gender of a target object based on voice, including:
and an audio input stage. The speech audio of one or more speakers is collected by a microphone or an array of microphones. The time for acquiring the voice audio in the embodiment of the application can be more than 1 second or more than 2 seconds; the collected voice audio includes at least voice audio of speakers of different genders. As an example, the gender classification in the embodiment of the present application may be: male, female and third sex.
And audio preprocessing stage. Embodiments of the present application generally require the input speech audio to be in the format of 16KHz sampling rate wav and/or pcm. Therefore, it is necessary to uniformly convert audio in other formats by an audio processing method. As an example, the format of one or more voice audios as training are uniformly converted, e.g., using downsampling and/or upsampling. Wherein the sampling rate, also called sampling speed or sampling frequency, defines the number of samples per second extracted from a continuous signal and constituting a discrete signal, which is expressed in hertz (Hz). The inverse of the sampling frequency is the sampling period or sampling time, which is the time interval between samples. Colloquially speaking, the sampling frequency refers to how many signal samples per second a computer takes. Common audio formats include, but are not limited to: audio formats such as wav, pcm, mp3, ape, wma, etc.
And (5) a characteristic extraction stage. In order to enable the classification model to normally read the input audio signal, the embodiment of the application may first perform feature extraction processing on the audio signal. For example, the one or more voice audios as training may be subjected to feature signal processing by a feature extraction method, and waveforms of the one or more voice audios as training may be converted into a feature vector sequence. The feature extraction method comprises at least one of the following steps: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform. The embodiment of the application can also change the number of the filters in the Mel filter, and convert the waveform of one or more voice audios used as training into a feature vector sequence through the Mel filter with the changed number. As an example, if the voice audio is audio in a format of 16KHz sampling rate wav and/or pcm format, the number of filters in the mel filter may be changed from 40 to 80 in the embodiment of the present application; through the waveform conversion of one or more pronunciation audio frequencies of 80 wave filters of constituteing into the eigenvector sequence, not only extract the effect better to the detail of 16kHz frequency spectrum, but also can let the information that contains in the classification model can learn the audio frequency characteristic better, realize the promotion of rate of recognition.
And a gender identification stage. The gender classification model can be generated by training a classification algorithm by using the obtained feature vector sequence. And identifying one or more voice audios to be processed through a gender classification model, and determining the gender of one or more target objects in the one or more voice audios to be processed. In the embodiment of the application, each voice audio used for training at least comprises a sentence of voice, and each voice audio is subjected to gender labeling, so that the training audio has a corresponding gender label; in the training process of the classification model, the audio samples of different genders can be screened out, and the proportion of each gender is balanced. The classification algorithm in the embodiment of the application comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network TDNN, a factorization time delay neural network TDNNF and a forward sequence memory network FSMN. As shown in fig. 4, the structure of the forward sequence memory network FSMN is schematically illustrated; the structure comprises 10 layers of depth forward sequence memory networks DFSMNs, and a short skip connection exists between every two layers of depth forward sequence memory networks DFSMNs. By setting the skip connection shortcut, the gradient transfer of the depth forward sequence memory network DFSMN can be optimized, and the gradient of the depth forward sequence memory network DFSMN can be better transferred, so that the model training effect is more excellent. In the embodiment of the application, the shortcut is jumped after the context relationship and the stride in the depth forward sequence memory network are changed, that is, the shortcut is jumped after a and b in each layer of the depth forward sequence memory network are changed. And when the shortcut is jumped, the gradient of the current shortcut is sent to the next shortcut and the DSFSMN after two layers of separation. Wherein, a x b in each layer of depth forward sequence memory network means that the context relationship is a, and the stride is b; for example, 4 × 1DFSMN indicates a context of 4 and a stride of 1; 8 × 1DFSMN indicates that the context is 8 and the stride is 1; 6 × 2DFSMN indicates that the context is 6 and the stride is 2; 10 x 2DFSMN indicates a context of 10 and a stride of 2. When the 8 × 1DFSMN is changed to 6 × 2DFSMN, the deep forward sequence memory network DFSMN corresponding to the 6 × 2DFSMN starts to perform a jump connection shortcut, transfer the gradient to the deep forward sequence memory network DFSMN corresponding to the 10 × 2DFSMN, and transfer the gradient to the next jump connection shortcut. The Memory modules Memory blocks of the deep forward sequence Memory networks in different layers are different in size, and the corresponding Memory modules Memory blocks are from small to large according to the hierarchy. Corresponding to fig. 4, the level of the layer where the 4 × 1DFSMN is located is less than the level of the layer where the 8 × 1DFSMN is located, the level of the layer where the 8 × 1DFSMN is located is less than the level of the layer where the 6 × 2DFSMN is located, and so on. When training the forward sequence memory network FSMN, the adopted loss function is cross entropy cross, KL divergence between the classification result and the label is judged, and when the cross entropy is gradually reduced to the condition that the cross entropy is not changed basically along with the continuous training, namely when the training loss of 2 epochs is not reduced any more, the neural network training is considered to achieve the available effect; and taking the classification model generated after the neural network training as a final classification model. Meanwhile, after the two layers of time-restricted self-attention networks are added behind the FSMN structure of the forward sequence memory network, compared with the common self-attention network, the embodiment of the application can extract the audio context information of a part of time periods in a voice audio without extracting the context information of the real voice audio every time, thereby being beneficial to the improvement of the model inference speed.
And a result output stage. After determining the gender of one or more target objects in the one or more voice audios to be processed, the gender of the one or more target objects can also be output in the form of words.
The embodiment processes based on the audio information, and avoids the problems of picture definition, picture background, face illumination, face size inconsistency or various shielding problems possibly brought by the image information when the image is subjected to gender identification. Meanwhile, the classification model is generated by training the deep learning neural network, so that the robustness of gender identification based on audio information is stronger; the gender of the target object can be identified efficiently and accurately in a noisy and complex environment. The target object in the embodiment of the present application may be a person to be subjected to gender identification according to actual situations.
The embodiment of the present application further provides a device for identifying age and/or gender based on voice, which includes:
acquiring one or more voice audios as training;
converting the waveform of the one or more speech audios as training into a sequence of feature vectors;
training a classification algorithm by using the characteristic vector sequence to generate a classification model;
identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed.
In this embodiment, the device executes the system or the method, and specific functions and technical effects are described with reference to the above embodiments, which are not described herein again.
An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the data processing method in fig. 1 according to the present embodiment.
Fig. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
Fig. 7 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. FIG. 7 is a specific embodiment of the implementation of FIG. 6. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, audio component 1206 also includes a speaker for outputting voice signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the audio component 1206, the input/output interface 1207 and the sensor component 1208 in the embodiment of fig. 7 may be implemented as the input device in the embodiment of fig. 6.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (13)
1. A method for identifying age and/or gender based on speech, comprising the steps of:
acquiring one or more voice audios as training;
converting the waveform of the one or more speech audios as training into a sequence of feature vectors;
training a classification algorithm by using the characteristic vector sequence to generate a classification model;
identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed.
2. The age and/or gender based speech recognition method according to claim 1, wherein the classification algorithm comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network, a factorization time delay neural network and a forward sequence memory network.
3. The method for recognizing age and/or gender based on voice according to claim 2, wherein the forward sequence memory network comprises a plurality of layers of deep forward sequence memory networks, and a jump connection is arranged between each two layers of deep forward sequence memory networks; the memory modules of the deep forward sequence memory network in different layers are different in size, and the corresponding memory modules are also different from small to large according to the hierarchy.
4. The method of claim 3, wherein the jump connection is performed after the context and the pace of the deep forward sequence memory network are changed, and the gradient of the current jump connection is transmitted to the next jump connection and the deep forward sequence memory network spaced by two layers.
5. The age and/or gender based speech recognition method according to claim 1, wherein the waveform of the one or more speech audios as training is converted into a feature vector sequence by a feature extraction method; the feature extraction method includes at least one of: fast fourier transform, short-time fourier transform, framing, windowing, pre-emphasis, mel-filter, discrete cosine transform.
6. The method of any one of claims 1 to 5, further comprising uniformly converting the format of the one or more voice audios as training by down-sampling and/or up-sampling before converting the waveform of the one or more voice audios as training into a feature vector sequence; the converted audio format includes at least: wav format and/or pcm format.
7. A system for identifying age and/or gender based on speech, comprising:
a collection module for obtaining one or more voice audios for training;
the feature extraction module is used for converting the waveforms of the one or more voice audios used as training into a feature vector sequence;
the training module is used for training a classification algorithm according to the characteristic vector sequence to generate a classification model;
and the identification module is used for identifying one or more voice audios to be processed through the classification model and determining the age and/or the gender of one or more target objects in the one or more voice audios to be processed.
8. The speech recognition age and/or gender based system of claim 7, wherein the classification algorithm comprises one or more deep learning neural networks; the deep learning neural network includes at least one of: a time delay neural network, a factorization time delay neural network and a forward sequence memory network.
9. The system of claim 8, wherein the forward sequence memory network comprises a plurality of layers of deep forward sequence memory networks, a jump connection is arranged between each two layers of deep forward sequence memory networks, and the jump connection is performed after context and pace in the deep forward sequence memory networks are changed; the memory modules of the deep forward sequence memory network in different layers are different in size, and the corresponding memory modules are also different from small to large according to the hierarchy.
10. The system according to claim 7, wherein when the feature vector sequence is used to train the classification algorithm, if the training loss of 2 epochs in the classification algorithm is not reduced, the classification model generated after the training is used as the final classification model.
11. An apparatus for identifying age and/or gender based on speech, comprising:
acquiring one or more voice audios as training;
converting the waveform of the one or more speech audios as training into a sequence of feature vectors;
training a classification algorithm by using the characteristic vector sequence to generate a classification model;
identifying one or more voice audios to be processed through the classification model, and determining the age and/or gender of one or more target objects in the one or more voice audios to be processed.
12. An apparatus for identifying age and/or gender based on speech, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-6.
13. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011596083.7A CN112581942A (en) | 2020-12-29 | 2020-12-29 | Method, system, device and medium for recognizing target object based on voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011596083.7A CN112581942A (en) | 2020-12-29 | 2020-12-29 | Method, system, device and medium for recognizing target object based on voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112581942A true CN112581942A (en) | 2021-03-30 |
Family
ID=75143997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011596083.7A Pending CN112581942A (en) | 2020-12-29 | 2020-12-29 | Method, system, device and medium for recognizing target object based on voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112581942A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694954A (en) * | 2018-06-13 | 2018-10-23 | 广州势必可赢网络科技有限公司 | A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing |
CN110136726A (en) * | 2019-06-20 | 2019-08-16 | 厦门市美亚柏科信息股份有限公司 | A kind of estimation method, device, system and the storage medium of voice gender |
CN110335591A (en) * | 2019-07-04 | 2019-10-15 | 广州云从信息科技有限公司 | A kind of parameter management method, device, machine readable media and equipment |
CN110619889A (en) * | 2019-09-19 | 2019-12-27 | Oppo广东移动通信有限公司 | Sign data identification method and device, electronic equipment and storage medium |
CN111091840A (en) * | 2019-12-19 | 2020-05-01 | 浙江百应科技有限公司 | Method for establishing gender identification model and gender identification method |
CN111179915A (en) * | 2019-12-30 | 2020-05-19 | 苏州思必驰信息科技有限公司 | Age identification method and device based on voice |
CN111785262A (en) * | 2020-06-23 | 2020-10-16 | 电子科技大学 | Speaker age and gender classification method based on residual error network and fusion characteristics |
CN111933148A (en) * | 2020-06-29 | 2020-11-13 | 厦门快商通科技股份有限公司 | Age identification method and device based on convolutional neural network and terminal |
-
2020
- 2020-12-29 CN CN202011596083.7A patent/CN112581942A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108694954A (en) * | 2018-06-13 | 2018-10-23 | 广州势必可赢网络科技有限公司 | A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing |
CN110136726A (en) * | 2019-06-20 | 2019-08-16 | 厦门市美亚柏科信息股份有限公司 | A kind of estimation method, device, system and the storage medium of voice gender |
CN110335591A (en) * | 2019-07-04 | 2019-10-15 | 广州云从信息科技有限公司 | A kind of parameter management method, device, machine readable media and equipment |
CN110619889A (en) * | 2019-09-19 | 2019-12-27 | Oppo广东移动通信有限公司 | Sign data identification method and device, electronic equipment and storage medium |
CN111091840A (en) * | 2019-12-19 | 2020-05-01 | 浙江百应科技有限公司 | Method for establishing gender identification model and gender identification method |
CN111179915A (en) * | 2019-12-30 | 2020-05-19 | 苏州思必驰信息科技有限公司 | Age identification method and device based on voice |
CN111785262A (en) * | 2020-06-23 | 2020-10-16 | 电子科技大学 | Speaker age and gender classification method based on residual error network and fusion characteristics |
CN111933148A (en) * | 2020-06-29 | 2020-11-13 | 厦门快商通科技股份有限公司 | Age identification method and device based on convolutional neural network and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021135577A1 (en) | Audio signal processing method and apparatus, electronic device, and storage medium | |
CN110136744B (en) | Audio fingerprint generation method, equipment and storage medium | |
CN110519636B (en) | Voice information playing method and device, computer equipment and storage medium | |
CN112200062B (en) | Target detection method and device based on neural network, machine readable medium and equipment | |
US20210042504A1 (en) | Method and apparatus for outputting data | |
CN112949708B (en) | Emotion recognition method, emotion recognition device, computer equipment and storage medium | |
CN112420069A (en) | Voice processing method, device, machine readable medium and equipment | |
CN112052792B (en) | Cross-model face recognition method, device, equipment and medium | |
CN112200318B (en) | Target detection method, device, machine readable medium and equipment | |
CN109063624A (en) | Information processing method, system, electronic equipment and computer readable storage medium | |
CN112071322A (en) | End-to-end voiceprint recognition method, device, storage medium and equipment | |
CN109947971B (en) | Image retrieval method, image retrieval device, electronic equipment and storage medium | |
WO2013097429A1 (en) | Method and apparatus for recognizing video captions | |
CN111091845A (en) | Audio processing method and device, terminal equipment and computer storage medium | |
CN111310725A (en) | Object identification method, system, machine readable medium and device | |
CN104361896A (en) | Voice quality evaluation equipment, method and system | |
CN112529939A (en) | Target track matching method and device, machine readable medium and equipment | |
CN114495916B (en) | Method, device, equipment and storage medium for determining insertion time point of background music | |
CN113053361B (en) | Speech recognition method, model training method, device, equipment and medium | |
CN105608114A (en) | Music retrieval method and apparatus | |
WO2023222071A1 (en) | Speech signal processing method and apparatus, and device and medium | |
WO2023208134A1 (en) | Image processing method and apparatus, model generation method and apparatus, vehicle, storage medium, and computer program product | |
CN115798459B (en) | Audio processing method and device, storage medium and electronic equipment | |
CN117275466A (en) | Business intention recognition method, device, equipment and storage medium thereof | |
Ivanko et al. | Designing advanced geometric features for automatic Russian visual speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210330 |
|
RJ01 | Rejection of invention patent application after publication |