CN111179915A - Age identification method and device based on voice - Google Patents

Age identification method and device based on voice Download PDF

Info

Publication number
CN111179915A
CN111179915A CN201911394752.XA CN201911394752A CN111179915A CN 111179915 A CN111179915 A CN 111179915A CN 201911394752 A CN201911394752 A CN 201911394752A CN 111179915 A CN111179915 A CN 111179915A
Authority
CN
China
Prior art keywords
training
age
neural network
data
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201911394752.XA
Other languages
Chinese (zh)
Inventor
张艳
黄厚军
钱彦旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN201911394752.XA priority Critical patent/CN111179915A/en
Publication of CN111179915A publication Critical patent/CN111179915A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an age identification method based on voice, which comprises the steps of obtaining original audio for preprocessing, and determining a training data set; training the selected neural network model by using the training data set, and determining an algorithm model for age identification; when real-time audio data are acquired, the audio data are identified through the algorithm model, and the age of a speaker is determined. The invention also discloses an age identification device based on the voice. According to the scheme disclosed by the invention, the age of the speaker can be identified based on the voice of the user, the identification accuracy is high, the user interaction experience is improved, and the requirements of different scenes are met.

Description

Age identification method and device based on voice
Technical Field
The invention relates to the technical field of intelligent voice, in particular to an age identification method and device based on voice.
Background
In recent years, with the development of intelligent voice technology, products based on intelligent voice are in a large number. These voice products are developing different application modes based on age requirements in order to improve user experience. Based on this, it is important how to accurately identify the age level of the subject in the user dialog. At present, the mature age recognition technology in the market is based on face recognition, and the age recognition technology based on voice also proposes some implementation schemes in the research stage, but the existing schemes generally have the following problems to be overcome:
firstly, the required test audio is too long, the tester needs to speak continuously for 5s, and then more than 20 words are required in a sentence, which is very difficult for the user, so that the user experience is poor, and the landing of the product is not facilitated;
second, these techniques do not perform well in noisy scenes such as vehicular, background tv, etc.
Disclosure of Invention
In order to overcome the defects of the existing scheme, the inventor sets out two aspects of the prior data processing and algorithm selection, and provides a solution capable of identifying the age of the user with higher accuracy by using short-time audio.
According to an aspect of the present invention, there is provided a speech-based age recognition method, including
Acquiring original audio for preprocessing, and determining a training data set;
training the selected neural network model by using the training data set, and determining an algorithm model for age identification;
when real-time audio data are acquired, the audio data are identified through the algorithm model, and the age of a speaker is determined.
According to another aspect of the present invention, there is provided a voice-based age recognition apparatus including
The training set determining module is used for acquiring original audio to perform preprocessing and determining a training data set;
the model training module is used for training the selected neural network model by utilizing the training data set and determining algorithm model storage for age identification; and
and the identification module is used for identifying the audio data through the algorithm model when the real-time audio data is acquired, and determining the age of the speaker.
According to a third aspect of the present invention, there is provided an electronic apparatus comprising: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.
According to a fourth aspect of the invention, a storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
According to the scheme of the embodiment of the invention, the age of the speaker can be predicted and recognized through the audio, and the higher recognition accuracy rate can be ensured because the neural network model is adopted for training and the original audio is subjected to preprocessing before training. In addition, the embodiment of the invention selects the neural network model, can realize rapid and efficient age identification through less parameter quantity, can accurately identify short-time audio without a user speaking a long voice, and thus has very good user experience.
Drawings
FIG. 1 is a flowchart of a method for speech-based age recognition according to an embodiment of the present invention;
FIG. 2 is a flow diagram of the model training process in the embodiment of FIG. 1;
fig. 3 is a block diagram of a voice-based age recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, a component can be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers and can be run by various computer-readable media. The components may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., signals from data interacting with another component in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The scheme of recognizing the age of the speaker based on the voice according to the embodiment of the present invention can be applied to any intelligent device with an intelligent voice interaction function, so that the voice product carried by the intelligent device has an age recognition function, such as a mobile phone, a watch, an earphone, a personal PC computer, and the like, but the application scope of the present invention is not limited thereto. By the scheme provided by the embodiment of the invention, the short-time audio can be identified, the age of the speaker can be accurately determined, the identification accuracy is improved, a developer can conveniently perform personalized customization on products or skills and the like according to the identified age characteristics, different requirements of users are met, and the user experience during interaction is improved.
The present invention will be described in further detail with reference to the accompanying drawings.
Fig. 1 and 2 schematically show a method flow of a speech-based age recognition method according to an embodiment of the present invention, and as shown in fig. 1, the method of this embodiment includes the following steps:
step S101: and acquiring original audio for preprocessing, and determining a training data set. When the age of the speaker is recognized by using audio, the selected training data set is very important, and if the early-stage data processing is insufficient, the training effect is influenced, so that the recognition accuracy and the performance of the trained algorithm are seriously influenced. As shown in fig. 2, in the embodiment of the present invention, in an early stage, data enhancement processing is performed on the acquired original audio, and alignment (alignment algorithm, which is a technology for performing alignment processing on Voice feature data and is a conventional technology) and vad (Voice Activity Detection, which is a technology for accurately positioning a start point and an end point of a Voice from a Voice with noise and is a conventional technology) are performed on data after data enhancement, feature value extraction is performed on the data after alignment and vad processing, and then an age tag is set on the extracted feature value, so as to form a final training data set.
In the embodiment of the invention, not only the audio data after data enhancement is subjected to alignment processing in the early-stage processing process, but also the audio data after data enhancement is subjected to vad processing at the same time, the audio data after alignment processing is used as training data, and the audio data after vad processing is used as training data, namely, the audio data of two processing modes are used for training at the same time, so that the age identification performance under the condition that vad cutting is inaccurate can be ensured, and the trained algorithm model can be accurately identified no matter whether the audio is accurately cut or not.
The selected original audio is 0.2s-1s long, preferably a wakeup word, and may include a mixture of chinese, english, and chinese and english, and the like. The selected original audio may be obtained from online audio, and the user's buffered audio may be downloaded from the background, for example, for 4000 hours.
Preferably, the data enhancement processing on the original audio may be implemented by setting a plurality of scenes, and performing near-field and far-field sound pickup on the original audio in each scene, for example, performing near-field and far-field sound pickup in a plurality of scenes such as home, vehicle, market, roadside, and office to obtain enhanced audio data. Through multi-scene simulation and near field far field pickup respectively, the authenticity of the trained algorithm model can be guaranteed.
Step S102: and training the selected neural network model by using the training data set, and determining an algorithm model for age identification. The embodiment of the invention selects the neural network model for training so as to identify the age of the speaker through the trained neural network model. Specifically, the complete process of training the neural network model once may be three steps including forward propagation, backward propagation and weight update. During training, training data are obtained from the training data set, characteristic values in the training data are used as input of the neural network model, age labels corresponding to the training data are used as output matching targets of the neural network model, and therefore the weight coefficients of the model are trained, and the trained neural network model is obtained.
Illustratively, the selected neural network model is a convolutional neural network. The process of training the convolutional neural network comprises the following steps: the method comprises the steps of randomly acquiring batch training data from a training data set, using characteristic values in the training data as input of a convolutional neural network, using age tags corresponding to the training data as output matching targets of the convolutional neural network, and repeatedly training the convolutional neural network sequentially through forward propagation, backward propagation and weight updating until the weight of the convolutional neural network is converged to a preset range. In a specific practice process, the training can be carried out for multiple times to finally determine the weight coefficient of the convolutional neural network which is more suitable for the actual situation, and in order to reduce the memory, a mini-batch mode is used in the training process, and a small batch of training data is randomly taken out each time for training. Firstly, in the forward propagation process, characteristic values are input into a trunk part of a convolutional neural network (4 CNN layers form the trunk), embeddings are extracted and are respectively input into a classification layer, the prediction probability of each label category is calculated, and a loss function calculates loss by using the prediction probability and a target age label. Then, in the back propagation process, a back propagation algorithm is used to calculate the gradient value of the loss function to the weight of the convolutional neural network. And finally, updating the network weight by using the gradient value of the convolutional neural network weight by adopting a random gradient descent method. After the weights of the convolutional neural network are trained, the age of the speaker can be identified by using the latest weights and network structures of the convolutional neural network as an algorithm model.
Preferably, the characteristic value in the embodiment of the present invention is an fbank characteristic, and the network structure of the convolutional neural network is: 8-channel convolutional layer-selu () -maxporoling layer-16-channel convolutional layer-selu () -maxporoling layer-embedding mean variance-linear layer, softmaxloss. Practice shows that the recognition rate of the algorithm model trained by the characteristics and the network structure on the ages of children and adults can reach 92%.
Step S103: when the real-time audio data are acquired, the audio data are identified through an algorithm model, and the age of a speaker is determined. After the algorithm model is trained, the age identification can be carried out by applying the algorithm model, and the specific application method can be as follows: when audio data collected in real time are received, feature values of the audio data are extracted, the extracted feature values are input into the algorithm model, so that the output results of the age labels and the corresponding probabilities of the age labels can be obtained, and the ages corresponding to the age labels with the highest probabilities are selected to serve as the ages of the finally determined speakers.
Through the embodiment, the age of the speaker can be predicted and recognized through the audio, and due to the fact that the neural network model is adopted for training, and training data are enhanced through multi-scene data, recognition accuracy rates for different scenes are very high. In addition, the algorithm model obtained by the method can realize rapid and efficient age identification through a small amount of parameters, can accurately identify short-time audio without a user speaking a long voice, and therefore user experience is very good. Moreover, based on the scheme, the age of the speaker can be judged based on the picked awakening audio while awakening, so that personalized response is made, and the voice product is more intelligent.
In other specific embodiments, relu () activation, which is an activation function used by most neural networks, can be used in the above convolutional neural network structure, but it has been shown in a great deal of practice that the recognition rate of relu () activation is 0.3 point lower than selu ().
In other implementations, the selected convolutional neural network may have a structure of convolutional layer-selu () -maxporoling layer-convolutional layer-selu () -maxporoling layer-embedding mean-linear layer, softmax loss. This scheme has 1 point of performance reduction over the network architecture scheme employed in the embodiment of fig. 1.
In other embodiments, the neural network model used may also be a 512 x 6 deep neural network (i.e., dnn) model, which has slightly worse performance than that of a convolutional neural network (i.e., cnn), with a recognition rate of 91% but a dnn network with a smaller number of parameters.
In another embodiment, the feature usage may also select an mfcc feature, or a msdc feature. Using the mfcc feature instead of the fbank feature, the recognition performance is not much different, but extracting the mfcc feature is slightly longer than the fbank feature.
FIG. 3 schematically shows a speech-based age recognition apparatus according to an embodiment of the present invention, which includes
A training set determining module 30, configured to obtain an original audio for preprocessing, and determine a training data set;
the model training module 31 is used for training the selected neural network model by using a training data set and determining an algorithm model for age identification; and
and the identification module 32 is configured to, when the real-time audio data is acquired, perform identification processing on the audio data through the algorithm model, and determine the age of the speaker.
Wherein the training set determining module 30 comprises
An enhancement processing unit 30A for performing data enhancement on the acquired original audio;
the preprocessing unit 30B is configured to perform align and vad processing on the data after data enhancement respectively;
the feature extraction unit 30C is configured to perform feature value extraction on the data after align and vad processing; and
and a label setting unit 30D for setting an age label to the extracted feature value to form a training data set.
As a preferred implementation example, the neural network model trained in the model training module 31 is a convolutional neural network, and the original audio selected by the training set determining module 30 is a short-time audio with a duration range of 0.2-1 s. The specific implementation of each module and unit in the device embodiment may be described with reference to the method portion, and other implementations mentioned in the method portion may also be applied to the device embodiment, so that details are not described here.
By the aid of the device, accuracy of age identification for different scenes can be improved, short-time audio can be identified, and high efficiency and high cost performance are achieved. .
In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform the above-mentioned voice-based age identification method of the present invention.
In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the above-described speech-based age recognition method.
In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described speech-based age recognition method.
In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, where the computer program is capable of executing the above-mentioned age recognition method based on speech when the computer program is executed by a processor.
The age identifying device based on voice according to the embodiment of the present invention may be used to execute the age identifying method based on voice according to the embodiment of the present invention, and accordingly achieve the technical effect achieved by the age identifying method based on voice according to the embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
Fig. 4 is a schematic hardware structure diagram of an electronic device for executing a speech-based age recognition method according to another embodiment of the present application, and as shown in fig. 4, the electronic device includes:
one or more processors 510 and memory 520, with one processor 510 being an example in fig. 4.
The apparatus for performing the voice-based age recognition method may further include: an input device 530 and an output device 540.
The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, such as by a bus connection in fig. 4.
The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the voice-based age recognition method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing, i.e., implements the voice-based age recognition method in the above-described method embodiments, by executing the nonvolatile software programs, instructions, and modules stored in the memory 520.
The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice-based age recognition apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 optionally includes memory located remotely from processor 510, which may be connected to a voice-based age recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the voice-based age recognition method. The output device 540 may include a display device such as a display screen.
The one or more modules described above are stored in the memory 520 and, when executed by the one or more processors 510, perform the speech-based age recognition method of any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or otherwise contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application
What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims (10)

1. A speech-based age identification method, comprising
Acquiring original audio for preprocessing, and determining a training data set;
training the selected neural network model by using the training data set, and determining an algorithm model for age identification;
when real-time audio data are acquired, the audio data are identified through the algorithm model, and the age of a speaker is determined.
2. The method of claim 1, wherein the raw audio is obtained for pre-processing, and wherein determining the training set of data comprises
Performing data enhancement on the acquired original audio;
respectively carrying out align and vad processing on the data after data enhancement;
extracting characteristic values of the data after align and vad processing;
and setting an age label for the extracted characteristic value to form a training data set.
3. The identification method according to claim 2, wherein the original audio is 0.2s-1s long audio.
4. The method of claim 2 or 3, wherein data enhancing the captured original audio comprises
Setting a plurality of scenes, and respectively carrying out near-field and far-field sound pickup on original audio in each scene.
5. The method of claim 2, wherein the selected neural network model is a convolutional neural network, and wherein training the selected neural network model using the training data set comprises training the selected neural network model using the training data set
The method comprises the steps of randomly acquiring batch training data from a training data set, using characteristic values in the training data as input of a convolutional neural network, using age labels corresponding to the training data as output matching targets of the convolutional neural network, and training the convolutional neural network sequentially through forward propagation, backward propagation and weight updating until the weight of the convolutional neural network is converged to a preset range.
6. The method according to claim 5, wherein the eigenvalues are fbank features and the network structure of the convolutional neural network is: 8-channel convolutional layer-selu () -maxporoling layer-16-channel convolutional layer-selu () -maxporoling layer-embedding mean variance-linear layer, softmax loss.
7. An age recognition device based on speech, comprising
The training set determining module is used for acquiring original audio to perform preprocessing and determining a training data set;
the model training module is used for training the selected neural network model by utilizing the training data set and determining algorithm model storage for age identification; and
and the identification module is used for identifying the audio data through the algorithm model when the real-time audio data is acquired, and determining the age of the speaker.
8. The apparatus of claim 8, wherein the training set determination module comprises
The enhancement processing unit is used for performing data enhancement on the acquired original audio;
the preprocessing unit is used for respectively carrying out align and vad processing on the data after the data enhancement;
the characteristic extraction unit is used for extracting characteristic values of the data after align and vad processing; and
and the label setting unit is used for setting an age label for the extracted characteristic value to form a training data set.
9. The apparatus of claim 7 or 8, wherein the neural network model is a convolutional neural network.
10. Storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN201911394752.XA 2019-12-30 2019-12-30 Age identification method and device based on voice Withdrawn CN111179915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911394752.XA CN111179915A (en) 2019-12-30 2019-12-30 Age identification method and device based on voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911394752.XA CN111179915A (en) 2019-12-30 2019-12-30 Age identification method and device based on voice

Publications (1)

Publication Number Publication Date
CN111179915A true CN111179915A (en) 2020-05-19

Family

ID=70650594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911394752.XA Withdrawn CN111179915A (en) 2019-12-30 2019-12-30 Age identification method and device based on voice

Country Status (1)

Country Link
CN (1) CN111179915A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933148A (en) * 2020-06-29 2020-11-13 厦门快商通科技股份有限公司 Age identification method and device based on convolutional neural network and terminal
CN112331187A (en) * 2020-11-24 2021-02-05 苏州思必驰信息科技有限公司 Multi-task speech recognition model training method and multi-task speech recognition method
CN112397075A (en) * 2020-12-10 2021-02-23 北京猿力未来科技有限公司 Human voice audio recognition model training method, audio classification method and system
CN112581942A (en) * 2020-12-29 2021-03-30 云从科技集团股份有限公司 Method, system, device and medium for recognizing target object based on voice
CN114157899A (en) * 2021-12-03 2022-03-08 北京奇艺世纪科技有限公司 Hierarchical screen projection method and device, readable storage medium and electronic equipment
CN114900767A (en) * 2022-04-28 2022-08-12 歌尔股份有限公司 Hearing protection method and device, terminal equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof
CN108281138A (en) * 2017-12-18 2018-07-13 百度在线网络技术(北京)有限公司 Age discrimination model training and intelligent sound exchange method, equipment and storage medium
CN108962247A (en) * 2018-08-13 2018-12-07 南京邮电大学 Based on gradual neural network multidimensional voice messaging identifying system and its method
CN109448756A (en) * 2018-11-14 2019-03-08 北京大生在线科技有限公司 A kind of voice age recognition methods and system
KR20190140801A (en) * 2018-05-23 2019-12-20 한국과학기술원 A multimodal system for simultaneous emotion, age and gender recognition
CN110619889A (en) * 2019-09-19 2019-12-27 Oppo广东移动通信有限公司 Sign data identification method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105761720A (en) * 2016-04-19 2016-07-13 北京地平线机器人技术研发有限公司 Interaction system based on voice attribute classification, and method thereof
CN108281138A (en) * 2017-12-18 2018-07-13 百度在线网络技术(北京)有限公司 Age discrimination model training and intelligent sound exchange method, equipment and storage medium
KR20190140801A (en) * 2018-05-23 2019-12-20 한국과학기술원 A multimodal system for simultaneous emotion, age and gender recognition
CN108962247A (en) * 2018-08-13 2018-12-07 南京邮电大学 Based on gradual neural network multidimensional voice messaging identifying system and its method
CN109448756A (en) * 2018-11-14 2019-03-08 北京大生在线科技有限公司 A kind of voice age recognition methods and system
CN110619889A (en) * 2019-09-19 2019-12-27 Oppo广东移动通信有限公司 Sign data identification method and device, electronic equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933148A (en) * 2020-06-29 2020-11-13 厦门快商通科技股份有限公司 Age identification method and device based on convolutional neural network and terminal
CN112331187A (en) * 2020-11-24 2021-02-05 苏州思必驰信息科技有限公司 Multi-task speech recognition model training method and multi-task speech recognition method
CN112397075A (en) * 2020-12-10 2021-02-23 北京猿力未来科技有限公司 Human voice audio recognition model training method, audio classification method and system
CN112397075B (en) * 2020-12-10 2024-05-28 北京猿力未来科技有限公司 Human voice audio frequency identification model training method, audio frequency classification method and system
CN112581942A (en) * 2020-12-29 2021-03-30 云从科技集团股份有限公司 Method, system, device and medium for recognizing target object based on voice
CN114157899A (en) * 2021-12-03 2022-03-08 北京奇艺世纪科技有限公司 Hierarchical screen projection method and device, readable storage medium and electronic equipment
CN114900767A (en) * 2022-04-28 2022-08-12 歌尔股份有限公司 Hearing protection method and device, terminal equipment and storage medium
WO2023206788A1 (en) * 2022-04-28 2023-11-02 歌尔股份有限公司 Hearing protection method and apparatus, terminal device and storage medium

Similar Documents

Publication Publication Date Title
CN111179915A (en) Age identification method and device based on voice
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
CN109473123B (en) Voice activity detection method and device
CN108899044B (en) Voice signal processing method and device
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN103971680B (en) A kind of method, apparatus of speech recognition
CN110473539B (en) Method and device for improving voice awakening performance
CN103065631B (en) A kind of method of speech recognition, device
CN110534099A (en) Voice wakes up processing method, device, storage medium and electronic equipment
CN111862942B (en) Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN110503944B (en) Method and device for training and using voice awakening model
CN110910885B (en) Voice wake-up method and device based on decoding network
CN110600008A (en) Voice wake-up optimization method and system
CN110570840A (en) Intelligent device awakening method and device based on artificial intelligence
CN112786029A (en) Method and apparatus for training VAD using weakly supervised data
CN111243604B (en) Training method for speaker recognition neural network model supporting multiple awakening words, speaker recognition method and system
CN111105803A (en) Method and device for quickly identifying gender and method for generating algorithm model for identifying gender
CN115798518B (en) Model training method, device, equipment and medium
CN111816216A (en) Voice activity detection method and device
CN111081260A (en) Method and system for identifying voiceprint of awakening word
CN114913859B (en) Voiceprint recognition method, voiceprint recognition device, electronic equipment and storage medium
CN111028861B (en) Spectrum mask model training method, audio scene recognition method and system
CN110580897B (en) Audio verification method and device, storage medium and electronic equipment
CN113658586B (en) Training method of voice recognition model, voice interaction method and device
CN113160855B (en) Method and apparatus for improving on-line voice activity detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
WW01 Invention patent application withdrawn after publication

Application publication date: 20200519

WW01 Invention patent application withdrawn after publication