CN111179915A

CN111179915A - Age identification method and device based on voice

Info

Publication number: CN111179915A
Application number: CN201911394752.XA
Authority: CN
Inventors: 张艳; 黄厚军; 钱彦旻
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19

Abstract

The invention discloses an age identification method based on voice, which comprises the steps of obtaining original audio for preprocessing, and determining a training data set; training the selected neural network model by using the training data set, and determining an algorithm model for age identification; when real-time audio data are acquired, the audio data are identified through the algorithm model, and the age of a speaker is determined. The invention also discloses an age identification device based on the voice. According to the scheme disclosed by the invention, the age of the speaker can be identified based on the voice of the user, the identification accuracy is high, the user interaction experience is improved, and the requirements of different scenes are met.

Description

Age identification method and device based on voice

Technical Field

The invention relates to the technical field of intelligent voice, in particular to an age identification method and device based on voice.

Background

In recent years, with the development of intelligent voice technology, products based on intelligent voice are in a large number. These voice products are developing different application modes based on age requirements in order to improve user experience. Based on this, it is important how to accurately identify the age level of the subject in the user dialog. At present, the mature age recognition technology in the market is based on face recognition, and the age recognition technology based on voice also proposes some implementation schemes in the research stage, but the existing schemes generally have the following problems to be overcome:

firstly, the required test audio is too long, the tester needs to speak continuously for 5s, and then more than 20 words are required in a sentence, which is very difficult for the user, so that the user experience is poor, and the landing of the product is not facilitated;

second, these techniques do not perform well in noisy scenes such as vehicular, background tv, etc.

Disclosure of Invention

In order to overcome the defects of the existing scheme, the inventor sets out two aspects of the prior data processing and algorithm selection, and provides a solution capable of identifying the age of the user with higher accuracy by using short-time audio.

According to an aspect of the present invention, there is provided a speech-based age recognition method, including

Acquiring original audio for preprocessing, and determining a training data set;

training the selected neural network model by using the training data set, and determining an algorithm model for age identification;

when real-time audio data are acquired, the audio data are identified through the algorithm model, and the age of a speaker is determined.

According to another aspect of the present invention, there is provided a voice-based age recognition apparatus including

The training set determining module is used for acquiring original audio to perform preprocessing and determining a training data set;

the model training module is used for training the selected neural network model by utilizing the training data set and determining algorithm model storage for age identification; and

and the identification module is used for identifying the audio data through the algorithm model when the real-time audio data is acquired, and determining the age of the speaker.

According to a third aspect of the present invention, there is provided an electronic apparatus comprising: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.

According to a fourth aspect of the invention, a storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

According to the scheme of the embodiment of the invention, the age of the speaker can be predicted and recognized through the audio, and the higher recognition accuracy rate can be ensured because the neural network model is adopted for training and the original audio is subjected to preprocessing before training. In addition, the embodiment of the invention selects the neural network model, can realize rapid and efficient age identification through less parameter quantity, can accurately identify short-time audio without a user speaking a long voice, and thus has very good user experience.

Drawings

FIG. 1 is a flowchart of a method for speech-based age recognition according to an embodiment of the present invention;

FIG. 2 is a flow diagram of the model training process in the embodiment of FIG. 1;

fig. 3 is a block diagram of a voice-based age recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, a component can be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers and can be run by various computer-readable media. The components may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., signals from data interacting with another component in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The scheme of recognizing the age of the speaker based on the voice according to the embodiment of the present invention can be applied to any intelligent device with an intelligent voice interaction function, so that the voice product carried by the intelligent device has an age recognition function, such as a mobile phone, a watch, an earphone, a personal PC computer, and the like, but the application scope of the present invention is not limited thereto. By the scheme provided by the embodiment of the invention, the short-time audio can be identified, the age of the speaker can be accurately determined, the identification accuracy is improved, a developer can conveniently perform personalized customization on products or skills and the like according to the identified age characteristics, different requirements of users are met, and the user experience during interaction is improved.

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 and 2 schematically show a method flow of a speech-based age recognition method according to an embodiment of the present invention, and as shown in fig. 1, the method of this embodiment includes the following steps:

step S101: and acquiring original audio for preprocessing, and determining a training data set. When the age of the speaker is recognized by using audio, the selected training data set is very important, and if the early-stage data processing is insufficient, the training effect is influenced, so that the recognition accuracy and the performance of the trained algorithm are seriously influenced. As shown in fig. 2, in the embodiment of the present invention, in an early stage, data enhancement processing is performed on the acquired original audio, and alignment (alignment algorithm, which is a technology for performing alignment processing on Voice feature data and is a conventional technology) and vad (Voice Activity Detection, which is a technology for accurately positioning a start point and an end point of a Voice from a Voice with noise and is a conventional technology) are performed on data after data enhancement, feature value extraction is performed on the data after alignment and vad processing, and then an age tag is set on the extracted feature value, so as to form a final training data set.

In the embodiment of the invention, not only the audio data after data enhancement is subjected to alignment processing in the early-stage processing process, but also the audio data after data enhancement is subjected to vad processing at the same time, the audio data after alignment processing is used as training data, and the audio data after vad processing is used as training data, namely, the audio data of two processing modes are used for training at the same time, so that the age identification performance under the condition that vad cutting is inaccurate can be ensured, and the trained algorithm model can be accurately identified no matter whether the audio is accurately cut or not.

The selected original audio is 0.2s-1s long, preferably a wakeup word, and may include a mixture of chinese, english, and chinese and english, and the like. The selected original audio may be obtained from online audio, and the user's buffered audio may be downloaded from the background, for example, for 4000 hours.

Preferably, the data enhancement processing on the original audio may be implemented by setting a plurality of scenes, and performing near-field and far-field sound pickup on the original audio in each scene, for example, performing near-field and far-field sound pickup in a plurality of scenes such as home, vehicle, market, roadside, and office to obtain enhanced audio data. Through multi-scene simulation and near field far field pickup respectively, the authenticity of the trained algorithm model can be guaranteed.

Step S102: and training the selected neural network model by using the training data set, and determining an algorithm model for age identification. The embodiment of the invention selects the neural network model for training so as to identify the age of the speaker through the trained neural network model. Specifically, the complete process of training the neural network model once may be three steps including forward propagation, backward propagation and weight update. During training, training data are obtained from the training data set, characteristic values in the training data are used as input of the neural network model, age labels corresponding to the training data are used as output matching targets of the neural network model, and therefore the weight coefficients of the model are trained, and the trained neural network model is obtained.

Illustratively, the selected neural network model is a convolutional neural network. The process of training the convolutional neural network comprises the following steps: the method comprises the steps of randomly acquiring batch training data from a training data set, using characteristic values in the training data as input of a convolutional neural network, using age tags corresponding to the training data as output matching targets of the convolutional neural network, and repeatedly training the convolutional neural network sequentially through forward propagation, backward propagation and weight updating until the weight of the convolutional neural network is converged to a preset range. In a specific practice process, the training can be carried out for multiple times to finally determine the weight coefficient of the convolutional neural network which is more suitable for the actual situation, and in order to reduce the memory, a mini-batch mode is used in the training process, and a small batch of training data is randomly taken out each time for training. Firstly, in the forward propagation process, characteristic values are input into a trunk part of a convolutional neural network (4 CNN layers form the trunk), embeddings are extracted and are respectively input into a classification layer, the prediction probability of each label category is calculated, and a loss function calculates loss by using the prediction probability and a target age label. Then, in the back propagation process, a back propagation algorithm is used to calculate the gradient value of the loss function to the weight of the convolutional neural network. And finally, updating the network weight by using the gradient value of the convolutional neural network weight by adopting a random gradient descent method. After the weights of the convolutional neural network are trained, the age of the speaker can be identified by using the latest weights and network structures of the convolutional neural network as an algorithm model.

Preferably, the characteristic value in the embodiment of the present invention is an fbank characteristic, and the network structure of the convolutional neural network is: 8-channel convolutional layer-selu () -maxporoling layer-16-channel convolutional layer-selu () -maxporoling layer-embedding mean variance-linear layer, softmaxloss. Practice shows that the recognition rate of the algorithm model trained by the characteristics and the network structure on the ages of children and adults can reach 92%.

Step S103: when the real-time audio data are acquired, the audio data are identified through an algorithm model, and the age of a speaker is determined. After the algorithm model is trained, the age identification can be carried out by applying the algorithm model, and the specific application method can be as follows: when audio data collected in real time are received, feature values of the audio data are extracted, the extracted feature values are input into the algorithm model, so that the output results of the age labels and the corresponding probabilities of the age labels can be obtained, and the ages corresponding to the age labels with the highest probabilities are selected to serve as the ages of the finally determined speakers.

Through the embodiment, the age of the speaker can be predicted and recognized through the audio, and due to the fact that the neural network model is adopted for training, and training data are enhanced through multi-scene data, recognition accuracy rates for different scenes are very high. In addition, the algorithm model obtained by the method can realize rapid and efficient age identification through a small amount of parameters, can accurately identify short-time audio without a user speaking a long voice, and therefore user experience is very good. Moreover, based on the scheme, the age of the speaker can be judged based on the picked awakening audio while awakening, so that personalized response is made, and the voice product is more intelligent.

In other specific embodiments, relu () activation, which is an activation function used by most neural networks, can be used in the above convolutional neural network structure, but it has been shown in a great deal of practice that the recognition rate of relu () activation is 0.3 point lower than selu ().

In other implementations, the selected convolutional neural network may have a structure of convolutional layer-selu () -maxporoling layer-convolutional layer-selu () -maxporoling layer-embedding mean-linear layer, softmax loss. This scheme has 1 point of performance reduction over the network architecture scheme employed in the embodiment of fig. 1.

In other embodiments, the neural network model used may also be a 512 x 6 deep neural network (i.e., dnn) model, which has slightly worse performance than that of a convolutional neural network (i.e., cnn), with a recognition rate of 91% but a dnn network with a smaller number of parameters.

In another embodiment, the feature usage may also select an mfcc feature, or a msdc feature. Using the mfcc feature instead of the fbank feature, the recognition performance is not much different, but extracting the mfcc feature is slightly longer than the fbank feature.

FIG. 3 schematically shows a speech-based age recognition apparatus according to an embodiment of the present invention, which includes

A training set determining module 30, configured to obtain an original audio for preprocessing, and determine a training data set;

the model training module 31 is used for training the selected neural network model by using a training data set and determining an algorithm model for age identification; and

and the identification module 32 is configured to, when the real-time audio data is acquired, perform identification processing on the audio data through the algorithm model, and determine the age of the speaker.

Wherein the training set determining module 30 comprises

An enhancement processing unit 30A for performing data enhancement on the acquired original audio;

the preprocessing unit 30B is configured to perform align and vad processing on the data after data enhancement respectively;

the feature extraction unit 30C is configured to perform feature value extraction on the data after align and vad processing; and

and a label setting unit 30D for setting an age label to the extracted feature value to form a training data set.

As a preferred implementation example, the neural network model trained in the model training module 31 is a convolutional neural network, and the original audio selected by the training set determining module 30 is a short-time audio with a duration range of 0.2-1 s. The specific implementation of each module and unit in the device embodiment may be described with reference to the method portion, and other implementations mentioned in the method portion may also be applied to the device embodiment, so that details are not described here.

By the aid of the device, accuracy of age identification for different scenes can be improved, short-time audio can be identified, and high efficiency and high cost performance are achieved. .

In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, and the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform the above-mentioned voice-based age identification method of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the above-described speech-based age recognition method.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described speech-based age recognition method.

In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, where the computer program is capable of executing the above-mentioned age recognition method based on speech when the computer program is executed by a processor.

The age identifying device based on voice according to the embodiment of the present invention may be used to execute the age identifying method based on voice according to the embodiment of the present invention, and accordingly achieve the technical effect achieved by the age identifying method based on voice according to the embodiment of the present invention, and will not be described herein again. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

Fig. 4 is a schematic hardware structure diagram of an electronic device for executing a speech-based age recognition method according to another embodiment of the present application, and as shown in fig. 4, the electronic device includes:

one or more processors 510 and memory 520, with one processor 510 being an example in fig. 4.

The apparatus for performing the voice-based age recognition method may further include: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, such as by a bus connection in fig. 4.

The memory 520, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the voice-based age recognition method in the embodiments of the present application. The processor 510 executes various functional applications of the server and data processing, i.e., implements the voice-based age recognition method in the above-described method embodiments, by executing the nonvolatile software programs, instructions, and modules stored in the memory 520.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice-based age recognition apparatus, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 520 optionally includes memory located remotely from processor 510, which may be connected to a voice-based age recognition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate signals related to user settings and function control of the voice-based age recognition method. The output device 540 may include a display device such as a display screen.

The one or more modules described above are stored in the memory 520 and, when executed by the one or more processors 510, perform the speech-based age recognition method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or otherwise contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A speech-based age identification method, comprising

2. The method of claim 1, wherein the raw audio is obtained for pre-processing, and wherein determining the training set of data comprises

Performing data enhancement on the acquired original audio;

respectively carrying out align and vad processing on the data after data enhancement;

extracting characteristic values of the data after align and vad processing;

and setting an age label for the extracted characteristic value to form a training data set.

3. The identification method according to claim 2, wherein the original audio is 0.2s-1s long audio.

4. The method of claim 2 or 3, wherein data enhancing the captured original audio comprises

Setting a plurality of scenes, and respectively carrying out near-field and far-field sound pickup on original audio in each scene.

5. The method of claim 2, wherein the selected neural network model is a convolutional neural network, and wherein training the selected neural network model using the training data set comprises training the selected neural network model using the training data set

The method comprises the steps of randomly acquiring batch training data from a training data set, using characteristic values in the training data as input of a convolutional neural network, using age labels corresponding to the training data as output matching targets of the convolutional neural network, and training the convolutional neural network sequentially through forward propagation, backward propagation and weight updating until the weight of the convolutional neural network is converged to a preset range.

6. The method according to claim 5, wherein the eigenvalues are fbank features and the network structure of the convolutional neural network is: 8-channel convolutional layer-selu () -maxporoling layer-16-channel convolutional layer-selu () -maxporoling layer-embedding mean variance-linear layer, softmax loss.

7. An age recognition device based on speech, comprising

8. The apparatus of claim 8, wherein the training set determination module comprises

The enhancement processing unit is used for performing data enhancement on the acquired original audio;

the preprocessing unit is used for respectively carrying out align and vad processing on the data after the data enhancement;

the characteristic extraction unit is used for extracting characteristic values of the data after align and vad processing; and

and the label setting unit is used for setting an age label for the extracted characteristic value to form a training data set.

9. The apparatus of claim 7 or 8, wherein the neural network model is a convolutional neural network.

10. Storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.