WO2021175031A1

WO2021175031A1 - Information prompting method and apparatus, electronic device, and medium

Info

Publication number: WO2021175031A1
Application number: PCT/CN2021/072860
Authority: WO
Inventors: 马坤; 刘微微; 赵之砚; 施奕明
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-03-03
Filing date: 2021-01-20
Publication date: 2021-09-10
Also published as: CN111462755A

Abstract

Provided are an information prompting method, an information prompting apparatus, an electronic device, and a storage medium. The method comprises: collecting voice information (S1); determining the category of the voice information (S2); when it is determined that the category of the voice information is a male voice or a female voice, identifying semantic information corresponding to the voice information (S3); giving a prompt according to the identified semantic information and the determined category (S4); and when it is determined that the category of the voice information is a neutral voice, identifying, according to a pre-trained gender identification deep neural network model, a gender corresponding to the neutral voice (S5), and then identifying semantic information corresponding to the voice information. The method can provide an accurate indication for a user according to the identified semantic information and gender information.

Description

Information prompt method, device, electronic equipment and medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 3, 2020, the application number is 202010139944.2, and the invention title is "Information Prompt Method, Apparatus, Electronic Equipment and Medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to an information prompt method, device, electronic equipment, and medium.

Background technique

Speech recognition of the speaker's biological attributes (such as gender) is an important field in the field of artificial intelligence. Recognizing the gender of a speaker based on the voice is a natural ability for humans, but for artificial intelligence, it represents the highest level of progress. The voices of male and female speakers usually differ significantly. However, the inventor realized that it is difficult to accurately identify the gender of the speaker without careful identification of a more neutral voice. This is a great challenge for artificial intelligence. If the neutral tone can be accurately identified, it can greatly improve the application ability of the speech recognition speaker's biological attributes in actual business scenarios (such as intelligent customer service systems).

Summary of the invention

In view of the above, it is necessary to propose an information prompt method, device, electronic equipment and medium, which can provide users with more accurate instructions by identifying the neutral voice in the collected voice information, and then identifying the gender corresponding to the neutral voice. information.

The first aspect of the present application provides an information prompt method, the method includes: a collecting step, collecting sound information; a confirming step, confirming the type of the sound information, wherein the type of the sound information includes male voice and female voice And neutral voice; a recognition step, when the category of the voice information is confirmed to be a male voice or a female voice, recognize the semantic information corresponding to the voice information; the prompt step, give a prompt based on the recognized semantic information and the confirmed category; The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.

The second aspect of the present application is an information prompting device, the device comprising: a collection module for collecting sound information; a confirmation module for confirming the type of the sound information, wherein the type of the sound information includes a male voice , Female voice and neutral voice; recognition module, used to identify the semantic information corresponding to the voice information when the category of the voice information is confirmed to be male voice or female voice; prompt module, used to identify the semantic information according to the recognized semantic information and A prompt is given for the confirmed category; the processing module is used to identify the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when the category of the voice information is confirmed to be a neutral voice, wherein the The pre-trained deep neural network model for gender recognition is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.

A third aspect of the present application provides an electronic device, wherein the electronic device includes a processor configured to execute computer-readable instructions stored in a memory to implement the following steps: collecting step, collecting sound information; confirming step , Confirming the category of the voice information, wherein the category of the voice information includes a male voice, a female voice, and a neutral voice; the recognition step, when the category of the voice information is confirmed to be a male voice or a female voice, recognize the The semantic information corresponding to the sound information; the prompt step is to give a prompt based on the recognized semantic information and the confirmed category; the processing step, when the category of the sound information is confirmed to be a neutral voice, the deep neural network model is recognized according to the pre-trained gender Recognize the gender corresponding to the neutral voice, and then return to the recognition step, wherein the pre-trained gender recognition deep neural network model is a residual neural network ResNet-10 model, and the ResNet-10 model includes a convolutional layer Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x and the fully connected layer total 10 layers.

A fourth aspect of the present application provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium, wherein the computer-readable instructions implement the following steps when executed by a processor: an acquisition step, Collecting sound information; confirming step, confirming the type of the sound information, wherein the type of the sound information includes male voice, female voice and neutral voice; in the identifying step, when it is confirmed that the type of the sound information is male voice or female In the case of sound, the semantic information corresponding to the sound information is recognized; in the prompt step, a prompt is given according to the recognized semantic information and the confirmed category; the processing step is, when the category of the sound information is confirmed as neutral sound, according to the pre-trained The gender recognition deep neural network model recognizes the gender corresponding to the neutral voice, and then returns to the recognition step, wherein the pre-trained gender recognition deep neural network model is a residual neural network ResNet-10 model, a ResNet-10 model Including convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x and a total of 10 layers of fully connected layers.

The information prompt method, device, electronic equipment, and storage medium described in this application collect sound information; confirm the type of the sound information; when it is confirmed that the type of the sound information is a male voice or a female voice, recognize the voice The semantic information corresponding to the information; prompts are given according to the recognized semantic information and the confirmed category; when the category of the voice information is confirmed to be a neutral voice, the neutral voice corresponding to the neutral voice is identified according to the pre-trained gender recognition deep neural network model Then, the semantic information corresponding to the voice information is identified, so that accurate instructions can be provided to the user based on the identified semantic information and gender information. When identifying the gender corresponding to a neutral voice, this application draws on the mature ideas in face recognition, and promotes a larger classification boundary, so that speech data with fuzzy classification boundaries, such as neutral voices, can pass deep training. Obtaining effective gender attribution recognition greatly improves the accuracy of gender recognition, and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is the embodiment of the present application. For those of ordinary skill in the art, other drawings can be obtained according to the provided drawings without creative work.

FIG. 1 is a flowchart of an information prompt method provided by Embodiment 1 of the present application.

Fig. 2 is a functional block diagram of the information prompting device provided in the second embodiment of the present application.

Fig. 3 is a schematic diagram of an electronic device provided in a third embodiment of the present application.

The following specific embodiments will further illustrate this application in conjunction with the above-mentioned drawings.

Detailed ways

In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.

In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the description of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.

The terms "first", "second", and "third" in the specification and claims of the present application and the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence. In addition, the term "including" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

The information prompt method in the embodiment of the present application is applied to electronic equipment. For the electronic device that needs to perform information prompting, the information prompting function provided by the method of this application can be directly integrated on the electronic device, or a client for implementing the method of this application can be installed. For another example, the method provided in this application can also be run on servers and other devices in the form of a Software Development Kit (SDK), and provide an interface for information prompting functions in the form of SDK, and electronic devices or other devices can provide The interface can realize the information prompt function.

Example one

FIG. 1 is a flowchart of an information prompt method provided in Embodiment 1 of the present invention. According to different requirements, the execution sequence in the flowchart can be changed, and some steps can be omitted.

In this embodiment, the information prompting method can be applied to electronic devices such as a robot, and the robot can be a robot that guides the user. For example, the robot is used in a hospital. When the user needs to ask the gynecological clinic of the robot hospital, the robot can give accurate instructions based on the identified user's gender and semantic information. Or when the user needs to ask the robot where the men's restroom or the women's restroom is in a public place, the robot can also give accurate instructions based on the identified user's gender and semantic information. The method includes:

Step S1, collecting sound information.

In this embodiment, a microphone is installed on the electronic device, and sound information can be collected through the microphone.

Step S2: Confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice.

In the prior art, the pitch frequency range of most people's voices is 50Hz-400Hz. Under normal circumstances, the pitch frequency range of male voices is 50Hz-200Hz, and the pitch frequency range of female voices is 150Hz-400Hz. From the pitch frequency range of the male voice and the pitch frequency range of the female voice, it can be found that there is a partial overlap of 150Hz-200Hz between the two. According to this overlapping pitch frequency range, it is more difficult to distinguish whether the speaker is a male or a female. Thus, the sound corresponding to the overlapping pitch frequency range can be defined as a neutral sound.

In this embodiment, confirming the category of the sound information includes:

(1) Extract the pitch frequency of the sound information;

(2) Compare the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;

(3) When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is male voice; when the pitch frequency of the sound information falls within the second pitch frequency range , Confirm that the category of the sound information is a female voice; confirm that the pitch frequency of the voice information falls within the third pitch frequency range, and confirm that the category of the voice information is a neutral voice.

Specifically, the first pitch frequency range of the male voice is set to 50 Hz-150 Hz, the second pitch frequency range of the female voice is 200 Hz-400 Hz, and the third pitch frequency range of the neutral voice is 150 Hz-200 Hz. In this embodiment, the category of the sound information is confirmed according to the pitch frequency of the sound information. When the pitch frequency of the sound information falls within the first pitch frequency range (for example, 50Hz-150Hz), it is confirmed that the type of the sound information is male voice, and the process goes to step S3; when the pitch frequency of the sound information falls In the second pitch frequency range (such as 200Hz-400Hz), it is confirmed that the category of the sound information is a female voice, and the process goes to step S3; when the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150Hz-200Hz), it is confirmed that the category of the sound information is neutral sound, and the process goes to step S5.

Step S3: When it is confirmed that the type of the sound information is a male voice or a female voice, the semantic information corresponding to the sound information is recognized.

In this embodiment, the semantic information corresponding to the sound information can be identified through a natural language processing method, which specifically includes:

Converting the sound information into text information;

The text information is preprocessed, and the preprocessing includes word segmentation and noise word removal processing.

Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.

In this embodiment, the above-mentioned basic concept library includes basic concepts of meaning and extended concepts corresponding to the above-mentioned basic concepts. The semantic relation library includes relations and fuzzy semantic relations related to the above-mentioned basic concept library, sentence pattern relation template, and common sense library.

In step S4, a prompt is given according to the recognized semantic information and the confirmed category.

In this embodiment, when the determined category is male voice, it can be confirmed that the gender of the user is male. When the determined category is female voice, it can be confirmed that the gender of the user is female. At this time, because the user may not need to get instructions from the robot based on his gender. Therefore, the robot can not only give instructions based on the recognized gender, but needs to give instructions based on the recognized semantic information.

Preferably, the priority of the identified semantic information is higher than the priority of the confirmed category.

For example, when the user is a male, and the user needs to ask the women around him for the location of the female toilet. When a male user gives a voice asking message "Where is the female bathroom?". At this time, the robot can determine that the user's gender is male based on the user's voice information. However, it is not possible to give a reminder of the location of the men's restroom. Instead, it is necessary to give a reminder of the location of the women's restroom based on the semantic information corresponding to the voice inquiry information given by the user. As a result, more accurate prompts can be provided to users, and user experience can be improved.

For example, when the user only gives the voice inquiry message "Where is the restroom". At this time, the robot only needs to determine the gender of the user based on the voice query information, and then give instructions based on the determined gender of the user. For example, when a male user gives the voice inquiry message "Where is the restroom", the robot determines the corresponding user gender male according to the voice inquiry information, and then gives a prompt of the location of the men's restroom.

Step S5: When it is confirmed that the category of the voice information is a neutral voice, the gender corresponding to the neutral voice is recognized according to the pre-trained gender recognition deep neural network model, and then the flow returns to step S3.

In this embodiment, after the user’s gender is recognized through the gender recognition deep neural network model, in order to avoid the situation where the user’s gender is only given wrong instructions, the semantic information corresponding to the voice information needs to be acquired first, and then based on The semantic information and user gender give correct instructions.

In this embodiment, the gender recognition deep neural network model is a residual neural network ResNet model. The ResNet model is a deep neural network model designed based on the AM-Softmax loss function, wherein the optimal decision boundary of the gender recognition deep neural network model is obtained by adjusting the parameter factors of the AM-Softmax loss function.

In this embodiment, for gender recognition, a two-class depth model needs to be designed. The two-class model usually uses sigmod or softmax loss function. However, the sigmod or softmax loss function does not work well on data with blurred boundaries. In order to accurately classify genders based on neutral sounds, increase the class spacing, and reduce the intra-class spacing, the AM-Softmax loss function is used in this application to design a deep neural network model. The AM-Softmax loss function can promote a larger classification boundary between categories.

The AM-Softmax loss function is:

Among them, S=30 and m=0.2. In order to improve the convergence speed, a hyperparameter s is introduced, where s is set to a fixed value of 30.

When the value of the parameter factor m of the AM-Softmax loss function is 0.2, the optimal decision boundary of the gender recognition deep neural network model can be obtained.

In this embodiment, since gender recognition is a two-class problem, the target categories are only male and female. Compared with image classification, the solution space of the problem is relatively simple. If the depth model in the field of image classification is used directly, it is prone to overfitting. Therefore, in this application, in order to avoid the phenomenon of over-fitting and improve the generalization ability of the depth model, the existing depth model for recognizing pictures is improved to obtain the ResNet-10 model. Specifically, on the basis of ResNet-18, the model depth and the number of residual layers are reduced again to obtain the ResNet-10 model.

In this embodiment, the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x, and a total of 10 layers of fully connected layers. Among them, the parameters of ResNet-10 in the present invention can be referred to as shown in Table 1. The max pool in 1 is the pooling layer, where the stride stride of the first layer of Conv3_x, Conv4_x and Conv5_x are all 2. After each convolutional layer, the activation layer ReLU and the regularization layer Batech Normalization are connected, as shown in Table 1. Conv2_x, Conv3_x, Conv4_x and Conv5_x all include 1 residual module (X1blocks). In order to realize the binary classification task of the gender recognition model of the present invention, the last layer of the conv5_x is connected to a fully connected layer. The layer can output the type result corresponding to the sound information.

In this embodiment, the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected to an adaptive global average pooling. Because the problem to be solved in this application is less classification problems (male and female), using average pooling has a better effect than maximum pooling. In this application, adaptive global average pooling is used to avoid feature size mismatch. Since the feature size of the speech spectrogram fluctuates greatly, the effect of adaptive global average pooling is better.

Table 1

In this embodiment, the convolution kernel of the input part of the ResNet-10 model is a 3×3 convolution kernel. In this embodiment, the 3x3 convolution kernel can effectively reduce the amount of calculation, and at the same time can better adapt to the speech spectrogram. And after reducing the feature map size of each residual layer of the ResNet-10 model, the model can be made not easy to overfit, and the magnitude of the model parameters can be reduced at the same time.

In this embodiment, the pre-trained method of a gender recognition deep neural network model includes:

(1) Expanding the neutral voice to obtain training data;

(2) Train the deep neural network model according to the expanded training data to obtain a gender recognition deep neural network model.

In this embodiment, the training method of the neural network model includes the following steps:

(a) Obtain the characteristic parameter corresponding to the neutral sound, and mark the characteristic parameter with a category, so that the characteristic parameter carries a category label.

For example, to select 500 male neutral voices and female neutral voices corresponding feature parameters, and label each feature parameter category, you can use "1" as the parameter label of the male neutral voice, and "2" as the female neutral voice. The parameter label of the sexual voice.

In this embodiment, the characteristic parameters corresponding to the male neutral voice and the female neutral voice include the Mel frequency cepstrum coefficient of the sound signal. The Mel-Frequency Cepstral Coefficients (MFCC) analysis is based on the auditory characteristics of the human ear. Because the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more in line with the hearing characteristics of the human ear.

(b) Randomly divide the characteristic parameters into a training set with a first preset ratio and a verification set with a second preset ratio.

First distribute the training samples in the training set of neutral voices of different genders to different folders. For example, distribute training samples of male neutral voices to the first folder, and distribute training samples of female neutral voices to the second folder. Then extract the first preset ratio (for example, 70%) of the training samples from different folders as the total training set, the purpose of which is to train the deep neural network model; and then from different folders The remaining second preset ratio (for example, 30%) of the training samples are respectively taken as the test set, the purpose of which is to test the classification performance of the deep neural network model.

(c) Use the training set to train the deep neural network model.

The process of inputting the training set into the established neural network model (such as resnet10) for model training can be implemented by means in the prior art, and will not be described in detail here. In some embodiments, using the training set to train the neural network model may further include: deploying the training of the deep neural network model on multiple graphics processing units (GPUs) for distributed training. For example, through the distributed training principle of Tensorflow, the training of the model can be deployed on multiple graphics processors for distributed training, which can shorten the training time of the model and accelerate the convergence of the model.

(d) Use the verification set to verify the accuracy of the trained deep neural network model.

In this embodiment, if the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, and the trained deep neural network model is used as a classifier to identify the gender of the user corresponding to the current neutral voice; if the accuracy is accurate When the rate is less than the preset accuracy rate, the number of samples is increased to retrain the deep neural network model until the accuracy rate is greater than or equal to the preset accuracy rate.

In this embodiment, the method for expanding the neutral sound is: superimposing noise on the collected neutral sound, obtaining a spectrogram of the neutral sound after superimposing the noise, and shuffling and recombining the spectrogram in the time direction.

In this embodiment, the training data of the neutral sound is expanded by data enhancement technology. Because neutral sounds are relatively rare. In order to train the deep neural network model, the collected neutral sounds need to be expanded.

Specifically, superimposing noise in the collected neutral sound includes: superimposing white noise in the collected neutral sound and/mixing environmental noise in the collected neutral sound.

For example, linearly superimpose Gaussian white noise on the collected neutral sound (original_signal) to obtain a new sound signal: new_signal=0.9*original_signal+0.1*white_noise().

For example, mixing real environmental noise in the collected neutral sound can be to replace the above-mentioned Gaussian white noise with the collected real environmental noise to obtain a new sound signal: new_signal=0.9*original_signal+0.1*real_noise(). The real environmental noise may be noise collected from parks, bus stops, stadiums, coffee shops and other venues.

The neutral sound after superimposed noise is subjected to short-time Fourier transform processing to obtain a spectrogram, and the spectrogram is shuffled and reorganized in the time direction to obtain training data.

For example, in the time direction of the spectrogram, crop the spectrogram corresponding to the neutral sound according to a fixed length of the speech frame sequence (such as 64 frames) to obtain a speech fragment with a length of 64 frames, and then randomly re-create the speech fragment combination. For example, a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal. Through the above two steps, enough effective and high-quality neutral voice data can be generated.

Since this application needs to design a deep model of the voice spectrogram to identify the gender of the user, the method of random cropping and stacking of the spectrogram is used here to expand the voice data. In this way, a relatively large-scale neutral voice data is obtained, and the training data is expanded by dividing it into the corresponding gender data group according to different labels.

In summary, the information prompting method provided by the present invention includes collecting sound information; confirming the type of the sound information, wherein the type of the sound information includes male voice, female voice, and neutral voice; when confirming the voice When the type of information is a male voice or a female voice, the semantic information corresponding to the voice information is recognized; prompts are given according to the recognized semantic information and the confirmed category; when the category of the voice information is confirmed to be a neutral voice, according to the previous The trained gender recognition deep neural network model recognizes the gender corresponding to the neutral voice. This application can identify the neutral voice in the collected voice information, and then identify the gender corresponding to the neutral voice, so as to provide more accurate instructions. And when recognizing the gender corresponding to the neutral voice, by designing the deep neural network model ResNet-10 trained based on AM-Softmax, in the user gender recognition, the user's voice feature space has a better classification boundary. In addition, the information prompt method proposed in this application draws on the mature ideas in face recognition and promotes the expansion of the classification boundary, so that voice data with fuzzy classification boundaries, such as neutral voices, can be effectively trained through in-depth training to obtain effective gender attribution. Recognition greatly improves the accuracy of gender recognition and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.

The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. For those of ordinary skill in the art, without departing from the inventive concept of the present invention, they can also make Improvements, but these all belong to the protection scope of the present invention.

The functional modules and hardware structure of the electronic device that realizes the above-mentioned information prompt are respectively introduced below in conjunction with FIG. 2 and FIG. 3.

Example two

Fig. 2 is a diagram of functional modules in a preferred embodiment of the information prompting device of the present invention.

In some embodiments, the information prompting device 20 runs in an electronic device. The information prompting device 20 may include multiple functional modules composed of program code segments. The program code of each program segment in the information prompt device 20 can be stored in a memory and executed by at least one processor to perform an information prompt function.

In this embodiment, the information prompting device 20 can be divided into multiple functional modules according to the functions it performs. The functional modules may include: an acquisition module 201, a confirmation module 202, an identification module 203, a prompt module 204, and a processing module 205. The module referred to in the present invention refers to a series of computer program segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.

The collection module 201 is used to collect sound information.

In this embodiment, a microphone is installed on the robot, and sound information can be collected through the microphone.

The confirmation module 202 is used to confirm the category of the sound information, where the category of the sound information includes male voice, female voice, and neutral voice.

In this embodiment, the confirmation module 202 confirming the category of the sound information includes:

(1) Extract the pitch frequency of the sound information;

(3) When the pitch frequency of the sound information falls within the first pitch frequency range, the confirmation module 202 confirms that the category of the sound information is a male voice; when the pitch frequency of the sound information falls within the first pitch frequency range, Within the second pitch frequency range, the confirmation module 202 confirms that the category of the sound information is a female voice; the pitch frequency of the sound information falls within the third pitch frequency range, and the confirmation module 202 confirms the sound The category of information is neutral voice.

Specifically, the first pitch frequency range of the male voice is set to 50 Hz-150 Hz, the second pitch frequency range of the female voice is 200 Hz-400 Hz, and the third pitch frequency range of the neutral voice is 150 Hz-200 Hz. In this embodiment, the category of the sound information is confirmed according to the pitch frequency of the sound information. When the pitch frequency of the sound information falls within the first pitch frequency range (for example, 50Hz-150Hz), the confirmation module 202 confirms that the category of the sound information is a male voice; when the pitch frequency of the sound information falls Within the second pitch frequency range (such as 200Hz-400Hz), the confirmation module 202 confirms that the category of the sound information is a female voice; when the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150 Hz-200 Hz), the confirmation module 202 confirms that the category of the sound information is a neutral sound.

The recognition module 203 is configured to recognize the semantic information corresponding to the sound information when it is confirmed that the type of the sound information is a male voice or a female voice.

Converting the sound information into text information;

The prompt module 204 is used to give a prompt according to the recognized semantic information and the confirmed category.

For example, when the user only gives the voice inquiry message "Where is the restroom". At this time, the robot only needs to determine the gender of the user based on the voice query information, and then give instructions based on the determined gender of the user. For example, when a male user gives the voice inquiry message "Where is the restroom", the robot determines the corresponding user gender male according to the voice inquiry information, and then gives a prompt of the location of the men's bathroom.

The processing module 205 is configured to recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice.

In this embodiment, for gender recognition, a two-class depth model needs to be designed. The two-class model usually uses sigmod or softmax loss function. However, the sigmod or softmax loss function does not work well on data with blurred boundaries. In order to accurately classify gender based on neutral voices, increase the class spacing, and reduce the intra-class distance, the AM-Softmax loss function is used in this application to design a deep neural network model. The AM-Softmax loss function can promote a larger classification boundary between categories.

The AM-Softmax loss function is:

In this embodiment, since gender recognition is a two-class problem, the target categories are only male and female. Compared with image classification, the solution space of the problem is relatively simple. If the depth model in the field of image classification is used directly, it is prone to overfitting. Therefore, in this application, in order to avoid the phenomenon of over-fitting and improve the generalization ability of the depth model, the existing depth model for recognizing pictures is improved to obtain the ResNet-10 model. Specifically, on the basis of ResNet-18, the depth of the model and the number of residual layers are reduced again to obtain the ResNet-10 model.

In this embodiment, the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x, and a total of 10 layers of fully connected layers. Among them, the parameters of ResNet-10 in the present invention can be referred to in Table 1 above. As shown, the max pool in Table 1 is the pooling layer, where the stride stride of the first layer of Conv3_x, Conv4_x and Conv5_x are all 2, and each convolutional layer is connected with the activation layer ReLU and the regularization layer Batech Normalization In Table 1, Conv2_x, Conv3_x, Conv4_x and Conv5_x all include one residual module (X1blocks). In order to achieve the binary classification task of the gender recognition model in the present invention, the last layer of the conv5_x is connected to a fully connected layer, so The fully connected layer can output type results corresponding to the sound information.

(1) Expanding the neutral voice to obtain training data;

In this embodiment, the characteristic parameter corresponding to the male neutral voice and the female neutral voice includes the Mel frequency cepstrum coefficient of the sound signal. The Mel-Frequency Cepstral Coefficients (MFCC) analysis is based on the auditory characteristics of the human ear. Because the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more in line with the hearing characteristics of the human ear.

First distribute the training samples in the training set of neutral voices of different genders to different folders. For example, distribute training samples of male neutral voices to the first folder, and distribute training samples of female neutral voices to the second folder. Then extract the training samples of the first preset ratio (for example, 70%) from different folders as the total training set, the purpose of which is to train the deep neural network model; and then from different folders The remaining second preset ratio (for example, 30%) of the training samples are respectively taken as the test set, the purpose of which is to test the classification performance of the deep neural network model.

(c) Use the training set to train the deep neural network model.

In summary, the neutral sound-based information prompting device 20 provided by the present invention includes a collection module 201, a confirmation module 202, an identification module 203, a prompt module 204, and a processing module 205. The collection module 201 is used to collect sound information; the confirmation module 202 is used to confirm the type of the sound information, wherein the types of the sound information include male voice, female voice, and neutral voice; the recognition module 203 It is used to identify the semantic information corresponding to the sound information when it is confirmed that the type of the sound information is a male voice or a female voice; the prompt module 204 is used to give a prompt according to the recognized semantic information and the confirmed type; The processing module 205 is configured to identify the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice. This application can identify the neutral voice in the collected voice information, and then identify the gender corresponding to the neutral voice, so as to provide more accurate instructions. And when recognizing the gender corresponding to the neutral voice, by designing the deep neural network model ResNet-10 trained based on AM-Softmax, in the user gender recognition, the user's voice feature space has a better classification boundary. In addition, the information prompting device proposed in this application draws on mature ideas in face recognition and promotes a larger classification boundary, so that voice data with fuzzy classification boundaries, such as neutral voices, can be effectively trained through in-depth training to obtain effective gender attribution. Recognition greatly improves the accuracy of gender recognition and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.

The above-mentioned integrated unit implemented in the form of a software function module may be stored in a computer readable storage medium. The above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a dual-screen device, or a network device, etc.) or a processor to execute the various embodiments of the present invention. Part of the method.

FIG. 3 is a schematic diagram of the electronic device provided in the third embodiment of the application.

The electronic device 3 includes a memory 31, at least one processor 32, a computer program 33 stored in the memory 31 and running on the at least one processor 32, at least one communication bus 34 and a database 35.

When the at least one processor 32 executes the computer program 33, the steps in the foregoing embodiment of the information prompting method are implemented.

Exemplarily, the computer program 33 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 31 and executed by the at least one processor 32, To complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer program 33 in the electronic device 3.

The electronic device 3 may be a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA) and other devices installed with applications. Those skilled in the art can understand that the schematic diagram 3 is only an example of the electronic device 3, and does not constitute a limitation on the electronic device 3. For example, the electronic device 3 may also include input and output devices, network access devices, buses, and so on.

The at least one processor 32 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 32 may be a microprocessor, or the processor 32 may also be any conventional processor, etc. The processor 32 is the control center of the electronic device 3, and connects the entire electronic device with various interfaces and lines. Various parts of device 3.

The memory 31 may be used to store the computer program 33 and/or modules/units. The processor 32 runs or executes the computer programs and/or modules/units stored in the memory 31 and calls the computer programs and/or modules/units stored in the memory 31. The data in 31 realizes various functions of the electronic device 3. The memory 31 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function, etc.; the storage data area may store data created according to the use of the electronic device 3 Wait. In addition, the memory 31 may include a volatile memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) card. , Flash Card, at least one magnetic disk storage device, flash memory device, high-speed random access memory, or other storage device.

The memory 31 stores program codes, and the at least one processor 32 can call the program codes stored in the memory 31 to perform related functions. For example, the modules (collection module 201, confirmation module 202, identification module 203, prompt module 204, and processing module 205) described in FIG. 2 are program codes stored in the memory 31 and processed by the at least one Executed by the device 32, so as to realize the functions of the various modules to achieve the purpose of information prompting.

The database (Database) 35 is a warehouse built on the electronic device 3 for organizing, storing and managing data according to a data structure. Databases are usually divided into three types: hierarchical database, network database and relational database. In this embodiment, the database 35 is used to store collected sound information and the like.

If the integrated module/unit of the electronic device 3 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer-readable instruction code, and the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random access memory, etc.

In the several embodiments provided in this application, it should be understood that the disclosed electronic device and method can be implemented in other ways. For example, the electronic device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other division methods in actual implementation.

In addition, the functional units in the various embodiments of the present application may be integrated in the same processing unit, or each unit may exist alone physically, or two or more units may be integrated in the same unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any reference signs in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other elements or the singular number does not exclude the plural number. Multiple units or devices stated in this application can also be implemented by one unit or device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of the present application.

Claims

An information prompt method, wherein the method includes:

Collecting steps, collecting sound information;

The confirming step is to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice;

In the recognition step, when it is confirmed that the category of the sound information is a male voice or a female voice, recognizing the semantic information corresponding to the voice information;

Prompt step, give prompts based on the recognized semantic information and confirmed categories;

The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
8. The information prompting method according to claim 1, wherein the confirming step comprises:

Extracting the pitch frequency of the sound information;

Comparing the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;

When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is a male voice;

When the pitch frequency of the sound information falls within the second pitch frequency range, confirm that the category of the sound information is a female voice;

When the pitch frequency of the sound information falls within the third pitch frequency range, it is confirmed that the category of the sound information is a neutral sound.
8. The information prompting method of claim 1, wherein said identifying semantic information corresponding to said sound information comprises:

Converting the sound information into text information;

Preprocessing the text information, where the preprocessing includes word segmentation and noise word removal processing;

Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
The information prompting method according to claim 1, wherein the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x each include a residual module, and the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected An adaptive global average pooling.
The information prompting method of claim 1, wherein the ResNet-10 model is a deep neural network model designed based on the AM-Softmax loss function, wherein, when the parameter factor of the AM-Softmax loss function is 0.2, The best decision boundary of the gender recognition deep neural network model is obtained.
8. The information prompting method according to claim 1, wherein the method further comprises the step of pre-training the gender recognition deep neural network model, and the step of pre-training the gender recognition deep neural network model comprises:

Expanding the neutral sound to obtain training data;

The deep neural network model is trained according to the expanded training data to obtain the gender recognition deep neural network model.
8. The information prompting method according to claim 6, wherein said expanding said neutral voice to obtain training data comprises:

Superimpose noise on the collected neutral sound;

Obtain the frequency spectrum of the neutral sound after superimposed noise;

The spectrogram is shuffled and reorganized in the time direction.
An information prompting device, wherein the device includes:

Collection module, used to collect sound information;

The confirmation module is used to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice and neutral voice;

The recognition module is used to recognize the semantic information corresponding to the voice information when it is confirmed that the category of the voice information is a male voice or a female voice;

The prompt module is used to give prompts based on the recognized semantic information and the confirmed category;

The processing module is configured to recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice, wherein the pre-trained gender recognition deep neural network The network model is a residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x and a total of 10 fully connected layers.
An electronic device, wherein the electronic device includes a processor, and the processor is configured to execute computer-readable instructions stored in a memory to implement the following steps:

Collecting steps, collecting sound information;

The confirming step is to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice;

In the recognition step, when it is confirmed that the category of the sound information is a male voice or a female voice, recognizing the semantic information corresponding to the voice information;

Prompt step, give prompts based on the recognized semantic information and confirmed categories;

The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
9. The electronic device according to claim 9, wherein, when the processor executes the computer-readable instruction to realize the confirming the category of the sound information, it specifically comprises:

Extracting the pitch frequency of the sound information;

Comparing the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;

When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is a male voice;

When the pitch frequency of the sound information falls within the second pitch frequency range, confirm that the category of the sound information is a female voice;

When the pitch frequency of the sound information falls within the third pitch frequency range, it is confirmed that the category of the sound information is a neutral sound.
9. The electronic device according to claim 9, wherein, when the processor executes the computer-readable instruction to realize the recognition of the semantic information corresponding to the sound information, it specifically comprises:

Converting the sound information into text information;

Preprocessing the text information, where the preprocessing includes word segmentation and noise word removal processing;

Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
The electronic device of claim 9, wherein the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x each comprise a residual module, and the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are connected to one Adaptive global average pooling.
The electronic device according to claim 9, wherein the ResNet-10 model is a deep neural network model designed based on the AM-Softmax loss function, wherein when the parameter factor of the AM-Softmax loss function is 0.2, it is obtained The best decision boundary of the gender recognition deep neural network model.
The electronic device of claim 9, wherein the processor executing the computer-readable instructions is further used to implement the step of pre-training the gender recognition deep neural network model, and the processor executes the computer-readable The instructions to pre-train the gender recognition deep neural network model include:

Expanding the neutral sound to obtain training data;

The deep neural network model is trained according to the expanded training data to obtain the gender recognition deep neural network model.
The electronic device according to claim 14, wherein, when the processor executes the computer-readable instructions to implement the expansion of the neutral sound to obtain training data, the method comprises:

Superimpose noise on the collected neutral sound;

Obtain the frequency spectrum of the neutral sound after superimposed noise;

The spectrogram is shuffled and reorganized in the time direction.
A computer-readable storage medium having computer-readable instructions stored thereon, wherein the computer-readable instructions implement the following steps when executed by a processor:

Collecting steps, collecting sound information;

The confirming step is to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice;

In the recognition step, when it is confirmed that the category of the sound information is a male voice or a female voice, recognizing the semantic information corresponding to the voice information;

Prompt step, give prompts based on the recognized semantic information and confirmed categories;

The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
15. The computer-readable storage medium of claim 16, wherein, when the computer-readable instruction is executed by the processor to realize the confirming the category of the sound information, it specifically comprises:

Extracting the pitch frequency of the sound information;

Comparing the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;

When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is a male voice;

When the pitch frequency of the sound information falls within the second pitch frequency range, confirm that the category of the sound information is a female voice;

When the pitch frequency of the sound information falls within the third pitch frequency range, it is confirmed that the category of the sound information is a neutral sound.
15. The computer-readable storage medium according to claim 16, wherein, when the computer-readable instruction is executed by the processor to realize the recognition of the semantic information corresponding to the sound information, it specifically comprises:

Converting the sound information into text information;

Preprocessing the text information, where the preprocessing includes word segmentation and noise word removal processing;

Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
The computer-readable storage medium of claim 16, wherein the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x each comprise a residual module, and the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x Connect an adaptive global average pooling respectively.
The computer-readable storage medium of claim 16, wherein the ResNet-10 model is a deep neural network model designed based on the AM-Softmax loss function, wherein when the parameter factor of the AM-Softmax loss function is 0.2 When the time, the best decision boundary of the gender recognition deep neural network model is obtained.