WO2021175031A1 - Information prompting method and apparatus, electronic device, and medium - Google Patents

Information prompting method and apparatus, electronic device, and medium Download PDF

Info

Publication number
WO2021175031A1
WO2021175031A1 PCT/CN2021/072860 CN2021072860W WO2021175031A1 WO 2021175031 A1 WO2021175031 A1 WO 2021175031A1 CN 2021072860 W CN2021072860 W CN 2021072860W WO 2021175031 A1 WO2021175031 A1 WO 2021175031A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
category
sound
neutral
Prior art date
Application number
PCT/CN2021/072860
Other languages
French (fr)
Chinese (zh)
Inventor
马坤
刘微微
赵之砚
施奕明
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021175031A1 publication Critical patent/WO2021175031A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an information prompt method, device, electronic equipment, and medium.
  • Speech recognition of the speaker's biological attributes is an important field in the field of artificial intelligence. Recognizing the gender of a speaker based on the voice is a natural ability for humans, but for artificial intelligence, it represents the highest level of progress. The voices of male and female speakers usually differ significantly. However, the inventor realized that it is difficult to accurately identify the gender of the speaker without careful identification of a more neutral voice. This is a great challenge for artificial intelligence. If the neutral tone can be accurately identified, it can greatly improve the application ability of the speech recognition speaker's biological attributes in actual business scenarios (such as intelligent customer service systems).
  • the first aspect of the present application provides an information prompt method, the method includes: a collecting step, collecting sound information; a confirming step, confirming the type of the sound information, wherein the type of the sound information includes male voice and female voice And neutral voice; a recognition step, when the category of the voice information is confirmed to be a male voice or a female voice, recognize the semantic information corresponding to the voice information; the prompt step, give a prompt based on the recognized semantic information and the confirmed category; The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training
  • the gender recognition deep neural network model is the residual neural network ResNet-10 model.
  • the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
  • the second aspect of the present application is an information prompting device, the device comprising: a collection module for collecting sound information; a confirmation module for confirming the type of the sound information, wherein the type of the sound information includes a male voice , Female voice and neutral voice; recognition module, used to identify the semantic information corresponding to the voice information when the category of the voice information is confirmed to be male voice or female voice; prompt module, used to identify the semantic information according to the recognized semantic information and A prompt is given for the confirmed category; the processing module is used to identify the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when the category of the voice information is confirmed to be a neutral voice, wherein the The pre-trained deep neural network model for gender recognition is the residual neural network ResNet-10 model.
  • the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
  • a third aspect of the present application provides an electronic device, wherein the electronic device includes a processor configured to execute computer-readable instructions stored in a memory to implement the following steps: collecting step, collecting sound information; confirming step , Confirming the category of the voice information, wherein the category of the voice information includes a male voice, a female voice, and a neutral voice; the recognition step, when the category of the voice information is confirmed to be a male voice or a female voice, recognize the The semantic information corresponding to the sound information; the prompt step is to give a prompt based on the recognized semantic information and the confirmed category; the processing step, when the category of the sound information is confirmed to be a neutral voice, the deep neural network model is recognized according to the pre-trained gender Recognize the gender corresponding to the neutral voice, and then return to the recognition step, wherein the pre-trained gender recognition deep neural network model is a residual neural network ResNet-10 model, and the ResNet-10 model includes a convolutional layer Conv_1, Conv2_x, Conv3_x, Conv
  • a fourth aspect of the present application provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium, wherein the computer-readable instructions implement the following steps when executed by a processor: an acquisition step, Collecting sound information; confirming step, confirming the type of the sound information, wherein the type of the sound information includes male voice, female voice and neutral voice; in the identifying step, when it is confirmed that the type of the sound information is male voice or female In the case of sound, the semantic information corresponding to the sound information is recognized; in the prompt step, a prompt is given according to the recognized semantic information and the confirmed category; the processing step is, when the category of the sound information is confirmed as neutral sound, according to the pre-trained
  • the gender recognition deep neural network model recognizes the gender corresponding to the neutral voice, and then returns to the recognition step, wherein the pre-trained gender recognition deep neural network model is a residual neural network ResNet-10 model, a ResNet-10 model Including convolutional layers Conv_1, Conv2_x, Conv3_x,
  • the information prompt method, device, electronic equipment, and storage medium described in this application collect sound information; confirm the type of the sound information; when it is confirmed that the type of the sound information is a male voice or a female voice, recognize the voice
  • the semantic information corresponding to the information; prompts are given according to the recognized semantic information and the confirmed category; when the category of the voice information is confirmed to be a neutral voice, the neutral voice corresponding to the neutral voice is identified according to the pre-trained gender recognition deep neural network model Then, the semantic information corresponding to the voice information is identified, so that accurate instructions can be provided to the user based on the identified semantic information and gender information.
  • this application draws on the mature ideas in face recognition, and promotes a larger classification boundary, so that speech data with fuzzy classification boundaries, such as neutral voices, can pass deep training.
  • Obtaining effective gender attribution recognition greatly improves the accuracy of gender recognition, and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.
  • FIG. 1 is a flowchart of an information prompt method provided by Embodiment 1 of the present application.
  • Fig. 2 is a functional block diagram of the information prompting device provided in the second embodiment of the present application.
  • Fig. 3 is a schematic diagram of an electronic device provided in a third embodiment of the present application.
  • the information prompt method in the embodiment of the present application is applied to electronic equipment.
  • the information prompting function provided by the method of this application can be directly integrated on the electronic device, or a client for implementing the method of this application can be installed.
  • the method provided in this application can also be run on servers and other devices in the form of a Software Development Kit (SDK), and provide an interface for information prompting functions in the form of SDK, and electronic devices or other devices can provide The interface can realize the information prompt function.
  • SDK Software Development Kit
  • FIG. 1 is a flowchart of an information prompt method provided in Embodiment 1 of the present invention. According to different requirements, the execution sequence in the flowchart can be changed, and some steps can be omitted.
  • the information prompting method can be applied to electronic devices such as a robot, and the robot can be a robot that guides the user.
  • the robot is used in a hospital.
  • the robot can give accurate instructions based on the identified user's gender and semantic information.
  • the robot can also give accurate instructions based on the identified user's gender and semantic information.
  • the method includes:
  • Step S1 collecting sound information.
  • a microphone is installed on the electronic device, and sound information can be collected through the microphone.
  • Step S2 Confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice.
  • the pitch frequency range of most people's voices is 50Hz-400Hz.
  • the pitch frequency range of male voices is 50Hz-200Hz
  • the pitch frequency range of female voices is 150Hz-400Hz. From the pitch frequency range of the male voice and the pitch frequency range of the female voice, it can be found that there is a partial overlap of 150Hz-200Hz between the two. According to this overlapping pitch frequency range, it is more difficult to distinguish whether the speaker is a male or a female. Thus, the sound corresponding to the overlapping pitch frequency range can be defined as a neutral sound.
  • confirming the category of the sound information includes:
  • the first pitch frequency range of the male voice is set to 50 Hz-150 Hz
  • the second pitch frequency range of the female voice is 200 Hz-400 Hz
  • the third pitch frequency range of the neutral voice is 150 Hz-200 Hz.
  • the category of the sound information is confirmed according to the pitch frequency of the sound information.
  • the pitch frequency of the sound information falls within the first pitch frequency range (for example, 50Hz-150Hz), it is confirmed that the type of the sound information is male voice, and the process goes to step S3; when the pitch frequency of the sound information falls In the second pitch frequency range (such as 200Hz-400Hz), it is confirmed that the category of the sound information is a female voice, and the process goes to step S3; when the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150Hz-200Hz), it is confirmed that the category of the sound information is neutral sound, and the process goes to step S5.
  • the first pitch frequency range for example, 50Hz-150Hz
  • the second pitch frequency range such as 200Hz-400Hz
  • the category of the sound information is a female voice
  • the process goes to step S3
  • the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150Hz-200Hz)
  • the category of the sound information is neutral sound
  • Step S3 When it is confirmed that the type of the sound information is a male voice or a female voice, the semantic information corresponding to the sound information is recognized.
  • the semantic information corresponding to the sound information can be identified through a natural language processing method, which specifically includes:
  • the text information is preprocessed, and the preprocessing includes word segmentation and noise word removal processing.
  • the above-mentioned basic concept library includes basic concepts of meaning and extended concepts corresponding to the above-mentioned basic concepts.
  • the semantic relation library includes relations and fuzzy semantic relations related to the above-mentioned basic concept library, sentence pattern relation template, and common sense library.
  • step S4 a prompt is given according to the recognized semantic information and the confirmed category.
  • the robot when the determined category is male voice, it can be confirmed that the gender of the user is male.
  • the determined category is female voice, it can be confirmed that the gender of the user is female.
  • the robot can not only give instructions based on the recognized gender, but needs to give instructions based on the recognized semantic information.
  • the priority of the identified semantic information is higher than the priority of the confirmed category.
  • the robot can determine that the user's gender is male based on the user's voice information.
  • the robot only needs to determine the gender of the user based on the voice query information, and then give instructions based on the determined gender of the user. For example, when a male user gives the voice inquiry message "Where is the restroom", the robot determines the corresponding user gender male according to the voice inquiry information, and then gives a prompt of the location of the men's restroom.
  • Step S5 When it is confirmed that the category of the voice information is a neutral voice, the gender corresponding to the neutral voice is recognized according to the pre-trained gender recognition deep neural network model, and then the flow returns to step S3.
  • the semantic information corresponding to the voice information needs to be acquired first, and then based on The semantic information and user gender give correct instructions.
  • the gender recognition deep neural network model is a residual neural network ResNet model.
  • the ResNet model is a deep neural network model designed based on the AM-Softmax loss function, wherein the optimal decision boundary of the gender recognition deep neural network model is obtained by adjusting the parameter factors of the AM-Softmax loss function.
  • a two-class depth model needs to be designed.
  • the two-class model usually uses sigmod or softmax loss function.
  • the sigmod or softmax loss function does not work well on data with blurred boundaries.
  • the AM-Softmax loss function is used in this application to design a deep neural network model.
  • the AM-Softmax loss function can promote a larger classification boundary between categories.
  • the AM-Softmax loss function is:
  • the optimal decision boundary of the gender recognition deep neural network model can be obtained.
  • the target categories are only male and female.
  • the solution space of the problem is relatively simple. If the depth model in the field of image classification is used directly, it is prone to overfitting. Therefore, in this application, in order to avoid the phenomenon of over-fitting and improve the generalization ability of the depth model, the existing depth model for recognizing pictures is improved to obtain the ResNet-10 model. Specifically, on the basis of ResNet-18, the model depth and the number of residual layers are reduced again to obtain the ResNet-10 model.
  • the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x, and a total of 10 layers of fully connected layers.
  • the parameters of ResNet-10 in the present invention can be referred to as shown in Table 1.
  • the max pool in 1 is the pooling layer, where the stride stride of the first layer of Conv3_x, Conv4_x and Conv5_x are all 2.
  • the activation layer ReLU and the regularization layer Batech Normalization are connected, as shown in Table 1.
  • Conv2_x, Conv3_x, Conv4_x and Conv5_x all include 1 residual module (X1blocks).
  • the last layer of the conv5_x is connected to a fully connected layer.
  • the layer can output the type result corresponding to the sound information.
  • the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected to an adaptive global average pooling. Because the problem to be solved in this application is less classification problems (male and female), using average pooling has a better effect than maximum pooling. In this application, adaptive global average pooling is used to avoid feature size mismatch. Since the feature size of the speech spectrogram fluctuates greatly, the effect of adaptive global average pooling is better.
  • the convolution kernel of the input part of the ResNet-10 model is a 3 ⁇ 3 convolution kernel.
  • the 3x3 convolution kernel can effectively reduce the amount of calculation, and at the same time can better adapt to the speech spectrogram.
  • the model can be made not easy to overfit, and the magnitude of the model parameters can be reduced at the same time.
  • the pre-trained method of a gender recognition deep neural network model includes:
  • the training method of the neural network model includes the following steps:
  • the characteristic parameters corresponding to the male neutral voice and the female neutral voice include the Mel frequency cepstrum coefficient of the sound signal.
  • the Mel-Frequency Cepstral Coefficients (MFCC) analysis is based on the auditory characteristics of the human ear. Because the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more in line with the hearing characteristics of the human ear.
  • first preset ratio for example, 70%
  • using the training set to train the neural network model may further include: deploying the training of the deep neural network model on multiple graphics processing units (GPUs) for distributed training.
  • GPUs graphics processing units
  • the training of the model can be deployed on multiple graphics processors for distributed training, which can shorten the training time of the model and accelerate the convergence of the model.
  • the training ends, and the trained deep neural network model is used as a classifier to identify the gender of the user corresponding to the current neutral voice; if the accuracy is accurate When the rate is less than the preset accuracy rate, the number of samples is increased to retrain the deep neural network model until the accuracy rate is greater than or equal to the preset accuracy rate.
  • the method for expanding the neutral sound is: superimposing noise on the collected neutral sound, obtaining a spectrogram of the neutral sound after superimposing the noise, and shuffling and recombining the spectrogram in the time direction.
  • the training data of the neutral sound is expanded by data enhancement technology. Because neutral sounds are relatively rare. In order to train the deep neural network model, the collected neutral sounds need to be expanded.
  • superimposing noise in the collected neutral sound includes: superimposing white noise in the collected neutral sound and/mixing environmental noise in the collected neutral sound.
  • new_signal 0.9*original_signal+0.1*white_noise().
  • the real environmental noise may be noise collected from parks, bus stops, stadiums, coffee shops and other venues.
  • the neutral sound after superimposed noise is subjected to short-time Fourier transform processing to obtain a spectrogram, and the spectrogram is shuffled and reorganized in the time direction to obtain training data.
  • the spectrogram For example, in the time direction of the spectrogram, crop the spectrogram corresponding to the neutral sound according to a fixed length of the speech frame sequence (such as 64 frames) to obtain a speech fragment with a length of 64 frames, and then randomly re-create the speech fragment combination. For example, a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal.
  • a fixed length of the speech frame sequence such as 64 frames
  • a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal.
  • the method of random cropping and stacking of the spectrogram is used here to expand the voice data. In this way, a relatively large-scale neutral voice data is obtained, and the training data is expanded by dividing it into the corresponding gender data group according to different labels.
  • the information prompting method includes collecting sound information; confirming the type of the sound information, wherein the type of the sound information includes male voice, female voice, and neutral voice; when confirming the voice
  • the type of information is a male voice or a female voice
  • the semantic information corresponding to the voice information is recognized; prompts are given according to the recognized semantic information and the confirmed category; when the category of the voice information is confirmed to be a neutral voice, according to the previous
  • the trained gender recognition deep neural network model recognizes the gender corresponding to the neutral voice. This application can identify the neutral voice in the collected voice information, and then identify the gender corresponding to the neutral voice, so as to provide more accurate instructions.
  • the user's voice feature space has a better classification boundary.
  • the information prompt method proposed in this application draws on the mature ideas in face recognition and promotes the expansion of the classification boundary, so that voice data with fuzzy classification boundaries, such as neutral voices, can be effectively trained through in-depth training to obtain effective gender attribution.
  • Recognition greatly improves the accuracy of gender recognition and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.
  • Fig. 2 is a diagram of functional modules in a preferred embodiment of the information prompting device of the present invention.
  • the information prompting device 20 runs in an electronic device.
  • the information prompting device 20 may include multiple functional modules composed of program code segments.
  • the program code of each program segment in the information prompt device 20 can be stored in a memory and executed by at least one processor to perform an information prompt function.
  • the information prompting device 20 can be divided into multiple functional modules according to the functions it performs.
  • the functional modules may include: an acquisition module 201, a confirmation module 202, an identification module 203, a prompt module 204, and a processing module 205.
  • the module referred to in the present invention refers to a series of computer program segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.
  • the collection module 201 is used to collect sound information.
  • a microphone is installed on the robot, and sound information can be collected through the microphone.
  • the confirmation module 202 is used to confirm the category of the sound information, where the category of the sound information includes male voice, female voice, and neutral voice.
  • the pitch frequency range of most people's voices is 50Hz-400Hz.
  • the pitch frequency range of male voices is 50Hz-200Hz
  • the pitch frequency range of female voices is 150Hz-400Hz. From the pitch frequency range of the male voice and the pitch frequency range of the female voice, it can be found that there is a partial overlap of 150Hz-200Hz between the two. According to this overlapping pitch frequency range, it is more difficult to distinguish whether the speaker is a male or a female. Thus, the sound corresponding to the overlapping pitch frequency range can be defined as a neutral sound.
  • the confirmation module 202 confirming the category of the sound information includes:
  • the confirmation module 202 confirms that the category of the sound information is a male voice; when the pitch frequency of the sound information falls within the first pitch frequency range, Within the second pitch frequency range, the confirmation module 202 confirms that the category of the sound information is a female voice; the pitch frequency of the sound information falls within the third pitch frequency range, and the confirmation module 202 confirms the sound
  • the category of information is neutral voice.
  • the first pitch frequency range of the male voice is set to 50 Hz-150 Hz
  • the second pitch frequency range of the female voice is 200 Hz-400 Hz
  • the third pitch frequency range of the neutral voice is 150 Hz-200 Hz.
  • the category of the sound information is confirmed according to the pitch frequency of the sound information.
  • the confirmation module 202 confirms that the category of the sound information is a male voice; when the pitch frequency of the sound information falls within the second pitch frequency range (such as 200Hz-400Hz), the confirmation module 202 confirms that the category of the sound information is a female voice; when the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150 Hz-200 Hz), the confirmation module 202 confirms that the category of the sound information is a neutral sound.
  • the first pitch frequency range for example, 50Hz-150Hz
  • the confirmation module 202 confirms that the category of the sound information is a male voice
  • the second pitch frequency range such as 200Hz-400Hz
  • the confirmation module 202 confirms that the category of the sound information is a female voice
  • the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150 Hz-200 Hz)
  • the confirmation module 202 confirms that the category of the sound information is a neutral sound.
  • the recognition module 203 is configured to recognize the semantic information corresponding to the sound information when it is confirmed that the type of the sound information is a male voice or a female voice.
  • the semantic information corresponding to the sound information can be identified through a natural language processing method, which specifically includes:
  • the text information is preprocessed, and the preprocessing includes word segmentation and noise word removal processing.
  • the above-mentioned basic concept library includes basic concepts of meaning and extended concepts corresponding to the above-mentioned basic concepts.
  • the semantic relation library includes relations and fuzzy semantic relations related to the above-mentioned basic concept library, sentence pattern relation template, and common sense library.
  • the prompt module 204 is used to give a prompt according to the recognized semantic information and the confirmed category.
  • the robot when the determined category is male voice, it can be confirmed that the gender of the user is male.
  • the determined category is female voice, it can be confirmed that the gender of the user is female.
  • the robot can not only give instructions based on the recognized gender, but needs to give instructions based on the recognized semantic information.
  • the priority of the identified semantic information is higher than the priority of the confirmed category.
  • the robot can determine that the user's gender is male based on the user's voice information.
  • the robot only needs to determine the gender of the user based on the voice query information, and then give instructions based on the determined gender of the user. For example, when a male user gives the voice inquiry message "Where is the restroom", the robot determines the corresponding user gender male according to the voice inquiry information, and then gives a prompt of the location of the men's bathroom.
  • the processing module 205 is configured to recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice.
  • the semantic information corresponding to the voice information needs to be acquired first, and then based on The semantic information and user gender give correct instructions.
  • the gender recognition deep neural network model is a residual neural network ResNet model.
  • the ResNet model is a deep neural network model designed based on the AM-Softmax loss function, wherein the optimal decision boundary of the gender recognition deep neural network model is obtained by adjusting the parameter factors of the AM-Softmax loss function.
  • a two-class depth model needs to be designed.
  • the two-class model usually uses sigmod or softmax loss function.
  • the sigmod or softmax loss function does not work well on data with blurred boundaries.
  • the AM-Softmax loss function is used in this application to design a deep neural network model.
  • the AM-Softmax loss function can promote a larger classification boundary between categories.
  • the AM-Softmax loss function is:
  • the optimal decision boundary of the gender recognition deep neural network model can be obtained.
  • the target categories are only male and female.
  • the solution space of the problem is relatively simple. If the depth model in the field of image classification is used directly, it is prone to overfitting. Therefore, in this application, in order to avoid the phenomenon of over-fitting and improve the generalization ability of the depth model, the existing depth model for recognizing pictures is improved to obtain the ResNet-10 model. Specifically, on the basis of ResNet-18, the depth of the model and the number of residual layers are reduced again to obtain the ResNet-10 model.
  • the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x, and a total of 10 layers of fully connected layers.
  • the parameters of ResNet-10 in the present invention can be referred to in Table 1 above.
  • the max pool in Table 1 is the pooling layer, where the stride stride of the first layer of Conv3_x, Conv4_x and Conv5_x are all 2, and each convolutional layer is connected with the activation layer ReLU and the regularization layer Batech Normalization
  • Conv2_x, Conv3_x, Conv4_x and Conv5_x all include one residual module (X1blocks).
  • the last layer of the conv5_x is connected to a fully connected layer, so The fully connected layer can output type results corresponding to the sound information.
  • the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected to an adaptive global average pooling. Because the problem to be solved in this application is less classification problems (male and female), using average pooling has a better effect than maximum pooling. In this application, adaptive global average pooling is used to avoid feature size mismatch. Since the feature size of the speech spectrogram fluctuates greatly, the effect of adaptive global average pooling is better.
  • the convolution kernel of the input part of the ResNet-10 model is a 3 ⁇ 3 convolution kernel.
  • the 3x3 convolution kernel can effectively reduce the amount of calculation, and at the same time can better adapt to the speech spectrogram.
  • the model can be made not easy to overfit, and the magnitude of the model parameters can be reduced at the same time.
  • the pre-trained method of a gender recognition deep neural network model includes:
  • the training method of the neural network model includes the following steps:
  • the characteristic parameter corresponding to the male neutral voice and the female neutral voice includes the Mel frequency cepstrum coefficient of the sound signal.
  • the Mel-Frequency Cepstral Coefficients (MFCC) analysis is based on the auditory characteristics of the human ear. Because the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more in line with the hearing characteristics of the human ear.
  • first preset ratio for example, 70%
  • using the training set to train the neural network model may further include: deploying the training of the deep neural network model on multiple graphics processing units (GPUs) for distributed training.
  • GPUs graphics processing units
  • the training of the model can be deployed on multiple graphics processors for distributed training, which can shorten the training time of the model and accelerate the convergence of the model.
  • the training ends, and the trained deep neural network model is used as a classifier to identify the gender of the user corresponding to the current neutral voice; if the accuracy is accurate When the rate is less than the preset accuracy rate, the number of samples is increased to retrain the deep neural network model until the accuracy rate is greater than or equal to the preset accuracy rate.
  • the method for expanding the neutral sound is: superimposing noise on the collected neutral sound, obtaining a spectrogram of the neutral sound after superimposing the noise, and shuffling and recombining the spectrogram in the time direction.
  • the training data of the neutral sound is expanded by data enhancement technology. Because neutral sounds are relatively rare. In order to train the deep neural network model, the collected neutral sounds need to be expanded.
  • superimposing noise in the collected neutral sound includes: superimposing white noise in the collected neutral sound and/mixing environmental noise in the collected neutral sound.
  • new_signal 0.9*original_signal+0.1*white_noise().
  • the real environmental noise may be noise collected from parks, bus stops, stadiums, coffee shops and other venues.
  • the neutral sound after superimposed noise is subjected to short-time Fourier transform processing to obtain a spectrogram, and the spectrogram is shuffled and reorganized in the time direction to obtain training data.
  • the spectrogram For example, in the time direction of the spectrogram, crop the spectrogram corresponding to the neutral sound according to a fixed length of the speech frame sequence (such as 64 frames) to obtain a speech fragment with a length of 64 frames, and then randomly re-create the speech fragment combination. For example, a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal.
  • a fixed length of the speech frame sequence such as 64 frames
  • a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal.
  • the method of random cropping and stacking of the spectrogram is used here to expand the voice data. In this way, a relatively large-scale neutral voice data is obtained, and the training data is expanded by dividing it into the corresponding gender data group according to different labels.
  • the neutral sound-based information prompting device 20 includes a collection module 201, a confirmation module 202, an identification module 203, a prompt module 204, and a processing module 205.
  • the collection module 201 is used to collect sound information;
  • the confirmation module 202 is used to confirm the type of the sound information, wherein the types of the sound information include male voice, female voice, and neutral voice;
  • the recognition module 203 It is used to identify the semantic information corresponding to the sound information when it is confirmed that the type of the sound information is a male voice or a female voice;
  • the prompt module 204 is used to give a prompt according to the recognized semantic information and the confirmed type;
  • the processing module 205 is configured to identify the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice.
  • This application can identify the neutral voice in the collected voice information, and then identify the gender corresponding to the neutral voice, so as to provide more accurate instructions. And when recognizing the gender corresponding to the neutral voice, by designing the deep neural network model ResNet-10 trained based on AM-Softmax, in the user gender recognition, the user's voice feature space has a better classification boundary.
  • the information prompting device proposed in this application draws on mature ideas in face recognition and promotes a larger classification boundary, so that voice data with fuzzy classification boundaries, such as neutral voices, can be effectively trained through in-depth training to obtain effective gender attribution. Recognition greatly improves the accuracy of gender recognition and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.
  • the above-mentioned integrated unit implemented in the form of a software function module may be stored in a computer readable storage medium.
  • the above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a dual-screen device, or a network device, etc.) or a processor to execute the various embodiments of the present invention. Part of the method.
  • FIG. 3 is a schematic diagram of the electronic device provided in the third embodiment of the application.
  • the electronic device 3 includes a memory 31, at least one processor 32, a computer program 33 stored in the memory 31 and running on the at least one processor 32, at least one communication bus 34 and a database 35.
  • the computer program 33 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 31 and executed by the at least one processor 32, To complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer program 33 in the electronic device 3.
  • the electronic device 3 may be a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA) and other devices installed with applications.
  • PDA Personal Digital Assistant
  • the schematic diagram 3 is only an example of the electronic device 3, and does not constitute a limitation on the electronic device 3.
  • the electronic device 3 may also include input and output devices, network access devices, buses, and so on.
  • the at least one processor 32 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the processor 32 may be a microprocessor, or the processor 32 may also be any conventional processor, etc.
  • the processor 32 is the control center of the electronic device 3, and connects the entire electronic device with various interfaces and lines. Various parts of device 3.
  • the memory 31 may be used to store the computer program 33 and/or modules/units.
  • the processor 32 runs or executes the computer programs and/or modules/units stored in the memory 31 and calls the computer programs and/or modules/units stored in the memory 31.
  • the data in 31 realizes various functions of the electronic device 3.
  • the memory 31 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function, etc.; the storage data area may store data created according to the use of the electronic device 3 Wait.
  • the memory 31 may include a volatile memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) card.
  • a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) card.
  • Flash Card at least one magnetic disk storage device, flash memory device, high-speed random access memory, or other storage device.
  • the memory 31 stores program codes, and the at least one processor 32 can call the program codes stored in the memory 31 to perform related functions.
  • the modules selection module 201, confirmation module 202, identification module 203, prompt module 204, and processing module 205) described in FIG. 2 are program codes stored in the memory 31 and processed by the at least one Executed by the device 32, so as to realize the functions of the various modules to achieve the purpose of information prompting.
  • the database (Database) 35 is a warehouse built on the electronic device 3 for organizing, storing and managing data according to a data structure. Databases are usually divided into three types: hierarchical database, network database and relational database. In this embodiment, the database 35 is used to store collected sound information and the like.
  • the integrated module/unit of the electronic device 3 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program includes computer-readable instruction code
  • the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random access memory, etc.
  • the functional units in the various embodiments of the present application may be integrated in the same processing unit, or each unit may exist alone physically, or two or more units may be integrated in the same unit.
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

Abstract

Provided are an information prompting method, an information prompting apparatus, an electronic device, and a storage medium. The method comprises: collecting voice information (S1); determining the category of the voice information (S2); when it is determined that the category of the voice information is a male voice or a female voice, identifying semantic information corresponding to the voice information (S3); giving a prompt according to the identified semantic information and the determined category (S4); and when it is determined that the category of the voice information is a neutral voice, identifying, according to a pre-trained gender identification deep neural network model, a gender corresponding to the neutral voice (S5), and then identifying semantic information corresponding to the voice information. The method can provide an accurate indication for a user according to the identified semantic information and gender information.

Description

信息提示方法、装置、电子设备及介质Information prompt method, device, electronic equipment and medium
本申请要求于2020年3月3日提交中国专利局、申请号为202010139944.2,发明名称为“信息提示方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 3, 2020, the application number is 202010139944.2, and the invention title is "Information Prompt Method, Apparatus, Electronic Equipment and Medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能技术领域,具体涉及一种信息提示方法、装置、电子设备及介质。This application relates to the field of artificial intelligence technology, and in particular to an information prompt method, device, electronic equipment, and medium.
背景技术Background technique
语音识别说话人的生物属性(如性别)是人工智能领域重要的领域。根据声音识别说话者的性别对于人类来说这是一种天生的能力,而对于人工智能来说,却代表了最高水平的进展。男性与女性说话者的声音通常会有较为明显的区别。但是,发明人意识到,较为中性的声音,若不仔细辨别是很难准确识别出说话人的性别的。这对于人工智能来讲,更是一个极大的挑战。如果能准确识别出中性音,能大大提升语音识别说话人生物属性在实际业务场景(如智能客服系统)中的应用能力。Speech recognition of the speaker's biological attributes (such as gender) is an important field in the field of artificial intelligence. Recognizing the gender of a speaker based on the voice is a natural ability for humans, but for artificial intelligence, it represents the highest level of progress. The voices of male and female speakers usually differ significantly. However, the inventor realized that it is difficult to accurately identify the gender of the speaker without careful identification of a more neutral voice. This is a great challenge for artificial intelligence. If the neutral tone can be accurately identified, it can greatly improve the application ability of the speech recognition speaker's biological attributes in actual business scenarios (such as intelligent customer service systems).
发明内容Summary of the invention
鉴于以上内容,有必要提出一种信息提示方法、装置、电子设备及介质,通过识别采集的声音信息中的中性声音,再识别所述中性声音对应的性别,为用户提供更加准确的指示信息。In view of the above, it is necessary to propose an information prompt method, device, electronic equipment and medium, which can provide users with more accurate instructions by identifying the neutral voice in the collected voice information, and then identifying the gender corresponding to the neutral voice. information.
本申请的第一方面提供一种信息提示方法,所述方法包括:采集步骤,采集声音信息;确认步骤,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;识别步骤,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;提示步骤,根据识别的语义信息和确认的类别给出提示;处理步骤,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再返回所述识别步骤,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。The first aspect of the present application provides an information prompt method, the method includes: a collecting step, collecting sound information; a confirming step, confirming the type of the sound information, wherein the type of the sound information includes male voice and female voice And neutral voice; a recognition step, when the category of the voice information is confirmed to be a male voice or a female voice, recognize the semantic information corresponding to the voice information; the prompt step, give a prompt based on the recognized semantic information and the confirmed category; The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
本申请的第二方面一种信息提示装置,所述装置包括:采集模块,用于采集声音信息;确认模块,用于确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;识别模块,用于当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;提示模块,用于根据识别的语义信息和确认的类别给出提示;处理模块,用于当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。The second aspect of the present application is an information prompting device, the device comprising: a collection module for collecting sound information; a confirmation module for confirming the type of the sound information, wherein the type of the sound information includes a male voice , Female voice and neutral voice; recognition module, used to identify the semantic information corresponding to the voice information when the category of the voice information is confirmed to be male voice or female voice; prompt module, used to identify the semantic information according to the recognized semantic information and A prompt is given for the confirmed category; the processing module is used to identify the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when the category of the voice information is confirmed to be a neutral voice, wherein the The pre-trained deep neural network model for gender recognition is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
本申请的第三方面提供一种电子设备,其中,所述电子设备包括处理器,所述处理器用于执行存储器中存储的计算机可读指令以实现以下步骤:采集步骤,采集声音信息;确认步骤,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;识别步骤,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;提示步骤,根据识别的语义信息和确认的类别给出提示;处理步骤,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再返回所述识别步骤,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。A third aspect of the present application provides an electronic device, wherein the electronic device includes a processor configured to execute computer-readable instructions stored in a memory to implement the following steps: collecting step, collecting sound information; confirming step , Confirming the category of the voice information, wherein the category of the voice information includes a male voice, a female voice, and a neutral voice; the recognition step, when the category of the voice information is confirmed to be a male voice or a female voice, recognize the The semantic information corresponding to the sound information; the prompt step is to give a prompt based on the recognized semantic information and the confirmed category; the processing step, when the category of the sound information is confirmed to be a neutral voice, the deep neural network model is recognized according to the pre-trained gender Recognize the gender corresponding to the neutral voice, and then return to the recognition step, wherein the pre-trained gender recognition deep neural network model is a residual neural network ResNet-10 model, and the ResNet-10 model includes a convolutional layer Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x and the fully connected layer total 10 layers.
本申请的第四方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现以下步骤:采集步骤,采集声音信息;确认步骤,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;识别步骤,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;提示步骤,根据识别的语义信息和确认的类别给出提示;处理步骤,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再返回所述识别步骤,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。A fourth aspect of the present application provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium, wherein the computer-readable instructions implement the following steps when executed by a processor: an acquisition step, Collecting sound information; confirming step, confirming the type of the sound information, wherein the type of the sound information includes male voice, female voice and neutral voice; in the identifying step, when it is confirmed that the type of the sound information is male voice or female In the case of sound, the semantic information corresponding to the sound information is recognized; in the prompt step, a prompt is given according to the recognized semantic information and the confirmed category; the processing step is, when the category of the sound information is confirmed as neutral sound, according to the pre-trained The gender recognition deep neural network model recognizes the gender corresponding to the neutral voice, and then returns to the recognition step, wherein the pre-trained gender recognition deep neural network model is a residual neural network ResNet-10 model, a ResNet-10 model Including convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x and a total of 10 layers of fully connected layers.
本申请所述的信息提示方法、装置、电子设备及存储介质,通过采集声音信息;确认所述声音信息的类别;当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;根据识别的语义信息和确认的类别给出提示;当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再识别所述声音信息对应的语义信息,从而可以根据识别的语义信息和性别信息为用户提供准确的指示。本申请在对中性声音对应的性别进行识别时,借鉴了人脸识别中的成熟思想,推动了分类边界的更大化,使得分类边界模糊的语音数据如中性声音,能够通过深度训练,得到有效的性别归属识别,大大提升了性别识别的准确率,提高了说话人性别识别在实际业务场景以及智能客服系统中的应用能力。The information prompt method, device, electronic equipment, and storage medium described in this application collect sound information; confirm the type of the sound information; when it is confirmed that the type of the sound information is a male voice or a female voice, recognize the voice The semantic information corresponding to the information; prompts are given according to the recognized semantic information and the confirmed category; when the category of the voice information is confirmed to be a neutral voice, the neutral voice corresponding to the neutral voice is identified according to the pre-trained gender recognition deep neural network model Then, the semantic information corresponding to the voice information is identified, so that accurate instructions can be provided to the user based on the identified semantic information and gender information. When identifying the gender corresponding to a neutral voice, this application draws on the mature ideas in face recognition, and promotes a larger classification boundary, so that speech data with fuzzy classification boundaries, such as neutral voices, can pass deep training. Obtaining effective gender attribution recognition greatly improves the accuracy of gender recognition, and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is the embodiment of the present application. For those of ordinary skill in the art, other drawings can be obtained according to the provided drawings without creative work.
图1是本申请实施例一提供的信息提示方法的流程图。FIG. 1 is a flowchart of an information prompt method provided by Embodiment 1 of the present application.
图2是本申请实施例二提供的信息提示装置的功能模块图。Fig. 2 is a functional block diagram of the information prompting device provided in the second embodiment of the present application.
图3是本申请实施例三提供的电子设备的示意图。Fig. 3 is a schematic diagram of an electronic device provided in a third embodiment of the present application.
如下具体实施方式将结合上述附图进一步说明本申请。The following specific embodiments will further illustrate this application in conjunction with the above-mentioned drawings.
具体实施方式Detailed ways
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。In order to be able to understand the above objectives, features and advantages of the application more clearly, the application will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the application and the features in the embodiments can be combined with each other if there is no conflict.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following description, many specific details are set forth in order to fully understand the present application. The described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of this application. The terms used in the description of the application herein are only for the purpose of describing specific embodiments, and are not intended to limit the application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而非用于描述特定顺序。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", and "third" in the specification and claims of the present application and the above-mentioned drawings are used to distinguish different objects, rather than to describe a specific sequence. In addition, the term "including" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
本申请实施例的信息提示方法应用在电子设备中。所述对于需要进行信息提示的电子设备,可以直接在电子设备上集成本申请的方法所提供的信息提示功能,或者安装用于实现本 申请的方法的客户端。再如,本申请所提供的方法还可以以软件开发工具包(Software Development Kit,SDK)的形式运行在服务器等设备上,以SDK的形式提供信息提示功能的接口,电子设备或其他设备通过提供的接口即可实现信息提示功能。The information prompt method in the embodiment of the present application is applied to electronic equipment. For the electronic device that needs to perform information prompting, the information prompting function provided by the method of this application can be directly integrated on the electronic device, or a client for implementing the method of this application can be installed. For another example, the method provided in this application can also be run on servers and other devices in the form of a Software Development Kit (SDK), and provide an interface for information prompting functions in the form of SDK, and electronic devices or other devices can provide The interface can realize the information prompt function.
实施例一Example one
图1是本发明实施例一提供的信息提示方法的流程图。根据不同的需求,所述流程图中的执行顺序可以改变,某些步骤可以省略。FIG. 1 is a flowchart of an information prompt method provided in Embodiment 1 of the present invention. According to different requirements, the execution sequence in the flowchart can be changed, and some steps can be omitted.
在本实施方式中,所述信息提示方法可以应用于机器人等电子设备中,所述机器人可以是为用户指路的机器人。例如,所述机器人应用在医院中,当用户需要询问机器人医院的妇科门诊在哪里时,所述机器人可以根据识别的用户性别和语义信息给出准确的指示。或者当用户需要询问机器人某个公共场所的男洗手间或女洗手间在哪里时,所述机器人也可以根据识别的用户性别和语义信息给出准确的指示。所述方法包括:In this embodiment, the information prompting method can be applied to electronic devices such as a robot, and the robot can be a robot that guides the user. For example, the robot is used in a hospital. When the user needs to ask the gynecological clinic of the robot hospital, the robot can give accurate instructions based on the identified user's gender and semantic information. Or when the user needs to ask the robot where the men's restroom or the women's restroom is in a public place, the robot can also give accurate instructions based on the identified user's gender and semantic information. The method includes:
步骤S1,采集声音信息。Step S1, collecting sound information.
在本实施例中,所述电子设备上安装有麦克风,可以通过所述麦克风采集声音信息。In this embodiment, a microphone is installed on the electronic device, and sound information can be collected through the microphone.
步骤S2,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音。Step S2: Confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice.
现有技术中,绝大多数人发出声音的基音频率范围为50Hz-400Hz,正常情况下,男性发出声音的基音频率范围为50Hz-200Hz,女性发出声音的基音频率范围为150Hz-400Hz。从男性声音的基音频率范围和女性声音的基音频率范围可发现,两者有部分重叠150Hz-200Hz。根据这部分重叠的基音频率范围是比较难分辨出说话者到底是男性还是女性。由此,可以将重叠的基音频率范围对应的声音定义为中性声音。In the prior art, the pitch frequency range of most people's voices is 50Hz-400Hz. Under normal circumstances, the pitch frequency range of male voices is 50Hz-200Hz, and the pitch frequency range of female voices is 150Hz-400Hz. From the pitch frequency range of the male voice and the pitch frequency range of the female voice, it can be found that there is a partial overlap of 150Hz-200Hz between the two. According to this overlapping pitch frequency range, it is more difficult to distinguish whether the speaker is a male or a female. Thus, the sound corresponding to the overlapping pitch frequency range can be defined as a neutral sound.
在本实施例中,确认所述声音信息的类别包括:In this embodiment, confirming the category of the sound information includes:
(1)提取所述声音信息的基音频率;(1) Extract the pitch frequency of the sound information;
(2)将所述声音信息的基音频率与第一基音频率范围、第二基音频率范围和第三基音频率范围进行比对;(2) Compare the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;
(3)当所述声音信息的基音频率落在所述第一基音频率范围内,确认所述声音信息的类别为男性声音;当所述声音信息的基音频率落在所述第二基音频率范围内,确认所述声音信息的类别为女性声音;所述声音信息的基音频率落在所述第三基音频率范围内,确认所述声音信息的类别为中性声音。(3) When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is male voice; when the pitch frequency of the sound information falls within the second pitch frequency range , Confirm that the category of the sound information is a female voice; confirm that the pitch frequency of the voice information falls within the third pitch frequency range, and confirm that the category of the voice information is a neutral voice.
具体地,设定男性声音的第一基音频率范围为50Hz-150Hz,女性声音的第二基音频率范围为200Hz-400Hz,中性声音的第三基音频率范围为150Hz-200Hz。在本实施例中,根据所述声音信息的基音频率来确认所述声音信息的类别。当所述声音信息的基音频率落在所述第一基音频率范围内(如50Hz-150Hz),确认所述声音信息的类别为男性声音,流程进入步骤S3;当所述声音信息的基音频率落在所述第二基音频率范围(如200Hz-400Hz)内,确认所述声音信息的类别为女性声音,流程进入步骤S3;当所述声音信息的基音频率落在所述第三基音频率范围(如150Hz-200Hz)内,确认所述声音信息的类别为中性声音,流程进入步骤S5。Specifically, the first pitch frequency range of the male voice is set to 50 Hz-150 Hz, the second pitch frequency range of the female voice is 200 Hz-400 Hz, and the third pitch frequency range of the neutral voice is 150 Hz-200 Hz. In this embodiment, the category of the sound information is confirmed according to the pitch frequency of the sound information. When the pitch frequency of the sound information falls within the first pitch frequency range (for example, 50Hz-150Hz), it is confirmed that the type of the sound information is male voice, and the process goes to step S3; when the pitch frequency of the sound information falls In the second pitch frequency range (such as 200Hz-400Hz), it is confirmed that the category of the sound information is a female voice, and the process goes to step S3; when the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150Hz-200Hz), it is confirmed that the category of the sound information is neutral sound, and the process goes to step S5.
步骤S3,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息。Step S3: When it is confirmed that the type of the sound information is a male voice or a female voice, the semantic information corresponding to the sound information is recognized.
在本实施例中,可以通过自然语言处理方法识别所述声音信息对应的语义信息,具体包括:In this embodiment, the semantic information corresponding to the sound information can be identified through a natural language processing method, which specifically includes:
将所述声音信息转换为文本信息;Converting the sound information into text information;
对所述文本信息进行预处理,所述预处理包括分词和噪声词去除处理。The text information is preprocessed, and the preprocessing includes word segmentation and noise word removal processing.
根据预先存储的语义关系库和基本概念库对预处理后的文本信息进行语义匹配,得到语义匹配结果。Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
在本实施例中,上述基本概念库包括含义的基本概念以及与上述基本概念对应的扩展概念。所述语义关系库包括与上述基本概念库、句型关系模板和常识库相关的关系和 模糊语义关系。In this embodiment, the above-mentioned basic concept library includes basic concepts of meaning and extended concepts corresponding to the above-mentioned basic concepts. The semantic relation library includes relations and fuzzy semantic relations related to the above-mentioned basic concept library, sentence pattern relation template, and common sense library.
步骤S4,根据识别的语义信息和确认的类别给出提示。In step S4, a prompt is given according to the recognized semantic information and the confirmed category.
在本实施例中,当确定的类别为男性声音,可以确认用户的性别为男性。当确定的类别为女性声音,可以确认用户的性别为女性。此时,由于用户有可能并非需要根据自身性别得到机器人的指示。所以机器人还不能仅根据识别的性别给出指示,而是需要根据识别的语义信息来给出指示。In this embodiment, when the determined category is male voice, it can be confirmed that the gender of the user is male. When the determined category is female voice, it can be confirmed that the gender of the user is female. At this time, because the user may not need to get instructions from the robot based on his gender. Therefore, the robot can not only give instructions based on the recognized gender, but needs to give instructions based on the recognized semantic information.
优选地,所述识别的语义信息的优先级高于确认的类别的优先级。Preferably, the priority of the identified semantic information is higher than the priority of the confirmed category.
例如,当用户为男性,而该用户需要替他身边的女性询问女洗手间所在位置时。当男性用户给出语音询问信息“女洗手间在哪?”。此时,机器人可以根据用户的声音信息判定用户性别为男性。但是并不能给出男洗手间的位置的提示,而是需要根据用户给出的语音询问信息对应的语义信息来给出女洗手间所在位置的提示。由此,可以为用户提供更加准确的提示,提高用户体验。For example, when the user is a male, and the user needs to ask the women around him for the location of the female toilet. When a male user gives a voice asking message "Where is the female bathroom?". At this time, the robot can determine that the user's gender is male based on the user's voice information. However, it is not possible to give a reminder of the location of the men's restroom. Instead, it is necessary to give a reminder of the location of the women's restroom based on the semantic information corresponding to the voice inquiry information given by the user. As a result, more accurate prompts can be provided to users, and user experience can be improved.
例如,当用户仅给出语音询问信息“请问洗手间在哪”。此时,机器人只需要根据所述语音询问信息判定用户性别,再根据判定的用户性别给出指示。如,当男性用户给出语音询问信息“请问洗手间在哪”,机器人根据所述语音询问信息确定对应的用户性别男性,则给出男洗手间所在位置的提示。For example, when the user only gives the voice inquiry message "Where is the restroom". At this time, the robot only needs to determine the gender of the user based on the voice query information, and then give instructions based on the determined gender of the user. For example, when a male user gives the voice inquiry message "Where is the restroom", the robot determines the corresponding user gender male according to the voice inquiry information, and then gives a prompt of the location of the men's restroom.
步骤S5,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,之后流程返回步骤S3。Step S5: When it is confirmed that the category of the voice information is a neutral voice, the gender corresponding to the neutral voice is recognized according to the pre-trained gender recognition deep neural network model, and then the flow returns to step S3.
在本实施例中,当通过所述性别识别深度神经网络模型识别了用户性别后,为了避免仅根据用户性别给出错误指示的情况出现,需要先获取所述声音信息对应的语义信息,再根据所述语义信息和用户性别给出正确指示。In this embodiment, after the user’s gender is recognized through the gender recognition deep neural network model, in order to avoid the situation where the user’s gender is only given wrong instructions, the semantic information corresponding to the voice information needs to be acquired first, and then based on The semantic information and user gender give correct instructions.
在本实施例中,所述性别识别深度神经网络模型为残差神经网络ResNet模型。所述ResNet模型为基于AM-Softmax损失函数设计的深度神经网络模型,其中,通过调整所述AM-Softmax损失函数的参数因子得到所述性别识别深度神经网络模型的最佳决策边界。In this embodiment, the gender recognition deep neural network model is a residual neural network ResNet model. The ResNet model is a deep neural network model designed based on the AM-Softmax loss function, wherein the optimal decision boundary of the gender recognition deep neural network model is obtained by adjusting the parameter factors of the AM-Softmax loss function.
在本实施方式中,对于性别识别而言,需要设计一个二分类的深度模型。二分类模型通常使用sigmod或者softmax损失函数。然而,sigmod或者softmax损失函数在边界模糊的数据效果不佳。为了能够准确的根据中性声音对性别进行分类,增加类间距,并减小类内距,本申请中采用AM-Softmax损失函数设计深度神经网络模型。所述AM-Softmax损失函数可以推动类别之间的分类边界更大化。In this embodiment, for gender recognition, a two-class depth model needs to be designed. The two-class model usually uses sigmod or softmax loss function. However, the sigmod or softmax loss function does not work well on data with blurred boundaries. In order to accurately classify genders based on neutral sounds, increase the class spacing, and reduce the intra-class spacing, the AM-Softmax loss function is used in this application to design a deep neural network model. The AM-Softmax loss function can promote a larger classification boundary between categories.
所述AM-Softmax损失函数为:The AM-Softmax loss function is:
Figure PCTCN2021072860-appb-000001
Figure PCTCN2021072860-appb-000001
其中,S=30,m=0.2,为了提高收敛速度,引进了一个超参数s,这里s设置为固定值30。Among them, S=30 and m=0.2. In order to improve the convergence speed, a hyperparameter s is introduced, where s is set to a fixed value of 30.
所述AM-Softmax损失函数的参数因子m的取值为0.2时,可以得到所述性别识别深度神经网络模型的最佳决策边界。When the value of the parameter factor m of the AM-Softmax loss function is 0.2, the optimal decision boundary of the gender recognition deep neural network model can be obtained.
在本实施方式中,由于性别识别是一个二分类的问题,目标类别只有男性和女性。相对于图片分类来说,问题的解空间较为简单。如果直接使用图像分类领域的深度模型,容易出现过拟合的情况。因此,在本申请中,为了避免出现过拟合现象,提高深度模型的泛化能力,对现有的识别图片的深度模型进行改进,得到ResNet-10模型。具体地,在 ResNet-18的基础上,再次降低模型深度和减少残差层数,得到所述ResNet-10模型。In this embodiment, since gender recognition is a two-class problem, the target categories are only male and female. Compared with image classification, the solution space of the problem is relatively simple. If the depth model in the field of image classification is used directly, it is prone to overfitting. Therefore, in this application, in order to avoid the phenomenon of over-fitting and improve the generalization ability of the depth model, the existing depth model for recognizing pictures is improved to obtain the ResNet-10 model. Specifically, on the basis of ResNet-18, the model depth and the number of residual layers are reduced again to obtain the ResNet-10 model.
在本实施方式中,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层,其中,本发明中ResNet-10的参数可以参考表1所示,表1中的max pool为池化层,其中,Conv3_x、Conv4_x和Conv5_x的第一层的步长stride均为2,每个卷积层之后都连接有激活层ReLU和正则化层Batech Normalization,表1中Conv2_x、Conv3_x、Conv4_x和Conv5_x均包括1个残差模块(X1blocks),为了实现本发明中性别识别模型的二分类任务,卷积层Conv5_x的最后一层连接一个全连接层,所述全连接层可以输出声音信息对应的类型结果。In this embodiment, the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x, and a total of 10 layers of fully connected layers. Among them, the parameters of ResNet-10 in the present invention can be referred to as shown in Table 1. The max pool in 1 is the pooling layer, where the stride stride of the first layer of Conv3_x, Conv4_x and Conv5_x are all 2. After each convolutional layer, the activation layer ReLU and the regularization layer Batech Normalization are connected, as shown in Table 1. Conv2_x, Conv3_x, Conv4_x and Conv5_x all include 1 residual module (X1blocks). In order to realize the binary classification task of the gender recognition model of the present invention, the last layer of the conv5_x is connected to a fully connected layer. The layer can output the type result corresponding to the sound information.
在本实施方式中,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别连接一个自适应全局平均池化。因为本申请需要解决的问题是较少分类问题(男性和女性),使用平均池化比最大池化的效果更好。在本申请中,通过采用自适应全局平均池化,避免特征尺寸不匹配。由于语音频谱图的特征尺寸浮动较大,采用自适应全局平均池化效果更好。In this embodiment, the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected to an adaptive global average pooling. Because the problem to be solved in this application is less classification problems (male and female), using average pooling has a better effect than maximum pooling. In this application, adaptive global average pooling is used to avoid feature size mismatch. Since the feature size of the speech spectrogram fluctuates greatly, the effect of adaptive global average pooling is better.
表1Table 1
Figure PCTCN2021072860-appb-000002
Figure PCTCN2021072860-appb-000002
在本实施例中,所述ResNet-10模型的输入部分的卷积核为3x3的卷积核。在本实施方式中,3x3的卷积核可以有效减少计算量,同时能更好的适应语音频谱图。并且在降低所述ResNet-10模型的每个残差层的特征图尺寸后,可以使得模型不容易过拟合,同时降低模型参数量级。In this embodiment, the convolution kernel of the input part of the ResNet-10 model is a 3×3 convolution kernel. In this embodiment, the 3x3 convolution kernel can effectively reduce the amount of calculation, and at the same time can better adapt to the speech spectrogram. And after reducing the feature map size of each residual layer of the ResNet-10 model, the model can be made not easy to overfit, and the magnitude of the model parameters can be reduced at the same time.
在本实施例中,预先训练的性别识别深度神经网络模型的方法包括:In this embodiment, the pre-trained method of a gender recognition deep neural network model includes:
(1)扩充所述中性声音得到训练数据;(1) Expanding the neutral voice to obtain training data;
(2)根据扩充的训练数据训练所述深度神经网络模型,得到性别识别深度神经网络模型。(2) Train the deep neural network model according to the expanded training data to obtain a gender recognition deep neural network model.
本实施方式中,所述神经网络模型的训练方法包括如下步骤:In this embodiment, the training method of the neural network model includes the following steps:
(a)获取中性声音对应的特征参数,并对所述特征参数标注类别,以使所述特征参数携带类别标签。(a) Obtain the characteristic parameter corresponding to the neutral sound, and mark the characteristic parameter with a category, so that the characteristic parameter carries a category label.
例如,分别选取500个男性中性声音和女性中性声音对应的特征参数,并对每个特 征参数标注类别,可以以“1”作为男性中性声音的参数标签,以“2”作为女性中性声音的参数标签。For example, to select 500 male neutral voices and female neutral voices corresponding feature parameters, and label each feature parameter category, you can use "1" as the parameter label of the male neutral voice, and "2" as the female neutral voice. The parameter label of the sexual voice.
在本实施方式中,所述男性中性声音和女性中性声音对应的特征参数包括声音信号的Mel频率倒谱系数。所述Mel频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)的分析基于人耳的听觉特性。因为,人耳听到的声音的高低与声音的频率并不成线性正比关系,Mel频率尺度更符合人耳的听觉特性。In this embodiment, the characteristic parameters corresponding to the male neutral voice and the female neutral voice include the Mel frequency cepstrum coefficient of the sound signal. The Mel-Frequency Cepstral Coefficients (MFCC) analysis is based on the auditory characteristics of the human ear. Because the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more in line with the hearing characteristics of the human ear.
(b)将所述特征参数随机分成第一预设比例的训练集和第二预设比例的验证集。(b) Randomly divide the characteristic parameters into a training set with a first preset ratio and a verification set with a second preset ratio.
先将不同性别中性声音的训练集中的训练样本分发到不同的文件夹里。例如,将男性中性声音的训练样本分发到第一文件夹里、女性中性声音的训练样本分发到第二文件夹里。然后从不同的文件夹里分别提取第一预设比例(例如,70%)的训练样本作为总的作为训练集,其目的是用于训练所述深度神经网络模型;再从不同的文件夹里分别取剩余第二预设比例(例如,30%)的训练样本作为测试集,其目的是用于测试所述深度神经网络模型的分类性能。First distribute the training samples in the training set of neutral voices of different genders to different folders. For example, distribute training samples of male neutral voices to the first folder, and distribute training samples of female neutral voices to the second folder. Then extract the first preset ratio (for example, 70%) of the training samples from different folders as the total training set, the purpose of which is to train the deep neural network model; and then from different folders The remaining second preset ratio (for example, 30%) of the training samples are respectively taken as the test set, the purpose of which is to test the classification performance of the deep neural network model.
(c)利用所述训练集对深度神经网络模型进行训练。(c) Use the training set to train the deep neural network model.
将训练集输入至建立好的神经网络模型(如resnet10)中进行模型训练的过程可以采用现有技术中的手段实现,在此不做详述。在一些实施例中,利用训练集对所述神经网络模型进行训练还可以包括:将深度神经网络模型的训练部署在多个图形处理器(Graphics Processing Unit,GPU)上进行分布式训练。例如,可以通过Tensorflow的分布式训练原理,将模型的训练部署在多个图形处理器上进行分布式训练,可以缩短模型的训练时间,加快模型收敛。The process of inputting the training set into the established neural network model (such as resnet10) for model training can be implemented by means in the prior art, and will not be described in detail here. In some embodiments, using the training set to train the neural network model may further include: deploying the training of the deep neural network model on multiple graphics processing units (GPUs) for distributed training. For example, through the distributed training principle of Tensorflow, the training of the model can be deployed on multiple graphics processors for distributed training, which can shorten the training time of the model and accelerate the convergence of the model.
(d)利用所述验证集验证训练后的所述深度神经网络模型的准确率。(d) Use the verification set to verify the accuracy of the trained deep neural network model.
本实施方式中,若所述准确率大于或者等于预设准确率时,则结束训练,以训练后的所述深度神经网络模型作为分类器识别当前中性声音对应的用户性别;若所述准确率小于所述预设准确率时,则增加样本数量以重新训练所述深度神经网络模型直至所述准确率大于或者等于预设准确率。In this embodiment, if the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, and the trained deep neural network model is used as a classifier to identify the gender of the user corresponding to the current neutral voice; if the accuracy is accurate When the rate is less than the preset accuracy rate, the number of samples is increased to retrain the deep neural network model until the accuracy rate is greater than or equal to the preset accuracy rate.
在本实施方式中,扩充所述中性声音的方法为:在采集的中性声音中叠加噪声,获取叠加噪声后的中性声音的频谱图,将频谱图在时间方向上进行打乱重组。In this embodiment, the method for expanding the neutral sound is: superimposing noise on the collected neutral sound, obtaining a spectrogram of the neutral sound after superimposing the noise, and shuffling and recombining the spectrogram in the time direction.
在本实施方式中,通过数据增强技术,来扩充所述中性声音的训练数据。因为中性声音相对来说是较少的。而为了训练深度神经网络模型,需要对采集的中性声音进行扩充。In this embodiment, the training data of the neutral sound is expanded by data enhancement technology. Because neutral sounds are relatively rare. In order to train the deep neural network model, the collected neutral sounds need to be expanded.
具体地,在采集的中性声音中叠加噪声包括:在采集的中性声音中叠加白噪声和/在采集的中性声音中混合环境噪声。Specifically, superimposing noise in the collected neutral sound includes: superimposing white noise in the collected neutral sound and/mixing environmental noise in the collected neutral sound.
例如,在采集的中性声音(original_signal)上线性叠加高斯白噪声,得到新的声音信号:new_signal=0.9*original_signal+0.1*white_noise()。For example, linearly superimpose Gaussian white noise on the collected neutral sound (original_signal) to obtain a new sound signal: new_signal=0.9*original_signal+0.1*white_noise().
例如,在采集的中性声音中混合真实环境噪声可以是将上述高斯白噪声替换成采集的真实环境噪声,来得到新的声音信号:new_signal=0.9*original_signal+0.1*real_noise()。所述真实环境噪声可以是从公园、公交站、体育场馆、咖啡店等场地采集的噪声。For example, mixing real environmental noise in the collected neutral sound can be to replace the above-mentioned Gaussian white noise with the collected real environmental noise to obtain a new sound signal: new_signal=0.9*original_signal+0.1*real_noise(). The real environmental noise may be noise collected from parks, bus stops, stadiums, coffee shops and other venues.
将叠加噪声后的中性声音经过短时傅里叶变换处理后得到频谱图,对所述频谱图在时间方向上进行打乱重组,得到训练数据。The neutral sound after superimposed noise is subjected to short-time Fourier transform processing to obtain a spectrogram, and the spectrogram is shuffled and reorganized in the time direction to obtain training data.
例如,在频谱图的时间方向上,按固定的语音帧序列长度(如64帧)裁剪上述中性声音对应的频谱图,得到长度为64帧的语音片段,然后将所述语音片段进行随机重新组合。例如,对于640帧的中性声音频谱图进行裁剪得到十个为64帧的语音片段,从十个语音片段中选取随机选取3个进行顺序拼接得到新的声音信号。通过以上两个步骤可生成足够多的有效的高质量的中性语音数据。For example, in the time direction of the spectrogram, crop the spectrogram corresponding to the neutral sound according to a fixed length of the speech frame sequence (such as 64 frames) to obtain a speech fragment with a length of 64 frames, and then randomly re-create the speech fragment combination. For example, a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal. Through the above two steps, enough effective and high-quality neutral voice data can be generated.
由于本申请需要对语音的频谱图设计用于识别用户性别的深度模型,因此,在这里使用了频谱图随机剪裁和堆叠的方法,进行语音数据的扩充。以此,得到一个相对规模 的中性声音数据,按照不同的标签划入对应的性别数据组别,来扩充训练数据。Since this application needs to design a deep model of the voice spectrogram to identify the gender of the user, the method of random cropping and stacking of the spectrogram is used here to expand the voice data. In this way, a relatively large-scale neutral voice data is obtained, and the training data is expanded by dividing it into the corresponding gender data group according to different labels.
综上所述,本发明提供的信息提示方法包括,采集声音信息;确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;根据识别的语义信息和确认的类别给出提示;当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别。本申请可以识别采集的声音信息中的中性声音,再识别所述中性声音对应的性别,从而可以提供更加准确的指示。并且在识别所述中性声音对应的性别时,通过设计基于AM-Softmax训练的深度神经网络模型ResNet-10,在用户性别识别上,使得用户的声音特征空间,具有更好的分类边界。并且本申请提出的信息提示方法,借鉴了人脸识别中的成熟思想,推动了分类边界的更大化,使得分类边界模糊的语音数据如中性声音,能够通过深度训练,得到有效的性别归属识别,大大提升了性别识别的准确率,提高了说话人性别识别在实际业务场景以及智能客服系统中的应用能力。In summary, the information prompting method provided by the present invention includes collecting sound information; confirming the type of the sound information, wherein the type of the sound information includes male voice, female voice, and neutral voice; when confirming the voice When the type of information is a male voice or a female voice, the semantic information corresponding to the voice information is recognized; prompts are given according to the recognized semantic information and the confirmed category; when the category of the voice information is confirmed to be a neutral voice, according to the previous The trained gender recognition deep neural network model recognizes the gender corresponding to the neutral voice. This application can identify the neutral voice in the collected voice information, and then identify the gender corresponding to the neutral voice, so as to provide more accurate instructions. And when recognizing the gender corresponding to the neutral voice, by designing the deep neural network model ResNet-10 trained based on AM-Softmax, in the user gender recognition, the user's voice feature space has a better classification boundary. In addition, the information prompt method proposed in this application draws on the mature ideas in face recognition and promotes the expansion of the classification boundary, so that voice data with fuzzy classification boundaries, such as neutral voices, can be effectively trained through in-depth training to obtain effective gender attribution. Recognition greatly improves the accuracy of gender recognition and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.
以上所述,仅是本发明的具体实施方式,但本发明的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本发明创造构思的前提下,还可以做出改进,但这些均属于本发明的保护范围。The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited to this. For those of ordinary skill in the art, without departing from the inventive concept of the present invention, they can also make Improvements, but these all belong to the protection scope of the present invention.
下面结合图2和图3,分别对实现上述信息提示的电子设备的功能模块及硬件结构进行介绍。The functional modules and hardware structure of the electronic device that realizes the above-mentioned information prompt are respectively introduced below in conjunction with FIG. 2 and FIG. 3.
实施例二Example two
图2为本发明信息提示装置较佳实施例中的功能模块图。Fig. 2 is a diagram of functional modules in a preferred embodiment of the information prompting device of the present invention.
在一些实施例中,所述信息提示装置20运行于电子设备中。所述信息提示装置20可以包括多个由程序代码段所组成的功能模块。所述信息提示装置20中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以执行信息提示功能。In some embodiments, the information prompting device 20 runs in an electronic device. The information prompting device 20 may include multiple functional modules composed of program code segments. The program code of each program segment in the information prompt device 20 can be stored in a memory and executed by at least one processor to perform an information prompt function.
本实施例中,所述信息提示装置20根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:采集模块201、确认模块202、识别模块203、提示模块204及处理模块205。本发明所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机程序段,其存储在存储器中。在一些实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the information prompting device 20 can be divided into multiple functional modules according to the functions it performs. The functional modules may include: an acquisition module 201, a confirmation module 202, an identification module 203, a prompt module 204, and a processing module 205. The module referred to in the present invention refers to a series of computer program segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory. In some embodiments, the functions of each module will be detailed in subsequent embodiments.
所述采集模块201用于采集声音信息。The collection module 201 is used to collect sound information.
在本实施例中,所述机器人上安装有麦克风,可以通过所述麦克风采集声音信息。In this embodiment, a microphone is installed on the robot, and sound information can be collected through the microphone.
所述确认模块202用于确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音。The confirmation module 202 is used to confirm the category of the sound information, where the category of the sound information includes male voice, female voice, and neutral voice.
现有技术中,绝大多数人发出声音的基音频率范围为50Hz-400Hz,正常情况下,男性发出声音的基音频率范围为50Hz-200Hz,女性发出声音的基音频率范围为150Hz-400Hz。从男性声音的基音频率范围和女性声音的基音频率范围可发现,两者有部分重叠150Hz-200Hz。根据这部分重叠的基音频率范围是比较难分辨出说话者到底是男性还是女性。由此,可以将重叠的基音频率范围对应的声音定义为中性声音。In the prior art, the pitch frequency range of most people's voices is 50Hz-400Hz. Under normal circumstances, the pitch frequency range of male voices is 50Hz-200Hz, and the pitch frequency range of female voices is 150Hz-400Hz. From the pitch frequency range of the male voice and the pitch frequency range of the female voice, it can be found that there is a partial overlap of 150Hz-200Hz between the two. According to this overlapping pitch frequency range, it is more difficult to distinguish whether the speaker is a male or a female. Thus, the sound corresponding to the overlapping pitch frequency range can be defined as a neutral sound.
在本实施例中,所述确认模块202确认所述声音信息的类别包括:In this embodiment, the confirmation module 202 confirming the category of the sound information includes:
(1)提取所述声音信息的基音频率;(1) Extract the pitch frequency of the sound information;
(2)将所述声音信息的基音频率与第一基音频率范围、第二基音频率范围和第三基音频率范围进行比对;(2) Compare the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;
(3)当所述声音信息的基音频率落在所述第一基音频率范围内,所述确认模块202确认所述声音信息的类别为男性声音;当所述声音信息的基音频率落在所述第二基音频率范围内,所述确认模块202确认所述声音信息的类别为女性声音;所述声音信息的基音频率落在所述第三基音频率范围内,所述确认模块202确认所述声音信息的类别为中性声音。(3) When the pitch frequency of the sound information falls within the first pitch frequency range, the confirmation module 202 confirms that the category of the sound information is a male voice; when the pitch frequency of the sound information falls within the first pitch frequency range, Within the second pitch frequency range, the confirmation module 202 confirms that the category of the sound information is a female voice; the pitch frequency of the sound information falls within the third pitch frequency range, and the confirmation module 202 confirms the sound The category of information is neutral voice.
具体地,设定男性声音的第一基音频率范围为50Hz-150Hz,女性声音的第二基音频率范围为200Hz-400Hz,中性声音的第三基音频率范围为150Hz-200Hz。在本实施例中,根据所述声音信息的基音频率来确认所述声音信息的类别。当所述声音信息的基音频率落在所述第一基音频率范围内(如50Hz-150Hz),所述确认模块202确认所述声音信息的类别为男性声音;当所述声音信息的基音频率落在所述第二基音频率范围(如200Hz-400Hz)内,所述确认模块202确认所述声音信息的类别为女性声音;当所述声音信息的基音频率落在所述第三基音频率范围(如150Hz-200Hz)内,所述确认模块202确认所述声音信息的类别为中性声音。Specifically, the first pitch frequency range of the male voice is set to 50 Hz-150 Hz, the second pitch frequency range of the female voice is 200 Hz-400 Hz, and the third pitch frequency range of the neutral voice is 150 Hz-200 Hz. In this embodiment, the category of the sound information is confirmed according to the pitch frequency of the sound information. When the pitch frequency of the sound information falls within the first pitch frequency range (for example, 50Hz-150Hz), the confirmation module 202 confirms that the category of the sound information is a male voice; when the pitch frequency of the sound information falls Within the second pitch frequency range (such as 200Hz-400Hz), the confirmation module 202 confirms that the category of the sound information is a female voice; when the pitch frequency of the sound information falls within the third pitch frequency range ( For example, within 150 Hz-200 Hz), the confirmation module 202 confirms that the category of the sound information is a neutral sound.
所述识别模块203用于当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息。The recognition module 203 is configured to recognize the semantic information corresponding to the sound information when it is confirmed that the type of the sound information is a male voice or a female voice.
在本实施例中,可以通过自然语言处理方法识别所述声音信息对应的语义信息,具体包括:In this embodiment, the semantic information corresponding to the sound information can be identified through a natural language processing method, which specifically includes:
将所述声音信息转换为文本信息;Converting the sound information into text information;
对所述文本信息进行预处理,所述预处理包括分词和噪声词去除处理。The text information is preprocessed, and the preprocessing includes word segmentation and noise word removal processing.
根据预先存储的语义关系库和基本概念库对预处理后的文本信息进行语义匹配,得到语义匹配结果。Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
在本实施例中,上述基本概念库包括含义的基本概念以及与上述基本概念对应的扩展概念。所述语义关系库包括与上述基本概念库、句型关系模板和常识库相关的关系和模糊语义关系。In this embodiment, the above-mentioned basic concept library includes basic concepts of meaning and extended concepts corresponding to the above-mentioned basic concepts. The semantic relation library includes relations and fuzzy semantic relations related to the above-mentioned basic concept library, sentence pattern relation template, and common sense library.
所述提示模块204用于根据识别的语义信息和确认的类别给出提示。The prompt module 204 is used to give a prompt according to the recognized semantic information and the confirmed category.
在本实施例中,当确定的类别为男性声音,可以确认用户的性别为男性。当确定的类别为女性声音,可以确认用户的性别为女性。此时,由于用户有可能并非需要根据自身性别得到机器人的指示。所以机器人还不能仅根据识别的性别给出指示,而是需要根据识别的语义信息来给出指示。In this embodiment, when the determined category is male voice, it can be confirmed that the gender of the user is male. When the determined category is female voice, it can be confirmed that the gender of the user is female. At this time, because the user may not need to get instructions from the robot based on his gender. Therefore, the robot can not only give instructions based on the recognized gender, but needs to give instructions based on the recognized semantic information.
优选地,所述识别的语义信息的优先级高于确认的类别的优先级。Preferably, the priority of the identified semantic information is higher than the priority of the confirmed category.
例如,当用户为男性,而该用户需要替他身边的女性询问女洗手间所在位置时。当男性用户给出语音询问信息“女洗手间在哪?”。此时,机器人可以根据用户的声音信息判定用户性别为男性。但是并不能给出男洗手间的位置的提示,而是需要根据用户给出的语音询问信息对应的语义信息来给出女洗手间所在位置的提示。由此,可以为用户提供更加准确的提示,提高用户体验。For example, when the user is a male, and the user needs to ask the women around him for the location of the female toilet. When a male user gives a voice asking message "Where is the female bathroom?". At this time, the robot can determine that the user's gender is male based on the user's voice information. However, it is not possible to give a reminder of the location of the men's restroom. Instead, it is necessary to give a reminder of the location of the women's restroom based on the semantic information corresponding to the voice inquiry information given by the user. As a result, more accurate prompts can be provided to users, and user experience can be improved.
例如,当用户仅给出语音询问信息“请问洗手间在哪”。此时,机器人只需要根据所述语音询问信息判定用户性别,再根据判定的用户性别给出指示。如,当男性用户给出语音询问信息“请问洗手间在哪”,机器人根据所述语音询问信息确定对应的用户性别男性,则给出男洗手间所在位置的提示。For example, when the user only gives the voice inquiry message "Where is the restroom". At this time, the robot only needs to determine the gender of the user based on the voice query information, and then give instructions based on the determined gender of the user. For example, when a male user gives the voice inquiry message "Where is the restroom", the robot determines the corresponding user gender male according to the voice inquiry information, and then gives a prompt of the location of the men's bathroom.
所述处理模块205用于当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别。The processing module 205 is configured to recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice.
在本实施例中,当通过所述性别识别深度神经网络模型识别了用户性别后,为了避免仅根据用户性别给出错误指示的情况出现,需要先获取所述声音信息对应的语义信息,再根据所述语义信息和用户性别给出正确指示。In this embodiment, after the user’s gender is recognized through the gender recognition deep neural network model, in order to avoid the situation where the user’s gender is only given wrong instructions, the semantic information corresponding to the voice information needs to be acquired first, and then based on The semantic information and user gender give correct instructions.
在本实施例中,所述性别识别深度神经网络模型为残差神经网络ResNet模型。所述ResNet模型为基于AM-Softmax损失函数设计的深度神经网络模型,其中,通过调整所述AM-Softmax损失函数的参数因子得到所述性别识别深度神经网络模型的最佳决策边界。In this embodiment, the gender recognition deep neural network model is a residual neural network ResNet model. The ResNet model is a deep neural network model designed based on the AM-Softmax loss function, wherein the optimal decision boundary of the gender recognition deep neural network model is obtained by adjusting the parameter factors of the AM-Softmax loss function.
在本实施方式中,对于性别识别而言,需要设计一个二分类的深度模型。二分类模型通常使用sigmod或者softmax损失函数。然而,sigmod或者softmax损失函数在边界模糊的数据效果不佳。为了能够准确的根据中性声音对性别进行分类,增加类间距,并 减小类内距,本申请中采用AM-Softmax损失函数设计深度神经网络模型。所述AM-Softmax损失函数可以推动类别之间的分类边界更大化。In this embodiment, for gender recognition, a two-class depth model needs to be designed. The two-class model usually uses sigmod or softmax loss function. However, the sigmod or softmax loss function does not work well on data with blurred boundaries. In order to accurately classify gender based on neutral voices, increase the class spacing, and reduce the intra-class distance, the AM-Softmax loss function is used in this application to design a deep neural network model. The AM-Softmax loss function can promote a larger classification boundary between categories.
所述AM-Softmax损失函数为:The AM-Softmax loss function is:
Figure PCTCN2021072860-appb-000003
Figure PCTCN2021072860-appb-000003
其中,S=30,m=0.2,为了提高收敛速度,引进了一个超参数s,这里s设置为固定值30。Among them, S=30 and m=0.2. In order to improve the convergence speed, a hyperparameter s is introduced, where s is set to a fixed value of 30.
所述AM-Softmax损失函数的参数因子m的取值为0.2时,可以得到所述性别识别深度神经网络模型的最佳决策边界。When the value of the parameter factor m of the AM-Softmax loss function is 0.2, the optimal decision boundary of the gender recognition deep neural network model can be obtained.
在本实施方式中,由于性别识别是一个二分类的问题,目标类别只有男性和女性。相对于图片分类来说,问题的解空间较为简单。如果直接使用图像分类领域的深度模型,容易出现过拟合的情况。因此,在本申请中,为了避免出现过拟合现象,提高深度模型的泛化能力,对现有的识别图片的深度模型进行改进,得到ResNet-10模型。具体地,在ResNet-18的基础上,再次降低模型深度和减少残差层数,得到所述ResNet-10模型。In this embodiment, since gender recognition is a two-class problem, the target categories are only male and female. Compared with image classification, the solution space of the problem is relatively simple. If the depth model in the field of image classification is used directly, it is prone to overfitting. Therefore, in this application, in order to avoid the phenomenon of over-fitting and improve the generalization ability of the depth model, the existing depth model for recognizing pictures is improved to obtain the ResNet-10 model. Specifically, on the basis of ResNet-18, the depth of the model and the number of residual layers are reduced again to obtain the ResNet-10 model.
在本实施方式中,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层,其中,本发明中ResNet-10的参数可以参考上文中的表1所示,表1中的max pool为池化层,其中,Conv3_x、Conv4_x和Conv5_x的第一层的步长stride均为2,每个卷积层之后都连接有激活层ReLU和正则化层Batech Normalization,表1中Conv2_x、Conv3_x、Conv4_x和Conv5_x均包括1个残差模块(X1blocks),为了实现本发明中性别识别模型的二分类任务,卷积层Conv5_x的最后一层连接一个全连接层,所述全连接层可以输出声音信息对应的类型结果。In this embodiment, the ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x, and a total of 10 layers of fully connected layers. Among them, the parameters of ResNet-10 in the present invention can be referred to in Table 1 above. As shown, the max pool in Table 1 is the pooling layer, where the stride stride of the first layer of Conv3_x, Conv4_x and Conv5_x are all 2, and each convolutional layer is connected with the activation layer ReLU and the regularization layer Batech Normalization In Table 1, Conv2_x, Conv3_x, Conv4_x and Conv5_x all include one residual module (X1blocks). In order to achieve the binary classification task of the gender recognition model in the present invention, the last layer of the conv5_x is connected to a fully connected layer, so The fully connected layer can output type results corresponding to the sound information.
在本实施方式中,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别连接一个自适应全局平均池化。因为本申请需要解决的问题是较少分类问题(男性和女性),使用平均池化比最大池化的效果更好。在本申请中,通过采用自适应全局平均池化,避免特征尺寸不匹配。由于语音频谱图的特征尺寸浮动较大,采用自适应全局平均池化效果更好。In this embodiment, the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected to an adaptive global average pooling. Because the problem to be solved in this application is less classification problems (male and female), using average pooling has a better effect than maximum pooling. In this application, adaptive global average pooling is used to avoid feature size mismatch. Since the feature size of the speech spectrogram fluctuates greatly, the effect of adaptive global average pooling is better.
在本实施例中,所述ResNet-10模型的输入部分的卷积核为3x3的卷积核。在本实施方式中,3x3的卷积核可以有效减少计算量,同时能更好的适应语音频谱图。并且在降低所述ResNet-10模型的每个残差层的特征图尺寸后,可以使得模型不容易过拟合,同时降低模型参数量级。In this embodiment, the convolution kernel of the input part of the ResNet-10 model is a 3×3 convolution kernel. In this embodiment, the 3x3 convolution kernel can effectively reduce the amount of calculation, and at the same time can better adapt to the speech spectrogram. And after reducing the feature map size of each residual layer of the ResNet-10 model, the model can be made not easy to overfit, and the magnitude of the model parameters can be reduced at the same time.
在本实施例中,预先训练的性别识别深度神经网络模型的方法包括:In this embodiment, the pre-trained method of a gender recognition deep neural network model includes:
(1)扩充所述中性声音得到训练数据;(1) Expanding the neutral voice to obtain training data;
(2)根据扩充的训练数据训练所述深度神经网络模型,得到性别识别深度神经网络模型。(2) Train the deep neural network model according to the expanded training data to obtain a gender recognition deep neural network model.
本实施方式中,所述神经网络模型的训练方法包括如下步骤:In this embodiment, the training method of the neural network model includes the following steps:
(a)获取中性声音对应的特征参数,并对所述特征参数标注类别,以使所述特征参数携带类别标签。(a) Obtain the characteristic parameter corresponding to the neutral sound, and mark the characteristic parameter with a category, so that the characteristic parameter carries a category label.
例如,分别选取500个男性中性声音和女性中性声音对应的特征参数,并对每个特征参数标注类别,可以以“1”作为男性中性声音的参数标签,以“2”作为女性中性声音的参数标签。For example, to select 500 male neutral voices and female neutral voices corresponding feature parameters, and label each feature parameter category, you can use "1" as the parameter label of the male neutral voice, and "2" as the female neutral voice. The parameter label of the sexual voice.
在本实施方式中,所述男性中性声音和女性中性声音对应的特征参数包括声音信号 的Mel频率倒谱系数。所述Mel频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC)的分析基于人耳的听觉特性。因为,人耳听到的声音的高低与声音的频率并不成线性正比关系,Mel频率尺度更符合人耳的听觉特性。In this embodiment, the characteristic parameter corresponding to the male neutral voice and the female neutral voice includes the Mel frequency cepstrum coefficient of the sound signal. The Mel-Frequency Cepstral Coefficients (MFCC) analysis is based on the auditory characteristics of the human ear. Because the level of the sound heard by the human ear is not linearly proportional to the frequency of the sound, the Mel frequency scale is more in line with the hearing characteristics of the human ear.
(b)将所述特征参数随机分成第一预设比例的训练集和第二预设比例的验证集。(b) Randomly divide the characteristic parameters into a training set with a first preset ratio and a verification set with a second preset ratio.
先将不同性别中性声音的训练集中的训练样本分发到不同的文件夹里。例如,将男性中性声音的训练样本分发到第一文件夹里、女性中性声音的训练样本分发到第二文件夹里。然后从不同的文件夹里分别提取第一预设比例(例如,70%)的训练样本作为总的作为训练集,其目的是用于训练所述深度神经网络模型;再从不同的文件夹里分别取剩余第二预设比例(例如,30%)的训练样本作为测试集,其目的是用于测试所述深度神经网络模型的分类性能。First distribute the training samples in the training set of neutral voices of different genders to different folders. For example, distribute training samples of male neutral voices to the first folder, and distribute training samples of female neutral voices to the second folder. Then extract the training samples of the first preset ratio (for example, 70%) from different folders as the total training set, the purpose of which is to train the deep neural network model; and then from different folders The remaining second preset ratio (for example, 30%) of the training samples are respectively taken as the test set, the purpose of which is to test the classification performance of the deep neural network model.
(c)利用所述训练集对深度神经网络模型进行训练。(c) Use the training set to train the deep neural network model.
将训练集输入至建立好的神经网络模型(如resnet10)中进行模型训练的过程可以采用现有技术中的手段实现,在此不做详述。在一些实施例中,利用训练集对所述神经网络模型进行训练还可以包括:将深度神经网络模型的训练部署在多个图形处理器(Graphics Processing Unit,GPU)上进行分布式训练。例如,可以通过Tensorflow的分布式训练原理,将模型的训练部署在多个图形处理器上进行分布式训练,可以缩短模型的训练时间,加快模型收敛。The process of inputting the training set into the established neural network model (such as resnet10) for model training can be implemented by means in the prior art, and will not be described in detail here. In some embodiments, using the training set to train the neural network model may further include: deploying the training of the deep neural network model on multiple graphics processing units (GPUs) for distributed training. For example, through the distributed training principle of Tensorflow, the training of the model can be deployed on multiple graphics processors for distributed training, which can shorten the training time of the model and accelerate the convergence of the model.
(d)利用所述验证集验证训练后的所述深度神经网络模型的准确率。(d) Use the verification set to verify the accuracy of the trained deep neural network model.
本实施方式中,若所述准确率大于或者等于预设准确率时,则结束训练,以训练后的所述深度神经网络模型作为分类器识别当前中性声音对应的用户性别;若所述准确率小于所述预设准确率时,则增加样本数量以重新训练所述深度神经网络模型直至所述准确率大于或者等于预设准确率。In this embodiment, if the accuracy rate is greater than or equal to the preset accuracy rate, the training ends, and the trained deep neural network model is used as a classifier to identify the gender of the user corresponding to the current neutral voice; if the accuracy is accurate When the rate is less than the preset accuracy rate, the number of samples is increased to retrain the deep neural network model until the accuracy rate is greater than or equal to the preset accuracy rate.
在本实施方式中,扩充所述中性声音的方法为:在采集的中性声音中叠加噪声,获取叠加噪声后的中性声音的频谱图,将频谱图在时间方向上进行打乱重组。In this embodiment, the method for expanding the neutral sound is: superimposing noise on the collected neutral sound, obtaining a spectrogram of the neutral sound after superimposing the noise, and shuffling and recombining the spectrogram in the time direction.
在本实施方式中,通过数据增强技术,来扩充所述中性声音的训练数据。因为中性声音相对来说是较少的。而为了训练深度神经网络模型,需要对采集的中性声音进行扩充。In this embodiment, the training data of the neutral sound is expanded by data enhancement technology. Because neutral sounds are relatively rare. In order to train the deep neural network model, the collected neutral sounds need to be expanded.
具体地,在采集的中性声音中叠加噪声包括:在采集的中性声音中叠加白噪声和/在采集的中性声音中混合环境噪声。Specifically, superimposing noise in the collected neutral sound includes: superimposing white noise in the collected neutral sound and/mixing environmental noise in the collected neutral sound.
例如,在采集的中性声音(original_signal)上线性叠加高斯白噪声,得到新的声音信号:new_signal=0.9*original_signal+0.1*white_noise()。For example, linearly superimpose Gaussian white noise on the collected neutral sound (original_signal) to obtain a new sound signal: new_signal=0.9*original_signal+0.1*white_noise().
例如,在采集的中性声音中混合真实环境噪声可以是将上述高斯白噪声替换成采集的真实环境噪声,来得到新的声音信号:new_signal=0.9*original_signal+0.1*real_noise()。所述真实环境噪声可以是从公园、公交站、体育场馆、咖啡店等场地采集的噪声。For example, mixing real environmental noise in the collected neutral sound can be to replace the above-mentioned Gaussian white noise with the collected real environmental noise to obtain a new sound signal: new_signal=0.9*original_signal+0.1*real_noise(). The real environmental noise may be noise collected from parks, bus stops, stadiums, coffee shops and other venues.
将叠加噪声后的中性声音经过短时傅里叶变换处理后得到频谱图,对所述频谱图在时间方向上进行打乱重组,得到训练数据。The neutral sound after superimposed noise is subjected to short-time Fourier transform processing to obtain a spectrogram, and the spectrogram is shuffled and reorganized in the time direction to obtain training data.
例如,在频谱图的时间方向上,按固定的语音帧序列长度(如64帧)裁剪上述中性声音对应的频谱图,得到长度为64帧的语音片段,然后将所述语音片段进行随机重新组合。例如,对于640帧的中性声音频谱图进行裁剪得到十个为64帧的语音片段,从十个语音片段中选取随机选取3个进行顺序拼接得到新的声音信号。通过以上两个步骤可生成足够多的有效的高质量的中性语音数据。For example, in the time direction of the spectrogram, crop the spectrogram corresponding to the neutral sound according to a fixed length of the speech frame sequence (such as 64 frames) to obtain a speech fragment with a length of 64 frames, and then randomly re-create the speech fragment combination. For example, a 640-frame neutral sound spectrogram is cropped to obtain ten 64-frame speech fragments, and three of the ten speech fragments are randomly selected for sequential splicing to obtain a new sound signal. Through the above two steps, enough effective and high-quality neutral voice data can be generated.
由于本申请需要对语音的频谱图设计用于识别用户性别的深度模型,因此,在这里使用了频谱图随机剪裁和堆叠的方法,进行语音数据的扩充。以此,得到一个相对规模的中性声音数据,按照不同的标签划入对应的性别数据组别,来扩充训练数据。Since this application needs to design a deep model of the voice spectrogram to identify the gender of the user, the method of random cropping and stacking of the spectrogram is used here to expand the voice data. In this way, a relatively large-scale neutral voice data is obtained, and the training data is expanded by dividing it into the corresponding gender data group according to different labels.
综上所述,本发明提供的基于中性声音的信息提示装置20包括采集模块201、确认模块202、识别模块203、提示模块204及处理模块205。所述采集模块201用于采集声音信息; 所述确认模块202用于确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;所述识别模块203用于当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;所述提示模块204用于根据识别的语义信息和确认的类别给出提示;所述处理模块205用于当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别。本申请可以识别采集的声音信息中的中性声音,再识别所述中性声音对应的性别,从而可以提供更加准确的指示。并且在识别所述中性声音对应的性别时,通过设计基于AM-Softmax训练的深度神经网络模型ResNet-10,在用户性别识别上,使得用户的声音特征空间,具有更好的分类边界。并且本申请提出的信息提示装置,借鉴了人脸识别中的成熟思想,推动了分类边界的更大化,使得分类边界模糊的语音数据如中性声音,能够通过深度训练,得到有效的性别归属识别,大大提升了性别识别的准确率,提高了说话人性别识别在实际业务场景以及智能客服系统中的应用能力。In summary, the neutral sound-based information prompting device 20 provided by the present invention includes a collection module 201, a confirmation module 202, an identification module 203, a prompt module 204, and a processing module 205. The collection module 201 is used to collect sound information; the confirmation module 202 is used to confirm the type of the sound information, wherein the types of the sound information include male voice, female voice, and neutral voice; the recognition module 203 It is used to identify the semantic information corresponding to the sound information when it is confirmed that the type of the sound information is a male voice or a female voice; the prompt module 204 is used to give a prompt according to the recognized semantic information and the confirmed type; The processing module 205 is configured to identify the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice. This application can identify the neutral voice in the collected voice information, and then identify the gender corresponding to the neutral voice, so as to provide more accurate instructions. And when recognizing the gender corresponding to the neutral voice, by designing the deep neural network model ResNet-10 trained based on AM-Softmax, in the user gender recognition, the user's voice feature space has a better classification boundary. In addition, the information prompting device proposed in this application draws on mature ideas in face recognition and promotes a larger classification boundary, so that voice data with fuzzy classification boundaries, such as neutral voices, can be effectively trained through in-depth training to obtain effective gender attribution. Recognition greatly improves the accuracy of gender recognition and improves the application ability of speaker gender recognition in actual business scenarios and intelligent customer service systems.
上述以软件功能模块的形式实现的集成的单元,可以存储在一个计算机可读取存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,双屏设备,或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分。The above-mentioned integrated unit implemented in the form of a software function module may be stored in a computer readable storage medium. The above-mentioned software function module is stored in a storage medium and includes several instructions to make a computer device (which can be a personal computer, a dual-screen device, or a network device, etc.) or a processor to execute the various embodiments of the present invention. Part of the method.
图3为本申请实施例三提供的电子设备的示意图。FIG. 3 is a schematic diagram of the electronic device provided in the third embodiment of the application.
所述电子设备3包括:存储器31、至少一个处理器32、存储在所述存储器31中并可在所述至少一个处理器32上运行的计算机程序33、至少一条通讯总线34及数据库35。The electronic device 3 includes a memory 31, at least one processor 32, a computer program 33 stored in the memory 31 and running on the at least one processor 32, at least one communication bus 34 and a database 35.
所述至少一个处理器32执行所述计算机程序33时实现上述信息提示方法实施例中的步骤。When the at least one processor 32 executes the computer program 33, the steps in the foregoing embodiment of the information prompting method are implemented.
示例性的,所述计算机程序33可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器31中,并由所述至少一个处理器32执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,所述计算机可读指令段用于描述所述计算机程序33在所述电子设备3中的执行过程。Exemplarily, the computer program 33 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 31 and executed by the at least one processor 32, To complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer program 33 in the electronic device 3.
所述电子设备3可以是手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)等安装有应用程序的设备。本领域技术人员可以理解,所述示意图3仅仅是电子设备3的示例,并不构成对电子设备3的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备3还可以包括输入输出设备、网络接入设备、总线等。The electronic device 3 may be a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA) and other devices installed with applications. Those skilled in the art can understand that the schematic diagram 3 is only an example of the electronic device 3, and does not constitute a limitation on the electronic device 3. For example, the electronic device 3 may also include input and output devices, network access devices, buses, and so on.
所述至少一个处理器32可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。所述处理器32可以是微处理器或者所述处理器32也可以是任何常规的处理器等,所述处理器32是所述电子设备3的控制中心,利用各种接口和线路连接整个电子设备3的各个部分。The at least one processor 32 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 32 may be a microprocessor, or the processor 32 may also be any conventional processor, etc. The processor 32 is the control center of the electronic device 3, and connects the entire electronic device with various interfaces and lines. Various parts of device 3.
所述存储器31可用于存储所述计算机程序33和/或模块/单元,所述处理器32通过运行或执行存储在所述存储器31内的计算机程序和/或模块/单元,以及调用存储在存储器31内的数据,实现所述电子设备3的各种功能。所述存储器31可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据电子设备3的使用所创建的数据等。此外,存储器31可以包括易失性存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、高速随机存取存储器,或其他存储器件。The memory 31 may be used to store the computer program 33 and/or modules/units. The processor 32 runs or executes the computer programs and/or modules/units stored in the memory 31 and calls the computer programs and/or modules/units stored in the memory 31. The data in 31 realizes various functions of the electronic device 3. The memory 31 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function, etc.; the storage data area may store data created according to the use of the electronic device 3 Wait. In addition, the memory 31 may include a volatile memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), and a Secure Digital (SD) card. , Flash Card, at least one magnetic disk storage device, flash memory device, high-speed random access memory, or other storage device.
所述存储器31中存储有程序代码,且所述至少一个处理器32可调用所述存储器31中存储的程序代码以执行相关的功能。例如,图2中所述的各个模块(采集模块201、确认模块202、识别模块203、提示模块204及处理模块205)是存储在所述存储器31中的程序代码, 并由所述至少一个处理器32所执行,从而实现所述各个模块的功能以达到信息提示的目的。The memory 31 stores program codes, and the at least one processor 32 can call the program codes stored in the memory 31 to perform related functions. For example, the modules (collection module 201, confirmation module 202, identification module 203, prompt module 204, and processing module 205) described in FIG. 2 are program codes stored in the memory 31 and processed by the at least one Executed by the device 32, so as to realize the functions of the various modules to achieve the purpose of information prompting.
所述数据库(Database)35是按照数据结构来组织、存储和管理数据的建立在所述电子设备3上的仓库。数据库通常分为层次式数据库、网络式数据库和关系式数据库三种。在本实施方式中,所述数据库35用于存储采集的声音信息等。The database (Database) 35 is a warehouse built on the electronic device 3 for organizing, storing and managing data according to a data structure. Databases are usually divided into three types: hierarchical database, network database and relational database. In this embodiment, the database 35 is used to store collected sound information and the like.
所述电子设备3集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,所述计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器等。If the integrated module/unit of the electronic device 3 is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, it can implement the steps of the foregoing method embodiments. Wherein, the computer program includes computer-readable instruction code, and the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random access memory, etc.
在本申请所提供的几个实施例中,应所述理解到,所揭露的电子设备和方法,可以通过其它的方式实现。例如,以上所描述的电子设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed electronic device and method can be implemented in other ways. For example, the electronic device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other division methods in actual implementation.
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated in the same processing unit, or each unit may exist alone physically, or two or more units may be integrated in the same unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或,单数不排除复数。本申请中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any reference signs in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other elements or the singular number does not exclude the plural number. Multiple units or devices stated in this application can also be implemented by one unit or device through software or hardware. Words such as first and second are used to denote names, but do not denote any specific order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Modifications or equivalent replacements are made without departing from the spirit and scope of the technical solution of the present application.

Claims (20)

  1. 一种信息提示方法,其中,所述方法包括:An information prompt method, wherein the method includes:
    采集步骤,采集声音信息;Collecting steps, collecting sound information;
    确认步骤,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;The confirming step is to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice;
    识别步骤,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;In the recognition step, when it is confirmed that the category of the sound information is a male voice or a female voice, recognizing the semantic information corresponding to the voice information;
    提示步骤,根据识别的语义信息和确认的类别给出提示;Prompt step, give prompts based on the recognized semantic information and confirmed categories;
    处理步骤,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再返回所述识别步骤,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
  2. 如权利要求1所述的信息提示方法,其中,所述确认步骤包括:8. The information prompting method according to claim 1, wherein the confirming step comprises:
    提取所述声音信息的基音频率;Extracting the pitch frequency of the sound information;
    将所述声音信息的基音频率与第一基音频率范围、第二基音频率范围和第三基音频率范围进行比对;Comparing the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;
    当所述声音信息的基音频率落在所述第一基音频率范围内,确认所述声音信息的类别为男性声音;When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is a male voice;
    当所述声音信息的基音频率落在所述第二基音频率范围内,确认所述声音信息的类别为女性声音;When the pitch frequency of the sound information falls within the second pitch frequency range, confirm that the category of the sound information is a female voice;
    当所述声音信息的基音频率落在所述第三基音频率范围内,确认所述声音信息的类别为中性声音。When the pitch frequency of the sound information falls within the third pitch frequency range, it is confirmed that the category of the sound information is a neutral sound.
  3. 如权利要求1所述的信息提示方法,其中,所述识别所述声音信息对应的语义信息包括:8. The information prompting method of claim 1, wherein said identifying semantic information corresponding to said sound information comprises:
    将所述声音信息转换为文本信息;Converting the sound information into text information;
    对所述文本信息进行预处理,所述预处理包括分词和噪声词去除处理;Preprocessing the text information, where the preprocessing includes word segmentation and noise word removal processing;
    根据预先存储的语义关系库和基本概念库对预处理后的文本信息进行语义匹配,得到语义匹配结果。Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
  4. 如权利要求1所述的信息提示方法,其中,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别包括一个残差模块,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别连接一个自适应全局平均池化。The information prompting method according to claim 1, wherein the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x each include a residual module, and the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are respectively connected An adaptive global average pooling.
  5. 如权利要求1所述的信息提示方法,其中,所述ResNet-10模型为基于AM-Softmax损失函数设计的深度神经网络模型,其中,当所述AM-Softmax损失函数的参数因子为0.2时,得到所述性别识别深度神经网络模型的最佳决策边界。The information prompting method of claim 1, wherein the ResNet-10 model is a deep neural network model designed based on the AM-Softmax loss function, wherein, when the parameter factor of the AM-Softmax loss function is 0.2, The best decision boundary of the gender recognition deep neural network model is obtained.
  6. 如权利要求1所述的信息提示方法,其中,所述方法还包括:预先训练所述性别识别深度神经网络模型的步骤,所述预先训练所述性别识别深度神经网络模型的步骤包括:8. The information prompting method according to claim 1, wherein the method further comprises the step of pre-training the gender recognition deep neural network model, and the step of pre-training the gender recognition deep neural network model comprises:
    扩充所述中性声音得到训练数据;Expanding the neutral sound to obtain training data;
    根据扩充的训练数据训练所述深度神经网络模型,得到性别识别深度神经网络模型。The deep neural network model is trained according to the expanded training data to obtain the gender recognition deep neural network model.
  7. 如权利要求6所述的信息提示方法,其中,所述扩充所述中性声音得到训练数据包括:8. The information prompting method according to claim 6, wherein said expanding said neutral voice to obtain training data comprises:
    在采集的中性声音中叠加噪声;Superimpose noise on the collected neutral sound;
    获取叠加噪声后的中性声音的频谱图;Obtain the frequency spectrum of the neutral sound after superimposed noise;
    将频谱图在时间方向上进行打乱重组。The spectrogram is shuffled and reorganized in the time direction.
  8. 一种信息提示装置,其中,所述装置包括:An information prompting device, wherein the device includes:
    采集模块,用于采集声音信息;Collection module, used to collect sound information;
    确认模块,用于确认所述声音信息的类别,其中,所述声音信息的类别包括男性声 音、女性声音和中性声音;The confirmation module is used to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice and neutral voice;
    识别模块,用于当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;The recognition module is used to recognize the semantic information corresponding to the voice information when it is confirmed that the category of the voice information is a male voice or a female voice;
    提示模块,用于根据识别的语义信息和确认的类别给出提示;The prompt module is used to give prompts based on the recognized semantic information and the confirmed category;
    处理模块,用于当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。The processing module is configured to recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model when it is confirmed that the category of the voice information is a neutral voice, wherein the pre-trained gender recognition deep neural network The network model is a residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x and a total of 10 fully connected layers.
  9. 一种电子设备,其中,所述电子设备包括处理器,所述处理器用于执行存储器中存储的计算机可读指令以实现以下步骤:An electronic device, wherein the electronic device includes a processor, and the processor is configured to execute computer-readable instructions stored in a memory to implement the following steps:
    采集步骤,采集声音信息;Collecting steps, collecting sound information;
    确认步骤,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;The confirming step is to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice;
    识别步骤,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;In the recognition step, when it is confirmed that the category of the sound information is a male voice or a female voice, recognizing the semantic information corresponding to the voice information;
    提示步骤,根据识别的语义信息和确认的类别给出提示;Prompt step, give prompts based on the recognized semantic information and confirmed categories;
    处理步骤,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再返回所述识别步骤,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
  10. 如权利要求9所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现所述确认所述声音信息的类别时,具体包括:9. The electronic device according to claim 9, wherein, when the processor executes the computer-readable instruction to realize the confirming the category of the sound information, it specifically comprises:
    提取所述声音信息的基音频率;Extracting the pitch frequency of the sound information;
    将所述声音信息的基音频率与第一基音频率范围、第二基音频率范围和第三基音频率范围进行比对;Comparing the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;
    当所述声音信息的基音频率落在所述第一基音频率范围内,确认所述声音信息的类别为男性声音;When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is a male voice;
    当所述声音信息的基音频率落在所述第二基音频率范围内,确认所述声音信息的类别为女性声音;When the pitch frequency of the sound information falls within the second pitch frequency range, confirm that the category of the sound information is a female voice;
    当所述声音信息的基音频率落在所述第三基音频率范围内,确认所述声音信息的类别为中性声音。When the pitch frequency of the sound information falls within the third pitch frequency range, it is confirmed that the category of the sound information is a neutral sound.
  11. 如权利要求9所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现所述识别所述声音信息对应的语义信息时,具体包括:9. The electronic device according to claim 9, wherein, when the processor executes the computer-readable instruction to realize the recognition of the semantic information corresponding to the sound information, it specifically comprises:
    将所述声音信息转换为文本信息;Converting the sound information into text information;
    对所述文本信息进行预处理,所述预处理包括分词和噪声词去除处理;Preprocessing the text information, where the preprocessing includes word segmentation and noise word removal processing;
    根据预先存储的语义关系库和基本概念库对预处理后的文本信息进行语义匹配,得到语义匹配结果。Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
  12. 如权利要求9所述的电子设备,其中,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别包括一个残差模块,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别连接一个自适应全局平均池化。The electronic device of claim 9, wherein the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x each comprise a residual module, and the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x are connected to one Adaptive global average pooling.
  13. 如权利要求9所述的电子设备,其中,所述ResNet-10模型为基于AM-Softmax损失函数设计的深度神经网络模型,其中,当所述AM-Softmax损失函数的参数因子为0.2时,得到所述性别识别深度神经网络模型的最佳决策边界。The electronic device according to claim 9, wherein the ResNet-10 model is a deep neural network model designed based on the AM-Softmax loss function, wherein when the parameter factor of the AM-Softmax loss function is 0.2, it is obtained The best decision boundary of the gender recognition deep neural network model.
  14. 如权利要求9所述的电子设备,其中,所述处理器执行所述计算机可读指令还用以实现预先训练所述性别识别深度神经网络模型的步骤,所述处理器执行所述计算机可读指令以实现预先训练所述性别识别深度神经网络模型时,包括:The electronic device of claim 9, wherein the processor executing the computer-readable instructions is further used to implement the step of pre-training the gender recognition deep neural network model, and the processor executes the computer-readable The instructions to pre-train the gender recognition deep neural network model include:
    扩充所述中性声音得到训练数据;Expanding the neutral sound to obtain training data;
    根据扩充的训练数据训练所述深度神经网络模型,得到性别识别深度神经网络模型。The deep neural network model is trained according to the expanded training data to obtain the gender recognition deep neural network model.
  15. 如权利要求14所述的电子设备,其中,所述处理器执行所述计算机可读指令以实现所述扩充所述中性声音得到训练数据时,包括:The electronic device according to claim 14, wherein, when the processor executes the computer-readable instructions to implement the expansion of the neutral sound to obtain training data, the method comprises:
    在采集的中性声音中叠加噪声;Superimpose noise on the collected neutral sound;
    获取叠加噪声后的中性声音的频谱图;Obtain the frequency spectrum of the neutral sound after superimposed noise;
    将频谱图在时间方向上进行打乱重组。The spectrogram is shuffled and reorganized in the time direction.
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现以下步骤:A computer-readable storage medium having computer-readable instructions stored thereon, wherein the computer-readable instructions implement the following steps when executed by a processor:
    采集步骤,采集声音信息;Collecting steps, collecting sound information;
    确认步骤,确认所述声音信息的类别,其中,所述声音信息的类别包括男性声音、女性声音和中性声音;The confirming step is to confirm the category of the voice information, wherein the category of the voice information includes male voice, female voice, and neutral voice;
    识别步骤,当确认所述声音信息的类别为男性声音或女性声音时,识别所述声音信息对应的语义信息;In the recognition step, when it is confirmed that the category of the sound information is a male voice or a female voice, recognizing the semantic information corresponding to the voice information;
    提示步骤,根据识别的语义信息和确认的类别给出提示;Prompt step, give prompts based on the recognized semantic information and confirmed categories;
    处理步骤,当确认所述声音信息的类别为中性声音时,根据预先训练的性别识别深度神经网络模型识别所述中性声音对应的性别,再返回所述识别步骤,其中,所述预先训练的性别识别深度神经网络模型为残差神经网络ResNet-10模型,ResNet-10模型包括卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x和全连接层共10个层。The processing step, when it is confirmed that the category of the voice information is a neutral voice, recognize the gender corresponding to the neutral voice according to the pre-trained gender recognition deep neural network model, and then return to the recognition step, wherein the pre-training The gender recognition deep neural network model is the residual neural network ResNet-10 model. The ResNet-10 model includes convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x and Conv5_x, and a total of 10 fully connected layers.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现所述确认所述声音信息的类别时,具体包括:15. The computer-readable storage medium of claim 16, wherein, when the computer-readable instruction is executed by the processor to realize the confirming the category of the sound information, it specifically comprises:
    提取所述声音信息的基音频率;Extracting the pitch frequency of the sound information;
    将所述声音信息的基音频率与第一基音频率范围、第二基音频率范围和第三基音频率范围进行比对;Comparing the pitch frequency of the sound information with the first pitch frequency range, the second pitch frequency range, and the third pitch frequency range;
    当所述声音信息的基音频率落在所述第一基音频率范围内,确认所述声音信息的类别为男性声音;When the pitch frequency of the sound information falls within the first pitch frequency range, confirm that the category of the sound information is a male voice;
    当所述声音信息的基音频率落在所述第二基音频率范围内,确认所述声音信息的类别为女性声音;When the pitch frequency of the sound information falls within the second pitch frequency range, confirm that the category of the sound information is a female voice;
    当所述声音信息的基音频率落在所述第三基音频率范围内,确认所述声音信息的类别为中性声音。When the pitch frequency of the sound information falls within the third pitch frequency range, it is confirmed that the category of the sound information is a neutral sound.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述计算机可读指令被所述处理器执行以实现所述识别所述声音信息对应的语义信息时,具体包括:15. The computer-readable storage medium according to claim 16, wherein, when the computer-readable instruction is executed by the processor to realize the recognition of the semantic information corresponding to the sound information, it specifically comprises:
    将所述声音信息转换为文本信息;Converting the sound information into text information;
    对所述文本信息进行预处理,所述预处理包括分词和噪声词去除处理;Preprocessing the text information, where the preprocessing includes word segmentation and noise word removal processing;
    根据预先存储的语义关系库和基本概念库对预处理后的文本信息进行语义匹配,得到语义匹配结果。Perform semantic matching on the pre-processed text information according to the pre-stored semantic relation library and basic concept library to obtain the semantic matching result.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别包括一个残差模块,所述卷积层Conv_1、Conv2_x、Conv3_x、Conv4_x和Conv5_x分别连接一个自适应全局平均池化。The computer-readable storage medium of claim 16, wherein the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x each comprise a residual module, and the convolutional layers Conv_1, Conv2_x, Conv3_x, Conv4_x, and Conv5_x Connect an adaptive global average pooling respectively.
  20. 如权利要求16所述的计算机可读存储介质,其中,所述ResNet-10模型为基于AM-Softmax损失函数设计的深度神经网络模型,其中,当所述AM-Softmax损失函数的参数因子为0.2时,得到所述性别识别深度神经网络模型的最佳决策边界。The computer-readable storage medium of claim 16, wherein the ResNet-10 model is a deep neural network model designed based on the AM-Softmax loss function, wherein when the parameter factor of the AM-Softmax loss function is 0.2 When the time, the best decision boundary of the gender recognition deep neural network model is obtained.
PCT/CN2021/072860 2020-03-03 2021-01-20 Information prompting method and apparatus, electronic device, and medium WO2021175031A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010139944.2A CN111462755A (en) 2020-03-03 2020-03-03 Information prompting method and device, electronic equipment and medium
CN202010139944.2 2020-03-03

Publications (1)

Publication Number Publication Date
WO2021175031A1 true WO2021175031A1 (en) 2021-09-10

Family

ID=71678415

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/072860 WO2021175031A1 (en) 2020-03-03 2021-01-20 Information prompting method and apparatus, electronic device, and medium

Country Status (2)

Country Link
CN (1) CN111462755A (en)
WO (1) WO2021175031A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium
CN112447188B (en) * 2020-11-18 2023-10-20 中国人民解放军陆军工程大学 Acoustic scene classification method based on improved softmax function
CN112382301B (en) * 2021-01-12 2021-05-14 北京快鱼电子股份公司 Noise-containing voice gender identification method and system based on lightweight neural network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167174A (en) * 2013-02-25 2013-06-19 广东欧珀移动通信有限公司 Output method, device and mobile terminal of mobile terminal greetings
CN105185385A (en) * 2015-08-11 2015-12-23 东莞市凡豆信息科技有限公司 Voice fundamental tone frequency estimation method based on gender anticipation and multi-frequency-band parameter mapping
US20180308487A1 (en) * 2017-04-21 2018-10-25 Go-Vivace Inc. Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response
CN108962223A (en) * 2018-06-25 2018-12-07 厦门快商通信息技术有限公司 A kind of voice gender identification method, equipment and medium based on deep learning
CN109961794A (en) * 2019-01-14 2019-07-02 湘潭大学 A kind of layering method for distinguishing speek person of model-based clustering
JP6553015B2 (en) * 2016-11-15 2019-07-31 日本電信電話株式会社 Speaker attribute estimation system, learning device, estimation device, speaker attribute estimation method, and program
CN110136726A (en) * 2019-06-20 2019-08-16 厦门市美亚柏科信息股份有限公司 A kind of estimation method, device, system and the storage medium of voice gender
CN110428843A (en) * 2019-03-11 2019-11-08 杭州雄迈信息技术有限公司 A kind of voice gender identification deep learning method
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167174A (en) * 2013-02-25 2013-06-19 广东欧珀移动通信有限公司 Output method, device and mobile terminal of mobile terminal greetings
CN105185385A (en) * 2015-08-11 2015-12-23 东莞市凡豆信息科技有限公司 Voice fundamental tone frequency estimation method based on gender anticipation and multi-frequency-band parameter mapping
JP6553015B2 (en) * 2016-11-15 2019-07-31 日本電信電話株式会社 Speaker attribute estimation system, learning device, estimation device, speaker attribute estimation method, and program
US20180308487A1 (en) * 2017-04-21 2018-10-25 Go-Vivace Inc. Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response
CN108962223A (en) * 2018-06-25 2018-12-07 厦门快商通信息技术有限公司 A kind of voice gender identification method, equipment and medium based on deep learning
CN109961794A (en) * 2019-01-14 2019-07-02 湘潭大学 A kind of layering method for distinguishing speek person of model-based clustering
CN110428843A (en) * 2019-03-11 2019-11-08 杭州雄迈信息技术有限公司 A kind of voice gender identification deep learning method
CN110136726A (en) * 2019-06-20 2019-08-16 厦门市美亚柏科信息股份有限公司 A kind of estimation method, device, system and the storage medium of voice gender
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium

Also Published As

Publication number Publication date
CN111462755A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
WO2021175031A1 (en) Information prompting method and apparatus, electronic device, and medium
CN110097894B (en) End-to-end speech emotion recognition method and system
CN107578775B (en) Multi-classification voice method based on deep neural network
CN111179975B (en) Voice endpoint detection method for emotion recognition, electronic device and storage medium
CN107680582B (en) Acoustic model training method, voice recognition method, device, equipment and medium
WO2020248376A1 (en) Emotion detection method and apparatus, electronic device, and storage medium
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
WO2019227672A1 (en) Voice separation model training method, two-speaker separation method and associated apparatus
Jing et al. Prominence features: Effective emotional features for speech emotion recognition
US9020822B2 (en) Emotion recognition using auditory attention cues extracted from users voice
CN109036381A (en) Method of speech processing and device, computer installation and readable storage medium storing program for executing
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
WO2021047319A1 (en) Voice-based personal credit assessment method and apparatus, terminal and storage medium
CN109036437A (en) Accents recognition method, apparatus, computer installation and computer readable storage medium
CN111696579B (en) Speech emotion recognition method, device, equipment and computer storage medium
CN108962243A (en) arrival reminding method and device, mobile terminal and computer readable storage medium
US20230013370A1 (en) Generating audio waveforms using encoder and decoder neural networks
CN113327586A (en) Voice recognition method and device, electronic equipment and storage medium
El-Moneim et al. Text-dependent and text-independent speaker recognition of reverberant speech based on CNN
CN114927126A (en) Scheme output method, device and equipment based on semantic analysis and storage medium
CN111898363A (en) Method and device for compressing long and difficult sentences of text, computer equipment and storage medium
CN111985231B (en) Unsupervised role recognition method and device, electronic equipment and storage medium
Lokitha et al. Smart voice assistance for speech disabled and paralyzed people
Srinivasan et al. Multi-view representation based speech assisted system for people with neurological disorders
Yan et al. In-tunnel accident detection system based on the learning of accident sound

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21764808

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 23.01.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21764808

Country of ref document: EP

Kind code of ref document: A1