WO2021218136A1

WO2021218136A1 - Voice-based user gender and age recognition method and apparatus, computer device, and storage medium

Info

Publication number: WO2021218136A1
Application number: PCT/CN2020/131612
Authority: WO
Inventors: 赵婧; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-04-27
Filing date: 2020-11-26
Publication date: 2021-11-04
Also published as: CN111683181A; CN111683181B

Abstract

The present application relates to the technical field of voice classification in artificial intelligence, and provides a voice-based user gender and age recognition method and apparatus, a computer device, and a storage medium. The method comprises: preprocessing received current user voice data transmitted by a user side to obtain preprocessed voice data; performing feature extraction in terms of short-time average amplitude, Mel-frequency cepstral coefficient, and Mel-frequency cepstral coefficient first-order difference on each frame of voice data to obtain corresponding mixed parameter features, so as to form a mixed parameter feature time sequence; inputting the mixed parameter feature time sequence into a Gaussian mixture model to obtain a corresponding current user classification result; and calling a voice response strategy, obtaining corresponding current voice response data, and sending the current voice response data to the user side. Accurate gender and age recognition based on user voice is implemented.

Description

Voice-based user gender and age recognition method, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 27, 2020, the application number is 202010345904.3, and the invention title is "Voice-based user gender and age recognition method, device and computer equipment". The entire content of the Chinese patent application is approved The reference is incorporated in this application.

Technical field

This application relates to the technical field of voice classification in artificial intelligence, and in particular to a voice-based user gender and age recognition method, device, computer equipment, and storage medium.

Background technique

At present, the inventor realizes that when the smart phone outbound call system automatically makes outbound calls to each user according to the user information in the list of outbound users, it determines the outbound agent voice based on the age and gender in the user information. Type and outbound process.

For example, when it is known that the user is a middle-aged male according to the user information, the smart phone outbound call system calls the female agent to record to realize the outbound call. However, if the user who answers the phone is not himself, the accuracy of gender broadcast is low.

Summary of the invention

The embodiments of the present application provide a voice-based user gender and age identification method, device, computer equipment and storage medium, which are intended to solve the problem that the prior art smart phone out-call system automatically performs a check on each user based on the user information in the list of outbound users. When a user makes an outbound call, if the user who answers the call is not himself, it is easy to cause the problem of low accuracy of gender broadcast.

In the first aspect, an embodiment of the present application provides a voice-based user gender and age identification method, which includes:

Receive the current user voice data sent by the user terminal;

Preprocessing the current user voice data to obtain preprocessed voice data;

Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;

Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and

Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.

In the second aspect, an embodiment of the present application provides a voice-based user gender and age recognition device, which includes:

The voice data receiving unit is used to receive the current user voice data sent by the user terminal;

A voice preprocessing unit, configured to preprocess the current user voice data to obtain preprocessed voice data;

The mixing parameter sequence acquisition unit is used to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the Mel frequency cepstrum coefficient and Mel frequency inversion of each frame of speech data. Feature extraction of the first-order difference of spectral coefficients to obtain mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter characteristic time series;

The user classification unit is used to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and Estimated age parameters; and

The reply data sending unit is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:

Receive the current user voice data sent by the user terminal;

Preprocessing the current user voice data to obtain preprocessed voice data;

In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :

Receive the current user voice data sent by the user terminal;

Preprocessing the current user voice data to obtain preprocessed voice data;

The embodiments of the application provide a voice-based user gender and age recognition method, device, computer equipment, and storage medium, including receiving current user voice data sent by a user terminal; preprocessing the current user voice data to obtain preprocessing Post-speech data; extract the short-term average amplitude of each frame of speech data in the pre-processed speech data, and perform mel frequency cepstral coefficient and mel frequency cepstrum coefficient first-order on each frame of speech data Differential feature extraction to obtain mixed parameter features corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter feature time series; input the mixed parameter feature time series to a pre-trained Gaussian mixture model , Obtain the current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and estimated age parameters; The current voice reply data corresponding to the classification result of the current user is sent to the user terminal. This method comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on user's voice.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic diagram of an application scenario of a voice-based user gender and age identification method provided by an embodiment of this application;

2 is a schematic flowchart of a voice-based user gender and age identification method provided by an embodiment of this application;

3 is a schematic diagram of a sub-flow of a voice-based user gender and age identification method provided by an embodiment of this application;

FIG. 4 is a schematic diagram of another sub-flow of a voice-based user gender and age identification method provided by an embodiment of this application;

FIG. 5 is a schematic block diagram of a voice-based user gender and age recognition device provided by an embodiment of this application;

6 is a schematic block diagram of subunits of a voice-based user gender and age recognition device provided by an embodiment of the application;

FIG. 7 is a schematic block diagram of another subunit of the voice-based user gender and age recognition device provided by an embodiment of the application; FIG.

FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to Figures 1 and 2. Figure 1 is a schematic diagram of an application scenario of a voice-based user gender and age identification method provided by an embodiment of this application; Figure 2 is a schematic flowchart of a voice-based user gender and age identification method provided by an embodiment of this application The voice-based user gender and age identification method is applied to a server, and the method is executed by application software installed in the server.

As shown in Figure 2, the method includes steps S110 to S150.

S110: Receive current user voice data sent by the user terminal.

In this embodiment, when the intelligent voice system deployed in the server needs to recognize the gender and age of the user's voice, it initially needs to receive the current user's voice data uploaded by the user terminal, so as to perform the subsequent voice preprocessing and classification recognition process.

S120. Preprocess the voice data of the current user to obtain preprocessed voice data.

In this embodiment, since the actual voice signal (for example, the current user voice data collected in this application) is an analog signal, before the voice signal is digitally processed, the current user voice data is first required to be The current user voice data is recorded as s(t)) is sampled with a sampling period T and discretized as s(n). The selection of the period should be determined according to the bandwidth of the current user’s voice data (according to the Nyquist sampling theorem) , To avoid signal aliasing and distortion in the frequency domain. Certain quantization noise and distortion will be brought about in the quantization process of the discrete speech signal. After having the initial voice data of the current user, the pre-processing of the voice includes the steps of pre-emphasis and windowing and framing.

In one embodiment, as shown in FIG. 3, step S120 includes:

S121. Call a pre-stored sampling period to sample the current user voice data to obtain a current discrete voice signal;

S122. Call a pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal to obtain the current pre-emphasized voice signal;

S123: Invoke a pre-stored Hamming window to window the current pre-emphasis voice information to obtain windowed voice data;

S124: Call the pre-stored frame shift and frame length to divide the windowed voice data into frames to obtain pre-processed voice data.

In this embodiment, before digital processing of the voice signal, the current user voice data (the current user voice data is denoted as s(t)) is sampled with a sampling period T, and it is discretized as s(n).

Then, when calling the pre-stored first-order FIR high-pass digital filter, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:

H(z)=1-az ^-1 (1)

In specific implementation, the value of a is 0.98. For example, suppose that the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasis speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).

After that, the function of the called Hamming window is as follows (2):

The current pre-emphasis voice information is windowed through the Hamming window, and the resulting windowed voice data can be expressed as: Q(n)=y(n)*ω(n).

Finally, when calling the pre-stored frame shift and frame length to frame the windowed voice data, for example, the time domain signal corresponding to the windowed voice data is x(l), and the windowed and framed voice data is processed The voice data of the nth frame in the preprocessed voice data is xn(m), and xn(m) satisfies formula (3):

xn(m)=ω(n)*x(n+m), 0≤m≤N-1 (3)

Among them, n=0, 1T, 2T,..., N is the frame length, T is the frame shift, and ω(n) is the function of the Hamming window.

By preprocessing the current user voice data, it can be effectively used for subsequent voice parameter extraction.

S130. Extract the short-term average amplitude of each frame of voice data in the preprocessed voice data, and perform the first-order difference of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient of each frame of voice data. The feature extraction is to obtain the mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form the mixed parameter feature time sequence.

In this embodiment, when extracting important parameters from the preprocessed speech data, the short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference are generally extracted, and then all The extracted parameters constitute a mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form a mixed parameter feature time series. In this way, important parameters extracted from the preprocessed voice data are obtained, and combining these important parameters can more accurately classify user types (mainly age and gender classification).

Wherein, when extracting the short-term average amplitude of each frame of speech data in the preprocessed speech data, the specific basis is

Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M _n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.

In an embodiment, as shown in FIG. 4, step S130 includes:

S131: Perform Fourier transform on the preprocessed voice data in sequence to obtain frequency domain voice data;

S132. Take the absolute value of the frequency domain voice data to obtain the voice data after the absolute value is taken;

S133: Pass the voice data after the absolute value is filtered through mel to obtain voice data after mel filtering;

S134. Perform logarithmic operation and discrete cosine transform on the voice data after Mel filtering in sequence to obtain Mel frequency cepstral coefficients corresponding to the voice data after preprocessing.

S135. Obtain the difference between two consecutive adjacent two of the Mel frequency cepstral coefficients to obtain the first-order difference of Mel frequency cepstral coefficients.

In this embodiment, since the preprocessed speech data is often a speech signal in the time domain, if you want to map it to a linear frequency, you must use DFT (DFT, Discrete Fourier Transform) or FFT (FFT). That is, Fourier transform) to realize the conversion from time domain to frequency domain. For N-point signals, if N/2 is an integer, FFT can be used to speed up the processing speed of the algorithm. If N/2 is not an integer, you can only use DFT, and the algorithm speed will decrease as the number of points increases. Therefore, when framing, the number of points must be an integer multiple of 2.

Since the result of FFT is a complex number with real and imaginary parts, take the absolute value of it to get the modulus of the complex number, and remove the phase. The mode reflects the amplitude of the sound, and the amplitude contains useful information. The human ear is not sensitive to the phase of the sound, and the phase can be ignored.

The voice data after the absolute value is filtered by the mel filter bank to obtain the voice data after the mel filter. The specific parameters of the Mel filter bank are as follows:

Set the sampling rate of the Mel filter bank fs=8000Hz, the lowest frequency of the filter frequency range fl=0, the highest frequency of the filter frequency range fh=fs/2=8000/2=4000; set the number of filters M= 24. The length of the FFT is N=256. After the voice data after the absolute value is passed through the mel filter, the linear frequency is mel filtered, which reflects the auditory characteristics of the human ear.

When the voice data after Mel filtering is sequentially performed on logarithmic operation and discrete cosine transform, the discrete cosine transform is DCT transform, the time domain signal is transformed to the frequency domain, the logarithm is taken, and the DCT transform is performed to obtain the cepstrum system number. If the Mel filter (ie Mel filter) is added after the frequency domain, MFCC (MFCC is Mel frequency cepstrum coefficient) is finally obtained.

The first-order difference is the difference between two consecutive adjacent two items in the discrete function. When the independent variable changes from x to x+1, the change of function y=y(x) Δyx=y(x+1)-y(x), (x=0,1,2,... .) is called the first difference of the function y(x) at the point x, denoted as Δyx=yx+1-yx, (x=0,1,2,...).

Since each frame of voice data in the preprocessed voice data can correspondingly obtain the above three characteristic parameters (ie, short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference), also That is, one frame of voice data corresponds to a 1*3 line vector, and the preprocessed voice data includes M frames of voice data, and a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector, and the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.

In specific implementation, in addition to obtaining the short-term average amplitude, Mel frequency cepstrum coefficient, and Mel frequency cepstrum coefficient first-order difference for each frame of voice data in the preprocessed speech data, it can also compare all the After the preprocessing, each frame of voice data in the voice data corresponds to obtaining the three parameters of the fundamental frequency, the speech rate, and the sound pressure level, so as to form a time series of mixed parameter characteristics with more parameter dimensions.

S140. Input the mixed parameter characteristic time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter .

In this embodiment, when pre-training the Gaussian mixture model, several sub-Gaussian mixture models need to be trained separately, for example, the first sub-Gaussian mixture model used to identify men aged 18-20, and the first sub-Gaussian mixture model used to identify men aged 21-30. Two-child Gaussian mixture model, the third-child Gaussian mixture model for identifying men aged 31-49, the fourth-child Gaussian mixture model for identifying men aged 41-50, and the fifth child for identifying men aged 51-70 Gaussian mixture model, the sixth sub-Gaussian mixture model used to identify women aged 18-20, the seventh sub-Gaussian mixture model used to identify women aged 21-30, and the eighth sub-Gaussian mixture model used to identify women aged 31-49 Model, the ninth sub-Gaussian mixture model used to identify women aged 41-50, and the tenth sub-Gaussian mixture model used to identify women aged 51-70.

The Gaussian mixture model (Gaussian mixture model, abbreviated as GMM) refers to the probability distribution model with the following formula (4):

Where α _k is a coefficient and α _k ≥ 0,

φ(y|θ _k ) is the Gaussian distribution density,

in,

Become the kth submodel.

In an embodiment, the Gaussian mixture model in step S140 includes a plurality of sub-Gaussian mixture models; wherein, one of the plurality of sub-Gaussian mixture models is denoted as a first sub-Gaussian mixture model, and the first sub-Gaussian mixture model It is a recognition model used to recognize males aged 18-20. Taking the training of the first sub-Gaussian mixture model for recognizing men aged 18-20 as an example, the method further includes before step S140:

Acquire first sample data; wherein, the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data;

Use the first sample data to train the first sub-Gaussian mixture model to be trained to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;

Store the trained first sub-Gaussian mixture model to the blockchain network.

Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

In this embodiment, the method of obtaining the mixed parameter feature time series corresponding to the voice data of the 18-20 year old male in the first sample data can refer to steps S110 to S130 to obtain the mixed parameter feature corresponding to the current user's voice data The specific process of the time series. The process of training the first sub-Gaussian mixture model to be trained is to input multiple sets of mixed parameter characteristic time series, and solve the parameters in the first sub-Gaussian mixture model to be trained through the EM algorithm (EM algorithm is the maximum expectation algorithm), thereby obtaining the first sub-Gaussian mixture model. A sub-Gaussian mixture model.

The trained first sub-Gaussian mixture model in the server can be stored on the blockchain in the blockchain network (the blockchain network is preferably a private chain, so that each subsidiary of the enterprise can use the private chain to call the first sub-Gaussian mixture). A sub-Gaussian mixture model), except for the first sub-Gaussian mixture model included in the Gaussian mixture model, which can be stored in the blockchain network, and other sub-Gaussian mixture models in the Gaussian mixture model can also be stored in the chain Blockchain network. _{The parameter values (such as α k} , parameter values corresponding to φ(y|θ _k )) included in each sub-Gaussian mixture model in the Gaussian mixture model are all stored in the blockchain network. In this process, the server is regarded as a block chain node device in the block chain network, which has the authority to upload data to the block chain network. When the server needs to obtain the first sub-Gaussian mixture model from the blockchain network, verify whether the server has the authority of the blockchain node device, and if the server has the authority of the blockchain node device, then obtain the first sub-Gaussian mixture model. A sub-Gaussian mixture model, and broadcast in the blockchain network to inform the blockchain node device server that the first sub-Gaussian mixture model has been acquired.

S150. Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.

In this embodiment, the voice response strategy stored in the server includes multiple voice style template data, and each voice style template data corresponds to one voice response data, and each voice style template data uses the gender of the speaker, The speaker's style and speech flow are all preset.

For example, it is obtained that the current user classification result is 18-20 year old male, and the current voice response data corresponding to the 18-20 year old male user classification result in the voice response strategy is a female sweet style and lively speaking process. That is to say, when a male customer is recognized, it will automatically call the sweet and beautiful sex agent to record, and call the other party as Mr. in the speech process, and increase politeness. When a female customer answers the phone, it automatically calls the magnetic male voice agent to record, and calls it a lady to show politeness. Call a relaxed and lively speech technique process for young customers, and a mature and stable speech technique process for older customers.

In an embodiment, after step S150, the method further includes:

The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.

In this embodiment, the voice data of the current user is recognized through the N-gram model (ie, the multiple model), and the recognition is a whole sentence, for example, "My name is Zhang San, my gender is male, and my age is 25. Business A needs to be handled today.” The N-gram model can effectively recognize the current user’s voice data, and obtain the sentence with the highest recognition probability as the recognition result.

Since the current user’s voice data has been converted into the text data of the recognition result at this time, several key strings in the recognition result can be located at this time to obtain the corresponding user age field and user gender field in the recognition result. The value of the user's age and the value of the user's gender. At the same time, the unique identification code of the user identity corresponding to the user identification code field in the identification result can also be obtained, and the unique identification code of the user identity is preferably the user ID number.

In one embodiment, the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and after obtaining the unique identification code of the user identity corresponding to the user identification code field in the recognition result, include:

According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;

If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.

In this embodiment, after the user's unique identification code (that is, the user's ID number) is obtained, the user's real age and gender can be obtained through the user's unique identification code. The Gaussian mixture model is used to classify the voice data of the current user, and the classification result of the current user including the gender parameter and the estimated age parameter is obtained. At this time, the value of the estimated age parameter is compared with the real age value of the user to determine whether they are equal, and the value of the gender parameter is compared with the value of the user's real gender to determine whether they are equal. Through the above comparison, it can be judged whether the classification of the current user's voice data through the Gaussian mixture model is correct.

If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, it indicates the value of the gender parameter in the current user classification result And/or the value of the estimated age parameter is inaccurate. At this time, the current voice response data corresponding to the current user classification result is not suitable for the current user, so all the current user classification results and the current user classification results that are inaccurate are classified The current user voice data is stored in the first storage area created in advance.

In the first storage area in the server, the inaccurate data of the result of intelligently identifying gender and age is recorded as the customer's historical record, so as to facilitate the subsequent improvement of the Gaussian mixture model.

If the value of the estimated age parameter is equal to the value of the real age of the user, and the value of the gender parameter is equal to the value of the real gender of the user, it means that the value and prediction of the gender parameter in the current user classification result are The values of the estimated age parameters are all accurate. At this time, the current voice response data corresponding to the current user classification result is suitable for the current user, and there is no need to adjust the current user voice data corresponding to the current user classification result.

This method comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on user's voice.

An embodiment of the present application also provides a voice-based user gender and age recognition device, which is used to perform any of the foregoing voice-based user gender and age recognition methods. Specifically, please refer to FIG. 5, which is a schematic block diagram of a voice-based user gender and age recognition device provided in an embodiment of the present application. The voice-based user gender and age recognition device 100 can be configured in a server.

As shown in FIG. 5, the voice-based user gender and age recognition device 100 includes a voice data receiving unit 110, a voice preprocessing unit 120, a mixed parameter sequence acquiring unit 130, a user classification unit 140, and a reply data sending unit 150.

The voice data receiving unit 110 is used to receive the current user voice data sent by the user terminal.

The voice preprocessing unit 120 is configured to preprocess the current user voice data to obtain preprocessed voice data.

In an embodiment, as shown in FIG. 6, the speech preprocessing unit 120 includes:

The voice data sampling unit 121 is configured to call a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;

The pre-emphasis unit 122 is configured to call a pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal to obtain the current pre-emphasis voice signal;

The windowing unit 123 is configured to call a pre-stored Hamming window to window the current pre-emphasized voice information to obtain the windowed voice data;

The framing unit 124 is configured to call the pre-stored frame shift and frame length to divide the windowed voice data to obtain preprocessed voice data.

Then, when the pre-stored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is the first-order non-recursive high-pass digital filter, and its transfer function is as the above formula (1).

For example, suppose that the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasis speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).

After that, the function of the called Hamming window is as the above formula (2), and the current pre-emphasis voice information is windowed through the Hamming window, and the resulting windowed voice data can be expressed as: Q(n)=y( n)*ω(n).

Finally, when calling the pre-stored frame shift and frame length to frame the windowed voice data, for example, the time domain signal corresponding to the windowed voice data is x(l), and the windowed and framed voice data is processed The nth frame of voice data in the preprocessed voice data is xn(m), and xn(m) satisfies the above formula (3). By preprocessing the current user voice data, it can be effectively used for subsequent voice parameter extraction.

The mixing parameter sequence acquiring unit 130 is configured to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the mel frequency cepstrum coefficient and mel frequency of each frame of speech data. The feature extraction of the first-order difference of the cepstral coefficients obtains the mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form the mixed parameter characteristic time series.

In an embodiment, as shown in FIG. 7, the mixed parameter sequence obtaining unit 130 includes:

The Fourier transform unit 131 is configured to perform Fourier transform on the preprocessed voice data in sequence to obtain frequency domain voice data;

The absolute value obtaining unit 132 is configured to take the absolute value of the frequency domain voice data to obtain the voice data after the absolute value;

The mel filtering unit 133 is configured to pass the absolute value of the voice data through mel filtering to obtain the voice data after mel filtering;

Mel frequency cepstral coefficient acquisition unit 134, configured to sequentially perform logarithmic operation and discrete cosine transform on the voice data after Mel filtering to obtain Mel frequency cepstrum coefficients corresponding to the preprocessed voice data;

The first-order difference obtaining unit 135 obtains the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.

The user classification unit 140 is configured to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter And estimated age parameters.

The Gaussian mixture model (ie Gaussian mixture model, abbreviated as GMM) refers to the probability distribution model with the above formula (4).

In an embodiment, the Gaussian mixture model of the user classification unit 140 includes a plurality of sub-Gaussian mixture models; wherein, one of the plurality of sub-Gaussian mixture models is denoted as a first sub-Gaussian mixture model, and the first sub-Gaussian mixture model The model is a recognition model for recognizing males aged 18-20. Taking the first sub-Gaussian mixture model trained to recognize males aged 18-20 as an example, the voice-based user gender and age recognition device 100 further includes:

The first sample acquisition unit is configured to acquire first sample data; wherein, the first sample data is a mixed parameter characteristic time series corresponding to voice data of a plurality of 18-20 year-old males;

The first sub-model training unit is used to train the first sub-Gaussian mixture model to be trained by using the first sample data to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;

The sub-model on-chain unit is used to store the trained first sub-Gaussian mixture model to the blockchain network.

In this embodiment, for the method of obtaining the mixed parameter characteristic time series corresponding to the voice data of the 18-20 year old male in the first sample data, refer to the specific process of obtaining the mixed parameter characteristic time series corresponding to the current user's voice data. The process of training the first sub-Gaussian mixture model to be trained is to input multiple sets of mixed parameter characteristic time series, and solve the parameters in the first sub-Gaussian mixture model to be trained through the EM algorithm (EM algorithm is the maximum expectation algorithm), thereby obtaining the first sub-Gaussian mixture model. A sub-Gaussian mixture model.

The reply data sending unit 150 is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.

For example, it is obtained that the current user classification result is 18-20 year old male, and the current voice response data corresponding to the 18-20 year old male user classification result in the voice response strategy is a female sweet style and lively speaking process. That is to say, when a male client is recognized, the sweet and beautiful sex agent is automatically called to record, and the other party is called Mr. in the conversation process, which increases politeness. When a female customer answers the phone, it automatically calls the magnetic male voice agent to record, and calls it a lady to show politeness. Call a relaxed and lively speech technique process for young customers, and a mature and stable speech technique process for older customers.

In an embodiment, the voice-based user gender and age recognition device 100 further includes:

The unique identification code acquisition unit is configured to recognize the current user's voice data through a pre-trained N-gram model to obtain a recognition result, and obtain the unique identification code of the user corresponding to the user identification code field in the recognition result.

The gender and age comparison unit is used to obtain the user’s true age value and the user’s true gender value corresponding to the user terminal according to the unique identification code of the user’s identity, and determine whether the value of the estimated age parameter is equal to the true user’s value. Age value, and judging whether the value of the gender parameter is equal to the value of the user's real gender;

An error data storage unit, configured to classify the current user if the value of the estimated age parameter is not equal to the true age value of the user, or the value of the gender parameter is not equal to the true gender value of the user The result and the current user voice data are stored in the first storage area created in advance.

In this embodiment, after the user's unique identification code (that is, the user's ID number) is obtained, the user's real age and gender can be obtained through the user's unique identification code. The Gaussian mixture model is used to classify the voice data of the current user, and the classification result of the current user including the gender parameter and the estimated age parameter is obtained. At this time, the value of the estimated age parameter is compared with the real age value of the user to determine whether they are equal, and the value of the gender parameter is compared with the value of the user's real gender to determine whether they are equal. Through the above comparison, it can be judged whether the classification of the current user's voice data by the Gaussian mixture model is correct.

The device comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on the user's voice.

The above voice-based user gender and age recognition device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 8.

Please refer to FIG. 8, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.

Referring to FIG. 8, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.

The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute a voice-based user gender and age identification method.

The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can make the processor 502 execute a voice-based user gender and age identification method.

The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.

Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the voice-based user gender and age identification method disclosed in the embodiment of the present application.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 8 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 8 and will not be repeated here.

It should be understood that in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the voice-based user gender and age identification method disclosed in the embodiments of the present application.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described equipment, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.

In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims

A voice-based user gender and age identification method, which includes:

Receive the current user voice data sent by the user terminal;

Preprocessing the current user voice data to obtain preprocessed voice data;

Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;

Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and

Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
The voice-based user gender and age identification method according to claim 1, wherein the voice response strategy stored in advance is invoked to obtain the current voice response data corresponding to the current user classification result in the voice response strategy, and all After the current voice reply data is sent to the user terminal, it also includes:

The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
The voice-based user gender and age recognition method according to claim 2, wherein the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and the recognition result is compared with the user recognition result. After the unique identification code of the user identity corresponding to the code field, it also includes:

According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;

If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.
The voice-based user gender and age identification method according to claim 1, wherein said preprocessing said current user voice data to obtain preprocessed voice data comprises:

Calling a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;

Calling a pre-stored first-order FIR high-pass digital filter to perform pre-emphasis on the current discrete voice signal to obtain the current pre-emphasis voice signal;

Calling a pre-stored Hamming window to window the current pre-emphasis voice information to obtain windowed voice data;

The pre-stored frame shift and frame length are called to frame the voice data after windowing, and the preprocessed voice data is obtained.
The method for recognizing user gender and age based on voice according to claim 1, wherein the feature extraction of each frame of speech data by Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first-order difference comprises:

Fourier transform is sequentially performed on the preprocessed voice data to obtain frequency domain voice data;

Taking the frequency domain voice data as an absolute value to obtain the voice data after taking the absolute value;

Passing the voice data after the absolute value through mel filtering to obtain voice data after mel filtering;

Performing logarithmic operation and discrete cosine transform on the voice data after mel filtering in sequence to obtain the mel frequency cepstral coefficients corresponding to the voice data after preprocessing;

Obtain the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.
The voice-based user gender and age recognition method according to claim 1, wherein the Gaussian mixture model includes a plurality of sub-Gaussian mixture models; wherein one of the plurality of sub-Gaussian mixture models is denoted as the first sub-Gaussian mixture model , The first sub-Gaussian mixture model is a recognition model for recognizing males aged 18-20;

Before the inputting the characteristic time sequence of the mixing parameters into the pre-trained Gaussian mixture model to obtain the current user classification result corresponding to the current user voice data, the method further includes:

Acquire first sample data; wherein, the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data;

Use the first sample data to train the first sub-Gaussian mixture model to be trained to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;

Store the trained first sub-Gaussian mixture model to the blockchain network.
The voice-based user gender and age recognition method according to claim 1, wherein said extracting the short-term average amplitude of each frame of voice data in the pre-processed voice data comprises:

according to
Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.
The method for recognizing user gender and age based on voice according to claim 5, wherein the voice data after the absolute value is passed through mel filtering to obtain the mel-filtered voice data through the mel filter group. After the absolute value is taken, the voice data is filtered by mel to obtain the mel-filtered voice data; among them, the sampling rate of the mel filter bank is fs=8000Hz, the lowest frequency of the filter frequency range is fl=0, and the filter frequency range is The highest frequency fh=fs/2=8000/2=4000; set the number of filters M=24, and the length of FFT N=256.
A voice-based user gender and age recognition device, which includes:

The voice data receiving unit is used to receive the current user voice data sent by the user terminal;

A voice preprocessing unit, configured to preprocess the current user voice data to obtain preprocessed voice data;

The mixing parameter sequence acquisition unit is used to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the Mel frequency cepstrum coefficient and Mel frequency inversion of each frame of speech data. Feature extraction of the first-order difference of spectral coefficients to obtain mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter characteristic time series;

The user classification unit is used to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and Estimated age parameters; and

The reply data sending unit is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer program:

Receive the current user voice data sent by the user terminal;

Preprocessing the current user voice data to obtain preprocessed voice data;

Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;

Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and

Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
The computer device according to claim 10, wherein the call a pre-stored voice response strategy, obtain the current voice response data corresponding to the current user classification result in the voice response strategy, and send the current voice response data After reaching the client, it also includes:

The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
The computer device according to claim 11, wherein the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and obtaining the user identity corresponding to the user identification code field in the recognition result After the unique identification code, it also includes:

According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;

If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.
The voice-based computer device according to claim 10, wherein said preprocessing said current user voice data to obtain preprocessed voice data comprises:

Calling a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;

Calling a pre-stored first-order FIR high-pass digital filter to perform pre-emphasis on the current discrete voice signal to obtain the current pre-emphasis voice signal;

Calling a pre-stored Hamming window to window the current pre-emphasis voice information to obtain windowed voice data;

The pre-stored frame shift and frame length are called to frame the voice data after windowing, and the preprocessed voice data is obtained.
10. The speech-based computer device according to claim 10, wherein the feature extraction of each frame of speech data by the Mel frequency cepstral coefficient and the Mel frequency cepstral coefficient first-order difference comprises:

Fourier transform is sequentially performed on the preprocessed voice data to obtain frequency domain voice data;

Taking the frequency domain voice data as an absolute value to obtain the voice data after taking the absolute value;

Passing the voice data after the absolute value through mel filtering to obtain voice data after mel filtering;

Performing logarithmic operation and discrete cosine transform on the voice data after mel filtering in sequence to obtain the mel frequency cepstral coefficients corresponding to the voice data after preprocessing;

Obtain the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.
The speech-based computer device according to claim 10, wherein the Gaussian mixture model includes a plurality of sub-Gaussian mixture models; wherein one of the plurality of sub-Gaussian mixture models is denoted as the first sub-Gaussian mixture model, and The first sub-Gaussian mixture model is a recognition model used to recognize males aged 18-20;

Before the inputting the characteristic time sequence of the mixing parameters into the pre-trained Gaussian mixture model to obtain the current user classification result corresponding to the current user voice data, the method further includes:

Acquire first sample data; wherein, the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data;

Use the first sample data to train the first sub-Gaussian mixture model to be trained to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;

Store the trained first sub-Gaussian mixture model to the blockchain network.
The computer device according to claim 10, wherein said extracting the short-term average amplitude of each frame of voice data in the pre-processed voice data comprises:

according to
Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.
14. The computer device according to claim 14, wherein the speech data after the absolute value is passed through mel filtering to obtain the speech data after the mel filtering, in which the speech data after the absolute value is obtained by the mel filter group. The data is filtered by mel to obtain the voice data after mel filtering; among them, the sampling rate of the mel filter bank is fs=8000Hz, the lowest frequency of the filter frequency range is fl=0, and the highest frequency of the filter frequency range is fh=fs/ 2=8000/2=4000; set the number of filters M=24, and the length of FFT N=256.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:

Receive the current user voice data sent by the user terminal;

Preprocessing the current user voice data to obtain preprocessed voice data;

Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;

Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and

Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
18. The computer-readable storage medium according to claim 18, wherein said calling a pre-stored voice response strategy obtains current voice response data corresponding to a result of current user classification in said voice response strategy, and converts said current voice After the reply data is sent to the client, it also includes:

The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
The computer-readable storage medium according to claim 19, wherein the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and obtaining the recognition result corresponding to the user identification code field After the user’s unique identification code, it also includes:

According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;

If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.