WO2021218136A1 - Voice-based user gender and age recognition method and apparatus, computer device, and storage medium - Google Patents

Voice-based user gender and age recognition method and apparatus, computer device, and storage medium Download PDF

Info

Publication number
WO2021218136A1
WO2021218136A1 PCT/CN2020/131612 CN2020131612W WO2021218136A1 WO 2021218136 A1 WO2021218136 A1 WO 2021218136A1 CN 2020131612 W CN2020131612 W CN 2020131612W WO 2021218136 A1 WO2021218136 A1 WO 2021218136A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
user
data
voice data
current
Prior art date
Application number
PCT/CN2020/131612
Other languages
French (fr)
Chinese (zh)
Inventor
赵婧
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021218136A1 publication Critical patent/WO2021218136A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/51Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing
    • H04M3/5166Centralised call answering arrangements requiring operator intervention, e.g. call or contact centers for telemarketing in combination with interactive voice response systems or voice portals, e.g. as front-ends
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • This application relates to the technical field of voice classification in artificial intelligence, and in particular to a voice-based user gender and age recognition method, device, computer equipment, and storage medium.
  • the inventor realizes that when the smart phone outbound call system automatically makes outbound calls to each user according to the user information in the list of outbound users, it determines the outbound agent voice based on the age and gender in the user information. Type and outbound process.
  • the smart phone outbound call system calls the female agent to record to realize the outbound call.
  • the accuracy of gender broadcast is low.
  • the embodiments of the present application provide a voice-based user gender and age identification method, device, computer equipment and storage medium, which are intended to solve the problem that the prior art smart phone out-call system automatically performs a check on each user based on the user information in the list of outbound users.
  • a user makes an outbound call, if the user who answers the call is not himself, it is easy to cause the problem of low accuracy of gender broadcast.
  • an embodiment of the present application provides a voice-based user gender and age identification method, which includes:
  • Invoke a pre-stored voice reply strategy obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  • an embodiment of the present application provides a voice-based user gender and age recognition device, which includes:
  • the voice data receiving unit is used to receive the current user voice data sent by the user terminal;
  • a voice preprocessing unit configured to preprocess the current user voice data to obtain preprocessed voice data
  • the mixing parameter sequence acquisition unit is used to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the Mel frequency cepstrum coefficient and Mel frequency inversion of each frame of speech data. Feature extraction of the first-order difference of spectral coefficients to obtain mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter characteristic time series;
  • the user classification unit is used to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and Estimated age parameters; and
  • the reply data sending unit is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:
  • Invoke a pre-stored voice reply strategy obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  • the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :
  • Invoke a pre-stored voice reply strategy obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  • the embodiments of the application provide a voice-based user gender and age recognition method, device, computer equipment, and storage medium, including receiving current user voice data sent by a user terminal; preprocessing the current user voice data to obtain preprocessing Post-speech data; extract the short-term average amplitude of each frame of speech data in the pre-processed speech data, and perform mel frequency cepstral coefficient and mel frequency cepstrum coefficient first-order on each frame of speech data Differential feature extraction to obtain mixed parameter features corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter feature time series; input the mixed parameter feature time series to a pre-trained Gaussian mixture model , Obtain the current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and estimated age parameters; The current voice reply data corresponding to the classification result of the current user is sent to the user terminal.
  • This method comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel
  • FIG. 1 is a schematic diagram of an application scenario of a voice-based user gender and age identification method provided by an embodiment of this application;
  • FIG. 2 is a schematic flowchart of a voice-based user gender and age identification method provided by an embodiment of this application;
  • FIG. 3 is a schematic diagram of a sub-flow of a voice-based user gender and age identification method provided by an embodiment of this application;
  • FIG. 4 is a schematic diagram of another sub-flow of a voice-based user gender and age identification method provided by an embodiment of this application;
  • FIG. 5 is a schematic block diagram of a voice-based user gender and age recognition device provided by an embodiment of this application.
  • FIG. 6 is a schematic block diagram of subunits of a voice-based user gender and age recognition device provided by an embodiment of the application;
  • FIG. 7 is a schematic block diagram of another subunit of the voice-based user gender and age recognition device provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • Figure 1 is a schematic diagram of an application scenario of a voice-based user gender and age identification method provided by an embodiment of this application
  • Figure 2 is a schematic flowchart of a voice-based user gender and age identification method provided by an embodiment of this application
  • the voice-based user gender and age identification method is applied to a server, and the method is executed by application software installed in the server.
  • the method includes steps S110 to S150.
  • S110 Receive current user voice data sent by the user terminal.
  • the intelligent voice system deployed in the server needs to recognize the gender and age of the user's voice, it initially needs to receive the current user's voice data uploaded by the user terminal, so as to perform the subsequent voice preprocessing and classification recognition process.
  • the current user voice data is first required to be The current user voice data is recorded as s(t)) is sampled with a sampling period T and discretized as s(n).
  • the selection of the period should be determined according to the bandwidth of the current user’s voice data (according to the Nyquist sampling theorem) , To avoid signal aliasing and distortion in the frequency domain. Certain quantization noise and distortion will be brought about in the quantization process of the discrete speech signal.
  • the pre-processing of the voice includes the steps of pre-emphasis and windowing and framing.
  • step S120 includes:
  • S124 Call the pre-stored frame shift and frame length to divide the windowed voice data into frames to obtain pre-processed voice data.
  • the current user voice data (the current user voice data is denoted as s(t)) is sampled with a sampling period T, and it is discretized as s(n).
  • the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:
  • the value of a is 0.98.
  • the sampling value of the current discrete speech signal at time n is x(n)
  • the time domain signal corresponding to the windowed voice data is x(l)
  • the windowed and framed voice data is processed
  • the voice data of the nth frame in the preprocessed voice data is xn(m), and xn(m) satisfies formula (3):
  • T is the frame shift
  • ⁇ (n) is the function of the Hamming window.
  • the short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference are generally extracted, and then all The extracted parameters constitute a mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form a mixed parameter feature time series.
  • important parameters extracted from the preprocessed voice data are obtained, and combining these important parameters can more accurately classify user types (mainly age and gender classification).
  • the specific basis is Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data;
  • M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data
  • the data is xn(m), 0 ⁇ m ⁇ N-1, and N is the frame length.
  • step S130 includes:
  • S131 Perform Fourier transform on the preprocessed voice data in sequence to obtain frequency domain voice data
  • the preprocessed speech data is often a speech signal in the time domain
  • DFT Discrete Fourier Transform
  • FFT FFT
  • N-point signals if N/2 is an integer, FFT can be used to speed up the processing speed of the algorithm. If N/2 is not an integer, you can only use DFT, and the algorithm speed will decrease as the number of points increases. Therefore, when framing, the number of points must be an integer multiple of 2.
  • the voice data after the absolute value is filtered by the mel filter bank to obtain the voice data after the mel filter.
  • the specific parameters of the Mel filter bank are as follows:
  • the discrete cosine transform is DCT transform
  • the time domain signal is transformed to the frequency domain
  • the logarithm is taken
  • the DCT transform is performed to obtain the cepstrum system number. If the Mel filter (ie Mel filter) is added after the frequency domain, MFCC (MFCC is Mel frequency cepstrum coefficient) is finally obtained.
  • the first-order difference is the difference between two consecutive adjacent two items in the discrete function.
  • each frame of voice data in the preprocessed voice data can correspondingly obtain the above three characteristic parameters (ie, short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference), also That is, one frame of voice data corresponds to a 1*3 line vector, and the preprocessed voice data includes M frames of voice data, and a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector, and the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.
  • the preprocessed voice data includes M frames of voice data
  • a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector
  • the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.
  • each frame of voice data in the voice data corresponds to obtaining the three parameters of the fundamental frequency, the speech rate, and the sound pressure level, so as to form a time series of mixed parameter characteristics with more parameter dimensions.
  • the Gaussian mixture model when pre-training the Gaussian mixture model, several sub-Gaussian mixture models need to be trained separately, for example, the first sub-Gaussian mixture model used to identify men aged 18-20, and the first sub-Gaussian mixture model used to identify men aged 21-30.
  • Two-child Gaussian mixture model Two-child Gaussian mixture model, the third-child Gaussian mixture model for identifying men aged 31-49, the fourth-child Gaussian mixture model for identifying men aged 41-50, and the fifth child for identifying men aged 51-70 Gaussian mixture model, the sixth sub-Gaussian mixture model used to identify women aged 18-20, the seventh sub-Gaussian mixture model used to identify women aged 21-30, and the eighth sub-Gaussian mixture model used to identify women aged 31-49 Model, the ninth sub-Gaussian mixture model used to identify women aged 41-50, and the tenth sub-Gaussian mixture model used to identify women aged 51-70.
  • GMM Gaussian mixture model
  • ⁇ k is a coefficient and ⁇ k ⁇ 0, ⁇ (y
  • the Gaussian mixture model in step S140 includes a plurality of sub-Gaussian mixture models; wherein, one of the plurality of sub-Gaussian mixture models is denoted as a first sub-Gaussian mixture model, and the first sub-Gaussian mixture model It is a recognition model used to recognize males aged 18-20. Taking the training of the first sub-Gaussian mixture model for recognizing men aged 18-20 as an example, the method further includes before step S140:
  • the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the method of obtaining the mixed parameter feature time series corresponding to the voice data of the 18-20 year old male in the first sample data can refer to steps S110 to S130 to obtain the mixed parameter feature corresponding to the current user's voice data
  • the specific process of the time series The process of training the first sub-Gaussian mixture model to be trained is to input multiple sets of mixed parameter characteristic time series, and solve the parameters in the first sub-Gaussian mixture model to be trained through the EM algorithm (EM algorithm is the maximum expectation algorithm), thereby obtaining the first sub-Gaussian mixture model.
  • EM algorithm is the maximum expectation algorithm
  • the trained first sub-Gaussian mixture model in the server can be stored on the blockchain in the blockchain network (the blockchain network is preferably a private chain, so that each subsidiary of the enterprise can use the private chain to call the first sub-Gaussian mixture).
  • a sub-Gaussian mixture model except for the first sub-Gaussian mixture model included in the Gaussian mixture model, which can be stored in the blockchain network, and other sub-Gaussian mixture models in the Gaussian mixture model can also be stored in the chain Blockchain network.
  • the parameter values (such as ⁇ k , parameter values corresponding to ⁇ (y
  • the server is regarded as a block chain node device in the block chain network, which has the authority to upload data to the block chain network.
  • the server needs to obtain the first sub-Gaussian mixture model from the blockchain network, verify whether the server has the authority of the blockchain node device, and if the server has the authority of the blockchain node device, then obtain the first sub-Gaussian mixture model.
  • the voice response strategy stored in the server includes multiple voice style template data, and each voice style template data corresponds to one voice response data, and each voice style template data uses the gender of the speaker, The speaker's style and speech flow are all preset.
  • the current user classification result is 18-20 year old male
  • the current voice response data corresponding to the 18-20 year old male user classification result in the voice response strategy is a female sweet style and lively speaking process. That is to say, when a male customer is recognized, it will automatically call the sweet and beautiful sex agent to record, and call the other party as Mr. in the speech process, and increase politeness. When a female customer answers the phone, it automatically calls the magnetic male voice agent to record, and calls it a lady to show politeness. Call a relaxed and lively speech technique process for young customers, and a mature and stable speech technique process for older customers.
  • step S150 the method further includes:
  • the voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
  • the voice data of the current user is recognized through the N-gram model (ie, the multiple model), and the recognition is a whole sentence, for example, "My name is Zhang San, my gender is male, and my age is 25. Business A needs to be handled today.”
  • the N-gram model can effectively recognize the current user’s voice data, and obtain the sentence with the highest recognition probability as the recognition result.
  • the recognition result Since the current user’s voice data has been converted into the text data of the recognition result at this time, several key strings in the recognition result can be located at this time to obtain the corresponding user age field and user gender field in the recognition result. The value of the user's age and the value of the user's gender. At the same time, the unique identification code of the user identity corresponding to the user identification code field in the identification result can also be obtained, and the unique identification code of the user identity is preferably the user ID number.
  • the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and after obtaining the unique identification code of the user identity corresponding to the user identification code field in the recognition result, include:
  • the unique identification code of the user identity obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;
  • the current user classification result and the current user voice The data is stored in the first storage area created in advance.
  • the user's real age and gender can be obtained through the user's unique identification code.
  • the Gaussian mixture model is used to classify the voice data of the current user, and the classification result of the current user including the gender parameter and the estimated age parameter is obtained.
  • the value of the estimated age parameter is compared with the real age value of the user to determine whether they are equal, and the value of the gender parameter is compared with the value of the user's real gender to determine whether they are equal. Through the above comparison, it can be judged whether the classification of the current user's voice data through the Gaussian mixture model is correct.
  • the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, it indicates the value of the gender parameter in the current user classification result And/or the value of the estimated age parameter is inaccurate.
  • the current voice response data corresponding to the current user classification result is not suitable for the current user, so all the current user classification results and the current user classification results that are inaccurate are classified
  • the current user voice data is stored in the first storage area created in advance.
  • the inaccurate data of the result of intelligently identifying gender and age is recorded as the customer's historical record, so as to facilitate the subsequent improvement of the Gaussian mixture model.
  • the value of the estimated age parameter is equal to the value of the real age of the user, and the value of the gender parameter is equal to the value of the real gender of the user, it means that the value and prediction of the gender parameter in the current user classification result are The values of the estimated age parameters are all accurate. At this time, the current voice response data corresponding to the current user classification result is suitable for the current user, and there is no need to adjust the current user voice data corresponding to the current user classification result.
  • This method comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on user's voice.
  • An embodiment of the present application also provides a voice-based user gender and age recognition device, which is used to perform any of the foregoing voice-based user gender and age recognition methods.
  • FIG. 5 is a schematic block diagram of a voice-based user gender and age recognition device provided in an embodiment of the present application.
  • the voice-based user gender and age recognition device 100 can be configured in a server.
  • the voice-based user gender and age recognition device 100 includes a voice data receiving unit 110, a voice preprocessing unit 120, a mixed parameter sequence acquiring unit 130, a user classification unit 140, and a reply data sending unit 150.
  • the voice data receiving unit 110 is used to receive the current user voice data sent by the user terminal.
  • the intelligent voice system deployed in the server needs to recognize the gender and age of the user's voice, it initially needs to receive the current user's voice data uploaded by the user terminal, so as to perform the subsequent voice preprocessing and classification recognition process.
  • the voice preprocessing unit 120 is configured to preprocess the current user voice data to obtain preprocessed voice data.
  • the current user voice data is first required to be The current user voice data is recorded as s(t)) is sampled with a sampling period T and discretized as s(n).
  • the selection of the period should be determined according to the bandwidth of the current user’s voice data (according to the Nyquist sampling theorem) , To avoid signal aliasing and distortion in the frequency domain. Certain quantization noise and distortion will be brought about in the quantization process of the discrete speech signal.
  • the pre-processing of the voice includes the steps of pre-emphasis and windowing and framing.
  • the speech preprocessing unit 120 includes:
  • the voice data sampling unit 121 is configured to call a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;
  • the pre-emphasis unit 122 is configured to call a pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal to obtain the current pre-emphasis voice signal;
  • the windowing unit 123 is configured to call a pre-stored Hamming window to window the current pre-emphasized voice information to obtain the windowed voice data;
  • the framing unit 124 is configured to call the pre-stored frame shift and frame length to divide the windowed voice data to obtain preprocessed voice data.
  • the current user voice data (the current user voice data is denoted as s(t)) is sampled with a sampling period T, and it is discretized as s(n).
  • the first-order FIR high-pass digital filter is the first-order non-recursive high-pass digital filter, and its transfer function is as the above formula (1).
  • the sampling value of the current discrete speech signal at time n is x(n)
  • the time domain signal corresponding to the windowed voice data is x(l)
  • the windowed and framed voice data is processed
  • the nth frame of voice data in the preprocessed voice data is xn(m), and xn(m) satisfies the above formula (3).
  • the mixing parameter sequence acquiring unit 130 is configured to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the mel frequency cepstrum coefficient and mel frequency of each frame of speech data.
  • the feature extraction of the first-order difference of the cepstral coefficients obtains the mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form the mixed parameter characteristic time series.
  • the short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference are generally extracted, and then all The extracted parameters constitute a mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form a mixed parameter feature time series.
  • important parameters extracted from the preprocessed voice data are obtained, and combining these important parameters can more accurately classify user types (mainly age and gender classification).
  • the specific basis is Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data;
  • M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data
  • the data is xn(m), 0 ⁇ m ⁇ N-1, and N is the frame length.
  • the mixed parameter sequence obtaining unit 130 includes:
  • the Fourier transform unit 131 is configured to perform Fourier transform on the preprocessed voice data in sequence to obtain frequency domain voice data;
  • the absolute value obtaining unit 132 is configured to take the absolute value of the frequency domain voice data to obtain the voice data after the absolute value;
  • the mel filtering unit 133 is configured to pass the absolute value of the voice data through mel filtering to obtain the voice data after mel filtering;
  • Mel frequency cepstral coefficient acquisition unit 134 configured to sequentially perform logarithmic operation and discrete cosine transform on the voice data after Mel filtering to obtain Mel frequency cepstrum coefficients corresponding to the preprocessed voice data;
  • the first-order difference obtaining unit 135 obtains the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.
  • the preprocessed speech data is often a speech signal in the time domain
  • DFT Discrete Fourier Transform
  • FFT FFT
  • N-point signals if N/2 is an integer, FFT can be used to speed up the processing speed of the algorithm. If N/2 is not an integer, you can only use DFT, and the algorithm speed will decrease as the number of points increases. Therefore, when framing, the number of points must be an integer multiple of 2.
  • the voice data after the absolute value is filtered by the mel filter bank to obtain the voice data after the mel filter.
  • the specific parameters of the Mel filter bank are as follows:
  • the discrete cosine transform is DCT transform
  • the time domain signal is transformed to the frequency domain
  • the logarithm is taken
  • the DCT transform is performed to obtain the cepstrum system number. If the Mel filter (ie Mel filter) is added after the frequency domain, MFCC (MFCC is Mel frequency cepstrum coefficient) is finally obtained.
  • the first-order difference is the difference between two consecutive adjacent two items in the discrete function.
  • each frame of voice data in the preprocessed voice data can correspondingly obtain the above three characteristic parameters (ie, short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference), also That is, one frame of voice data corresponds to a 1*3 line vector, and the preprocessed voice data includes M frames of voice data, and a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector, and the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.
  • the preprocessed voice data includes M frames of voice data
  • a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector
  • the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.
  • each frame of voice data in the voice data corresponds to obtaining the three parameters of the fundamental frequency, the speech rate, and the sound pressure level, so as to form a time series of mixed parameter characteristics with more parameter dimensions.
  • the user classification unit 140 is configured to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter And estimated age parameters.
  • the Gaussian mixture model when pre-training the Gaussian mixture model, several sub-Gaussian mixture models need to be trained separately, for example, the first sub-Gaussian mixture model used to identify men aged 18-20, and the first sub-Gaussian mixture model used to identify men aged 21-30.
  • Two-child Gaussian mixture model Two-child Gaussian mixture model, the third-child Gaussian mixture model for identifying men aged 31-49, the fourth-child Gaussian mixture model for identifying men aged 41-50, and the fifth child for identifying men aged 51-70 Gaussian mixture model, the sixth sub-Gaussian mixture model used to identify women aged 18-20, the seventh sub-Gaussian mixture model used to identify women aged 21-30, and the eighth sub-Gaussian mixture model used to identify women aged 31-49 Model, the ninth sub-Gaussian mixture model used to identify women aged 41-50, and the tenth sub-Gaussian mixture model used to identify women aged 51-70.
  • the Gaussian mixture model (ie Gaussian mixture model, abbreviated as GMM) refers to the probability distribution model with the above formula (4).
  • the Gaussian mixture model of the user classification unit 140 includes a plurality of sub-Gaussian mixture models; wherein, one of the plurality of sub-Gaussian mixture models is denoted as a first sub-Gaussian mixture model, and the first sub-Gaussian mixture model
  • the model is a recognition model for recognizing males aged 18-20. Taking the first sub-Gaussian mixture model trained to recognize males aged 18-20 as an example, the voice-based user gender and age recognition device 100 further includes:
  • the first sample acquisition unit is configured to acquire first sample data; wherein, the first sample data is a mixed parameter characteristic time series corresponding to voice data of a plurality of 18-20 year-old males;
  • the first sub-model training unit is used to train the first sub-Gaussian mixture model to be trained by using the first sample data to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;
  • the sub-model on-chain unit is used to store the trained first sub-Gaussian mixture model to the blockchain network.
  • the process of training the first sub-Gaussian mixture model to be trained is to input multiple sets of mixed parameter characteristic time series, and solve the parameters in the first sub-Gaussian mixture model to be trained through the EM algorithm (EM algorithm is the maximum expectation algorithm), thereby obtaining the first sub-Gaussian mixture model.
  • EM algorithm is the maximum expectation algorithm
  • the reply data sending unit 150 is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  • the voice response strategy stored in the server includes multiple voice style template data, and each voice style template data corresponds to one voice response data, and each voice style template data uses the gender of the speaker, The speaker's style and speech flow are all preset.
  • the current user classification result is 18-20 year old male
  • the current voice response data corresponding to the 18-20 year old male user classification result in the voice response strategy is a female sweet style and lively speaking process. That is to say, when a male client is recognized, the sweet and beautiful sex agent is automatically called to record, and the other party is called Mr. in the conversation process, which increases politeness.
  • Mr. in the conversation process
  • a female customer answers the phone it automatically calls the magnetic male voice agent to record, and calls it a lady to show politeness. Call a relaxed and lively speech technique process for young customers, and a mature and stable speech technique process for older customers.
  • the voice-based user gender and age recognition device 100 further includes:
  • the unique identification code acquisition unit is configured to recognize the current user's voice data through a pre-trained N-gram model to obtain a recognition result, and obtain the unique identification code of the user corresponding to the user identification code field in the recognition result.
  • the voice data of the current user is recognized through the N-gram model (ie, the multiple model), and the recognition is a whole sentence, for example, "My name is Zhang San, my gender is male, and my age is 25. Business A needs to be handled today.”
  • the N-gram model can effectively recognize the current user’s voice data, and obtain the sentence with the highest recognition probability as the recognition result.
  • the recognition result Since the current user’s voice data has been converted into the text data of the recognition result at this time, several key strings in the recognition result can be located at this time to obtain the corresponding user age field and user gender field in the recognition result. The value of the user's age and the value of the user's gender. At the same time, the unique identification code of the user identity corresponding to the user identification code field in the identification result can also be obtained, and the unique identification code of the user identity is preferably the user ID number.
  • the voice-based user gender and age recognition device 100 further includes:
  • the gender and age comparison unit is used to obtain the user’s true age value and the user’s true gender value corresponding to the user terminal according to the unique identification code of the user’s identity, and determine whether the value of the estimated age parameter is equal to the true user’s value.
  • An error data storage unit configured to classify the current user if the value of the estimated age parameter is not equal to the true age value of the user, or the value of the gender parameter is not equal to the true gender value of the user.
  • the user's real age and gender can be obtained through the user's unique identification code.
  • the Gaussian mixture model is used to classify the voice data of the current user, and the classification result of the current user including the gender parameter and the estimated age parameter is obtained.
  • the value of the estimated age parameter is compared with the real age value of the user to determine whether they are equal, and the value of the gender parameter is compared with the value of the user's real gender to determine whether they are equal. Through the above comparison, it can be judged whether the classification of the current user's voice data by the Gaussian mixture model is correct.
  • the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, it indicates the value of the gender parameter in the current user classification result And/or the value of the estimated age parameter is inaccurate.
  • the current voice response data corresponding to the current user classification result is not suitable for the current user, so all the current user classification results and the current user classification results that are inaccurate are classified
  • the current user voice data is stored in the first storage area created in advance.
  • the inaccurate data of the result of intelligently identifying gender and age is recorded as the customer's historical record, so as to facilitate the subsequent improvement of the Gaussian mixture model.
  • the value of the estimated age parameter is equal to the value of the real age of the user, and the value of the gender parameter is equal to the value of the real gender of the user, it means that the value and prediction of the gender parameter in the current user classification result are The values of the estimated age parameters are all accurate. At this time, the current voice response data corresponding to the current user classification result is suitable for the current user, and there is no need to adjust the current user voice data corresponding to the current user classification result.
  • the device comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on the user's voice.
  • the above voice-based user gender and age recognition device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 8.
  • FIG. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute a voice-based user gender and age identification method.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can make the processor 502 execute a voice-based user gender and age identification method.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the voice-based user gender and age identification method disclosed in the embodiment of the present application.
  • the embodiment of the computer device shown in FIG. 8 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or combine certain components, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 8 and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium In another embodiment of the present application, a computer-readable storage medium is provided.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the voice-based user gender and age identification method disclosed in the embodiments of the present application.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium.
  • the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Abstract

The present application relates to the technical field of voice classification in artificial intelligence, and provides a voice-based user gender and age recognition method and apparatus, a computer device, and a storage medium. The method comprises: preprocessing received current user voice data transmitted by a user side to obtain preprocessed voice data; performing feature extraction in terms of short-time average amplitude, Mel-frequency cepstral coefficient, and Mel-frequency cepstral coefficient first-order difference on each frame of voice data to obtain corresponding mixed parameter features, so as to form a mixed parameter feature time sequence; inputting the mixed parameter feature time sequence into a Gaussian mixture model to obtain a corresponding current user classification result; and calling a voice response strategy, obtaining corresponding current voice response data, and sending the current voice response data to the user side. Accurate gender and age recognition based on user voice is implemented.

Description

基于语音的用户性别年龄识别方法、装置、计算机设备及存储介质Voice-based user gender and age recognition method, device, computer equipment and storage medium
本申请要求于2020年4月27日提交中国专利局、申请号为202010345904.3,发明名称为“基于语音的用户性别年龄识别方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 27, 2020, the application number is 202010345904.3, and the invention title is "Voice-based user gender and age recognition method, device and computer equipment". The entire content of the Chinese patent application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能中的语音分类技术领域,尤其涉及一种基于语音的用户性别年龄识别方法、装置、计算机设备及存储介质。This application relates to the technical field of voice classification in artificial intelligence, and in particular to a voice-based user gender and age recognition method, device, computer equipment, and storage medium.
背景技术Background technique
目前,发明人意识到,智能电话外呼系统在自动根据待外呼用户清单中的用户信息对各用户进行电话外呼时,均是根据用户信息中的年龄和性别来确定外呼坐席声音的类型和外呼流程。At present, the inventor realizes that when the smart phone outbound call system automatically makes outbound calls to each user according to the user information in the list of outbound users, it determines the outbound agent voice based on the age and gender in the user information. Type and outbound process.
例如根据用户信息获知该用户为中年男性时,则智能电话外呼系统则调用女性坐席录音以实现外呼。但是若发生接电话的用户不是本人时,导致性别播报准确率较低。For example, when it is known that the user is a middle-aged male according to the user information, the smart phone outbound call system calls the female agent to record to realize the outbound call. However, if the user who answers the phone is not himself, the accuracy of gender broadcast is low.
发明内容Summary of the invention
本申请实施例提供了一种基于语音的用户性别年龄识别方法、装置、计算机设备及存储介质,旨在解决现有技术智能电话外呼系统在自动根据待外呼用户清单中的用户信息对各用户进行电话外呼时,若接电话的用户不是本人,易导致性别播报准确率较低的问题。The embodiments of the present application provide a voice-based user gender and age identification method, device, computer equipment and storage medium, which are intended to solve the problem that the prior art smart phone out-call system automatically performs a check on each user based on the user information in the list of outbound users. When a user makes an outbound call, if the user who answers the call is not himself, it is easy to cause the problem of low accuracy of gender broadcast.
第一方面,本申请实施例提供了一种基于语音的用户性别年龄识别方法,其包括:In the first aspect, an embodiment of the present application provides a voice-based user gender and age identification method, which includes:
接收用户端发送的当前用户语音数据;Receive the current user voice data sent by the user terminal;
将所述当前用户语音数据进行预处理,得到预处理后语音数据;Preprocessing the current user voice data to obtain preprocessed voice data;
将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;
将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and
调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
第二方面,本申请实施例提供了一种基于语音的用户性别年龄识别装置,其包括:In the second aspect, an embodiment of the present application provides a voice-based user gender and age recognition device, which includes:
语音数据接收单元,用于接收用户端发送的当前用户语音数据;The voice data receiving unit is used to receive the current user voice data sent by the user terminal;
语音预处理单元,用于将所述当前用户语音数据进行预处理,得到预处理后语音数据;A voice preprocessing unit, configured to preprocess the current user voice data to obtain preprocessed voice data;
混合参数序列获取单元,用于将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;The mixing parameter sequence acquisition unit is used to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the Mel frequency cepstrum coefficient and Mel frequency inversion of each frame of speech data. Feature extraction of the first-order difference of spectral coefficients to obtain mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter characteristic time series;
用户分类单元,用于将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及The user classification unit is used to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and Estimated age parameters; and
回复数据发送单元,用于调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。The reply data sending unit is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现 以下步骤:In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented during the program:
接收用户端发送的当前用户语音数据;Receive the current user voice data sent by the user terminal;
将所述当前用户语音数据进行预处理,得到预处理后语音数据;Preprocessing the current user voice data to obtain preprocessed voice data;
将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;
将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and
调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the following operations :
接收用户端发送的当前用户语音数据;Receive the current user voice data sent by the user terminal;
将所述当前用户语音数据进行预处理,得到预处理后语音数据;Preprocessing the current user voice data to obtain preprocessed voice data;
将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;
将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and
调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
本申请实施例提供了一种基于语音的用户性别年龄识别方法、装置、计算机设备及存储介质,包括接收用户端发送的当前用户语音数据;将所述当前用户语音数据进行预处理,得到预处理后语音数据;将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。该方法综合考虑了短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分等特征对性别识别的影响,实现了基于用户语音对性别和年龄的精准识别。The embodiments of the application provide a voice-based user gender and age recognition method, device, computer equipment, and storage medium, including receiving current user voice data sent by a user terminal; preprocessing the current user voice data to obtain preprocessing Post-speech data; extract the short-term average amplitude of each frame of speech data in the pre-processed speech data, and perform mel frequency cepstral coefficient and mel frequency cepstrum coefficient first-order on each frame of speech data Differential feature extraction to obtain mixed parameter features corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter feature time series; input the mixed parameter feature time series to a pre-trained Gaussian mixture model , Obtain the current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and estimated age parameters; The current voice reply data corresponding to the classification result of the current user is sent to the user terminal. This method comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on user's voice.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的基于语音的用户性别年龄识别方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of a voice-based user gender and age identification method provided by an embodiment of this application;
图2为本申请实施例提供的基于语音的用户性别年龄识别方法的流程示意图;2 is a schematic flowchart of a voice-based user gender and age identification method provided by an embodiment of this application;
图3为本申请实施例提供的基于语音的用户性别年龄识别方法的子流程示意图;3 is a schematic diagram of a sub-flow of a voice-based user gender and age identification method provided by an embodiment of this application;
图4为本申请实施例提供的基于语音的用户性别年龄识别方法的另一子流程示意图;FIG. 4 is a schematic diagram of another sub-flow of a voice-based user gender and age identification method provided by an embodiment of this application;
图5为本申请实施例提供的基于语音的用户性别年龄识别装置的示意性框图;FIG. 5 is a schematic block diagram of a voice-based user gender and age recognition device provided by an embodiment of this application;
图6为本申请实施例提供的基于语音的用户性别年龄识别装置的子单元示意性框图;6 is a schematic block diagram of subunits of a voice-based user gender and age recognition device provided by an embodiment of the application;
图7为本申请实施例提供的基于语音的用户性别年龄识别装置的另一子单元示意性框图;FIG. 7 is a schematic block diagram of another subunit of the voice-based user gender and age recognition device provided by an embodiment of the application; FIG.
图8为本申请实施例提供的计算机设备的示意性框图。FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1和图2,图1为本申请实施例提供的基于语音的用户性别年龄识别方法的应用场景示意图;图2为本申请实施例提供的基于语音的用户性别年龄识别方法的流程示意图,该基于语音的用户性别年龄识别方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。Please refer to Figures 1 and 2. Figure 1 is a schematic diagram of an application scenario of a voice-based user gender and age identification method provided by an embodiment of this application; Figure 2 is a schematic flowchart of a voice-based user gender and age identification method provided by an embodiment of this application The voice-based user gender and age identification method is applied to a server, and the method is executed by application software installed in the server.
如图2所示,该方法包括步骤S110~S150。As shown in Figure 2, the method includes steps S110 to S150.
S110、接收用户端发送的当前用户语音数据。S110: Receive current user voice data sent by the user terminal.
在本实施例中,当服务器中部署的智能语音系统需对用户语音进行性别和年龄识别时,初始需接收用户端上传的当前用户语音数据,从而进行后续的语音预处理和分类识别过程。In this embodiment, when the intelligent voice system deployed in the server needs to recognize the gender and age of the user's voice, it initially needs to receive the current user's voice data uploaded by the user terminal, so as to perform the subsequent voice preprocessing and classification recognition process.
S120、将所述当前用户语音数据进行预处理,得到预处理后语音数据。S120. Preprocess the voice data of the current user to obtain preprocessed voice data.
在本实施例中,由于实际的语音信号(例如本申请中采集的当前用户语音数据)是模拟信号,因此在对语音信号进行数字处理之前,首先要将所述当前用户语音数据(将所述当前用户语音数据记为s(t))以采样周期T采样,将其离散化为s(n),采用周期的选取应根据当前用户语音数据的带宽(依奈奎斯特采样定理)来确定,以避免信号的频域混叠失真。在对离散后的语音信号进行量化处理过程中会带来一定的量化噪声和失真。有了初始的所述当前用户语音数据后,对其进行语音的预处理包括:预加重和加窗分帧等步骤。In this embodiment, since the actual voice signal (for example, the current user voice data collected in this application) is an analog signal, before the voice signal is digitally processed, the current user voice data is first required to be The current user voice data is recorded as s(t)) is sampled with a sampling period T and discretized as s(n). The selection of the period should be determined according to the bandwidth of the current user’s voice data (according to the Nyquist sampling theorem) , To avoid signal aliasing and distortion in the frequency domain. Certain quantization noise and distortion will be brought about in the quantization process of the discrete speech signal. After having the initial voice data of the current user, the pre-processing of the voice includes the steps of pre-emphasis and windowing and framing.
在一实施例中,如图3所示,步骤S120包括:In one embodiment, as shown in FIG. 3, step S120 includes:
S121、调用预先存储的采样周期将所述当前用户语音数据进行采样,得到当前离散语音信号;S121. Call a pre-stored sampling period to sample the current user voice data to obtain a current discrete voice signal;
S122、调用预先存储的一阶FIR高通数字滤波器对所述当前离散语音信号进行预加重,得到当前预加重语音信号;S122. Call a pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal to obtain the current pre-emphasized voice signal;
S123、调用预先存储的汉明窗对所述当前预加重语音信息进行加窗,得到加窗后语音数据;S123: Invoke a pre-stored Hamming window to window the current pre-emphasis voice information to obtain windowed voice data;
S124、调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧,得到预处理后语音数据。S124: Call the pre-stored frame shift and frame length to divide the windowed voice data into frames to obtain pre-processed voice data.
在本实施例中,在对语音信号进行数字处理之前,首先要将所述当前用户语音数据(将所述当前用户语音数据记为s(t))以采样周期T采样,将其离散化为s(n)。In this embodiment, before digital processing of the voice signal, the current user voice data (the current user voice data is denoted as s(t)) is sampled with a sampling period T, and it is discretized as s(n).
然后,调用预先存储的一阶FIR高通数字滤波器时,一阶FIR高通数字滤波器即为一阶非递归型高通数字滤波器,其传递函数如下式(1):Then, when calling the pre-stored first-order FIR high-pass digital filter, the first-order FIR high-pass digital filter is a first-order non-recursive high-pass digital filter, and its transfer function is as follows:
H(z)=1-az -1          (1) H(z)=1-az -1 (1)
具体实施时,a的取值为0.98。例如,设n时刻的所述当前离散语音信号的采样值为x(n),经过预加重处理后的当前预加重语音信号中与x(n)对应的采样值为y(n)=x(n)-ax(n-1)。In specific implementation, the value of a is 0.98. For example, suppose that the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasis speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).
之后,所调用的汉明窗的函数如下式(2):After that, the function of the called Hamming window is as follows (2):
Figure PCTCN2020131612-appb-000001
Figure PCTCN2020131612-appb-000001
通过汉明窗对所述当前预加重语音信息进行加窗,得到的加窗后语音数据可以表示为:Q(n)=y(n)*ω(n)。The current pre-emphasis voice information is windowed through the Hamming window, and the resulting windowed voice data can be expressed as: Q(n)=y(n)*ω(n).
最后,调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧时,例如所述加窗后语音数据对应的时域信号为x(l),加窗分帧处理后的预处理后语音数据中第n帧语音数据为xn(m),且xn(m)满足式(3):Finally, when calling the pre-stored frame shift and frame length to frame the windowed voice data, for example, the time domain signal corresponding to the windowed voice data is x(l), and the windowed and framed voice data is processed The voice data of the nth frame in the preprocessed voice data is xn(m), and xn(m) satisfies formula (3):
xn(m)=ω(n)*x(n+m),0≤m≤N-1       (3)xn(m)=ω(n)*x(n+m), 0≤m≤N-1 (3)
其中,n=0,1T,2T,……,N是帧长,T是帧移,ω(n)是汉明窗的函数。Among them, n=0, 1T, 2T,..., N is the frame length, T is the frame shift, and ω(n) is the function of the Hamming window.
通过对所述当前用户语音数据进行预处理,能有效用于后续的声音参数提取。By preprocessing the current user voice data, it can be effectively used for subsequent voice parameter extraction.
S130、将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列。S130. Extract the short-term average amplitude of each frame of voice data in the preprocessed voice data, and perform the first-order difference of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient of each frame of voice data. The feature extraction is to obtain the mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form the mixed parameter feature time sequence.
在本实施例中,对所述预处理后语音数据中进行重要参数提取时,一般是提取短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分,之后由所提取的参数组成与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列。通过这一方式,得到了由所述预处理后语音数据中提取的重要参数,结合这些重要参数能更加准确的进行用户类型分类(主要是年龄和性别的分类)。In this embodiment, when extracting important parameters from the preprocessed speech data, the short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference are generally extracted, and then all The extracted parameters constitute a mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form a mixed parameter feature time series. In this way, important parameters extracted from the preprocessed voice data are obtained, and combining these important parameters can more accurately classify user types (mainly age and gender classification).
其中,将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取时,具体的根据
Figure PCTCN2020131612-appb-000002
计算预处理后语音数据中第n帧语音数据的短时平均幅度;其中,M n表示预处理后语音数据中第n帧语音数据的短时平均幅度,预处理后语音数据中第n帧语音数据为xn(m),0≤m≤N-1,N是帧长。
Wherein, when extracting the short-term average amplitude of each frame of speech data in the preprocessed speech data, the specific basis is
Figure PCTCN2020131612-appb-000002
Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.
在一实施例中,如图4所示,步骤S130包括:In an embodiment, as shown in FIG. 4, step S130 includes:
S131、将所述预处理后语音数据依次进行傅里叶变换,得到频域语音数据;S131: Perform Fourier transform on the preprocessed voice data in sequence to obtain frequency domain voice data;
S132、将所述频域语音数据取绝对值,得到取绝对值后语音数据;S132. Take the absolute value of the frequency domain voice data to obtain the voice data after the absolute value is taken;
S133、将所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据;S133: Pass the voice data after the absolute value is filtered through mel to obtain voice data after mel filtering;
S134、将所述梅尔滤波后语音数据依次进行取对数运算和离散余弦变换,得到与所述预处理后语音数据对应的梅尔频率倒谱系数;S134. Perform logarithmic operation and discrete cosine transform on the voice data after Mel filtering in sequence to obtain Mel frequency cepstral coefficients corresponding to the voice data after preprocessing.
S135、获取所述梅尔频率倒谱系数中连续相邻两项之差,以得到梅尔频率倒谱系数一阶差分。S135. Obtain the difference between two consecutive adjacent two of the Mel frequency cepstral coefficients to obtain the first-order difference of Mel frequency cepstral coefficients.
在本实施例中,由于所述预处理后语音数据往往是时域上的语音信号,要想将其映射到线性频率上,就必须用DFT(DFT即离散傅里叶变换)或者FFT(FFT即傅里叶变换),以实现时域到频域的转换。对N点的信号,若N/2为整数,可以使用FFT,以加快算法的处理速度。若N/2不为整数,就只能使用DFT,算法速度会随着点数的增加而下降。所以在分帧时,点数必须为2的整数倍。In this embodiment, since the preprocessed speech data is often a speech signal in the time domain, if you want to map it to a linear frequency, you must use DFT (DFT, Discrete Fourier Transform) or FFT (FFT). That is, Fourier transform) to realize the conversion from time domain to frequency domain. For N-point signals, if N/2 is an integer, FFT can be used to speed up the processing speed of the algorithm. If N/2 is not an integer, you can only use DFT, and the algorithm speed will decrease as the number of points increases. Therefore, when framing, the number of points must be an integer multiple of 2.
由于FFT出来结果的是复数,有实部和虚部,对其取绝对值,得到复数的模,而去掉相位。模反应的是声音的幅值,幅值包含有用的信息。人耳对声音的相位并不敏感,可以忽略相位。Since the result of FFT is a complex number with real and imaginary parts, take the absolute value of it to get the modulus of the complex number, and remove the phase. The mode reflects the amplitude of the sound, and the amplitude contains useful information. The human ear is not sensitive to the phase of the sound, and the phase can be ignored.
通过梅尔滤波器组对所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据。梅尔滤波器组的具体参数如下:The voice data after the absolute value is filtered by the mel filter bank to obtain the voice data after the mel filter. The specific parameters of the Mel filter bank are as follows:
设置梅尔滤波器组的采样率fs=8000Hz,滤波器频率范围的最低频率fl=0,滤波器频率范围的最高频率fh=fs/2=8000/2=4000;设置滤波器个数M=24,FFT的长度N=256。对所述取绝对值后语音数据通过梅尔滤波后,是对线性的频率进行梅尔滤波,反应了人耳的听觉特性。Set the sampling rate of the Mel filter bank fs=8000Hz, the lowest frequency of the filter frequency range fl=0, the highest frequency of the filter frequency range fh=fs/2=8000/2=4000; set the number of filters M= 24. The length of the FFT is N=256. After the voice data after the absolute value is passed through the mel filter, the linear frequency is mel filtered, which reflects the auditory characteristics of the human ear.
将所述梅尔滤波后语音数据依次进行取对数运算和离散余弦变换时,离散余弦变换即 DCT变换,时域信号变换到频域,取对数,再经过DCT变换,得到的是倒谱系数。若在频域之后增加Mel滤波(即梅尔滤波),则最终得到MFCC(MFCC即梅尔频率倒谱系数)。When the voice data after Mel filtering is sequentially performed on logarithmic operation and discrete cosine transform, the discrete cosine transform is DCT transform, the time domain signal is transformed to the frequency domain, the logarithm is taken, and the DCT transform is performed to obtain the cepstrum system number. If the Mel filter (ie Mel filter) is added after the frequency domain, MFCC (MFCC is Mel frequency cepstrum coefficient) is finally obtained.
一阶差分就是离散函数中连续相邻两项之差。当自变量从x变到x+1时,函数y=y(x)的改变量Δyx=y(x+1)-y(x),(x=0,1,2,......)称为函数y(x)在点x的一阶差分,记为Δyx=yx+1-yx,(x=0,1,2,......)。The first-order difference is the difference between two consecutive adjacent two items in the discrete function. When the independent variable changes from x to x+1, the change of function y=y(x) Δyx=y(x+1)-y(x), (x=0,1,2,... .) is called the first difference of the function y(x) at the point x, denoted as Δyx=yx+1-yx, (x=0,1,2,...).
由于所述预处理后语音数据中每一帧语音数据都可以对应获取上述三个特征参数(即短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分),也就是一帧语音数据对应一个1*3的行向量,且所述预处理后语音数据中包括M帧语音数据,每一帧语音数据对应的一个1*3的行向量按照时序串接后,得到一个1*3M的行向量,该1*3M的行向量为与所述预处理后语音数据对应的混合参数特征时间序列。Since each frame of voice data in the preprocessed voice data can correspondingly obtain the above three characteristic parameters (ie, short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference), also That is, one frame of voice data corresponds to a 1*3 line vector, and the preprocessed voice data includes M frames of voice data, and a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector, and the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.
具体实施时,除了对所述预处理后语音数据中每一帧语音数据都可以对应获取短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分,还可以对所述预处理后语音数据中每一帧语音数据都对应获取基频、语速、声压级这三个参数,从而组成参数维度更多的混合参数特征时间序列。In specific implementation, in addition to obtaining the short-term average amplitude, Mel frequency cepstrum coefficient, and Mel frequency cepstrum coefficient first-order difference for each frame of voice data in the preprocessed speech data, it can also compare all the After the preprocessing, each frame of voice data in the voice data corresponds to obtaining the three parameters of the fundamental frequency, the speech rate, and the sound pressure level, so as to form a time series of mixed parameter characteristics with more parameter dimensions.
S140、将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数。S140. Input the mixed parameter characteristic time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter .
在本实施例中,预先训练的高斯混合模型时,需要分别训练若干个子高斯混合模型,例如用于识别18-20岁男性的第一子高斯混合模型、用于识别21-30岁男性的第二子高斯混合模型、用于识别31-49岁男性的第三子高斯混合模型、用于识别41-50岁男性的第四子高斯混合模型、用于识别51-70岁男性的第五子高斯混合模型、用于识别18-20岁女性的第六子高斯混合模型、用于识别21-30岁女性的第七子高斯混合模型、用于识别31-49岁女性的第八子高斯混合模型、用于识别41-50岁女性的第九子高斯混合模型、用于识别51-70岁女性的第十子高斯混合模型。In this embodiment, when pre-training the Gaussian mixture model, several sub-Gaussian mixture models need to be trained separately, for example, the first sub-Gaussian mixture model used to identify men aged 18-20, and the first sub-Gaussian mixture model used to identify men aged 21-30. Two-child Gaussian mixture model, the third-child Gaussian mixture model for identifying men aged 31-49, the fourth-child Gaussian mixture model for identifying men aged 41-50, and the fifth child for identifying men aged 51-70 Gaussian mixture model, the sixth sub-Gaussian mixture model used to identify women aged 18-20, the seventh sub-Gaussian mixture model used to identify women aged 21-30, and the eighth sub-Gaussian mixture model used to identify women aged 31-49 Model, the ninth sub-Gaussian mixture model used to identify women aged 41-50, and the tenth sub-Gaussian mixture model used to identify women aged 51-70.
高斯混合模型(即Gaussian mixture model,简记为GMM)是指具有如下式(4)的概率分布模型:The Gaussian mixture model (Gaussian mixture model, abbreviated as GMM) refers to the probability distribution model with the following formula (4):
Figure PCTCN2020131612-appb-000003
Figure PCTCN2020131612-appb-000003
其中,α k是系数且α k≥0,
Figure PCTCN2020131612-appb-000004
φ(y|θ k)是高斯分布密度,
Figure PCTCN2020131612-appb-000005
Where α k is a coefficient and α k ≥ 0,
Figure PCTCN2020131612-appb-000004
φ(y|θ k ) is the Gaussian distribution density,
Figure PCTCN2020131612-appb-000005
其中,
Figure PCTCN2020131612-appb-000006
成为第k个子模型。
in,
Figure PCTCN2020131612-appb-000006
Become the kth submodel.
在一实施例中,步骤S140中所述高斯混合模型中包括多个子高斯混合模型;其中,多个子高斯混合模型中的其中一个记为第一子高斯混合模型,所述第一子高斯混合模型为用于识别18-20岁男性的识别模型。以训练用于识别18-20岁男性的第一子高斯混合模型为例来说明,步骤S140之前还包括:In an embodiment, the Gaussian mixture model in step S140 includes a plurality of sub-Gaussian mixture models; wherein, one of the plurality of sub-Gaussian mixture models is denoted as a first sub-Gaussian mixture model, and the first sub-Gaussian mixture model It is a recognition model used to recognize males aged 18-20. Taking the training of the first sub-Gaussian mixture model for recognizing men aged 18-20 as an example, the method further includes before step S140:
获取第一样本数据;其中,第一样本数据中为多个18-20岁男性的语音数据对应的混合参数特征时间序列;Acquire first sample data; wherein, the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data;
通过第一样本数据对待训练第一子高斯混合模型进行训练,得到用于识别18-20岁男性的第一子高斯混合模型;Use the first sample data to train the first sub-Gaussian mixture model to be trained to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;
将训练后的第一子高斯混合模型存储至区块链网络。Store the trained first sub-Gaussian mixture model to the blockchain network.
区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
在本实施例中,第一样本数据中获取18-20岁男性的语音数据对应的混合参数特征时间序列的方式,可以参考步骤S110至步骤S 130中获取当前用户语音数据对应的混合参数特征时间序列的具体过程。对待训练第一子高斯混合模型进行训练的过程就是输入多组混合参数 特征时间序列,并通过EM算法(EM算法即最大期望算法)求解待训练第一子高斯混合模型中的参数,从而得到第一子高斯混合模型。In this embodiment, the method of obtaining the mixed parameter feature time series corresponding to the voice data of the 18-20 year old male in the first sample data can refer to steps S110 to S130 to obtain the mixed parameter feature corresponding to the current user's voice data The specific process of the time series. The process of training the first sub-Gaussian mixture model to be trained is to input multiple sets of mixed parameter characteristic time series, and solve the parameters in the first sub-Gaussian mixture model to be trained through the EM algorithm (EM algorithm is the maximum expectation algorithm), thereby obtaining the first sub-Gaussian mixture model. A sub-Gaussian mixture model.
在服务器中的所述训练后的第一子高斯混合模型可以上链存储至区块链网络(该区块链网络较佳的是私有链,以供企业的各子公司使用该私有链调用第一子高斯混合模型),除了所述高斯混合模型中包括的第一子高斯混合模型可以上链存储至区块链网络,所述高斯混合模型中的其他子高斯混合模型也可以上链存储至区块链网络。所述高斯混合模型中每一个子高斯混合模型中包括的各参数值(如α k,φ(y|θ k)对应的参数值)均存储至区块链网络。在此过程中,服务器视为区块链网络中的一个区块链节点设备,其具备上传数据至区块链网络的权限。当服务器需从区块链网络中获取所述第一子高斯混合模型时,对服务器是否具备区块链节点设备的权限进行验证,若服务器具备区块链节点设备的权限,则获取所述第一子高斯混合模型,并在区块链网络中进行广播以告知区块链节点设备服务器已获取了所述第一子高斯混合模型。 The trained first sub-Gaussian mixture model in the server can be stored on the blockchain in the blockchain network (the blockchain network is preferably a private chain, so that each subsidiary of the enterprise can use the private chain to call the first sub-Gaussian mixture). A sub-Gaussian mixture model), except for the first sub-Gaussian mixture model included in the Gaussian mixture model, which can be stored in the blockchain network, and other sub-Gaussian mixture models in the Gaussian mixture model can also be stored in the chain Blockchain network. The parameter values (such as α k , parameter values corresponding to φ(y|θ k )) included in each sub-Gaussian mixture model in the Gaussian mixture model are all stored in the blockchain network. In this process, the server is regarded as a block chain node device in the block chain network, which has the authority to upload data to the block chain network. When the server needs to obtain the first sub-Gaussian mixture model from the blockchain network, verify whether the server has the authority of the blockchain node device, and if the server has the authority of the blockchain node device, then obtain the first sub-Gaussian mixture model. A sub-Gaussian mixture model, and broadcast in the blockchain network to inform the blockchain node device server that the first sub-Gaussian mixture model has been acquired.
S150、调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。S150. Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
在本实施例中,服务器中存储的语音回复策略中包括多种语音风格模板数据,每一种语音风格模板数据对应一种语音回复数据,且每一语音风格模板数据所使用的发音者性别、发音者风格、话术流程均是预先设置好的。In this embodiment, the voice response strategy stored in the server includes multiple voice style template data, and each voice style template data corresponds to one voice response data, and each voice style template data uses the gender of the speaker, The speaker's style and speech flow are all preset.
例如,获取了当前用户分类结果为18-20岁男性,在所述语音回复策略中与18-20岁男性这一用户分类结果对应的当前语音回复数据为女性甜美风格的、活泼话术流程。也即对于识别出了男性客户的时候自动调用甜美女性坐席录音,并且在话术流程中称呼对方为先生,增加礼貌。在女性客户接听电话的时候自动调用磁性男声坐席录音,并称呼其为女士以表礼貌。对于年轻的客户调用轻松活泼的话术流程,对于年长客户调用成熟稳重的话术流程。For example, it is obtained that the current user classification result is 18-20 year old male, and the current voice response data corresponding to the 18-20 year old male user classification result in the voice response strategy is a female sweet style and lively speaking process. That is to say, when a male customer is recognized, it will automatically call the sweet and beautiful sex agent to record, and call the other party as Mr. in the speech process, and increase politeness. When a female customer answers the phone, it automatically calls the magnetic male voice agent to record, and calls it a lady to show politeness. Call a relaxed and lively speech technique process for young customers, and a mature and stable speech technique process for older customers.
在一实施例中,步骤S150之后还包括:In an embodiment, after step S150, the method further includes:
通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码。The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
在本实施例中,通过所述N-gram模型(即多元模型)对所述当前用户语音数据进行识别,识别得到的是一整句话,例如“我叫张三,性别男,年龄25,今天需要办理A业务。”,通过N-gram模型能对所述当前用户语音数据进行有效识别,得到识别概率最大的语句作为识别结果。In this embodiment, the voice data of the current user is recognized through the N-gram model (ie, the multiple model), and the recognition is a whole sentence, for example, "My name is Zhang San, my gender is male, and my age is 25. Business A needs to be handled today.” The N-gram model can effectively recognize the current user’s voice data, and obtain the sentence with the highest recognition probability as the recognition result.
由于此时已将当前用户语音数据转化为识别结果这一文本数据,此时定位识别结果中的几个关键字符串,即可获取所述识别结果中与用户年龄字段及用户性别字段分别对应的用户年龄取值和用户性别取值。同时也可获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码,该用户身份唯一识别码最佳为用户身份证号。Since the current user’s voice data has been converted into the text data of the recognition result at this time, several key strings in the recognition result can be located at this time to obtain the corresponding user age field and user gender field in the recognition result. The value of the user's age and the value of the user's gender. At the same time, the unique identification code of the user identity corresponding to the user identification code field in the identification result can also be obtained, and the unique identification code of the user identity is preferably the user ID number.
在一实施例中,所述通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码之后,还包括:In one embodiment, the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and after obtaining the unique identification code of the user identity corresponding to the user identification code field in the recognition result, include:
根据所述用户身份唯一识别码,获取与用户端对应的用户真实年龄值和用户真实性别取值,判断所述预估年龄参数的取值是否等于所述用户真实年龄值,且判断所述性别参数的取值是否等于所述用户真实性别取值;According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;
若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,将所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.
在本实施例中,当获取了所述用户身份唯一识别码(也即用户的身份证号)之后,可以通过用户身份唯一识别码来获取该用户的真实年龄和性别。而通过高斯混合模型以对当前用户语音数据的分类,得到了包括性别参数和预估年龄参数的当前用户分类结果。此时将所述预估年龄参数的取值与用户真实年龄值相比较而判断是否相等,同时将所述性别参数的取值与所述用户真实性别取值相比较而判断是否相等。通过上述比较,即可判断通过高斯混合模 型以对当前用户语音数据的分类是否正确。In this embodiment, after the user's unique identification code (that is, the user's ID number) is obtained, the user's real age and gender can be obtained through the user's unique identification code. The Gaussian mixture model is used to classify the voice data of the current user, and the classification result of the current user including the gender parameter and the estimated age parameter is obtained. At this time, the value of the estimated age parameter is compared with the real age value of the user to determine whether they are equal, and the value of the gender parameter is compared with the value of the user's real gender to determine whether they are equal. Through the above comparison, it can be judged whether the classification of the current user's voice data through the Gaussian mixture model is correct.
若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,则表示当前用户分类结果中的性别参数的取值和/或预估年龄参数的取值不准确,此时根据当前用户分类结果对应获得的当前语音回复数据并不适合于当前用户,故将所有分类不准确的所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, it indicates the value of the gender parameter in the current user classification result And/or the value of the estimated age parameter is inaccurate. At this time, the current voice response data corresponding to the current user classification result is not suitable for the current user, so all the current user classification results and the current user classification results that are inaccurate are classified The current user voice data is stored in the first storage area created in advance.
在服务器中的第一存储区域中记录智能识别性别年龄结果不准确的数据,作为客户的历史记录,以便于后续改进高斯混合模型。In the first storage area in the server, the inaccurate data of the result of intelligently identifying gender and age is recorded as the customer's historical record, so as to facilitate the subsequent improvement of the Gaussian mixture model.
若所述预估年龄参数的取值等于所述用户真实年龄值,且所述性别参数的取值等于所述用户真实性别取值,则表示当前用户分类结果中的性别参数的取值和预估年龄参数的取值均是准确的,此时根据当前用户分类结果对应获得的当前语音回复数据是适合于当前用户,此时无需对针对当前用户分类结果对应的当前用户语音数据进行调整。If the value of the estimated age parameter is equal to the value of the real age of the user, and the value of the gender parameter is equal to the value of the real gender of the user, it means that the value and prediction of the gender parameter in the current user classification result are The values of the estimated age parameters are all accurate. At this time, the current voice response data corresponding to the current user classification result is suitable for the current user, and there is no need to adjust the current user voice data corresponding to the current user classification result.
该方法综合考虑了短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分等特征对性别识别的影响,实现了基于用户语音对性别和年龄的精准识别。This method comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on user's voice.
本申请实施例还提供一种基于语音的用户性别年龄识别装置,该基于语音的用户性别年龄识别装置用于执行前述基于语音的用户性别年龄识别方法的任一实施例。具体地,请参阅图5,图5是本申请实施例提供的基于语音的用户性别年龄识别装置的示意性框图。该基于语音的用户性别年龄识别装置100可以配置于服务器中。An embodiment of the present application also provides a voice-based user gender and age recognition device, which is used to perform any of the foregoing voice-based user gender and age recognition methods. Specifically, please refer to FIG. 5, which is a schematic block diagram of a voice-based user gender and age recognition device provided in an embodiment of the present application. The voice-based user gender and age recognition device 100 can be configured in a server.
如图5所示,基于语音的用户性别年龄识别装置100包括:语音数据接收单元110、语音预处理单元120、混合参数序列获取单元130、用户分类单元140、回复数据发送单元150。As shown in FIG. 5, the voice-based user gender and age recognition device 100 includes a voice data receiving unit 110, a voice preprocessing unit 120, a mixed parameter sequence acquiring unit 130, a user classification unit 140, and a reply data sending unit 150.
语音数据接收单元110,用于接收用户端发送的当前用户语音数据。The voice data receiving unit 110 is used to receive the current user voice data sent by the user terminal.
在本实施例中,当服务器中部署的智能语音系统需对用户语音进行性别和年龄识别时,初始需接收用户端上传的当前用户语音数据,从而进行后续的语音预处理和分类识别过程。In this embodiment, when the intelligent voice system deployed in the server needs to recognize the gender and age of the user's voice, it initially needs to receive the current user's voice data uploaded by the user terminal, so as to perform the subsequent voice preprocessing and classification recognition process.
语音预处理单元120,用于将所述当前用户语音数据进行预处理,得到预处理后语音数据。The voice preprocessing unit 120 is configured to preprocess the current user voice data to obtain preprocessed voice data.
在本实施例中,由于实际的语音信号(例如本申请中采集的当前用户语音数据)是模拟信号,因此在对语音信号进行数字处理之前,首先要将所述当前用户语音数据(将所述当前用户语音数据记为s(t))以采样周期T采样,将其离散化为s(n),采用周期的选取应根据当前用户语音数据的带宽(依奈奎斯特采样定理)来确定,以避免信号的频域混叠失真。在对离散后的语音信号进行量化处理过程中会带来一定的量化噪声和失真。有了初始的所述当前用户语音数据后,对其进行语音的预处理包括:预加重和加窗分帧等步骤。In this embodiment, since the actual voice signal (for example, the current user voice data collected in this application) is an analog signal, before the voice signal is digitally processed, the current user voice data is first required to be The current user voice data is recorded as s(t)) is sampled with a sampling period T and discretized as s(n). The selection of the period should be determined according to the bandwidth of the current user’s voice data (according to the Nyquist sampling theorem) , To avoid signal aliasing and distortion in the frequency domain. Certain quantization noise and distortion will be brought about in the quantization process of the discrete speech signal. After having the initial voice data of the current user, the pre-processing of the voice includes the steps of pre-emphasis and windowing and framing.
在一实施例中,如图6所示,语音预处理单元120包括:In an embodiment, as shown in FIG. 6, the speech preprocessing unit 120 includes:
语音数据采样单元121,用于调用预先存储的采样周期将所述当前用户语音数据进行采样,得到当前离散语音信号;The voice data sampling unit 121 is configured to call a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;
预加重单元122,用于调用预先存储的一阶FIR高通数字滤波器对所述当前离散语音信号进行预加重,得到当前预加重语音信号;The pre-emphasis unit 122 is configured to call a pre-stored first-order FIR high-pass digital filter to pre-emphasize the current discrete voice signal to obtain the current pre-emphasis voice signal;
加窗单元123,用于调用预先存储的汉明窗对所述当前预加重语音信息进行加窗,得到加窗后语音数据;The windowing unit 123 is configured to call a pre-stored Hamming window to window the current pre-emphasized voice information to obtain the windowed voice data;
分帧单元124,用于调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧,得到预处理后语音数据。The framing unit 124 is configured to call the pre-stored frame shift and frame length to divide the windowed voice data to obtain preprocessed voice data.
在本实施例中,在对语音信号进行数字处理之前,首先要将所述当前用户语音数据(将所述当前用户语音数据记为s(t))以采样周期T采样,将其离散化为s(n)。In this embodiment, before digital processing of the voice signal, the current user voice data (the current user voice data is denoted as s(t)) is sampled with a sampling period T, and it is discretized as s(n).
然后,调用预先存储的一阶FIR高通数字滤波器时,一阶FIR高通数字滤波器即为一阶非递归型高通数字滤波器,其传递函数如上式(1)。Then, when the pre-stored first-order FIR high-pass digital filter is called, the first-order FIR high-pass digital filter is the first-order non-recursive high-pass digital filter, and its transfer function is as the above formula (1).
例如,设n时刻的所述当前离散语音信号的采样值为x(n),经过预加重处理后的当前预加重语音信号中与x(n)对应的采样值为y(n)=x(n)-ax(n-1)。For example, suppose that the sampling value of the current discrete speech signal at time n is x(n), and the sampling value corresponding to x(n) in the current pre-emphasis speech signal after pre-emphasis processing is y(n)=x( n)-ax(n-1).
之后,所调用的汉明窗的函数如上式(2),通过汉明窗对所述当前预加重语音信息进行加窗,得到的加窗后语音数据可以表示为:Q(n)=y(n)*ω(n)。After that, the function of the called Hamming window is as the above formula (2), and the current pre-emphasis voice information is windowed through the Hamming window, and the resulting windowed voice data can be expressed as: Q(n)=y( n)*ω(n).
最后,调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧时,例如所述加窗后语音数据对应的时域信号为x(l),加窗分帧处理后的预处理后语音数据中第n帧语音数据为xn(m),且xn(m)满足上式(3)。通过对所述当前用户语音数据进行预处理,能有效用于后续的声音参数提取。Finally, when calling the pre-stored frame shift and frame length to frame the windowed voice data, for example, the time domain signal corresponding to the windowed voice data is x(l), and the windowed and framed voice data is processed The nth frame of voice data in the preprocessed voice data is xn(m), and xn(m) satisfies the above formula (3). By preprocessing the current user voice data, it can be effectively used for subsequent voice parameter extraction.
混合参数序列获取单元130,用于将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列。The mixing parameter sequence acquiring unit 130 is configured to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the mel frequency cepstrum coefficient and mel frequency of each frame of speech data. The feature extraction of the first-order difference of the cepstral coefficients obtains the mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form the mixed parameter characteristic time series.
在本实施例中,对所述预处理后语音数据中进行重要参数提取时,一般是提取短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分,之后由所提取的参数组成与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列。通过这一方式,得到了由所述预处理后语音数据中提取的重要参数,结合这些重要参数能更加准确的进行用户类型分类(主要是年龄和性别的分类)。In this embodiment, when extracting important parameters from the preprocessed speech data, the short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference are generally extracted, and then all The extracted parameters constitute a mixed parameter feature corresponding to each frame of voice data in the pre-processed voice data to form a mixed parameter feature time series. In this way, important parameters extracted from the preprocessed voice data are obtained, and combining these important parameters can more accurately classify user types (mainly age and gender classification).
其中,将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取时,具体的根据
Figure PCTCN2020131612-appb-000007
计算预处理后语音数据中第n帧语音数据的短时平均幅度;其中,M n表示预处理后语音数据中第n帧语音数据的短时平均幅度,预处理后语音数据中第n帧语音数据为xn(m),0≤m≤N-1,N是帧长。
Wherein, when extracting the short-term average amplitude of each frame of speech data in the preprocessed speech data, the specific basis is
Figure PCTCN2020131612-appb-000007
Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.
在一实施例中,如图7所示,混合参数序列获取单元130包括:In an embodiment, as shown in FIG. 7, the mixed parameter sequence obtaining unit 130 includes:
傅里叶变换单元131,用于将所述预处理后语音数据依次进行傅里叶变换,得到频域语音数据;The Fourier transform unit 131 is configured to perform Fourier transform on the preprocessed voice data in sequence to obtain frequency domain voice data;
取绝对值单元132,用于将所述频域语音数据取绝对值,得到取绝对值后语音数据;The absolute value obtaining unit 132 is configured to take the absolute value of the frequency domain voice data to obtain the voice data after the absolute value;
梅尔滤波单元133,用于将所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据;The mel filtering unit 133 is configured to pass the absolute value of the voice data through mel filtering to obtain the voice data after mel filtering;
梅尔频率倒谱系数获取单元134,用于将所述梅尔滤波后语音数据依次进行取对数运算和离散余弦变换,得到与所述预处理后语音数据对应的梅尔频率倒谱系数;Mel frequency cepstral coefficient acquisition unit 134, configured to sequentially perform logarithmic operation and discrete cosine transform on the voice data after Mel filtering to obtain Mel frequency cepstrum coefficients corresponding to the preprocessed voice data;
一阶差分获取单元135、获取所述梅尔频率倒谱系数中连续相邻两项之差,以得到梅尔频率倒谱系数一阶差分。The first-order difference obtaining unit 135 obtains the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.
在本实施例中,由于所述预处理后语音数据往往是时域上的语音信号,要想将其映射到线性频率上,就必须用DFT(DFT即离散傅里叶变换)或者FFT(FFT即傅里叶变换),以实现时域到频域的转换。对N点的信号,若N/2为整数,可以使用FFT,以加快算法的处理速度。若N/2不为整数,就只能使用DFT,算法速度会随着点数的增加而下降。所以在分帧时,点数必须为2的整数倍。In this embodiment, since the preprocessed speech data is often a speech signal in the time domain, if you want to map it to a linear frequency, you must use DFT (DFT, Discrete Fourier Transform) or FFT (FFT). That is, Fourier transform) to realize the conversion from time domain to frequency domain. For N-point signals, if N/2 is an integer, FFT can be used to speed up the processing speed of the algorithm. If N/2 is not an integer, you can only use DFT, and the algorithm speed will decrease as the number of points increases. Therefore, when framing, the number of points must be an integer multiple of 2.
由于FFT出来结果的是复数,有实部和虚部,对其取绝对值,得到复数的模,而去掉相位。模反应的是声音的幅值,幅值包含有用的信息。人耳对声音的相位并不敏感,可以忽略相位。Since the result of FFT is a complex number with real and imaginary parts, take the absolute value of it to get the modulus of the complex number, and remove the phase. The mode reflects the amplitude of the sound, and the amplitude contains useful information. The human ear is not sensitive to the phase of the sound, and the phase can be ignored.
通过梅尔滤波器组对所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据。梅尔滤波器组的具体参数如下:The voice data after the absolute value is filtered by the mel filter bank to obtain the voice data after the mel filter. The specific parameters of the Mel filter bank are as follows:
设置梅尔滤波器组的采样率fs=8000Hz,滤波器频率范围的最低频率fl=0,滤波器频率范围的最高频率fh=fs/2=8000/2=4000;设置滤波器个数M=24,FFT的长度N=256。对所述取绝对值后语音数据通过梅尔滤波后,是对线性的频率进行梅尔滤波,反应了人耳的听觉特性。Set the sampling rate of the Mel filter bank fs=8000Hz, the lowest frequency of the filter frequency range fl=0, the highest frequency of the filter frequency range fh=fs/2=8000/2=4000; set the number of filters M= 24. The length of the FFT is N=256. After the voice data after the absolute value is passed through the mel filter, the linear frequency is mel filtered, which reflects the auditory characteristics of the human ear.
将所述梅尔滤波后语音数据依次进行取对数运算和离散余弦变换时,离散余弦变换即DCT变换,时域信号变换到频域,取对数,再经过DCT变换,得到的是倒谱系数。若在频域之后增加Mel滤波(即梅尔滤波),则最终得到MFCC(MFCC即梅尔频率倒谱系数)。When the voice data after Mel filtering is sequentially performed on logarithmic operation and discrete cosine transform, the discrete cosine transform is DCT transform, the time domain signal is transformed to the frequency domain, the logarithm is taken, and the DCT transform is performed to obtain the cepstrum system number. If the Mel filter (ie Mel filter) is added after the frequency domain, MFCC (MFCC is Mel frequency cepstrum coefficient) is finally obtained.
一阶差分就是离散函数中连续相邻两项之差。当自变量从x变到x+1时,函数y=y(x)的改变量Δyx=y(x+1)-y(x),(x=0,1,2,......)称为函数y(x)在点x的一阶差分,记为Δyx=yx+1-yx,(x=0,1,2,......)。The first-order difference is the difference between two consecutive adjacent two items in the discrete function. When the independent variable changes from x to x+1, the change of function y=y(x) Δyx=y(x+1)-y(x), (x=0,1,2,... .) is called the first difference of the function y(x) at the point x, denoted as Δyx=yx+1-yx, (x=0,1,2,...).
由于所述预处理后语音数据中每一帧语音数据都可以对应获取上述三个特征参数(即短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分),也就是一帧语音数据对应一个1*3的行向量,且所述预处理后语音数据中包括M帧语音数据,每一帧语音数据对应的一个1*3的行向量按照时序串接后,得到一个1*3M的行向量,该1*3M的行向量为与所述预处理后语音数据对应的混合参数特征时间序列。Since each frame of voice data in the preprocessed voice data can correspondingly obtain the above three characteristic parameters (ie, short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference), also That is, one frame of voice data corresponds to a 1*3 line vector, and the preprocessed voice data includes M frames of voice data, and a 1*3 line vector corresponding to each frame of voice data is concatenated according to time sequence to obtain A 1*3M row vector, and the 1*3M row vector is a mixed parameter characteristic time sequence corresponding to the preprocessed speech data.
具体实施时,除了对所述预处理后语音数据中每一帧语音数据都可以对应获取短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分,还可以对所述预处理后语音数据中每一帧语音数据都对应获取基频、语速、声压级这三个参数,从而组成参数维度更多的混合参数特征时间序列。In specific implementation, in addition to obtaining the short-term average amplitude, Mel frequency cepstrum coefficient, and Mel frequency cepstrum coefficient first-order difference for each frame of voice data in the preprocessed speech data, it can also compare all the After the preprocessing, each frame of voice data in the voice data corresponds to obtaining the three parameters of the fundamental frequency, the speech rate, and the sound pressure level, so as to form a time series of mixed parameter characteristics with more parameter dimensions.
用户分类单元140,用于将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数。The user classification unit 140 is configured to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter And estimated age parameters.
在本实施例中,预先训练的高斯混合模型时,需要分别训练若干个子高斯混合模型,例如用于识别18-20岁男性的第一子高斯混合模型、用于识别21-30岁男性的第二子高斯混合模型、用于识别31-49岁男性的第三子高斯混合模型、用于识别41-50岁男性的第四子高斯混合模型、用于识别51-70岁男性的第五子高斯混合模型、用于识别18-20岁女性的第六子高斯混合模型、用于识别21-30岁女性的第七子高斯混合模型、用于识别31-49岁女性的第八子高斯混合模型、用于识别41-50岁女性的第九子高斯混合模型、用于识别51-70岁女性的第十子高斯混合模型。In this embodiment, when pre-training the Gaussian mixture model, several sub-Gaussian mixture models need to be trained separately, for example, the first sub-Gaussian mixture model used to identify men aged 18-20, and the first sub-Gaussian mixture model used to identify men aged 21-30. Two-child Gaussian mixture model, the third-child Gaussian mixture model for identifying men aged 31-49, the fourth-child Gaussian mixture model for identifying men aged 41-50, and the fifth child for identifying men aged 51-70 Gaussian mixture model, the sixth sub-Gaussian mixture model used to identify women aged 18-20, the seventh sub-Gaussian mixture model used to identify women aged 21-30, and the eighth sub-Gaussian mixture model used to identify women aged 31-49 Model, the ninth sub-Gaussian mixture model used to identify women aged 41-50, and the tenth sub-Gaussian mixture model used to identify women aged 51-70.
高斯混合模型(即Gaussian mixture model,简记为GMM)是指具有如上式(4)的概率分布模型。The Gaussian mixture model (ie Gaussian mixture model, abbreviated as GMM) refers to the probability distribution model with the above formula (4).
在一实施例中,用户分类单元140所述高斯混合模型中包括多个子高斯混合模型;其中,多个子高斯混合模型中的其中一个记为第一子高斯混合模型,所述第一子高斯混合模型为用于识别18-20岁男性的识别模型。以训练用于识别18-20岁男性的第一子高斯混合模型为例来说明,基于语音的用户性别年龄识别装置100还包括:In an embodiment, the Gaussian mixture model of the user classification unit 140 includes a plurality of sub-Gaussian mixture models; wherein, one of the plurality of sub-Gaussian mixture models is denoted as a first sub-Gaussian mixture model, and the first sub-Gaussian mixture model The model is a recognition model for recognizing males aged 18-20. Taking the first sub-Gaussian mixture model trained to recognize males aged 18-20 as an example, the voice-based user gender and age recognition device 100 further includes:
第一样本获取单元,用于获取第一样本数据;其中,第一样本数据中为多个18-20岁男性的语音数据对应的混合参数特征时间序列;The first sample acquisition unit is configured to acquire first sample data; wherein, the first sample data is a mixed parameter characteristic time series corresponding to voice data of a plurality of 18-20 year-old males;
第一子模型训练单元,用于通过第一样本数据对待训练第一子高斯混合模型进行训练,得到用于识别18-20岁男性的第一子高斯混合模型;The first sub-model training unit is used to train the first sub-Gaussian mixture model to be trained by using the first sample data to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;
子模型上链单元,用于将训练后的第一子高斯混合模型存储至区块链网络。The sub-model on-chain unit is used to store the trained first sub-Gaussian mixture model to the blockchain network.
在本实施例中,第一样本数据中获取18-20岁男性的语音数据对应的混合参数特征时间序列的方式,可以参考获取当前用户语音数据对应的混合参数特征时间序列的具体过程。对待训练第一子高斯混合模型进行训练的过程就是输入多组混合参数特征时间序列,并通过EM算法(EM算法即最大期望算法)求解待训练第一子高斯混合模型中的参数,从而得到第一子高斯混合模型。In this embodiment, for the method of obtaining the mixed parameter characteristic time series corresponding to the voice data of the 18-20 year old male in the first sample data, refer to the specific process of obtaining the mixed parameter characteristic time series corresponding to the current user's voice data. The process of training the first sub-Gaussian mixture model to be trained is to input multiple sets of mixed parameter characteristic time series, and solve the parameters in the first sub-Gaussian mixture model to be trained through the EM algorithm (EM algorithm is the maximum expectation algorithm), thereby obtaining the first sub-Gaussian mixture model. A sub-Gaussian mixture model.
回复数据发送单元150,用于调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。The reply data sending unit 150 is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
在本实施例中,服务器中存储的语音回复策略中包括多种语音风格模板数据,每一种语音风格模板数据对应一种语音回复数据,且每一语音风格模板数据所使用的发音者性别、发音者风格、话术流程均是预先设置好的。In this embodiment, the voice response strategy stored in the server includes multiple voice style template data, and each voice style template data corresponds to one voice response data, and each voice style template data uses the gender of the speaker, The speaker's style and speech flow are all preset.
例如,获取了当前用户分类结果为18-20岁男性,在所述语音回复策略中与18-20岁男性这一用户分类结果对应的当前语音回复数据为女性甜美风格的、活泼话术流程。也即对于识别出了男性客户的时候自动调用甜美女性坐席录音,并且在话术流程中称呼对方为先生, 增加礼貌。在女性客户接听电话的时候自动调用磁性男声坐席录音,并称呼其为女士以表礼貌。对于年轻的客户调用轻松活泼的话术流程,对于年长客户调用成熟稳重的话术流程。For example, it is obtained that the current user classification result is 18-20 year old male, and the current voice response data corresponding to the 18-20 year old male user classification result in the voice response strategy is a female sweet style and lively speaking process. That is to say, when a male client is recognized, the sweet and beautiful sex agent is automatically called to record, and the other party is called Mr. in the conversation process, which increases politeness. When a female customer answers the phone, it automatically calls the magnetic male voice agent to record, and calls it a lady to show politeness. Call a relaxed and lively speech technique process for young customers, and a mature and stable speech technique process for older customers.
在一实施例中,基于语音的用户性别年龄识别装置100还包括:In an embodiment, the voice-based user gender and age recognition device 100 further includes:
身份唯一识别码获取单元,用于通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码。The unique identification code acquisition unit is configured to recognize the current user's voice data through a pre-trained N-gram model to obtain a recognition result, and obtain the unique identification code of the user corresponding to the user identification code field in the recognition result.
在本实施例中,通过所述N-gram模型(即多元模型)对所述当前用户语音数据进行识别,识别得到的是一整句话,例如“我叫张三,性别男,年龄25,今天需要办理A业务。”,通过N-gram模型能对所述当前用户语音数据进行有效识别,得到识别概率最大的语句作为识别结果。In this embodiment, the voice data of the current user is recognized through the N-gram model (ie, the multiple model), and the recognition is a whole sentence, for example, "My name is Zhang San, my gender is male, and my age is 25. Business A needs to be handled today.” The N-gram model can effectively recognize the current user’s voice data, and obtain the sentence with the highest recognition probability as the recognition result.
由于此时已将当前用户语音数据转化为识别结果这一文本数据,此时定位识别结果中的几个关键字符串,即可获取所述识别结果中与用户年龄字段及用户性别字段分别对应的用户年龄取值和用户性别取值。同时也可获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码,该用户身份唯一识别码最佳为用户身份证号。Since the current user’s voice data has been converted into the text data of the recognition result at this time, several key strings in the recognition result can be located at this time to obtain the corresponding user age field and user gender field in the recognition result. The value of the user's age and the value of the user's gender. At the same time, the unique identification code of the user identity corresponding to the user identification code field in the identification result can also be obtained, and the unique identification code of the user identity is preferably the user ID number.
在一实施例中,基于语音的用户性别年龄识别装置100,还包括:In an embodiment, the voice-based user gender and age recognition device 100 further includes:
性别年龄比对单元,用于根据所述用户身份唯一识别码,获取与用户端对应的用户真实年龄值和用户真实性别取值,判断所述预估年龄参数的取值是否等于所述用户真实年龄值,且判断所述性别参数的取值是否等于所述用户真实性别取值;The gender and age comparison unit is used to obtain the user’s true age value and the user’s true gender value corresponding to the user terminal according to the unique identification code of the user’s identity, and determine whether the value of the estimated age parameter is equal to the true user’s value. Age value, and judging whether the value of the gender parameter is equal to the value of the user's real gender;
误差数据存储单元,用于若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,将所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。An error data storage unit, configured to classify the current user if the value of the estimated age parameter is not equal to the true age value of the user, or the value of the gender parameter is not equal to the true gender value of the user The result and the current user voice data are stored in the first storage area created in advance.
在本实施例中,当获取了所述用户身份唯一识别码(也即用户的身份证号)之后,可以通过用户身份唯一识别码来获取该用户的真实年龄和性别。而通过高斯混合模型以对当前用户语音数据的分类,得到了包括性别参数和预估年龄参数的当前用户分类结果。此时将所述预估年龄参数的取值与用户真实年龄值相比较而判断是否相等,同时将所述性别参数的取值与所述用户真实性别取值相比较而判断是否相等。通过上述比较,即可判断通过高斯混合模型以对当前用户语音数据的分类是否正确。In this embodiment, after the user's unique identification code (that is, the user's ID number) is obtained, the user's real age and gender can be obtained through the user's unique identification code. The Gaussian mixture model is used to classify the voice data of the current user, and the classification result of the current user including the gender parameter and the estimated age parameter is obtained. At this time, the value of the estimated age parameter is compared with the real age value of the user to determine whether they are equal, and the value of the gender parameter is compared with the value of the user's real gender to determine whether they are equal. Through the above comparison, it can be judged whether the classification of the current user's voice data by the Gaussian mixture model is correct.
若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,则表示当前用户分类结果中的性别参数的取值和/或预估年龄参数的取值不准确,此时根据当前用户分类结果对应获得的当前语音回复数据并不适合于当前用户,故将所有分类不准确的所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, it indicates the value of the gender parameter in the current user classification result And/or the value of the estimated age parameter is inaccurate. At this time, the current voice response data corresponding to the current user classification result is not suitable for the current user, so all the current user classification results and the current user classification results that are inaccurate are classified The current user voice data is stored in the first storage area created in advance.
在服务器中的第一存储区域中记录智能识别性别年龄结果不准确的数据,作为客户的历史记录,以便于后续改进高斯混合模型。In the first storage area in the server, the inaccurate data of the result of intelligently identifying gender and age is recorded as the customer's historical record, so as to facilitate the subsequent improvement of the Gaussian mixture model.
若所述预估年龄参数的取值等于所述用户真实年龄值,且所述性别参数的取值等于所述用户真实性别取值,则表示当前用户分类结果中的性别参数的取值和预估年龄参数的取值均是准确的,此时根据当前用户分类结果对应获得的当前语音回复数据是适合于当前用户,此时无需对针对当前用户分类结果对应的当前用户语音数据进行调整。If the value of the estimated age parameter is equal to the value of the real age of the user, and the value of the gender parameter is equal to the value of the real gender of the user, it means that the value and prediction of the gender parameter in the current user classification result are The values of the estimated age parameters are all accurate. At this time, the current voice response data corresponding to the current user classification result is suitable for the current user, and there is no need to adjust the current user voice data corresponding to the current user classification result.
该装置综合考虑了短时平均幅度、梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分等特征对性别识别的影响,实现了基于用户语音对性别和年龄的精准识别。The device comprehensively considers the influence of short-term average amplitude, Mel frequency cepstral coefficient, and Mel frequency cepstral coefficient first-order difference on gender recognition, and realizes accurate recognition of gender and age based on the user's voice.
上述基于语音的用户性别年龄识别装置可以实现为计算机程序的形式,该计算机程序可以在如图8所示的计算机设备上运行。The above voice-based user gender and age recognition device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 8.
请参阅图8,图8是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 8, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
参阅图8,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to FIG. 8, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032 被执行时,可使得处理器502执行基于语音的用户性别年龄识别方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute a voice-based user gender and age identification method.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于语音的用户性别年龄识别方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can make the processor 502 execute a voice-based user gender and age identification method.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的基于语音的用户性别年龄识别方法。Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the voice-based user gender and age identification method disclosed in the embodiment of the present application.
本领域技术人员可以理解,图8中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图8所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 8 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or combine certain components, or different component arrangements. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 8 and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以是非易失性,也可以是易失性。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的基于语音的用户性别年龄识别方法。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be non-volatile or volatile. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the voice-based user gender and age identification method disclosed in the embodiments of the present application.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described equipment, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件 产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于语音的用户性别年龄识别方法,其中,包括:A voice-based user gender and age identification method, which includes:
    接收用户端发送的当前用户语音数据;Receive the current user voice data sent by the user terminal;
    将所述当前用户语音数据进行预处理,得到预处理后语音数据;Preprocessing the current user voice data to obtain preprocessed voice data;
    将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;
    将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and
    调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  2. 根据权利要求1所述的基于语音的用户性别年龄识别方法,其中,所述调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端之后,还包括:The voice-based user gender and age identification method according to claim 1, wherein the voice response strategy stored in advance is invoked to obtain the current voice response data corresponding to the current user classification result in the voice response strategy, and all After the current voice reply data is sent to the user terminal, it also includes:
    通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码。The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
  3. 根据权利要求2所述的基于语音的用户性别年龄识别方法,其中,所述通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码之后还包括:The voice-based user gender and age recognition method according to claim 2, wherein the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and the recognition result is compared with the user recognition result. After the unique identification code of the user identity corresponding to the code field, it also includes:
    根据所述用户身份唯一识别码,获取与用户端对应的用户真实年龄值和用户真实性别取值,判断所述预估年龄参数的取值是否等于所述用户真实年龄值,且判断所述性别参数的取值是否等于所述用户真实性别取值;According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;
    若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,将所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.
  4. 根据权利要求1所述的基于语音的用户性别年龄识别方法,其中,所述将所述当前用户语音数据进行预处理,得到预处理后语音数据,包括:The voice-based user gender and age identification method according to claim 1, wherein said preprocessing said current user voice data to obtain preprocessed voice data comprises:
    调用预先存储的采样周期将所述当前用户语音数据进行采样,得到当前离散语音信号;Calling a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;
    调用预先存储的一阶FIR高通数字滤波器对所述当前离散语音信号进行预加重,得到当前预加重语音信号;Calling a pre-stored first-order FIR high-pass digital filter to perform pre-emphasis on the current discrete voice signal to obtain the current pre-emphasis voice signal;
    调用预先存储的汉明窗对所述当前预加重语音信息进行加窗,得到加窗后语音数据;Calling a pre-stored Hamming window to window the current pre-emphasis voice information to obtain windowed voice data;
    调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧,得到预处理后语音数据。The pre-stored frame shift and frame length are called to frame the voice data after windowing, and the preprocessed voice data is obtained.
  5. 根据权利要求1所述的基于语音的用户性别年龄识别方法,其中,所述将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,包括:The method for recognizing user gender and age based on voice according to claim 1, wherein the feature extraction of each frame of speech data by Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first-order difference comprises:
    将所述预处理后语音数据依次进行傅里叶变换,得到频域语音数据;Fourier transform is sequentially performed on the preprocessed voice data to obtain frequency domain voice data;
    将所述频域语音数据取绝对值,得到取绝对值后语音数据;Taking the frequency domain voice data as an absolute value to obtain the voice data after taking the absolute value;
    将所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据;Passing the voice data after the absolute value through mel filtering to obtain voice data after mel filtering;
    将所述梅尔滤波后语音数据依次进行取对数运算和离散余弦变换,得到与所述预处理后语音数据对应的梅尔频率倒谱系数;Performing logarithmic operation and discrete cosine transform on the voice data after mel filtering in sequence to obtain the mel frequency cepstral coefficients corresponding to the voice data after preprocessing;
    获取所述梅尔频率倒谱系数中连续相邻两项之差,以得到梅尔频率倒谱系数一阶差分。Obtain the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.
  6. 根据权利要求1所述的基于语音的用户性别年龄识别方法,其中,所述高斯混合模型中包括多个子高斯混合模型;其中,多个子高斯混合模型中的其中一个记为第一子高斯混合模型,所述第一子高斯混合模型为用于识别18-20岁男性的识别模型;The voice-based user gender and age recognition method according to claim 1, wherein the Gaussian mixture model includes a plurality of sub-Gaussian mixture models; wherein one of the plurality of sub-Gaussian mixture models is denoted as the first sub-Gaussian mixture model , The first sub-Gaussian mixture model is a recognition model for recognizing males aged 18-20;
    所述将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果之前,还包括:Before the inputting the characteristic time sequence of the mixing parameters into the pre-trained Gaussian mixture model to obtain the current user classification result corresponding to the current user voice data, the method further includes:
    获取第一样本数据;其中,第一样本数据中为多个18-20岁男性的语音数据对应的混合参数特征时间序列;Acquire first sample data; wherein, the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data;
    通过第一样本数据对待训练第一子高斯混合模型进行训练,得到用于识别18-20岁男性的第一子高斯混合模型;Use the first sample data to train the first sub-Gaussian mixture model to be trained to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;
    将训练后的第一子高斯混合模型存储至区块链网络。Store the trained first sub-Gaussian mixture model to the blockchain network.
  7. 根据权利要求1所述的基于语音的用户性别年龄识别方法,其中,所述将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取,包括:The voice-based user gender and age recognition method according to claim 1, wherein said extracting the short-term average amplitude of each frame of voice data in the pre-processed voice data comprises:
    根据
    Figure PCTCN2020131612-appb-100001
    计算预处理后语音数据中第n帧语音数据的短时平均幅度;其中,M n表示预处理后语音数据中第n帧语音数据的短时平均幅度,预处理后语音数据中第n帧语音数据为xn(m),0≤m≤N-1,N是帧长。
    according to
    Figure PCTCN2020131612-appb-100001
    Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.
  8. 根据权利要求5所述的基于语音的用户性别年龄识别方法,其中,所述将所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据中通过梅尔滤波器组对所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据;其中,梅尔滤波器组的采样率fs=8000Hz,滤波器频率范围的最低频率fl=0,滤波器频率范围的最高频率fh=fs/2=8000/2=4000;设置滤波器个数M=24,FFT的长度N=256。The method for recognizing user gender and age based on voice according to claim 5, wherein the voice data after the absolute value is passed through mel filtering to obtain the mel-filtered voice data through the mel filter group. After the absolute value is taken, the voice data is filtered by mel to obtain the mel-filtered voice data; among them, the sampling rate of the mel filter bank is fs=8000Hz, the lowest frequency of the filter frequency range is fl=0, and the filter frequency range is The highest frequency fh=fs/2=8000/2=4000; set the number of filters M=24, and the length of FFT N=256.
  9. 一种基于语音的用户性别年龄识别装置,其中,包括:A voice-based user gender and age recognition device, which includes:
    语音数据接收单元,用于接收用户端发送的当前用户语音数据;The voice data receiving unit is used to receive the current user voice data sent by the user terminal;
    语音预处理单元,用于将所述当前用户语音数据进行预处理,得到预处理后语音数据;A voice preprocessing unit, configured to preprocess the current user voice data to obtain preprocessed voice data;
    混合参数序列获取单元,用于将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;The mixing parameter sequence acquisition unit is used to extract the short-term average amplitude of each frame of speech data in the preprocessed speech data, and perform the Mel frequency cepstrum coefficient and Mel frequency inversion of each frame of speech data. Feature extraction of the first-order difference of spectral coefficients to obtain mixed parameter characteristics corresponding to each frame of speech data in the preprocessed speech data to form a mixed parameter characteristic time series;
    用户分类单元,用于将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及The user classification unit is used to input the mixed parameter feature time series into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein, the current user classification result includes gender parameters and Estimated age parameters; and
    回复数据发送单元,用于调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。The reply data sending unit is configured to call a pre-stored voice reply strategy, obtain the current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  10. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer program:
    接收用户端发送的当前用户语音数据;Receive the current user voice data sent by the user terminal;
    将所述当前用户语音数据进行预处理,得到预处理后语音数据;Preprocessing the current user voice data to obtain preprocessed voice data;
    将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;
    将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and
    调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  11. 根据权利要求10所述的计算机设备,其中,所述调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端之后,还包括:The computer device according to claim 10, wherein the call a pre-stored voice response strategy, obtain the current voice response data corresponding to the current user classification result in the voice response strategy, and send the current voice response data After reaching the client, it also includes:
    通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码。The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
  12. 根据权利要求11所述的计算机设备,其中,所述通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的 用户身份唯一识别码之后还包括:The computer device according to claim 11, wherein the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and obtaining the user identity corresponding to the user identification code field in the recognition result After the unique identification code, it also includes:
    根据所述用户身份唯一识别码,获取与用户端对应的用户真实年龄值和用户真实性别取值,判断所述预估年龄参数的取值是否等于所述用户真实年龄值,且判断所述性别参数的取值是否等于所述用户真实性别取值;According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;
    若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,将所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.
  13. 根据权利要求10所述的基于语音的计算机设备,其中,所述将所述当前用户语音数据进行预处理,得到预处理后语音数据,包括:The voice-based computer device according to claim 10, wherein said preprocessing said current user voice data to obtain preprocessed voice data comprises:
    调用预先存储的采样周期将所述当前用户语音数据进行采样,得到当前离散语音信号;Calling a pre-stored sampling period to sample the current user voice data to obtain the current discrete voice signal;
    调用预先存储的一阶FIR高通数字滤波器对所述当前离散语音信号进行预加重,得到当前预加重语音信号;Calling a pre-stored first-order FIR high-pass digital filter to perform pre-emphasis on the current discrete voice signal to obtain the current pre-emphasis voice signal;
    调用预先存储的汉明窗对所述当前预加重语音信息进行加窗,得到加窗后语音数据;Calling a pre-stored Hamming window to window the current pre-emphasis voice information to obtain windowed voice data;
    调用预先存储的帧移和帧长对所述加窗后语音数据进行分帧,得到预处理后语音数据。The pre-stored frame shift and frame length are called to frame the voice data after windowing, and the preprocessed voice data is obtained.
  14. 根据权利要求10所述的基于语音的计算机设备,其中,所述将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,包括:10. The speech-based computer device according to claim 10, wherein the feature extraction of each frame of speech data by the Mel frequency cepstral coefficient and the Mel frequency cepstral coefficient first-order difference comprises:
    将所述预处理后语音数据依次进行傅里叶变换,得到频域语音数据;Fourier transform is sequentially performed on the preprocessed voice data to obtain frequency domain voice data;
    将所述频域语音数据取绝对值,得到取绝对值后语音数据;Taking the frequency domain voice data as an absolute value to obtain the voice data after taking the absolute value;
    将所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据;Passing the voice data after the absolute value through mel filtering to obtain voice data after mel filtering;
    将所述梅尔滤波后语音数据依次进行取对数运算和离散余弦变换,得到与所述预处理后语音数据对应的梅尔频率倒谱系数;Performing logarithmic operation and discrete cosine transform on the voice data after mel filtering in sequence to obtain the mel frequency cepstral coefficients corresponding to the voice data after preprocessing;
    获取所述梅尔频率倒谱系数中连续相邻两项之差,以得到梅尔频率倒谱系数一阶差分。Obtain the difference between two consecutive adjacent two items in the mel frequency cepstrum coefficient to obtain the first-order difference of the mel frequency cepstrum coefficient.
  15. 根据权利要求10所述的基于语音的计算机设备,其中,所述高斯混合模型中包括多个子高斯混合模型;其中,多个子高斯混合模型中的其中一个记为第一子高斯混合模型,所述第一子高斯混合模型为用于识别18-20岁男性的识别模型;The speech-based computer device according to claim 10, wherein the Gaussian mixture model includes a plurality of sub-Gaussian mixture models; wherein one of the plurality of sub-Gaussian mixture models is denoted as the first sub-Gaussian mixture model, and The first sub-Gaussian mixture model is a recognition model used to recognize males aged 18-20;
    所述将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果之前,还包括:Before the inputting the characteristic time sequence of the mixing parameters into the pre-trained Gaussian mixture model to obtain the current user classification result corresponding to the current user voice data, the method further includes:
    获取第一样本数据;其中,第一样本数据中为多个18-20岁男性的语音数据对应的混合参数特征时间序列;Acquire first sample data; wherein, the first sample data is a mixed parameter feature time series corresponding to multiple 18-20 year-old male voice data;
    通过第一样本数据对待训练第一子高斯混合模型进行训练,得到用于识别18-20岁男性的第一子高斯混合模型;Use the first sample data to train the first sub-Gaussian mixture model to be trained to obtain the first sub-Gaussian mixture model for identifying males aged 18-20;
    将训练后的第一子高斯混合模型存储至区块链网络。Store the trained first sub-Gaussian mixture model to the blockchain network.
  16. 根据权利要求10所述的计算机设备,其中,所述将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取,包括:The computer device according to claim 10, wherein said extracting the short-term average amplitude of each frame of voice data in the pre-processed voice data comprises:
    根据
    Figure PCTCN2020131612-appb-100002
    计算预处理后语音数据中第n帧语音数据的短时平均幅度;其中,M n表示预处理后语音数据中第n帧语音数据的短时平均幅度,预处理后语音数据中第n帧语音数据为xn(m),0≤m≤N-1,N是帧长。
    according to
    Figure PCTCN2020131612-appb-100002
    Calculate the short-term average amplitude of the nth frame of speech data in the preprocessed speech data; where M n represents the short-term average amplitude of the nth frame of speech data in the preprocessed speech data, and the nth frame of speech in the preprocessed speech data The data is xn(m), 0≤m≤N-1, and N is the frame length.
  17. 根据权利要求14所述的计算机设备,其中,所述将所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据中通过梅尔滤波器组对所述取绝对值后语音数据通过梅尔滤波,得到梅尔滤波后语音数据;其中,梅尔滤波器组的采样率fs=8000Hz,滤波器频率范围的最低频率fl=0,滤波器频率范围的最高频率fh=fs/2=8000/2=4000;设置滤波器个数M=24,FFT的长度N=256。14. The computer device according to claim 14, wherein the speech data after the absolute value is passed through mel filtering to obtain the speech data after the mel filtering, in which the speech data after the absolute value is obtained by the mel filter group. The data is filtered by mel to obtain the voice data after mel filtering; among them, the sampling rate of the mel filter bank is fs=8000Hz, the lowest frequency of the filter frequency range is fl=0, and the highest frequency of the filter frequency range is fh=fs/ 2=8000/2=4000; set the number of filters M=24, and the length of FFT N=256.
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:
    接收用户端发送的当前用户语音数据;Receive the current user voice data sent by the user terminal;
    将所述当前用户语音数据进行预处理,得到预处理后语音数据;Preprocessing the current user voice data to obtain preprocessed voice data;
    将所述预处理后语音数据中每一帧语音数据进行短时平均幅度的提取、并将每一帧语音数据进行梅尔频率倒谱系数、及梅尔频率倒谱系数一阶差分的特征提取,得到与所述预处理后语音数据中每一帧语音数据对应的混合参数特征,以组成混合参数特征时间序列;Perform short-term average amplitude extraction for each frame of voice data in the preprocessed voice data, and perform feature extraction of Mel frequency cepstral coefficient and Mel frequency cepstral coefficient first difference for each frame of voice data To obtain a mixed parameter feature corresponding to each frame of voice data in the preprocessed voice data to form a time series of mixed parameter features;
    将所述混合参数特征时间序列输入至预先训练的高斯混合模型,得到与所述当前用户语音数据对应的当前用户分类结果;其中,所述当前用户分类结果包括性别参数和预估年龄参数;以及Input the characteristic time sequence of the mixture parameters into a pre-trained Gaussian mixture model to obtain a current user classification result corresponding to the current user voice data; wherein the current user classification result includes a gender parameter and an estimated age parameter; and
    调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端。Invoke a pre-stored voice reply strategy, obtain current voice reply data corresponding to the current user classification result in the voice reply strategy, and send the current voice reply data to the user terminal.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述调用预先存储的语音回复策略,获取在所述语音回复策略中与当前用户分类结果对应的当前语音回复数据,将所述当前语音回复数据发送至用户端之后,还包括:18. The computer-readable storage medium according to claim 18, wherein said calling a pre-stored voice response strategy obtains current voice response data corresponding to a result of current user classification in said voice response strategy, and converts said current voice After the reply data is sent to the client, it also includes:
    通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码。The voice data of the current user is recognized through a pre-trained N-gram model to obtain a recognition result, and the unique identification code of the user identity corresponding to the user identification code field in the recognition result is obtained.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述通过预先训练的N-gram模型对所述当前用户语音数据进行识别得到识别结果,获取所述识别结果中与用户识别码字段对应的用户身份唯一识别码之后还包括:The computer-readable storage medium according to claim 19, wherein the recognition result is obtained by recognizing the voice data of the current user through a pre-trained N-gram model, and obtaining the recognition result corresponding to the user identification code field After the user’s unique identification code, it also includes:
    根据所述用户身份唯一识别码,获取与用户端对应的用户真实年龄值和用户真实性别取值,判断所述预估年龄参数的取值是否等于所述用户真实年龄值,且判断所述性别参数的取值是否等于所述用户真实性别取值;According to the unique identification code of the user identity, obtain the user’s true age value and the user’s true gender value corresponding to the user terminal, determine whether the value of the estimated age parameter is equal to the user’s true age value, and determine the gender Whether the value of the parameter is equal to the value of the real gender of the user;
    若所述预估年龄参数的取值不等于所述用户真实年龄值,或者所述性别参数的取值不等于所述用户真实性别取值,将所述当前用户分类结果及所述当前用户语音数据存储至预先创建的第一存储区域。If the value of the estimated age parameter is not equal to the value of the user's real age, or the value of the gender parameter is not equal to the value of the user's real gender, the current user classification result and the current user voice The data is stored in the first storage area created in advance.
PCT/CN2020/131612 2020-04-27 2020-11-26 Voice-based user gender and age recognition method and apparatus, computer device, and storage medium WO2021218136A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010345904.3A CN111683181B (en) 2020-04-27 2020-04-27 Voice-based user gender and age identification method and device and computer equipment
CN202010345904.3 2020-04-27

Publications (1)

Publication Number Publication Date
WO2021218136A1 true WO2021218136A1 (en) 2021-11-04

Family

ID=72433818

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131612 WO2021218136A1 (en) 2020-04-27 2020-11-26 Voice-based user gender and age recognition method and apparatus, computer device, and storage medium

Country Status (2)

Country Link
CN (1) CN111683181B (en)
WO (1) WO2021218136A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187431A (en) * 2022-09-15 2022-10-14 广州天辰信息科技有限公司 Endowment service robot system based on big data

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111683181B (en) * 2020-04-27 2022-04-12 平安科技(深圳)有限公司 Voice-based user gender and age identification method and device and computer equipment
CN113192510B (en) * 2020-12-29 2024-04-30 云从科技集团股份有限公司 Method, system and medium for realizing voice age and/or sex identification service
CN113194210B (en) * 2021-04-30 2023-02-24 中国银行股份有限公司 Voice call access method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
CN103236259A (en) * 2013-03-22 2013-08-07 乐金电子研发中心(上海)有限公司 Voice recognition processing and feedback system, voice response method
CN104700843A (en) * 2015-02-05 2015-06-10 海信集团有限公司 Method and device for identifying ages
CN108694954A (en) * 2018-06-13 2018-10-23 广州势必可赢网络科技有限公司 A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing
CN110648672A (en) * 2019-09-05 2020-01-03 深圳追一科技有限公司 Character image generation method, interaction method, device and terminal equipment
CN111683181A (en) * 2020-04-27 2020-09-18 平安科技(深圳)有限公司 Voice-based user gender and age identification method and device and computer equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106128467A (en) * 2016-06-06 2016-11-16 北京云知声信息技术有限公司 Method of speech processing and device
CN106157135A (en) * 2016-07-14 2016-11-23 微额速达(上海)金融信息服务有限公司 Antifraud system and method based on Application on Voiceprint Recognition Sex, Age
CN107170456A (en) * 2017-06-28 2017-09-15 北京云知声信息技术有限公司 Method of speech processing and device
CN109256138B (en) * 2018-08-13 2023-07-07 平安科技(深圳)有限公司 Identity verification method, terminal device and computer readable storage medium
CN109448756A (en) * 2018-11-14 2019-03-08 北京大生在线科技有限公司 A kind of voice age recognition methods and system
CN110246507B (en) * 2019-08-05 2021-08-24 上海优扬新媒信息技术有限公司 Voice recognition method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8160877B1 (en) * 2009-08-06 2012-04-17 Narus, Inc. Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
CN103236259A (en) * 2013-03-22 2013-08-07 乐金电子研发中心(上海)有限公司 Voice recognition processing and feedback system, voice response method
CN104700843A (en) * 2015-02-05 2015-06-10 海信集团有限公司 Method and device for identifying ages
CN108694954A (en) * 2018-06-13 2018-10-23 广州势必可赢网络科技有限公司 A kind of Sex, Age recognition methods, device, equipment and readable storage medium storing program for executing
CN110648672A (en) * 2019-09-05 2020-01-03 深圳追一科技有限公司 Character image generation method, interaction method, device and terminal equipment
CN111683181A (en) * 2020-04-27 2020-09-18 平安科技(深圳)有限公司 Voice-based user gender and age identification method and device and computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115187431A (en) * 2022-09-15 2022-10-14 广州天辰信息科技有限公司 Endowment service robot system based on big data

Also Published As

Publication number Publication date
CN111683181A (en) 2020-09-18
CN111683181B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
WO2021218136A1 (en) Voice-based user gender and age recognition method and apparatus, computer device, and storage medium
WO2018149077A1 (en) Voiceprint recognition method, device, storage medium, and background server
US9336778B2 (en) Method and system for using conversational biometrics and speaker identification/verification to filter voice streams
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
CN104424952B (en) Voice processing device, voice processing method, and program
WO2016197811A1 (en) Method, device and system for noise suppression
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN102324232A (en) Method for recognizing sound-groove and system based on gauss hybrid models
WO2022126924A1 (en) Training method and apparatus for speech conversion model based on domain separation
WO2014114049A1 (en) Voice recognition method and device
CN109147796A (en) Audio recognition method, device, computer equipment and computer readable storage medium
WO2022116442A1 (en) Speech sample screening method and apparatus based on geometry, and computer device and storage medium
CN109801635A (en) A kind of vocal print feature extracting method and device based on attention mechanism
CN108877823A (en) Sound enhancement method and device
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
WO2022178942A1 (en) Emotion recognition method and apparatus, computer device, and storage medium
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN111883182B (en) Human voice detection method, device, equipment and storage medium
WO2019232826A1 (en) I-vector extraction method, speaker recognition method and apparatus, device, and medium
CN109584881B (en) Number recognition method and device based on voice processing and terminal equipment
CN109545226A (en) A kind of audio recognition method, equipment and computer readable storage medium
CN116312559A (en) Training method of cross-channel voiceprint recognition model, voiceprint recognition method and device
CN110767238B (en) Blacklist identification method, device, equipment and storage medium based on address information
WO2020195924A1 (en) Signal processing device, method, and program
CN117153185B (en) Call processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933609

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933609

Country of ref document: EP

Kind code of ref document: A1