WO2021159902A1

WO2021159902A1 - Age recognition method, apparatus and device, and computer-readable storage medium

Info

Publication number: WO2021159902A1
Application number: PCT/CN2021/071262
Authority: WO
Inventors: 马坤; 赵之砚; 施奕明
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-02-12
Filing date: 2021-01-12
Publication date: 2021-08-19
Also published as: CN111312286A

Abstract

An age recognition method, apparatus and device, and a computer-readable storage medium, relating to the technical field of artificial intelligence. Said method comprises: acquiring a real voice sample from a preset database, and performing sample expansion on the real voice sample on the basis of a generative adversarial network (GAN), to obtain an expanded voice sample (S10); training the expanded voice sample, to obtain an age recognition network model (S20); acquiring a target voice of a target user, and converting the target voice into a corresponding input spectrogram (S30); and extracting a depth feature of the input spectrogram by means of the age recognition network model, and determining, according to the depth feature, a target age group to which the target user belongs (S40). The described method can improve the accuracy of age recognition.

Description

Age identification method, device, equipment and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 12, 2020, the application number is 202010094834.9, and the invention title is "Age Recognition Method, Apparatus, Equipment, and Computer-readable Storage Medium". The entire content of the application is approved The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to an age identification method, device, equipment, and computer-readable storage medium.

Background technique

The inventor found that currently, in order to enhance user experience and collection effects, loan companies often identify the user’s age based on the user’s voice during the conversation, and then adopt different collection methods for collection according to the user’s age. The inventor realizes that traditional speech age recognition methods are mostly based on the statistical analysis of the speech signal characteristics of the sound to determine the age of the speaker; however, due to the limitation of speech signal characteristics, this method has its generalization ability. Insufficient, the recognition accuracy is low in practical applications, and the application effect is not good.

Summary of the invention

The main purpose of this application is to provide an age identification method, device, equipment, and computer-readable storage medium, aiming to solve the traditional technical problem of low accuracy in age identification.

To achieve the foregoing objective, an embodiment of the present application provides an age identification method, and the age identification method includes:

Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;

Obtaining an age recognition network model through the extended speech sample training;

Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;

The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.

In addition, in order to achieve the foregoing objective, an embodiment of the present application further provides an age identification device, the age identification device including:

The sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;

A model training module is used to obtain an age recognition network model through the extended speech sample training;

The voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram;

The age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.

In addition, in order to achieve the above object, an embodiment of the present application further provides an age identification device, the age identification device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein When the computer program is executed by the processor, the age identification method as described above is realized, and the age identification method includes the following steps:

In addition, in order to achieve the foregoing objective, the embodiments of the present application also provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned age An identification method, the age identification method includes the following steps:

The embodiments of the present application can improve the generalization ability of age recognition and improve the accuracy of age recognition.

Description of the drawings

FIG. 1 is a schematic diagram of the hardware structure of the age recognition device involved in the solution of the embodiment of the application;

FIG. 2 is a schematic flowchart of the first embodiment of the age identification method of this application;

FIG. 3 is a schematic flowchart of a second embodiment of the age identification method of this application;

FIG. 4 is a schematic flowchart of a third embodiment of the age identification method of this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

The technical solution of this application can be applied to the fields of artificial intelligence, smart city, digital medical care, blockchain and/or big data technology. Optionally, the data involved in this application, such as voice samples, depth features, and/or determined age group information, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain. The application is not limited.

The age identification method involved in the embodiments of the present application is mainly applied to an age identification device, and the age identification device may be a device with a data processing function such as a server, a personal computer (PC), or a notebook computer.

Referring to FIG. 1, FIG. 1 is a schematic diagram of the hardware structure of the age recognition device involved in the solution of the embodiment of this application. In the embodiment of the present application, the age recognition device may include a processor and a memory. Optionally, the age identification device may also include a communication bus, a user interface, and/or a network interface. For example, the age identification device includes a processor 1001 (for example, a central processing unit, a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize the connection and communication between these components; the user interface 1003 may include a display (Display), an input unit such as a keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a wireless interface (Such as wireless fidelity WIreless-FIdelity, WI-FI interface); the memory 1005 can be a high-speed random access memory (random access memory, RAM), or a stable memory (non-volatile memory), such as a disk memory, a memory Optionally, 1005 may also be a storage device independent of the foregoing processor 1001. Those skilled in the art can understand that the hardware structure shown in FIG. 1 does not constitute a limitation to the present application, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.

Continuing to refer to FIG. 1, the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a computer program. In FIG. 1, the network communication module can be used to connect to a preset database and perform data communication with the database; and the processor 1001 can call a computer program stored in the memory 1005 and execute the age identification method provided in the embodiment of the present application.

Based on the foregoing hardware architecture, various embodiments of the age identification method of the present application are proposed.

The embodiment of the present application provides an age identification method.

Refer to FIG. 2, which is a schematic flowchart of the first embodiment of the age identification method of this application.

In this embodiment, the age identification method includes the following steps:

Step S10: Obtain real speech samples from a preset database, and perform sample expansion on the real speech samples based on the generative confrontation network GAN to obtain expanded speech samples;

At present, in order to enhance the user experience and the effect of collection, loan companies often identify the user’s age based on the user’s voice during the conversation, and then adopt different collection methods for collection according to the user’s age. Traditional voice age recognition methods are mostly based on the statistical analysis of the voice signal characteristics of the voice to determine the age of the speaker; however, due to the limitation of the voice signal characteristics, this method has insufficient generalization ability. The recognition accuracy rate in the application is low, and the application effect is not good. In this regard, this embodiment proposes an age identification method. First, large-scale data samples are obtained by data expansion of the generative confrontation network GAN. While increasing the number of data samples, the data samples can be more in line with the distribution of real data (ie Ensure the quality of the sample), and then use a sufficient number of real enough data samples to train and build an end-to-end network model, so that the model training process can more accurately understand the hidden laws of the data samples, improve the performance of the resulting network model, and then Subsequent use of the network model for the accuracy of age recognition; then the target speech to be recognized is converted into a spectrogram, and feature extraction is performed on the spectrogram through the obtained network model to obtain the in-depth characteristics of the target speech. Compared with age recognition using features, these deep features include more features, and can pay more attention to the difficult-to-recognizable age attribute representations in the target speech. By using the deep features to identify the target age group to which the target user belongs, it is conducive to accuracy. Grasp the relationship between age and speech, and then improve the generalization ability of age recognition and improve the accuracy of age recognition.

The age identification method in this embodiment is implemented by an age identification device. The age identification device may be a server, a personal computer, a notebook computer, or other devices. In this embodiment, a server is used as an example for description. The server in this embodiment may be a server in a collection system. The server is connected to a preset database. The database stores a number of real voice samples collected in advance. These real voice samples can be in the form of original voice or frequency spectrum. In the form of a graph; these real voice samples include corresponding sample annotations, and the annotation content includes the age group of the user to which the real voice sample belongs (of course, the annotation content can also include other information).

In this embodiment, before performing age recognition, it is first necessary to train and construct an age recognition network model for recognizing age. The age recognition network model is constructed based on a deep neural network of machine learning. Considering that the real voice samples that can be obtained in practice may have the problem of data imbalance, and the number and quality of the samples have a greater impact on the training results (model capabilities) of the model. For this, in this embodiment, the real voice needs to be The sample is expanded by sample to obtain expanded voice samples, thereby obtaining large-scale data samples.

When performing sample expansion in this embodiment, in order to improve the efficiency of sample expansion and ensure sample instructions after sample expansion, it can be performed based on a generative countermeasure network GAN. It is worth noting that when performing sample expansion, the real speech samples used should be in the form of spectrograms (including three-dimensional information of time, frequency, and amplitude). For real speech samples in the form of original speech, they must first pass the short-time Fourier The leaf transform (or other methods) converts it into the corresponding spectrogram. Among them, the generative adversarial network GAN (Generative Adversarial Networks) includes two sub-networks, which can be called generator G (Generator) and discriminator D (Discriminator); G is a network that generates extended samples, which can pass a random noise Generate a simulated sample that follows the distribution of real speech samples as much as possible, denoted as G(z), D is a discriminant network that determines whether the input sample is "real", if the output is 1, it means real, and the output is 0, it means no It may be real; in the training process, G's goal is to generate real simulation samples as much as possible to deceive D, and D's goal is to separate the simulation samples generated by G from the real voice samples as much as possible. In the most ideal state, G can generate a simulated sample G(z) that is enough to "make it fake". For D, it is difficult to determine whether the simulated sample G(z) generated by G is real or not, that is, D(G (z))=0.5; when this condition is met, it can be considered that a trained G (that is, GAN training is completed) is obtained, and it is used to expand the real speech sample to obtain an expanded speech sample.

Step S20: Obtain an age recognition network model through the extended speech sample training;

In this embodiment, when the expanded voice sample is obtained, the server will train through the expanded voice sample to obtain the age recognition network model; for the convenience of subsequent processing, the age recognition network model can be set in an end-to-end form, that is, the age recognition The input of the network model is speech, and the output is the age group the speech belongs to. Compared with the traditional "feature extraction through one model and classification through another model", the end-to-end approach does not require every process Separate labeling is helpful to reduce the workload of labeling, and at the same time, it is helpful to improve the accuracy of age recognition. For the age recognition network model of this embodiment, in order to improve the generalization and recognition accuracy of the model, a deep network model can be used. For example, it can be implemented based on the classic deep residual network ResNet50. Of course, the ResNet50 architecture is used. At the same time, part of the structure can also be adjusted according to the actual situation.

Step S30: Obtain the target voice of the target user, and convert the target voice into a corresponding input spectrogram;

In this embodiment, when the age recognition network model is obtained, the server can perform voice age recognition processing in the collection process through the age recognition network model. Specifically, when a collection item needs to be collected, first obtain the collection item information corresponding to the item, such as the loan user (target user) of a certain loan, contact information, etc., and then call the collection phone number according to the collection item information ; When the phone is connected, a general greeting voice can be played first to confirm its identity; when the target user makes a voice reply via the phone, the server can obtain the target voice of the target user, and then use the short-time Fourier transform The target voice is converted into the corresponding input spectrogram for subsequent analysis and processing.

Step S40: Extract the depth feature of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth feature.

In this embodiment, when the input spectrogram is obtained, the input spectrogram can be extracted through the age recognition network model to obtain the corresponding depth feature, and then the target age group to which the target user belongs is determined according to the depth feature.

Furthermore, for speech signals, they have both time domain and frequency domain attributes. These two types of attributes will be reflected in the spectrogram and correspond to various characteristics. These characteristics may be related to age, or may be related to age. Irrelevant (such as environmental noise characteristics). In order to improve the accuracy of age recognition, in this embodiment, when performing feature extraction, an attention mechanism can be introduced into the age recognition network model, or it can be considered as constructing an attention module to be embedded in the age recognition network model model, such as inserting into the age recognition network model. A certain feature layer in the middle is then used for feature optimization (refine feature), and then the obtained optimized feature is subjected to subsequent processing (such as inputting the next layer, or using the feature as the final feature). The age recognition network model in this embodiment includes an intermediate feature layer and a feature optimization layer, and the step S30 includes:

Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;

The age recognition network model in this embodiment includes an intermediate feature layer and a feature optimization layer. The intermediate feature layer has the feature extraction functions of the general network intermediate layer (including convolution, pooling, etc.), and the feature optimization layer is based on attention Force mechanism construction. When obtaining the input spectrogram, the server may first extract the original features of the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original features. For this original feature, it can be considered that it includes all the features of the input spectrogram, but these features are not necessarily related to age. If all of them are used for age recognition, the accuracy of recognition may be affected; at the same time, the amount of calculation is too large. This will affect the recognition speed; therefore, this embodiment will also refine these original features.

Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.

When obtaining the original features, the server optimizes the original features through the feature optimization layer of the age recognition network model and based on the attention mechanism to obtain the corresponding optimized features. Specifically, for the original feature, it can be represented by the original feature map feature map and denoted as F; while the optimized feature can be optimized for the feature map representation, denoted as F". When performing feature optimization, you can use age Identify the feature optimization layer of the network model to obtain the original feature map F of the original feature, and have

F∈R ^C*H*W

Among them, R is the feature image (spectrogram) space, C is the image (spectrogram) channel number, H is the image high height, and W is the image width.

For this F, the corresponding one-dimensional channel attention map channel attention map can be calculated, denoted as M _C (F), and

M _C (F)∈R ^C*1*1

Among them, each channel channel of F can be regarded as a feature vector feature detector, channel attention channel attention mainly focuses on what is meaningful in the input; and in order to calculate channel attention efficiently, this embodiment uses maximum pooling and Average pooling compresses F in the spatial dimension, and then obtains two different spatial background descriptions

and

Then use a shared network composed of a multi-layer perceptron MLP to calculate the two different spatial background descriptions to obtain the M _C (F) of F, which is

Among them, σ is the sigmoid function, W ₀ ∈ R ^C/r*C , W ₁ ∈ R ^C*C/r , r is the compression ratio, and W ₀ uses Relu as the activation function.

When the channel attention map (M _C (F)) is obtained, element-wise multiplication can be performed on the F and the channel attention map to obtain the corresponding intermediate feature map F', that is

in,

It is element-wise multiplication.

When F'is obtained, the corresponding two-dimensional channel attention map spatial attention map can be calculated, denoted as M _S (F'), and

M _S (F')∈R ^1*H*W

Among them, the meanings of H and W are as described above. For this spatial attention, the main focus is on location information. When calculating spatial attention, first use maximum pooling and average pooling to obtain two different feature descriptions for F'in the dimension of the channel.

and

Then use concatenation to merge the two feature descriptions, and use the convolution operation to generate M _S (F') of F', that is

Among them, σ is the sigmoid function, f ^7*7 represents the 7*7 convolutional layer,

Is the feature description of the maximum pooling of F'in the channel dimension,

It is the feature description of the average pooling of F'in the channel dimension.

When the spatial attention map (M _S (F')) is obtained, element-wise multiplication can be performed on F'and the spatial attention map (M _S (F')) to obtain the corresponding optimized feature map F", that is

In the above formula, F" can be considered as an optimized feature. For this optimized feature, the server can determine it as the depth feature of the input spectrogram, and perform age recognition processing based on the depth feature. It is worth noting that in reality, age The intermediate feature layer of the recognition network model can be two or more layers (here "above" includes the number, the same below), and the optimized feature layer can be after any intermediate feature layer. For example, the intermediate feature layer includes two layers, starting from the input Called the first layer and the second layer in turn; the feature optimization layer can be after the first layer and before the second layer. At this time, the input of the feature optimization layer is the output of the first layer, and the output of the feature optimization layer is the optimized feature of the output As the input of the second layer, the final depth feature is obtained after the second layer processing for age recognition; the optimized feature layer can also be after the second layer, at this time the input of the optimized feature is the output of the second layer, and The optimized features output by the feature optimization layer will be directly used as the final depth features for age recognition.

When the depth feature is obtained, the target age group to which the target user belongs can be determined according to the depth feature. The age recognition network model in this embodiment is an end-to-end form, so the age recognition process can be implemented in the output layer of the age recognition network model; in the output layer, the server will calculate the depth features and the data of each expanded voice sample. The spatial distance of the sample feature, and use the expanded voice sample with the smallest spatial distance as the target sample that matches the input spectrogram (target voice); then the sample annotation of the target sample can be queried to determine the sample age corresponding to the target sample. It can also be considered as the voice age corresponding to the target voice, and the target age group to which the target user belongs can be determined according to the voice age. Of course, the span range of the age group can be set according to the actual situation.

Further, when the server determines the target age group to which the target user belongs, it can query the preset speech library according to the target age group to obtain the corresponding target speech template. The target speech template may be pre-defined by relevant managers. Set and stored, there will be different speech templates for different age groups; when the server obtains the target speech template, it can broadcast the voice according to the target speech template, so as to perform voice collection on the target user.

In this embodiment, real voice samples are obtained from a preset database, and the real voice samples are sample-expanded based on the generative confrontation network GAN to obtain expanded voice samples; the age recognition network model is obtained through training of the expanded voice samples; The target voice of the target user, and convert the target voice into a corresponding input spectrogram; extract the depth features of the input spectrogram through the age recognition network model, and determine the to which the target user belongs according to the depth characteristics Target age group. Through the above method, this embodiment first obtains large-scale data samples by means of data expansion of the generative confrontation network GAN. While increasing the number of data samples, the data samples can be more in line with the distribution of real data (that is, the quality of the samples is guaranteed). Then use a sufficient number of real enough data samples to train and build an end-to-end network model, so that the model training process can more accurately understand the hidden laws of data samples, improve the performance of the resulting network model, and then use the network model for subsequent The accuracy of age recognition; then the target speech to be recognized is converted into a spectrogram, and the spectrogram is feature extracted through the obtained network model to obtain the depth characteristics of the target speech. Compared with traditional age recognition based on signal characteristics , These in-depth features include more features, and can pay more attention to the age attribute representations that are not easily recognized in the target speech. By using the in-depth features to identify the target age group of the target user, it is helpful to accurately grasp the relationship between age and speech. The association relationship between the, and then improve the generalization ability of age recognition and improve the accuracy of age recognition.

Based on the embodiment shown in FIG. 2 above, a second embodiment of the age identification method of the present application is proposed.

Referring to FIG. 3, FIG. 3 is a schematic flowchart of a second embodiment of the age identification method of this application.

In this embodiment, the step S30 includes:

Step S31: Obtain the target voice of the target user, and determine whether the voice duration of the target voice is greater than a preset duration threshold;

When the target voice lasts for a long time, if the target voice is directly processed for age recognition, the calculation amount of the model processing process may be too large. In this embodiment, the voice with a longer duration may be cut. Obtain multiple segments of speech, and perform age recognition on each segment of speech respectively, thereby reducing the amount of calculation and also helping to improve the accuracy of age recognition. Specifically, when the server obtains the target voice of the target user, it can first determine whether the voice duration of the target voice is greater than a preset duration threshold, and the preset duration threshold may be set according to actual needs.

Step S32: If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;

In this embodiment, if the voice duration of the target voice is greater than the preset duration threshold, the server will perform voice cutting on the target voice to obtain more than two voice segments, and then convert each voice segment into a corresponding segment spectrogram. When performing voice cutting, the duration of each voice segment can be determined by defining different rules according to the actual situation. For example, when the voice duration of the target voice is greater than the preset duration threshold, different voice durations can correspond to the number of different voice fragments. For example, the preset duration threshold is 3 seconds, and the voice duration is greater than 3 seconds and not greater than 4 seconds. Corresponding to 2 speech fragments, the speech duration longer than 4 seconds corresponds to 3 speech fragments, and then the target speech can be averagely cut according to the determined speech fragments so that the fragment duration of each speech fragment is the same; for another example, when the speech of the target speech When the duration is greater than the preset duration threshold, it can be cut every other preset film length. If the preset duration threshold is 3 seconds, it can be cut every 3 seconds. For a voice duration of 5 seconds, cut It is two speech fragments, corresponding to the duration of 3 seconds and 2 seconds respectively. Of course, other cutting methods can also be used in practice. It is worth noting that if the voice duration of the target voice is less than or equal to the preset duration threshold, the entire target voice can be directly converted into the corresponding input spectrogram and the age recognition process in step S40 is performed.

The step S40 includes:

Step S41: Extract the depth features of each segment of the spectrogram through the age recognition network model, and respectively determine the segment age corresponding to each segment of the spectrogram according to the depth feature of each segment of the spectrogram;

In this embodiment, when the spectrogram of each segment is obtained, the depth feature of each segment of the spectrogram can be extracted through the age recognition network model, and the segment age corresponding to each segment of the spectrogram can be determined according to the depth feature of each segment of the spectrogram. . For the feature extraction process of each segment of the spectrogram and the segment age determination process, please refer to the above step S40, which will not be repeated here.

Step S42: Determine the target age group to which the target user belongs according to the segment age group corresponding to each segment spectrogram.

In this embodiment, when the segment age group corresponding to each segment spectrogram is determined, the target age group to which the target user belongs can be determined. Among them, if the segment age range corresponding to each segment spectrogram is the same, the same segment age range is determined as the target age group to which the target user belongs; and if the segment age range corresponding to each segment spectrogram is not the same, it can be determined according to the actual The situation defines a voting decision rule to determine the target age group to which the target user belongs according to the rule and a plurality of determined segment age groups. For example, for the defined voting decision rule, the median average can be used. For example, the target voice corresponds to three fragment spectrograms, one of which corresponds to the age range of 22 to 24 years old, one corresponds to the age range of 26 to 28 years old, and one corresponds to If the age group is 28 to 30 years old, the median value of the three age groups is 23, 27, 29, and then the average of the three median values is 26.3 as the target age of the target user to determine the target age to which the target user belongs Segment; for the definition of voting decision rules, it can also be a majority determination method, such as the target voice corresponds to three segment spectrograms, two of which correspond to the age range of 22 to 24 years old, and one corresponds to the age range of 26 to 28 years old, then The age group 22 to 24 years old with the largest number of spectrograms corresponding to the segment can be determined as the target age group to which the target user belongs. Of course, in addition to the above examples, other forms of voting decision rules can also be used, such as calculating the credibility of each age group based on the speech duration of each segment of speech, and taking the age group with the highest credibility as the target age to which the target user belongs. Duan etc.

In this embodiment, multiple segments of speech are obtained by cutting the speech with a long duration, and the age recognition is performed on each segment of the speech respectively, thereby reducing the amount of calculation and improving the efficiency of age recognition. Determining the target age group to which the target user belongs according to the recognition results is also beneficial to reduce the recognition error caused by accidental factors in the recognition process and improve the accuracy of age recognition.

Based on the embodiment shown in FIG. 2 above, a third embodiment of the age identification method of the present application is proposed.

Referring to FIG. 4, FIG. 4 is a schematic flowchart of a third embodiment of an age identification method according to this application.

In this embodiment, after the step S30, the method further includes:

Step S50, when receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;

In this embodiment, when the server receives the collection instruction, it can obtain the corresponding collection item information according to the collection instruction, such as the loan user (target user) of a certain loan, contact information, etc., and then call the collection based on the collection item information Call, and get the connected voice of the other party after the call is connected. Wherein, the collection instruction can be triggered by a manager through a certain terminal, or a related collection plan stored in the server. When the time reaches the collection time set by the collection plan, the collection instruction is automatically triggered.

Step S60, judging whether there are more than two user voices in the connected voice;

When the connected voice is obtained, the server will determine whether there are more than two user voices in the connected voice.

Step S70: If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.

In this embodiment, when the target user answers the call, it may be in a noisy environment or talking with a person. At this time, there will be more than two user voices in the connected voice obtained by the server. There are more than two user voices in the voice, and the server needs to determine the target voice of the target user from them to accurately identify the age of the target user. Specifically, the server can distinguish each user's voice according to frequency, and then obtain the voice attributes of each user's voice, the voice attributes include voice frequency, voice duration, voice volume, etc., and then determine the target voice of the target user based on the voice attributes. For example, in practice, the target user is always holding the phone to make a call, so the user's voice with the longest voice duration can be determined as the target voice of the target user; for another example, the target user is the user closest to the phone. The target user’s voice is relatively loudest, so the user’s voice with the largest voice volume can be determined as the target voice of the target user; of course, the above two factors can also be combined. For a user’s voice, according to the duration of the voice and the voice The volume can obtain the corresponding time length points and volume points respectively, and then add the time length points and the volume points to obtain a comprehensive score, and determine the user voice with the highest comprehensive score as the target voice of the target user. It is worth noting that if there is only one user voice in the connected voice, the user voice can be directly used as the target voice. When the target voice of the target user is determined, the relevant age recognition processing of steps S30 and S40 can be executed.

Through the above method, when the server of this embodiment is performing call collection, if there are more than two user voices in the connected voice, the target voice can be determined first, and then the subsequent age recognition processing can be performed to avoid the voice of multiple users. The problem of identification errors that may be caused by age identification, and improve the accuracy of age identification.

Based on the embodiment shown in FIG. 2 above, a fourth embodiment of the age identification method of the present application is proposed.

In this embodiment, after the step S40, the method further includes:

Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;

In this embodiment, when the server obtains the target age group to which the current target user belongs, it may also store it. When a preset number of historical target users or historical target age ranges of historical target users in a certain preset period are collected, these historical target age ranges can be summarized and counted to obtain the corresponding collection age distribution. For example, among 100 historical target users, 30 belong to the age range of 26 to 28, and 70 belong to the age range of 30 to 32; for example, in the collection cycle last month, there were a total of 100 historical target users. Among them, 30 are in the age group of 26 to 28, and 70 are in the age group of 30 to 32.

According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.

In this embodiment, when the server obtains the collection age distribution, it can determine whether there is an age group with an abnormal number of users based on the collection age distribution. For the abnormality judgment, the corresponding abnormality rule can be set in advance, and then the abnormality judgment is made based on the rule, such as the threshold or proportion of the abnormal number corresponding to the age group, when the number of users corresponding to a certain age group exceeds the threshold or the proportion of the abnormal number , It can be considered that the number of users in this age group is abnormal. In this embodiment, if there is an age group with an abnormal number of users, it may be that the current loan business has a certain risk in this age group, or it may be that the recognition ability of the age recognition network model is reduced, causing too many users to be identified as such. Age group, at this time, the server can send corresponding abnormal prompt information to the corresponding management terminal to prompt the relevant management personnel to inspect and deal with it in time.

Through the above method, the server of this embodiment can analyze the collection age distribution of historical target users, and analyze whether there is an abnormal situation according to the collection age distribution, which is conducive to timely detection of abnormal situations, reducing business risks and the stability of the age identification network model sex.

In addition, an embodiment of the present application also provides an age identification device.

In this embodiment, the age identification device includes:

Among them, each virtual function module of the above-mentioned age recognition device is stored in the memory 1005 of the age recognition device shown in FIG.

Further, the age recognition network model includes an intermediate feature layer and a feature optimization layer, and the age determination module includes:

A feature extraction unit, configured to perform original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain corresponding original features;

The feature optimization unit is used to perform feature optimization on the original feature based on the attention mechanism through the feature optimization layer of the age recognition network model to obtain the corresponding optimized feature, and determine the optimized feature as the input spectrogram The depth characteristics.

Further, the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F". The feature optimization unit is further configured to obtain the feature optimization layer through the age recognition network model. The original feature map F of the original feature; calculate the one-dimensional channel attention map corresponding to the F channel attention map; perform element-wise multiplication on the F and the channel attention map to obtain the corresponding intermediate feature map F' Calculate the two-dimensional channel attention map spatial attention map corresponding to the F'; perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".

Further, the voice conversion module includes:

The duration judging unit is configured to obtain the target voice of the target user, and determine whether the voice duration of the target voice is greater than a preset duration threshold;

The voice segmentation unit is configured to perform voice cutting on the target voice if the voice duration is greater than the preset duration threshold to obtain two or more voice segments, and respectively convert each voice segment into a corresponding segment spectrogram ；

The age determination module is further configured to extract the depth characteristics of each segment of the spectrogram through the age recognition network model, and determine the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram; The segment age group corresponding to the spectrogram determines the target age group to which the target user belongs.

Further, the age identification device further includes:

The voice acquisition module is configured to, when receiving the collection instruction, dial a collection call according to the collection instruction, and obtain the corresponding connected voice after the call is connected;

The voice judgment module is used to judge whether there are more than two user voices in the connected voice;

The voice determining module is configured to determine the target voice of the target user in each user voice according to the voice duration and/or voice volume of each user voice if there are more than two user voices in the connected voice.

Further, the age identification device further includes:

The voice collection module is used to obtain the corresponding target speech template according to the target age group, and perform voice collection on the target user according to the target speech template.

Further, the age identification device further includes:

The distribution acquisition module is used to acquire the historical target age range of the historical target users of a preset number or a preset period, and obtain the collection age distribution according to the historical target age range;

The abnormality judgment module is used for judging whether there is an age group with an abnormal number of users based on the collection age distribution, and if it exists, sending corresponding abnormality prompt information to the corresponding management terminal.

Among them, the function realization of each module in the above-mentioned age recognition device corresponds to each step in the above-mentioned embodiment of the age recognition method, and the functions and realization processes thereof will not be repeated here.

In addition, the embodiment of the present application also provides a computer-readable storage medium.

The computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by a processor, the steps of the above-mentioned age identification method are realized.

For the method implemented when the computer program is executed, please refer to the various embodiments of the age identification method of this application, which will not be repeated here.

Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments.

Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

An age identification method, wherein the age identification method includes:

Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;

Obtaining an age recognition network model through the extended speech sample training;

Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;

The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
The age identification method of claim 1, wherein the age identification network model includes an intermediate feature layer and a feature optimization layer, and the step of extracting the depth features of the input spectrogram through the age identification network model comprises:

Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;

Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
3. The age recognition method of claim 2, wherein the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F",

The step of performing feature optimization on the original feature based on the attention mechanism through the feature optimization layer of the age recognition network model to obtain the corresponding optimized feature includes:

Acquiring the original feature map F of the original feature through the feature optimization layer of the age recognition network model;

Calculate the one-dimensional channel attention map corresponding to the F channel attention map;

Element-wise multiplication is performed on the F and the channel attention map to obtain the corresponding intermediate feature map F';

Calculate the two-dimensional channel attention map corresponding to F'spatial attention map;

Perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
8. The age identification method of claim 1, wherein the step of acquiring the target voice of the target user and converting the target voice into a corresponding target spectrogram comprises:

Acquiring the target voice of the target user, and determining whether the voice duration of the target voice is greater than a preset duration threshold;

If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;

The step of extracting the depth feature of the input spectrogram through the age recognition network model, and determining the target age group to which the target user belongs according to the depth feature includes:

Extracting the depth features of each segment of the spectrogram by using the age recognition network model, and separately determining the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;

The target age group to which the target user belongs is determined according to the segment age group corresponding to each segment spectrogram.
The age recognition method of claim 1, wherein before the step of acquiring the target voice of the target user and converting the target voice into a corresponding input spectrogram, the method further comprises:

When receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;

Judging whether there are more than two user voices in the connected voice;

If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
The age identification method of claim 1, wherein the step of extracting the depth feature of the input spectrogram through the age identification network model, and determining the target age group to which the target user belongs based on the depth feature After that, it also includes:

Obtain a corresponding target speech template according to the target age group, and perform voice collection on the target user according to the target speech template.
The age identification method according to any one of claims 1 to 6, wherein the depth feature of the input spectrogram is extracted through the age identification network model, and the target user is determined to belong to After the steps of the target age group, it also includes:

Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;

According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
An age identification device, wherein the age identification device includes:

The sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;

A model training module is used to obtain an age recognition network model through the extended speech sample training;

The voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram;

The age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.
An age identification device, wherein the age identification device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein when the computer program is executed by the processor , To implement an age identification method, the age identification method includes the following steps:

Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;

Obtaining an age recognition network model through the extended speech sample training;

Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;

The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
9. The age recognition device of claim 9, wherein the age recognition network model includes an intermediate feature layer and a feature optimization layer, and the step of extracting the depth features of the input spectrogram through the age recognition network model is performed, include:

Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;

Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
The age recognition device according to claim 10, wherein the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F",

The step of performing the feature optimization layer of the age recognition network model to optimize the original feature based on the attention mechanism to obtain the corresponding optimized feature includes:

Acquiring the original feature map F of the original feature through the feature optimization layer of the age recognition network model;

Calculate the one-dimensional channel attention map corresponding to the F channel attention map;

Element-wise multiplication is performed on the F and the channel attention map to obtain the corresponding intermediate feature map F';

Calculate the two-dimensional channel attention map corresponding to F'spatial attention map;

Perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
9. The age recognition device of claim 9, wherein executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding target spectrogram comprises:

Acquiring the target voice of the target user, and determining whether the voice duration of the target voice is greater than a preset duration threshold;

If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;

Performing the step of extracting the depth features of the input spectrogram through the age recognition network model, and determining the target age group to which the target user belongs based on the depth features, includes:

Extracting the depth features of each segment of the spectrogram by using the age recognition network model, and separately determining the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;

The target age group to which the target user belongs is determined according to the segment age group corresponding to each segment spectrogram.
9. The age recognition device according to claim 9, wherein before executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding input spectrogram, the method further comprises:

When receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;

Judging whether there are more than two user voices in the connected voice;

If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
The age recognition device according to any one of claims 9 to 13, wherein performing the extraction of the depth feature of the input spectrogram through the age recognition network model, and determining the target user according to the depth feature After the steps of the target age group you belong to, it also includes:

Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;

According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, an age identification method is implemented. The age identification method includes the following steps:

Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;

Obtaining an age recognition network model through the extended speech sample training;

Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;

The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
The computer-readable storage medium of claim 15, wherein the age recognition network model includes an intermediate feature layer and a feature optimization layer, and the process of extracting the depth features of the input spectrogram through the age recognition network model is performed. The steps include:

Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;

Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
16. The computer-readable storage medium of claim 16, wherein the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F",

The step of performing the feature optimization layer of the age recognition network model to optimize the original feature based on the attention mechanism to obtain the corresponding optimized feature includes:

Acquiring the original feature map F of the original feature through the feature optimization layer of the age recognition network model;

Calculate the one-dimensional channel attention map corresponding to the F channel attention map;

Element-wise multiplication is performed on the F and the channel attention map to obtain the corresponding intermediate feature map F';

Calculate the two-dimensional channel attention map corresponding to F'spatial attention map;

Perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
15. The computer-readable storage medium of claim 15, wherein executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding target spectrogram comprises:

Acquiring the target voice of the target user, and determining whether the voice duration of the target voice is greater than a preset duration threshold;

If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;

Performing the step of extracting the depth features of the input spectrogram through the age recognition network model, and determining the target age group to which the target user belongs based on the depth features, includes:

Extracting the depth features of each segment of the spectrogram by using the age recognition network model, and separately determining the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;

The target age group to which the target user belongs is determined according to the segment age group corresponding to each segment spectrogram.
15. The computer-readable storage medium of claim 15, wherein before executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding input spectrogram, the method further comprises:

When receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;

Judging whether there are more than two user voices in the connected voice;

If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
The computer-readable storage medium according to any one of claims 15 to 19, wherein the performing the extraction of the depth feature of the input spectrogram through the age recognition network model, and determining the depth feature according to the depth feature After the steps of the target age group that the target user belongs to, it also includes:

Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;

According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.