WO2021159902A1 - Age recognition method, apparatus and device, and computer-readable storage medium - Google Patents
Age recognition method, apparatus and device, and computer-readable storage medium Download PDFInfo
- Publication number
- WO2021159902A1 WO2021159902A1 PCT/CN2021/071262 CN2021071262W WO2021159902A1 WO 2021159902 A1 WO2021159902 A1 WO 2021159902A1 CN 2021071262 W CN2021071262 W CN 2021071262W WO 2021159902 A1 WO2021159902 A1 WO 2021159902A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- age
- target
- feature
- network model
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 238000003860 storage Methods 0.000 title claims abstract description 22
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000005457 optimization Methods 0.000 claims description 31
- 238000009826 distribution Methods 0.000 claims description 19
- 230000002159 abnormal effect Effects 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 13
- 230000007246 mechanism Effects 0.000 claims description 11
- 238000005520 cutting process Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 11
- 239000012634 fragment Substances 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000011176 pooling Methods 0.000 description 7
- 230000005856 abnormality Effects 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- This application relates to the field of artificial intelligence technology, and in particular to an age identification method, device, equipment, and computer-readable storage medium.
- the main purpose of this application is to provide an age identification method, device, equipment, and computer-readable storage medium, aiming to solve the traditional technical problem of low accuracy in age identification.
- an embodiment of the present application provides an age identification method, and the age identification method includes:
- the depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
- an embodiment of the present application further provides an age identification device, the age identification device including:
- the sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
- a model training module is used to obtain an age recognition network model through the extended speech sample training
- the voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram
- the age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.
- an embodiment of the present application further provides an age identification device, the age identification device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein When the computer program is executed by the processor, the age identification method as described above is realized, and the age identification method includes the following steps:
- the depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
- the embodiments of the present application also provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned age An identification method, the age identification method includes the following steps:
- the depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
- the embodiments of the present application can improve the generalization ability of age recognition and improve the accuracy of age recognition.
- FIG. 1 is a schematic diagram of the hardware structure of the age recognition device involved in the solution of the embodiment of the application;
- FIG. 2 is a schematic flowchart of the first embodiment of the age identification method of this application.
- FIG. 3 is a schematic flowchart of a second embodiment of the age identification method of this application.
- FIG. 4 is a schematic flowchart of a third embodiment of the age identification method of this application.
- the technical solution of this application can be applied to the fields of artificial intelligence, smart city, digital medical care, blockchain and/or big data technology.
- the data involved in this application such as voice samples, depth features, and/or determined age group information, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain.
- the application is not limited.
- the age identification method involved in the embodiments of the present application is mainly applied to an age identification device, and the age identification device may be a device with a data processing function such as a server, a personal computer (PC), or a notebook computer.
- a data processing function such as a server, a personal computer (PC), or a notebook computer.
- FIG. 1 is a schematic diagram of the hardware structure of the age recognition device involved in the solution of the embodiment of this application.
- the age recognition device may include a processor and a memory.
- the age identification device may also include a communication bus, a user interface, and/or a network interface.
- the age identification device includes a processor 1001 (for example, a central processing unit, a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
- the communication bus 1002 is used to realize the connection and communication between these components;
- the user interface 1003 may include a display (Display), an input unit such as a keyboard (Keyboard);
- the network interface 1004 may optionally include a standard wired interface, a wireless interface (Such as wireless fidelity WIreless-FIdelity, WI-FI interface);
- the memory 1005 can be a high-speed random access memory (random access memory, RAM), or a stable memory (non-volatile memory), such as a disk memory, a memory
- 1005 may also be a storage device independent of the foregoing processor 1001.
- the hardware structure shown in FIG. 1 does not constitute a limitation to the present application, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.
- the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a computer program.
- the network communication module can be used to connect to a preset database and perform data communication with the database; and the processor 1001 can call a computer program stored in the memory 1005 and execute the age identification method provided in the embodiment of the present application.
- the embodiment of the present application provides an age identification method.
- FIG. 2 is a schematic flowchart of the first embodiment of the age identification method of this application.
- the age identification method includes the following steps:
- Step S10 Obtain real speech samples from a preset database, and perform sample expansion on the real speech samples based on the generative confrontation network GAN to obtain expanded speech samples;
- loan companies often identify the user’s age based on the user’s voice during the conversation, and then adopt different collection methods for collection according to the user’s age.
- Traditional voice age recognition methods are mostly based on the statistical analysis of the voice signal characteristics of the voice to determine the age of the speaker; however, due to the limitation of the voice signal characteristics, this method has insufficient generalization ability.
- the recognition accuracy rate in the application is low, and the application effect is not good.
- this embodiment proposes an age identification method. First, large-scale data samples are obtained by data expansion of the generative confrontation network GAN.
- the data samples can be more in line with the distribution of real data (ie Ensure the quality of the sample), and then use a sufficient number of real enough data samples to train and build an end-to-end network model, so that the model training process can more accurately understand the hidden laws of the data samples, improve the performance of the resulting network model, and then Subsequent use of the network model for the accuracy of age recognition; then the target speech to be recognized is converted into a spectrogram, and feature extraction is performed on the spectrogram through the obtained network model to obtain the in-depth characteristics of the target speech.
- these deep features include more features, and can pay more attention to the difficult-to-recognizable age attribute representations in the target speech.
- the age identification method in this embodiment is implemented by an age identification device.
- the age identification device may be a server, a personal computer, a notebook computer, or other devices.
- a server is used as an example for description.
- the server in this embodiment may be a server in a collection system.
- the server is connected to a preset database.
- the database stores a number of real voice samples collected in advance. These real voice samples can be in the form of original voice or frequency spectrum. In the form of a graph; these real voice samples include corresponding sample annotations, and the annotation content includes the age group of the user to which the real voice sample belongs (of course, the annotation content can also include other information).
- an age recognition network model for recognizing age.
- the age recognition network model is constructed based on a deep neural network of machine learning. Considering that the real voice samples that can be obtained in practice may have the problem of data imbalance, and the number and quality of the samples have a greater impact on the training results (model capabilities) of the model. For this, in this embodiment, the real voice needs to be The sample is expanded by sample to obtain expanded voice samples, thereby obtaining large-scale data samples.
- the real speech samples used should be in the form of spectrograms (including three-dimensional information of time, frequency, and amplitude). For real speech samples in the form of original speech, they must first pass the short-time Fourier The leaf transform (or other methods) converts it into the corresponding spectrogram.
- the generative adversarial network GAN includes two sub-networks, which can be called generator G (Generator) and discriminator D (Discriminator);
- G is a network that generates extended samples, which can pass a random noise Generate a simulated sample that follows the distribution of real speech samples as much as possible, denoted as G(z)
- D is a discriminant network that determines whether the input sample is "real", if the output is 1, it means real, and the output is 0, it means no It may be real; in the training process, G's goal is to generate real simulation samples as much as possible to deceive D, and D's goal is to separate the simulation samples generated by G from the real voice samples as much as possible.
- G can generate a simulated sample G(z) that is enough to "make it fake".
- Step S20 Obtain an age recognition network model through the extended speech sample training
- the server when the expanded voice sample is obtained, the server will train through the expanded voice sample to obtain the age recognition network model; for the convenience of subsequent processing, the age recognition network model can be set in an end-to-end form, that is, the age recognition
- the input of the network model is speech
- the output is the age group the speech belongs to.
- the end-to-end approach does not require every process Separate labeling is helpful to reduce the workload of labeling, and at the same time, it is helpful to improve the accuracy of age recognition.
- a deep network model can be used. For example, it can be implemented based on the classic deep residual network ResNet50. Of course, the ResNet50 architecture is used. At the same time, part of the structure can also be adjusted according to the actual situation.
- Step S30 Obtain the target voice of the target user, and convert the target voice into a corresponding input spectrogram
- the server can perform voice age recognition processing in the collection process through the age recognition network model. Specifically, when a collection item needs to be collected, first obtain the collection item information corresponding to the item, such as the loan user (target user) of a certain loan, contact information, etc., and then call the collection phone number according to the collection item information ; When the phone is connected, a general greeting voice can be played first to confirm its identity; when the target user makes a voice reply via the phone, the server can obtain the target voice of the target user, and then use the short-time Fourier transform The target voice is converted into the corresponding input spectrogram for subsequent analysis and processing.
- the collection item information corresponding to the item such as the loan user (target user) of a certain loan, contact information, etc.
- Step S40 Extract the depth feature of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth feature.
- the input spectrogram when the input spectrogram is obtained, the input spectrogram can be extracted through the age recognition network model to obtain the corresponding depth feature, and then the target age group to which the target user belongs is determined according to the depth feature.
- an attention mechanism can be introduced into the age recognition network model, or it can be considered as constructing an attention module to be embedded in the age recognition network model model, such as inserting into the age recognition network model.
- a certain feature layer in the middle is then used for feature optimization (refine feature), and then the obtained optimized feature is subjected to subsequent processing (such as inputting the next layer, or using the feature as the final feature).
- the age recognition network model in this embodiment includes an intermediate feature layer and a feature optimization layer, and the step S30 includes:
- the age recognition network model in this embodiment includes an intermediate feature layer and a feature optimization layer.
- the intermediate feature layer has the feature extraction functions of the general network intermediate layer (including convolution, pooling, etc.), and the feature optimization layer is based on attention Force mechanism construction.
- the server may first extract the original features of the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original features.
- this original feature it can be considered that it includes all the features of the input spectrogram, but these features are not necessarily related to age. If all of them are used for age recognition, the accuracy of recognition may be affected; at the same time, the amount of calculation is too large. This will affect the recognition speed; therefore, this embodiment will also refine these original features.
- the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
- the server optimizes the original features through the feature optimization layer of the age recognition network model and based on the attention mechanism to obtain the corresponding optimized features.
- the original feature it can be represented by the original feature map feature map and denoted as F; while the optimized feature can be optimized for the feature map representation, denoted as F".
- F the feature map representation
- R is the feature image (spectrogram) space
- C is the image (spectrogram) channel number
- H is the image high height
- W is the image width.
- the corresponding one-dimensional channel attention map channel attention map can be calculated, denoted as M C (F), and
- each channel channel of F can be regarded as a feature vector feature detector, channel attention channel attention mainly focuses on what is meaningful in the input; and in order to calculate channel attention efficiently, this embodiment uses maximum pooling and Average pooling compresses F in the spatial dimension, and then obtains two different spatial background descriptions and Then use a shared network composed of a multi-layer perceptron MLP to calculate the two different spatial background descriptions to obtain the M C (F) of F, which is
- ⁇ is the sigmoid function
- W 0 ⁇ R C/r*C W 1 ⁇ R C*C/r
- r is the compression ratio
- W 0 uses Relu as the activation function.
- H and W are as described above.
- the main focus is on location information.
- maximum pooling and average pooling to obtain two different feature descriptions for F'in the dimension of the channel. and Then use concatenation to merge the two feature descriptions, and use the convolution operation to generate M S (F') of F', that is
- ⁇ is the sigmoid function
- f 7*7 represents the 7*7 convolutional layer
- F can be considered as an optimized feature.
- the server can determine it as the depth feature of the input spectrogram, and perform age recognition processing based on the depth feature.
- age The intermediate feature layer of the recognition network model can be two or more layers (here "above” includes the number, the same below), and the optimized feature layer can be after any intermediate feature layer.
- the intermediate feature layer includes two layers, starting from the input Called the first layer and the second layer in turn; the feature optimization layer can be after the first layer and before the second layer.
- the input of the feature optimization layer is the output of the first layer
- the output of the feature optimization layer is the optimized feature of the output
- the optimized feature layer can also be after the second layer, at this time the input of the optimized feature is the output of the second layer, and
- the optimized features output by the feature optimization layer will be directly used as the final depth features for age recognition.
- the target age group to which the target user belongs can be determined according to the depth feature.
- the age recognition network model in this embodiment is an end-to-end form, so the age recognition process can be implemented in the output layer of the age recognition network model; in the output layer, the server will calculate the depth features and the data of each expanded voice sample.
- the span range of the age group can be set according to the actual situation.
- the server when the server determines the target age group to which the target user belongs, it can query the preset speech library according to the target age group to obtain the corresponding target speech template.
- the target speech template may be pre-defined by relevant managers. Set and stored, there will be different speech templates for different age groups; when the server obtains the target speech template, it can broadcast the voice according to the target speech template, so as to perform voice collection on the target user.
- real voice samples are obtained from a preset database, and the real voice samples are sample-expanded based on the generative confrontation network GAN to obtain expanded voice samples; the age recognition network model is obtained through training of the expanded voice samples; The target voice of the target user, and convert the target voice into a corresponding input spectrogram; extract the depth features of the input spectrogram through the age recognition network model, and determine the to which the target user belongs according to the depth characteristics Target age group.
- this embodiment first obtains large-scale data samples by means of data expansion of the generative confrontation network GAN. While increasing the number of data samples, the data samples can be more in line with the distribution of real data (that is, the quality of the samples is guaranteed).
- the model training process can more accurately understand the hidden laws of data samples, improve the performance of the resulting network model, and then use the network model for subsequent
- the accuracy of age recognition then the target speech to be recognized is converted into a spectrogram, and the spectrogram is feature extracted through the obtained network model to obtain the depth characteristics of the target speech.
- These in-depth features include more features, and can pay more attention to the age attribute representations that are not easily recognized in the target speech.
- FIG. 3 is a schematic flowchart of a second embodiment of the age identification method of this application.
- step S30 includes:
- Step S31 Obtain the target voice of the target user, and determine whether the voice duration of the target voice is greater than a preset duration threshold;
- the server obtains the target voice of the target user, it can first determine whether the voice duration of the target voice is greater than a preset duration threshold, and the preset duration threshold may be set according to actual needs.
- Step S32 If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;
- the server will perform voice cutting on the target voice to obtain more than two voice segments, and then convert each voice segment into a corresponding segment spectrogram.
- the duration of each voice segment can be determined by defining different rules according to the actual situation. For example, when the voice duration of the target voice is greater than the preset duration threshold, different voice durations can correspond to the number of different voice fragments. For example, the preset duration threshold is 3 seconds, and the voice duration is greater than 3 seconds and not greater than 4 seconds.
- the speech duration longer than 4 seconds corresponds to 3 speech fragments, and then the target speech can be averagely cut according to the determined speech fragments so that the fragment duration of each speech fragment is the same; for another example, when the speech of the target speech
- the duration is greater than the preset duration threshold, it can be cut every other preset film length. If the preset duration threshold is 3 seconds, it can be cut every 3 seconds. For a voice duration of 5 seconds, cut It is two speech fragments, corresponding to the duration of 3 seconds and 2 seconds respectively. Of course, other cutting methods can also be used in practice. It is worth noting that if the voice duration of the target voice is less than or equal to the preset duration threshold, the entire target voice can be directly converted into the corresponding input spectrogram and the age recognition process in step S40 is performed.
- the step S40 includes:
- Step S41 Extract the depth features of each segment of the spectrogram through the age recognition network model, and respectively determine the segment age corresponding to each segment of the spectrogram according to the depth feature of each segment of the spectrogram;
- the depth feature of each segment of the spectrogram can be extracted through the age recognition network model, and the segment age corresponding to each segment of the spectrogram can be determined according to the depth feature of each segment of the spectrogram.
- the feature extraction process of each segment of the spectrogram and the segment age determination process please refer to the above step S40, which will not be repeated here.
- Step S42 Determine the target age group to which the target user belongs according to the segment age group corresponding to each segment spectrogram.
- the target age group to which the target user belongs can be determined.
- the same segment age range is determined as the target age group to which the target user belongs; and if the segment age range corresponding to each segment spectrogram is not the same, it can be determined according to the actual.
- the situation defines a voting decision rule to determine the target age group to which the target user belongs according to the rule and a plurality of determined segment age groups. For example, for the defined voting decision rule, the median average can be used.
- the target voice corresponds to three fragment spectrograms, one of which corresponds to the age range of 22 to 24 years old, one corresponds to the age range of 26 to 28 years old, and one corresponds to If the age group is 28 to 30 years old, the median value of the three age groups is 23, 27, 29, and then the average of the three median values is 26.3 as the target age of the target user to determine the target age to which the target user belongs Segment; for the definition of voting decision rules, it can also be a majority determination method, such as the target voice corresponds to three segment spectrograms, two of which correspond to the age range of 22 to 24 years old, and one corresponds to the age range of 26 to 28 years old, then The age group 22 to 24 years old with the largest number of spectrograms corresponding to the segment can be determined as the target age group to which the target user belongs.
- voting decision rules can also be used, such as calculating the credibility of each age group based on the speech duration of each segment of speech, and taking the age group with the highest credibility as the target age to which the target user belongs. Duan etc.
- multiple segments of speech are obtained by cutting the speech with a long duration, and the age recognition is performed on each segment of the speech respectively, thereby reducing the amount of calculation and improving the efficiency of age recognition. Determining the target age group to which the target user belongs according to the recognition results is also beneficial to reduce the recognition error caused by accidental factors in the recognition process and improve the accuracy of age recognition.
- FIG. 4 is a schematic flowchart of a third embodiment of an age identification method according to this application.
- the method further includes:
- Step S50 when receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;
- the server when the server receives the collection instruction, it can obtain the corresponding collection item information according to the collection instruction, such as the loan user (target user) of a certain loan, contact information, etc., and then call the collection based on the collection item information Call, and get the connected voice of the other party after the call is connected.
- the collection instruction can be triggered by a manager through a certain terminal, or a related collection plan stored in the server. When the time reaches the collection time set by the collection plan, the collection instruction is automatically triggered.
- Step S60 judging whether there are more than two user voices in the connected voice
- the server When the connected voice is obtained, the server will determine whether there are more than two user voices in the connected voice.
- Step S70 If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
- the target user when the target user answers the call, it may be in a noisy environment or talking with a person. At this time, there will be more than two user voices in the connected voice obtained by the server. There are more than two user voices in the voice, and the server needs to determine the target voice of the target user from them to accurately identify the age of the target user. Specifically, the server can distinguish each user's voice according to frequency, and then obtain the voice attributes of each user's voice, the voice attributes include voice frequency, voice duration, voice volume, etc., and then determine the target voice of the target user based on the voice attributes.
- the target user is always holding the phone to make a call, so the user's voice with the longest voice duration can be determined as the target voice of the target user; for another example, the target user is the user closest to the phone.
- the target user’s voice is relatively loudest, so the user’s voice with the largest voice volume can be determined as the target voice of the target user; of course, the above two factors can also be combined.
- For a user’s voice according to the duration of the voice and the voice The volume can obtain the corresponding time length points and volume points respectively, and then add the time length points and the volume points to obtain a comprehensive score, and determine the user voice with the highest comprehensive score as the target voice of the target user. It is worth noting that if there is only one user voice in the connected voice, the user voice can be directly used as the target voice.
- the relevant age recognition processing of steps S30 and S40 can be executed.
- the server of this embodiment when the server of this embodiment is performing call collection, if there are more than two user voices in the connected voice, the target voice can be determined first, and then the subsequent age recognition processing can be performed to avoid the voice of multiple users.
- the method further includes:
- the server when it obtains the target age group to which the current target user belongs, it may also store it.
- these historical target age ranges can be summarized and counted to obtain the corresponding collection age distribution. For example, among 100 historical target users, 30 belong to the age range of 26 to 28, and 70 belong to the age range of 30 to 32; for example, in the collection cycle last month, there were a total of 100 historical target users. Among them, 30 are in the age group of 26 to 28, and 70 are in the age group of 30 to 32.
- the collection age distribution it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
- the server when it obtains the collection age distribution, it can determine whether there is an age group with an abnormal number of users based on the collection age distribution. For the abnormality judgment, the corresponding abnormality rule can be set in advance, and then the abnormality judgment is made based on the rule, such as the threshold or proportion of the abnormal number corresponding to the age group, when the number of users corresponding to a certain age group exceeds the threshold or the proportion of the abnormal number , It can be considered that the number of users in this age group is abnormal. In this embodiment, if there is an age group with an abnormal number of users, it may be that the current loan business has a certain risk in this age group, or it may be that the recognition ability of the age recognition network model is reduced, causing too many users to be identified as such. Age group, at this time, the server can send corresponding abnormal prompt information to the corresponding management terminal to prompt the relevant management personnel to inspect and deal with it in time.
- the corresponding abnormal prompt information to the corresponding management terminal to prompt the relevant management personnel to inspect and deal with it in time
- the server of this embodiment can analyze the collection age distribution of historical target users, and analyze whether there is an abnormal situation according to the collection age distribution, which is conducive to timely detection of abnormal situations, reducing business risks and the stability of the age identification network model sex.
- an embodiment of the present application also provides an age identification device.
- the age identification device includes:
- the sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
- a model training module is used to obtain an age recognition network model through the extended speech sample training
- the voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram
- the age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.
- each virtual function module of the above-mentioned age recognition device is stored in the memory 1005 of the age recognition device shown in FIG.
- the age recognition network model includes an intermediate feature layer and a feature optimization layer
- the age determination module includes:
- a feature extraction unit configured to perform original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain corresponding original features
- the feature optimization unit is used to perform feature optimization on the original feature based on the attention mechanism through the feature optimization layer of the age recognition network model to obtain the corresponding optimized feature, and determine the optimized feature as the input spectrogram The depth characteristics.
- the original feature includes an original feature map F
- the optimized feature includes an optimized feature map F".
- the feature optimization unit is further configured to obtain the feature optimization layer through the age recognition network model.
- the original feature map F of the original feature calculate the one-dimensional channel attention map corresponding to the F channel attention map; perform element-wise multiplication on the F and the channel attention map to obtain the corresponding intermediate feature map F'
- Calculate the two-dimensional channel attention map spatial attention map corresponding to the F' perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F.
- the voice conversion module includes:
- the duration judging unit is configured to obtain the target voice of the target user, and determine whether the voice duration of the target voice is greater than a preset duration threshold;
- the voice segmentation unit is configured to perform voice cutting on the target voice if the voice duration is greater than the preset duration threshold to obtain two or more voice segments, and respectively convert each voice segment into a corresponding segment spectrogram ;
- the age determination module is further configured to extract the depth characteristics of each segment of the spectrogram through the age recognition network model, and determine the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;
- the segment age group corresponding to the spectrogram determines the target age group to which the target user belongs.
- the age identification device further includes:
- the voice acquisition module is configured to, when receiving the collection instruction, dial a collection call according to the collection instruction, and obtain the corresponding connected voice after the call is connected;
- the voice judgment module is used to judge whether there are more than two user voices in the connected voice
- the voice determining module is configured to determine the target voice of the target user in each user voice according to the voice duration and/or voice volume of each user voice if there are more than two user voices in the connected voice.
- the age identification device further includes:
- the voice collection module is used to obtain the corresponding target speech template according to the target age group, and perform voice collection on the target user according to the target speech template.
- the age identification device further includes:
- the distribution acquisition module is used to acquire the historical target age range of the historical target users of a preset number or a preset period, and obtain the collection age distribution according to the historical target age range;
- the abnormality judgment module is used for judging whether there is an age group with an abnormal number of users based on the collection age distribution, and if it exists, sending corresponding abnormality prompt information to the corresponding management terminal.
- each module in the above-mentioned age recognition device corresponds to each step in the above-mentioned embodiment of the age recognition method, and the functions and realization processes thereof will not be repeated here.
- the embodiment of the present application also provides a computer-readable storage medium.
- the computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by a processor, the steps of the above-mentioned age identification method are realized.
- the storage medium involved in this application such as a computer-readable storage medium, may be non-volatile or volatile.
- the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
- a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
An age recognition method, apparatus and device, and a computer-readable storage medium, relating to the technical field of artificial intelligence. Said method comprises: acquiring a real voice sample from a preset database, and performing sample expansion on the real voice sample on the basis of a generative adversarial network (GAN), to obtain an expanded voice sample (S10); training the expanded voice sample, to obtain an age recognition network model (S20); acquiring a target voice of a target user, and converting the target voice into a corresponding input spectrogram (S30); and extracting a depth feature of the input spectrogram by means of the age recognition network model, and determining, according to the depth feature, a target age group to which the target user belongs (S40). The described method can improve the accuracy of age recognition.
Description
本申请要求于2020年2月12日提交中国专利局、申请号为202010094834.9,发明名称为“年龄识别方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 12, 2020, the application number is 202010094834.9, and the invention title is "Age Recognition Method, Apparatus, Equipment, and Computer-readable Storage Medium". The entire content of the application is approved The reference is incorporated in this application.
本申请涉及人工智能技术领域,尤其涉及一种年龄识别方法、装置、设备及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to an age identification method, device, equipment, and computer-readable storage medium.
发明人发现,目前,贷款公司在进行催收时,为了增强用户体验和催收效果,往往会根据对话过程中的用户语音对用户年龄进行识别,然后根据用户年龄采用不同的催收方式进行催收。发明人意识到,传统的语音年龄识别的方法,大多是基于声音的语音信号学特征进行统计学分析,进而确定说话者的年龄;但这种方法由于语音信号学特征的限制,其泛化能力不足,在实际应用中识别准确率低,应用效果不佳。The inventor found that currently, in order to enhance user experience and collection effects, loan companies often identify the user’s age based on the user’s voice during the conversation, and then adopt different collection methods for collection according to the user’s age. The inventor realizes that traditional speech age recognition methods are mostly based on the statistical analysis of the speech signal characteristics of the sound to determine the age of the speaker; however, due to the limitation of speech signal characteristics, this method has its generalization ability. Insufficient, the recognition accuracy is low in practical applications, and the application effect is not good.
发明内容Summary of the invention
本申请的主要目的在于提供一种年龄识别方法、装置、设备及计算机可读存储介质,旨在解决传统的年龄识别准确性低的技术问题。The main purpose of this application is to provide an age identification method, device, equipment, and computer-readable storage medium, aiming to solve the traditional technical problem of low accuracy in age identification.
为实现上述目的,本申请实施例提供一种年龄识别方法,所述年龄识别方法包括:To achieve the foregoing objective, an embodiment of the present application provides an age identification method, and the age identification method includes:
从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
通过所述扩充语音样本训练得到年龄识别网络模型;Obtaining an age recognition network model through the extended speech sample training;
获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;
通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
此外,为实现上述目的,本申请实施例还提供一种年龄识别装置,所述年龄识别装置包括:In addition, in order to achieve the foregoing objective, an embodiment of the present application further provides an age identification device, the age identification device including:
样本扩充模块,用于从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;The sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
模型训练模块,用于通过所述扩充语音样本训练得到年龄识别网络模型;A model training module is used to obtain an age recognition network model through the extended speech sample training;
语音转换模块,用于获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;The voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram;
年龄确定模块,用于通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.
此外,为实现上述目的,本申请实施例还提供一种年龄识别设备,所述年龄识别设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现如上述的年龄识别方法,该年龄识别方法包括以下步骤:In addition, in order to achieve the above object, an embodiment of the present application further provides an age identification device, the age identification device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein When the computer program is executed by the processor, the age identification method as described above is realized, and the age identification method includes the following steps:
从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
通过所述扩充语音样本训练得到年龄识别网络模型;Obtaining an age recognition network model through the extended speech sample training;
获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;
通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
此外,为实现上述目的,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如上述的 年龄识别方法,该年龄识别方法包括以下步骤:In addition, in order to achieve the foregoing objective, the embodiments of the present application also provide a computer-readable storage medium having a computer program stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned age An identification method, the age identification method includes the following steps:
从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
通过所述扩充语音样本训练得到年龄识别网络模型;Obtaining an age recognition network model through the extended speech sample training;
获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;
通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
本申请实施例能够提高年龄识别的泛化能力,提高年龄识别的准确性。The embodiments of the present application can improve the generalization ability of age recognition and improve the accuracy of age recognition.
图1为本申请实施例方案中涉及的年龄识别设备的硬件结构示意图;FIG. 1 is a schematic diagram of the hardware structure of the age recognition device involved in the solution of the embodiment of the application;
图2为本申请年龄识别方法第一实施例的流程示意图;FIG. 2 is a schematic flowchart of the first embodiment of the age identification method of this application;
图3为本申请年龄识别方法第二实施例的流程示意图;FIG. 3 is a schematic flowchart of a second embodiment of the age identification method of this application;
图4为本申请年龄识别方法第三实施例的流程示意图。FIG. 4 is a schematic flowchart of a third embodiment of the age identification method of this application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.
本申请的技术方案可应用于人工智能、智慧城市、数字医疗、区块链和/或大数据技术领域。可选的,本申请涉及的数据如语音样本、深度特征和/或确定出的年龄段信息等可存储于数据库中,或者可以存储于区块链中,比如通过区块链分布式存储,本申请不做限定。The technical solution of this application can be applied to the fields of artificial intelligence, smart city, digital medical care, blockchain and/or big data technology. Optionally, the data involved in this application, such as voice samples, depth features, and/or determined age group information, can be stored in a database, or can be stored in a blockchain, such as distributed storage through a blockchain. The application is not limited.
本申请实施例涉及的年龄识别方法主要应用于年龄识别设备,该年龄识别设备可以是服务器、个人计算机(personal computer,PC)、笔记本电脑等具有数据处理功能的设备。The age identification method involved in the embodiments of the present application is mainly applied to an age identification device, and the age identification device may be a device with a data processing function such as a server, a personal computer (PC), or a notebook computer.
参照图1,图1为本申请实施例方案中涉及的年龄识别设备的硬件结构示意图。本申请实施例中,该年龄识别设备可以包括处理器和存储器。可选的,该年龄识别设备还可包括通信总线、用户接口和/或网络接口。例如,该年龄识别设备包括处理器1001(例如中央处理器Central Processing Unit,CPU),通信总线1002,用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信;用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard);网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真WIreless-FIdelity,WI-FI接口);存储器1005可以是高速随机存取存储器(random access memory,RAM),也可以是稳定的存储器(non-volatile memory),例如磁盘存储器,存储器1005可选的还可以是独立于前述处理器1001的存储装置。本领域技术人员可以理解,图1中示出的硬件结构并不构成对本申请的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Referring to FIG. 1, FIG. 1 is a schematic diagram of the hardware structure of the age recognition device involved in the solution of the embodiment of this application. In the embodiment of the present application, the age recognition device may include a processor and a memory. Optionally, the age identification device may also include a communication bus, a user interface, and/or a network interface. For example, the age identification device includes a processor 1001 (for example, a central processing unit, a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize the connection and communication between these components; the user interface 1003 may include a display (Display), an input unit such as a keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a wireless interface (Such as wireless fidelity WIreless-FIdelity, WI-FI interface); the memory 1005 can be a high-speed random access memory (random access memory, RAM), or a stable memory (non-volatile memory), such as a disk memory, a memory Optionally, 1005 may also be a storage device independent of the foregoing processor 1001. Those skilled in the art can understand that the hardware structure shown in FIG. 1 does not constitute a limitation to the present application, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.
继续参照图1,图1中作为一种计算机可读存储介质的存储器1005可以包括操作系统、网络通信模块以及计算机程序。在图1中,网络通信模块可用于连接预设数据库,与数据库进行数据通信;而处理器1001可以调用存储器1005中存储的计算机程序,并执行本申请实施例提供的年龄识别方法。Continuing to refer to FIG. 1, the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a computer program. In FIG. 1, the network communication module can be used to connect to a preset database and perform data communication with the database; and the processor 1001 can call a computer program stored in the memory 1005 and execute the age identification method provided in the embodiment of the present application.
基于上述的硬件架构,提出本申请年龄识别方法的各实施例。Based on the foregoing hardware architecture, various embodiments of the age identification method of the present application are proposed.
本申请实施例提供了一种年龄识别方法。The embodiment of the present application provides an age identification method.
参照图2,图2为本申请年龄识别方法第一实施例的流程示意图。Refer to FIG. 2, which is a schematic flowchart of the first embodiment of the age identification method of this application.
本实施例中,所述年龄识别方法包括以下步骤:In this embodiment, the age identification method includes the following steps:
步骤S10,从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Step S10: Obtain real speech samples from a preset database, and perform sample expansion on the real speech samples based on the generative confrontation network GAN to obtain expanded speech samples;
目前,贷款公司在进行催收时,为了增强用户体验和催收效果,往往会根据对话过程中的用户语音对用户年龄进行识别,然后根据用户年龄采用不同的催收方式进行催收。传 统的语音的年龄识别的方法,大多是基于声音的语音信号学特征进行统计学分析,进而确定说话者的年龄;但这种方法由于语音信号学特征的限制,其泛化能力不足,在实际应用中识别准确率低,应用效果不佳。对此,本实施例提出一种年龄识别方法,首先通过生成式对抗网络GAN数据扩充的方式获得大规模数据样本,在提升数据样本数量的同时,使得数据样本能够更加符合真实数据的分布(即保证样本的质量),再利用足够数量、足够真实的数据样本训练构建端对端的网络模型,从而使得模型训练过程能够更准确地理解数据样本的隐藏规律,提高所得到的网络模型的性能,进而后续使用该网络模型进行年龄识别的准确性;然后将待识别的目标语音转化为频谱图,通过得到的网络模型对该频谱图进行特征提取,获得目标语音的深度特征,与传统的基于信号学特征进行年龄识别相比,这些深度特征所包括的特征数更多,且更能关注到目标语音中不易识别的年龄属性表征,通过使用该深度特征识别目标用户所属的目标年龄段,有利于准确把握年龄与语音之间的关联关系,进而提高年龄识别的泛化能力,提高年龄识别的准确性。At present, in order to enhance the user experience and the effect of collection, loan companies often identify the user’s age based on the user’s voice during the conversation, and then adopt different collection methods for collection according to the user’s age. Traditional voice age recognition methods are mostly based on the statistical analysis of the voice signal characteristics of the voice to determine the age of the speaker; however, due to the limitation of the voice signal characteristics, this method has insufficient generalization ability. The recognition accuracy rate in the application is low, and the application effect is not good. In this regard, this embodiment proposes an age identification method. First, large-scale data samples are obtained by data expansion of the generative confrontation network GAN. While increasing the number of data samples, the data samples can be more in line with the distribution of real data (ie Ensure the quality of the sample), and then use a sufficient number of real enough data samples to train and build an end-to-end network model, so that the model training process can more accurately understand the hidden laws of the data samples, improve the performance of the resulting network model, and then Subsequent use of the network model for the accuracy of age recognition; then the target speech to be recognized is converted into a spectrogram, and feature extraction is performed on the spectrogram through the obtained network model to obtain the in-depth characteristics of the target speech. Compared with age recognition using features, these deep features include more features, and can pay more attention to the difficult-to-recognizable age attribute representations in the target speech. By using the deep features to identify the target age group to which the target user belongs, it is conducive to accuracy. Grasp the relationship between age and speech, and then improve the generalization ability of age recognition and improve the accuracy of age recognition.
本实施例中的年龄识别方法是由年龄识别设备实现的,该年龄识别设备可以是服务器、个人计算机、笔记本电脑等设备,本实施例中以服务器为例进行说明。本实施例中的服务器可以是催收系统中的服务器,服务器与一预设数据库连接,该数据库中存储有预先收集的若干真实语音样本,这些真实语音样本可以是原始语音的形式,也可以是频谱图的形式;这些真实语音样本包括有对应的样本批注,批注内容包括真实语音样本所属用户的年龄段(当然批注内容还可以包括其它信息)。The age identification method in this embodiment is implemented by an age identification device. The age identification device may be a server, a personal computer, a notebook computer, or other devices. In this embodiment, a server is used as an example for description. The server in this embodiment may be a server in a collection system. The server is connected to a preset database. The database stores a number of real voice samples collected in advance. These real voice samples can be in the form of original voice or frequency spectrum. In the form of a graph; these real voice samples include corresponding sample annotations, and the annotation content includes the age group of the user to which the real voice sample belongs (of course, the annotation content can also include other information).
本实施例中,在进行年龄识别前,首先需要训练构建识别年龄所用的年龄识别网络模型,该年龄识别网络模型是基于机器学习的深度神经网络的方式构建。考虑到在实际中能获得的真实语音样本可能存在数据不平衡的问题,而样本数量和质量对模型的训练结果(模型能力)具有较大的影响,对此,本实施例中需要对真实语音样本进行样本扩充,得到扩充语音样本,从而得到大规模数据样本。In this embodiment, before performing age recognition, it is first necessary to train and construct an age recognition network model for recognizing age. The age recognition network model is constructed based on a deep neural network of machine learning. Considering that the real voice samples that can be obtained in practice may have the problem of data imbalance, and the number and quality of the samples have a greater impact on the training results (model capabilities) of the model. For this, in this embodiment, the real voice needs to be The sample is expanded by sample to obtain expanded voice samples, thereby obtaining large-scale data samples.
本实施例在进行样本扩充时,为了提高样本扩充的效率、并保证样本扩充后的样本指令,可基于生成式对抗网络GAN的方式进行。值得说明的是,进行样本扩充时,所使用的真实语音样本应该是频谱图的形式(包括时间、频率、振幅三维信息),对于原始语音的形式的真实语音样本,首先要通过短时傅里叶变换(或其它方式)将其转换为对应的频谱图。其中,生成式对抗网络GAN(Generative Adversarial Networks)包括两个子网络,可分别称为生成器G(Generator)和判别器D(Discriminator);G是一个生成扩充样本的网络,可通过一个随机的噪声生成一个尽量服从真实语音样本分布的模拟样本,记做G(z),D是一个判别网络,判别输入的样本是否“真实”,如输出为1,则代表真实,输出为0,则代表不可能真实;在训练过程中,G的目标就是尽量生成真实的模拟样本去欺骗D,而D的目标就是尽量把G生成的模拟样本和真实语音样本分别开来。在最理想的状态下,G可以生成足以“以假乱真”的模拟样本G(z),对于D来说,它难以判定G生成的模拟样本G(z)究竟是不是真实的,也即D(G(z))=0.5;当满足该条件时,即可认为得到一个训练完成的G(也即GAN训练完成),并用以对真实语音样本进行样本扩展,得到扩充语音样本。When performing sample expansion in this embodiment, in order to improve the efficiency of sample expansion and ensure sample instructions after sample expansion, it can be performed based on a generative countermeasure network GAN. It is worth noting that when performing sample expansion, the real speech samples used should be in the form of spectrograms (including three-dimensional information of time, frequency, and amplitude). For real speech samples in the form of original speech, they must first pass the short-time Fourier The leaf transform (or other methods) converts it into the corresponding spectrogram. Among them, the generative adversarial network GAN (Generative Adversarial Networks) includes two sub-networks, which can be called generator G (Generator) and discriminator D (Discriminator); G is a network that generates extended samples, which can pass a random noise Generate a simulated sample that follows the distribution of real speech samples as much as possible, denoted as G(z), D is a discriminant network that determines whether the input sample is "real", if the output is 1, it means real, and the output is 0, it means no It may be real; in the training process, G's goal is to generate real simulation samples as much as possible to deceive D, and D's goal is to separate the simulation samples generated by G from the real voice samples as much as possible. In the most ideal state, G can generate a simulated sample G(z) that is enough to "make it fake". For D, it is difficult to determine whether the simulated sample G(z) generated by G is real or not, that is, D(G (z))=0.5; when this condition is met, it can be considered that a trained G (that is, GAN training is completed) is obtained, and it is used to expand the real speech sample to obtain an expanded speech sample.
步骤S20,通过所述扩充语音样本训练得到年龄识别网络模型;Step S20: Obtain an age recognition network model through the extended speech sample training;
本实施例中,在得到扩充语音样本时,服务器将通过扩充语音样本进行训练,得到年龄识别网络模型;为了后续处理方便,该年龄识别网络模型可设置为端对端的形式,也即该年龄识别网络模型的输入为语音,输出为该语音所属的年龄段,与传统的“通过一模型进行特征提取、再通过另一模型进行分类”的方式相比,端对端的方式不需要对每一个过程分别进行标注,有利于减少标注的工作量,同时有利于提高年龄识别的准确性。而对于本实施例的年龄识别网络模型,为了提高模型泛化性和识别准确性,可采用深度网络模型的方式,例如可基于经典深度残差网络ResNet50的方式实现,当然,在采用ResNet50的 架构时,也可以根据实际情况对部分结构进行调整。In this embodiment, when the expanded voice sample is obtained, the server will train through the expanded voice sample to obtain the age recognition network model; for the convenience of subsequent processing, the age recognition network model can be set in an end-to-end form, that is, the age recognition The input of the network model is speech, and the output is the age group the speech belongs to. Compared with the traditional "feature extraction through one model and classification through another model", the end-to-end approach does not require every process Separate labeling is helpful to reduce the workload of labeling, and at the same time, it is helpful to improve the accuracy of age recognition. For the age recognition network model of this embodiment, in order to improve the generalization and recognition accuracy of the model, a deep network model can be used. For example, it can be implemented based on the classic deep residual network ResNet50. Of course, the ResNet50 architecture is used. At the same time, part of the structure can also be adjusted according to the actual situation.
步骤S30,获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Step S30: Obtain the target voice of the target user, and convert the target voice into a corresponding input spectrogram;
本实施例中,在得到年龄识别网络模型时,服务器即可通过该年龄识别网络模型在催收过程进行语音年龄识别处理。具体的,当需要对某一催收项目进行催收时,首先可获取该项目对应的催收项目信息,例如某一笔贷款的贷款用户(目标用户)、联系方式等,然后根据催收项目信息拨打催收电话;当电话接通时,可先播放一段通用的问候语音,以确认其身份;当目标用户通过电话进行语音回复时,服务器即可获取到目标用户的目标语音,然后通过短时傅里叶变换的方式将该目标语音转换为对应的输入频谱图,用以进行后续分析处理。In this embodiment, when the age recognition network model is obtained, the server can perform voice age recognition processing in the collection process through the age recognition network model. Specifically, when a collection item needs to be collected, first obtain the collection item information corresponding to the item, such as the loan user (target user) of a certain loan, contact information, etc., and then call the collection phone number according to the collection item information ; When the phone is connected, a general greeting voice can be played first to confirm its identity; when the target user makes a voice reply via the phone, the server can obtain the target voice of the target user, and then use the short-time Fourier transform The target voice is converted into the corresponding input spectrogram for subsequent analysis and processing.
步骤S40,通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。Step S40: Extract the depth feature of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth feature.
本实施例中,在得到输入频谱图时,即可通过年龄识别网络模型对输入频谱图进行特征提取,得到对应的深度特征,然后根据深度特征确定目标用户所属的目标年龄段。In this embodiment, when the input spectrogram is obtained, the input spectrogram can be extracted through the age recognition network model to obtain the corresponding depth feature, and then the target age group to which the target user belongs is determined according to the depth feature.
进一步的,对于语音信号而言,同时具有时域和频域的属性,这两类属性在频谱图中会有相应的体现,并对应了各种特征,这些特征可能与年龄有关,可能与年龄无关(如环境噪音特征)。为了提高年龄识别的准确性,本实施例中在进行特征提取时,可在年龄识别网络模型中引入注意力机制,或认为是构造一个注意力模块嵌入到年龄识别网络模型模型中,如插入到中间的某个特征层之后用于特征优化(refine feture),然后再将得到的优化特征进行后续处理(如输入下一层,或将该特征作为最终特征)。本实施例中的年龄识别网络模型包括中间特征层和特征优化层,所述步骤S30包括:Furthermore, for speech signals, they have both time domain and frequency domain attributes. These two types of attributes will be reflected in the spectrogram and correspond to various characteristics. These characteristics may be related to age, or may be related to age. Irrelevant (such as environmental noise characteristics). In order to improve the accuracy of age recognition, in this embodiment, when performing feature extraction, an attention mechanism can be introduced into the age recognition network model, or it can be considered as constructing an attention module to be embedded in the age recognition network model model, such as inserting into the age recognition network model. A certain feature layer in the middle is then used for feature optimization (refine feature), and then the obtained optimized feature is subjected to subsequent processing (such as inputting the next layer, or using the feature as the final feature). The age recognition network model in this embodiment includes an intermediate feature layer and a feature optimization layer, and the step S30 includes:
通过所述年龄识别网络模型的中间特征层对所述输入频谱图进行原始特征提取,得到对应的原始特征;Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;
本实施例中的年龄识别网络模型包括中间特征层和特征优化层,其中,中间特征层具有一般网络中间层的特征提取功能(包括卷积、池化等),而特征优化层则是基于注意力机制构建。在得到输入频谱图时,服务器可先通过年龄识别网络模型的中间特征层对输入频谱图进行原始特征提取,得到对应的原始特征。对于该原始特征,可认为是包括了输入频谱图的所有特征,但这些特征不一定与年龄都有关,若全部使用进行年龄识别,可能会影响识别的准确性;同时由于计算量过大,也将影响识别速度;因此本实施例还将对这些原始特征进行特征优化(refine feture)。The age recognition network model in this embodiment includes an intermediate feature layer and a feature optimization layer. The intermediate feature layer has the feature extraction functions of the general network intermediate layer (including convolution, pooling, etc.), and the feature optimization layer is based on attention Force mechanism construction. When obtaining the input spectrogram, the server may first extract the original features of the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original features. For this original feature, it can be considered that it includes all the features of the input spectrogram, but these features are not necessarily related to age. If all of them are used for age recognition, the accuracy of recognition may be affected; at the same time, the amount of calculation is too large. This will affect the recognition speed; therefore, this embodiment will also refine these original features.
通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征,并将所述优化特征确定为所述输入频谱图的深度特征。Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
在得到原始特征时,服务器将通过年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征。具体的,在对于原始特征,可以是以原始特征图feature map的方式进行表示,并记为F;而优化特征则可以优化特征图表示,记为F”。在进行特征优化时,可通过年龄识别网络模型的特征优化层获取原始特征的原始特征图F,且有When obtaining the original features, the server optimizes the original features through the feature optimization layer of the age recognition network model and based on the attention mechanism to obtain the corresponding optimized features. Specifically, for the original feature, it can be represented by the original feature map feature map and denoted as F; while the optimized feature can be optimized for the feature map representation, denoted as F". When performing feature optimization, you can use age Identify the feature optimization layer of the network model to obtain the original feature map F of the original feature, and have
F∈R
C*H*W
F∈R C*H*W
其中,R为特征图像(频谱图)空间,C为图像(频谱图)channel通道数,H为图像high高度,W为图像width宽度。Among them, R is the feature image (spectrogram) space, C is the image (spectrogram) channel number, H is the image high height, and W is the image width.
对于该F,可计算得到其对应的一维通道注意力图channel attention map,记为M
C(F),且有
For this F, the corresponding one-dimensional channel attention map channel attention map can be calculated, denoted as M C (F), and
M
C(F)∈R
C*1*1
M C (F)∈R C*1*1
其中,F的每个通道channel都可视为一个特征向量feature detector,通道注意力channel attention主要关注于输入中什么是有意义的;而为了高效计算channel attention,本实施例使用分别最大池化和平均池化对F在空间维度上进行压缩,进而得到两个不同的空间背景描述
和
然后使用由多层感知机MLP组成的共享网络对这两个不同的空间背景描述进行计算得到F的M
C(F),也即
Among them, each channel channel of F can be regarded as a feature vector feature detector, channel attention channel attention mainly focuses on what is meaningful in the input; and in order to calculate channel attention efficiently, this embodiment uses maximum pooling and Average pooling compresses F in the spatial dimension, and then obtains two different spatial background descriptions and Then use a shared network composed of a multi-layer perceptron MLP to calculate the two different spatial background descriptions to obtain the M C (F) of F, which is
其中,σ为sigmoid函数,W
0∈R
C/r*C,W
1∈R
C*C/r,r为压缩率,W
0使用Relu作为激活函数。
Among them, σ is the sigmoid function, W 0 ∈ R C/r*C , W 1 ∈ R C*C/r , r is the compression ratio, and W 0 uses Relu as the activation function.
在得到channel attention map(M
C(F))时,可对所述F和所述channel attention map进行元素相乘element-wise multiplication,得到对应的中间特征图F',也即
When the channel attention map (M C (F)) is obtained, element-wise multiplication can be performed on the F and the channel attention map to obtain the corresponding intermediate feature map F', that is
在得到F'时,可计算得到其对应的二维通道注意力图spatial attention map,记为M
S(F'),且有
When F'is obtained, the corresponding two-dimensional channel attention map spatial attention map can be calculated, denoted as M S (F'), and
M
S(F')∈R
1*H*W
M S (F')∈R 1*H*W
其中,H、W含义见上述。对于该spatial attention主要关注位置信息。在计算spatial attention时,首先在频道channel的维度对F'分别使用最大池化和平均池化得到两个不同的特征描述
和
然后使用使用concatenation将两个特征描述合并,并使用卷积操作生成F'的M
S(F'),也即
Among them, the meanings of H and W are as described above. For this spatial attention, the main focus is on location information. When calculating spatial attention, first use maximum pooling and average pooling to obtain two different feature descriptions for F'in the dimension of the channel. and Then use concatenation to merge the two feature descriptions, and use the convolution operation to generate M S (F') of F', that is
其中,σ为sigmoid函数,f
7*7表示7*7的卷积层,
为F'在channel维度的最大池化的特征描述,
为F'在channel维度的平均池化的特征描述。
Among them, σ is the sigmoid function, f 7*7 represents the 7*7 convolutional layer, Is the feature description of the maximum pooling of F'in the channel dimension, It is the feature description of the average pooling of F'in the channel dimension.
在得到spatial attention map(M
S(F'))时,可对F'和spatial attention map(M
S(F'))进行element-wise multiplication,得到对应的优化特征图F”,也即
When the spatial attention map (M S (F')) is obtained, element-wise multiplication can be performed on F'and the spatial attention map (M S (F')) to obtain the corresponding optimized feature map F", that is
上式中F”即可认为是优化特征,对于该优化特征,服务器可将其确定为输入频谱图 的深度特征,并根据该深度特征进行年龄识别处理。值得说明的是,在实际中,年龄识别网络模型的中间特征层可以是两层以上(此处“以上”包括本数,下同),而优化特征层可以是在任一中间特征层之后。例如,中间特征层包括两层,从输入开始依次称为第一层和第二层;特征优化层可以是在第一层之后、第二层之前,此时特征优化层的输入为第一层的输出,而特征优化层的输出的优化特征将作为第二层的输入,经第二层处理后得到最终的深度特征用以进行年龄识别;优化特征层也可以是第二层之后,此时优化特征的输入为第二层的输出,而特征优化层的输出的优化特征将直接作为最终的深度特征用以进行年龄识别。In the above formula, F" can be considered as an optimized feature. For this optimized feature, the server can determine it as the depth feature of the input spectrogram, and perform age recognition processing based on the depth feature. It is worth noting that in reality, age The intermediate feature layer of the recognition network model can be two or more layers (here "above" includes the number, the same below), and the optimized feature layer can be after any intermediate feature layer. For example, the intermediate feature layer includes two layers, starting from the input Called the first layer and the second layer in turn; the feature optimization layer can be after the first layer and before the second layer. At this time, the input of the feature optimization layer is the output of the first layer, and the output of the feature optimization layer is the optimized feature of the output As the input of the second layer, the final depth feature is obtained after the second layer processing for age recognition; the optimized feature layer can also be after the second layer, at this time the input of the optimized feature is the output of the second layer, and The optimized features output by the feature optimization layer will be directly used as the final depth features for age recognition.
在得到深度特征时,即可根据该深度特征确定目标用户所属的目标年龄段。本实施例中的年龄识别网络模型是端对端的形式,因而年龄识别的过程可以是在年龄识别网络模型的输出层实现的;在该输出层中,服务器会计算深度特征与各扩充语音样本的样本特征的空间距离,并将空间距离最小的扩充语音样本作为与输入频谱图(目标语音)匹配的目标样本;然后可查询目标样本的样本批注,从而确定目标样本对应的样本年龄,该样本年龄也可认为是目标语音对应的语音年龄,根据该语音年龄即可确定目标用户所属的目标年龄段。当然,对于年龄段的跨度范围,可以根据实际情况进行设置。When the depth feature is obtained, the target age group to which the target user belongs can be determined according to the depth feature. The age recognition network model in this embodiment is an end-to-end form, so the age recognition process can be implemented in the output layer of the age recognition network model; in the output layer, the server will calculate the depth features and the data of each expanded voice sample. The spatial distance of the sample feature, and use the expanded voice sample with the smallest spatial distance as the target sample that matches the input spectrogram (target voice); then the sample annotation of the target sample can be queried to determine the sample age corresponding to the target sample. It can also be considered as the voice age corresponding to the target voice, and the target age group to which the target user belongs can be determined according to the voice age. Of course, the span range of the age group can be set according to the actual situation.
进一步的,服务器在确定目标用户所属的目标年龄段时,即可根据该目标年龄段查询预设话术库,以获取对应的目标话术模板,该目标话术模板可以是由相关管理人员预先设置并存储的,对于不同的年龄段会对于有不同的话术模板;服务器在得到目标话术模板时,即可根据目标话术模板进行语音播报,从而对目标用户进行语音催收。Further, when the server determines the target age group to which the target user belongs, it can query the preset speech library according to the target age group to obtain the corresponding target speech template. The target speech template may be pre-defined by relevant managers. Set and stored, there will be different speech templates for different age groups; when the server obtains the target speech template, it can broadcast the voice according to the target speech template, so as to perform voice collection on the target user.
本实施例通过从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;通过所述扩充语音样本训练得到年龄识别网络模型;获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。通过以上方式,本实施例首先通过生成式对抗网络GAN数据扩充的方式获得大规模数据样本,在提升数据样本数量的同时,使得数据样本能够更加符合真实数据的分布(即保证样本的质量),再利用足够数量、足够真实的数据样本训练构建端对端的网络模型,从而使得模型训练过程能够更准确地理解数据样本的隐藏规律,提高所得到的网络模型的性能,进而后续使用该网络模型进行年龄识别的准确性;然后将待识别的目标语音转化为频谱图,通过得到的网络模型对该频谱图进行特征提取,获得目标语音的深度特征,与传统的基于信号学特征进行年龄识别相比,这些深度特征所包括的特征数更多,且更能关注到目标语音中不易识别的年龄属性表征,通过使用该深度特征识别目标用户所属的目标年龄段,有利于准确把握年龄与语音之间的关联关系,进而提高年龄识别的泛化能力,提高年龄识别的准确性。In this embodiment, real voice samples are obtained from a preset database, and the real voice samples are sample-expanded based on the generative confrontation network GAN to obtain expanded voice samples; the age recognition network model is obtained through training of the expanded voice samples; The target voice of the target user, and convert the target voice into a corresponding input spectrogram; extract the depth features of the input spectrogram through the age recognition network model, and determine the to which the target user belongs according to the depth characteristics Target age group. Through the above method, this embodiment first obtains large-scale data samples by means of data expansion of the generative confrontation network GAN. While increasing the number of data samples, the data samples can be more in line with the distribution of real data (that is, the quality of the samples is guaranteed). Then use a sufficient number of real enough data samples to train and build an end-to-end network model, so that the model training process can more accurately understand the hidden laws of data samples, improve the performance of the resulting network model, and then use the network model for subsequent The accuracy of age recognition; then the target speech to be recognized is converted into a spectrogram, and the spectrogram is feature extracted through the obtained network model to obtain the depth characteristics of the target speech. Compared with traditional age recognition based on signal characteristics , These in-depth features include more features, and can pay more attention to the age attribute representations that are not easily recognized in the target speech. By using the in-depth features to identify the target age group of the target user, it is helpful to accurately grasp the relationship between age and speech. The association relationship between the, and then improve the generalization ability of age recognition and improve the accuracy of age recognition.
基于上述图2所示实施例,提出本申请年龄识别方法第二实施例。Based on the embodiment shown in FIG. 2 above, a second embodiment of the age identification method of the present application is proposed.
参照图3,图3为本申请年龄识别方法第二实施例的流程示意图。Referring to FIG. 3, FIG. 3 is a schematic flowchart of a second embodiment of the age identification method of this application.
本实施例中,所述步骤S30包括:In this embodiment, the step S30 includes:
步骤S31,获取目标用户的目标语音,并判断所述目标语音的语音时长是否大于预设时长阈值;Step S31: Obtain the target voice of the target user, and determine whether the voice duration of the target voice is greater than a preset duration threshold;
当目标语音持续较长时间时,若直接对目标语音进行年龄识别的处理,可能会导致模型处理过程的计算量过大,对此本实施例中可以是将时长较长的语音进行切割处理,得到多段语音,并分别对各段语音进行年龄识别,从而减少计算量,并且还有利于提高年龄识别的准确性。具体的,服务器在获取到目标用户的目标语音时,首先可判断该目标语音的语音时长是否大于预设时长阈值,该预设时长阈值可以是根据实际需要进行设置。When the target voice lasts for a long time, if the target voice is directly processed for age recognition, the calculation amount of the model processing process may be too large. In this embodiment, the voice with a longer duration may be cut. Obtain multiple segments of speech, and perform age recognition on each segment of speech respectively, thereby reducing the amount of calculation and also helping to improve the accuracy of age recognition. Specifically, when the server obtains the target voice of the target user, it can first determine whether the voice duration of the target voice is greater than a preset duration threshold, and the preset duration threshold may be set according to actual needs.
步骤S32,若所述语音时长大于所述预设时长阈值,则对所述目标语音进行语音切割, 得到两段以上的语音片段,并分别将各语音片段转换为对应的片段频谱图;Step S32: If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;
本实施例中,若目标语音的语音时长大于预设时长阈值,则服务器会对目标语音进行语音切割,得到两段以上的语音片段,然后分别将各语音片段转换为对应的片段频谱图。在进行语音切割时,对于各段语音片段时长,可以根据实际情况定义不同的规则进行确定。例如,当目标语音的语音时长大于预设时长阈值时,对于不同的语音时长,则可以对应不同语音片段数量,如预设时长阈值为3秒,则大于3秒且不大于4秒的语音时长对应2个语音片段,大于4秒的语音时长对应3个语音片段,然后可根据确定的语音片段对目标语音进行均值切割,以使各语音片段的片段时长相同;又例如,当目标语音的语音时长大于预设时长阈值时,可以是每隔一预设片长进行一次切割,如预设时长阈值为3秒,则可以是每隔3秒进行一次切割,对于5秒的语音时长,则切割为两个语音片段,分别对应时长为3秒和2秒。当然在实际中还可以是其它切割方式。值得说明的是,若目标语音的语音时长小于或等于预设时长阈值,则可直接将整段目标语音转换为对应的输入频谱图并执行上述步骤S40的年龄识别处理。In this embodiment, if the voice duration of the target voice is greater than the preset duration threshold, the server will perform voice cutting on the target voice to obtain more than two voice segments, and then convert each voice segment into a corresponding segment spectrogram. When performing voice cutting, the duration of each voice segment can be determined by defining different rules according to the actual situation. For example, when the voice duration of the target voice is greater than the preset duration threshold, different voice durations can correspond to the number of different voice fragments. For example, the preset duration threshold is 3 seconds, and the voice duration is greater than 3 seconds and not greater than 4 seconds. Corresponding to 2 speech fragments, the speech duration longer than 4 seconds corresponds to 3 speech fragments, and then the target speech can be averagely cut according to the determined speech fragments so that the fragment duration of each speech fragment is the same; for another example, when the speech of the target speech When the duration is greater than the preset duration threshold, it can be cut every other preset film length. If the preset duration threshold is 3 seconds, it can be cut every 3 seconds. For a voice duration of 5 seconds, cut It is two speech fragments, corresponding to the duration of 3 seconds and 2 seconds respectively. Of course, other cutting methods can also be used in practice. It is worth noting that if the voice duration of the target voice is less than or equal to the preset duration threshold, the entire target voice can be directly converted into the corresponding input spectrogram and the age recognition process in step S40 is performed.
所述步骤S40包括:The step S40 includes:
步骤S41,通过所述年龄识别网络模型分别提取各片段频谱图的深度特征,并根据各片段频谱图的深度特征分别确定各片段频谱图对应的片段年龄段;Step S41: Extract the depth features of each segment of the spectrogram through the age recognition network model, and respectively determine the segment age corresponding to each segment of the spectrogram according to the depth feature of each segment of the spectrogram;
本实施例中,在得到各片段频谱图时,即可通过年龄识别网络模型分别提取各片段频谱图的深度特征,并根据各片段频谱图的深度特征分别确定各片段频谱图对应的片段年龄段。而对于每个片段频谱图的特征提取过程和片段年龄段确定过程可参照上述步骤S40,此处不再赘述。In this embodiment, when the spectrogram of each segment is obtained, the depth feature of each segment of the spectrogram can be extracted through the age recognition network model, and the segment age corresponding to each segment of the spectrogram can be determined according to the depth feature of each segment of the spectrogram. . For the feature extraction process of each segment of the spectrogram and the segment age determination process, please refer to the above step S40, which will not be repeated here.
步骤S42,根据各片段频谱图对应的片段年龄段确定所述目标用户所属的目标年龄段。Step S42: Determine the target age group to which the target user belongs according to the segment age group corresponding to each segment spectrogram.
本实施例中,在确定各片段频谱图对应的片段年龄段时,即可确定所述目标用户所属的目标年龄段。其中,若各片段频谱图对应的片段年龄段相同,则将该相同的片段年龄段确定为目标用户所属的目标年龄段;而若各片段频谱图对应的片段年龄段不相同,则可根据实际情况定义投票决策规则,以根据该规则和多个确定的片段年龄段确定目标用户所属的目标年龄段。例如,对于该定义投票决策规则,可以是中值平均的方式,如目标语音对应三个片段频谱图,其中一个对应年龄段为22到24岁,一个对应年龄段为26到28岁,一个对应年龄段为28到30岁,则可根据分别取三个年龄段的中值23、27、29,然后取三个中值的均值26.3作为目标用户的目标年龄,进而确定目标用户所属的目标年龄段;对于该定义投票决策规则,也可以是多数确定的方式,如目标语音对应三个片段频谱图,其中两个对应年龄段为22到24岁,一个对应年龄段为26到28岁,则可将对应片段频谱图数量最多的年龄段22到24岁确定为目标用户所属的目标年龄段。当然除了上述举例外,还可以采用其它形式的投票决策规则,例如根据各片段语音的语音时长加权计算各年龄段的可信度、并将可信度最高的年龄段作为目标用户所属的目标年龄段等。In this embodiment, when the segment age group corresponding to each segment spectrogram is determined, the target age group to which the target user belongs can be determined. Among them, if the segment age range corresponding to each segment spectrogram is the same, the same segment age range is determined as the target age group to which the target user belongs; and if the segment age range corresponding to each segment spectrogram is not the same, it can be determined according to the actual The situation defines a voting decision rule to determine the target age group to which the target user belongs according to the rule and a plurality of determined segment age groups. For example, for the defined voting decision rule, the median average can be used. For example, the target voice corresponds to three fragment spectrograms, one of which corresponds to the age range of 22 to 24 years old, one corresponds to the age range of 26 to 28 years old, and one corresponds to If the age group is 28 to 30 years old, the median value of the three age groups is 23, 27, 29, and then the average of the three median values is 26.3 as the target age of the target user to determine the target age to which the target user belongs Segment; for the definition of voting decision rules, it can also be a majority determination method, such as the target voice corresponds to three segment spectrograms, two of which correspond to the age range of 22 to 24 years old, and one corresponds to the age range of 26 to 28 years old, then The age group 22 to 24 years old with the largest number of spectrograms corresponding to the segment can be determined as the target age group to which the target user belongs. Of course, in addition to the above examples, other forms of voting decision rules can also be used, such as calculating the credibility of each age group based on the speech duration of each segment of speech, and taking the age group with the highest credibility as the target age to which the target user belongs. Duan etc.
本实施例中,通过对时长较长的语音进行切割处理,得到多段语音,在分别对各段语音进行年龄识别,从而可减少运算量,提高年龄识别效率,并且由于对各段语音识别后对再根据各识别结果确定目标用户所属的目标年龄段,还有利于减少识别过程偶然因素导致的识别误差,提高了年龄识别的准确性。In this embodiment, multiple segments of speech are obtained by cutting the speech with a long duration, and the age recognition is performed on each segment of the speech respectively, thereby reducing the amount of calculation and improving the efficiency of age recognition. Determining the target age group to which the target user belongs according to the recognition results is also beneficial to reduce the recognition error caused by accidental factors in the recognition process and improve the accuracy of age recognition.
基于上述图2所示实施例,提出本申请年龄识别方法第三实施例。Based on the embodiment shown in FIG. 2 above, a third embodiment of the age identification method of the present application is proposed.
参照图4,图4为本申请年龄识别方法第三实施例的流程示意图。Referring to FIG. 4, FIG. 4 is a schematic flowchart of a third embodiment of an age identification method according to this application.
本实施例中,所述步骤S30之后,还包括:In this embodiment, after the step S30, the method further includes:
步骤S50,在接收到催收指令时,根据所述催收指令拨打催收电话,并在电话接通后获取对应的接通语音;Step S50, when receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;
本实施例中,当服务器接收到催收指令时,即可根据该催收指令获取对应的催收项目 信息,例如某一笔贷款的贷款用户(目标用户)、联系方式等,然后根据催收项目信息拨打催收电话,并在电话接通后获取对方的接通语音。其中,该催收指令可以是管理人员通过某一终端触发,还可以是服务器中存放有相关的催收计划,当时间到达该催收计划所设置的催收时间时,自动触发催收指令。In this embodiment, when the server receives the collection instruction, it can obtain the corresponding collection item information according to the collection instruction, such as the loan user (target user) of a certain loan, contact information, etc., and then call the collection based on the collection item information Call, and get the connected voice of the other party after the call is connected. Wherein, the collection instruction can be triggered by a manager through a certain terminal, or a related collection plan stored in the server. When the time reaches the collection time set by the collection plan, the collection instruction is automatically triggered.
步骤S60,判断所述接通语音中是否存在两个以上的用户语音;Step S60, judging whether there are more than two user voices in the connected voice;
在得到接通语音时,服务器将判断该接通语音中是否存在两个以上的用户语音。When the connected voice is obtained, the server will determine whether there are more than two user voices in the connected voice.
步骤S70,若所述接通语音中存在两个以上的用户语音,则根据各用户语音的语音持续时长和/或语音音量在各用户语音中确定目标用户的目标语音。Step S70: If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
本实施例中,目标用户在接听电话时,可能是在嘈杂的环境或是正在与人交谈,此时服务器所获取到的接通语音会存在两个以上的用户语音,对此,若接通语音中存在两个以上的用户语音,服务器需要从中确定出目标用户的目标语音,以准确识别目标用户的年龄。具体的,服务器可根据频率将各用户语音区分,然后获取各用户语音的语音属性,该语音属性包括语音频率、语音持续时长、语音音量等,然后根据该语音属性确定出目标用户的目标语音。例如,在实际中,目标用户是一直手持电话进行通话的,因此可以是将语音持续时长最长的用户语音确定为目标用户的目标语音;又例如,目标用户距离电话最近的用户,因此对电话目标用户的声音相对来说是最大的,因此可以是将语音音量最大的用户语音确定为目标用户的目标语音;当然还可以将上述两种因素结合,对于一个用户语音,根据语音持续时长和语音音量可分别获得对应的时长分和音量分,然后将时长分和音量分进行加和得到综合分,并将综合分最高的用户语音确定为目标用户的目标语音。值得说说明的是,若接通语音仅存在一个的用户语音,则可直接将该用户语音作为目标语音。在确定目标用户的目标语音时,即可执行上述步骤S30和S40的相关年龄识别处理。In this embodiment, when the target user answers the call, it may be in a noisy environment or talking with a person. At this time, there will be more than two user voices in the connected voice obtained by the server. There are more than two user voices in the voice, and the server needs to determine the target voice of the target user from them to accurately identify the age of the target user. Specifically, the server can distinguish each user's voice according to frequency, and then obtain the voice attributes of each user's voice, the voice attributes include voice frequency, voice duration, voice volume, etc., and then determine the target voice of the target user based on the voice attributes. For example, in practice, the target user is always holding the phone to make a call, so the user's voice with the longest voice duration can be determined as the target voice of the target user; for another example, the target user is the user closest to the phone. The target user’s voice is relatively loudest, so the user’s voice with the largest voice volume can be determined as the target voice of the target user; of course, the above two factors can also be combined. For a user’s voice, according to the duration of the voice and the voice The volume can obtain the corresponding time length points and volume points respectively, and then add the time length points and the volume points to obtain a comprehensive score, and determine the user voice with the highest comprehensive score as the target voice of the target user. It is worth noting that if there is only one user voice in the connected voice, the user voice can be directly used as the target voice. When the target voice of the target user is determined, the relevant age recognition processing of steps S30 and S40 can be executed.
通过以上方式,本实施例的服务器在进行电话催收时,若接通语音中存在两个以上的用户语音,则可先从中确定出目标语音,再进行后续年龄识别处理,避免对多个用户语音进行年龄识别所可能导致的识别错误问题,提高年龄识别的准确性。Through the above method, when the server of this embodiment is performing call collection, if there are more than two user voices in the connected voice, the target voice can be determined first, and then the subsequent age recognition processing can be performed to avoid the voice of multiple users. The problem of identification errors that may be caused by age identification, and improve the accuracy of age identification.
基于上述图2所示实施例,提出本申请年龄识别方法第四实施例。Based on the embodiment shown in FIG. 2 above, a fourth embodiment of the age identification method of the present application is proposed.
本实施例中,所述步骤S40之后,还包括:In this embodiment, after the step S40, the method further includes:
获取预设数量或预设周期的历史目标用户的历史目标年龄段,并根据所述历史目标年龄段获得催收年龄分布;Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;
本实施例中,服务器在得到当前的目标用户所属的目标年龄段时,还可将其进行存储。当收集到预设数量的历史目标用户或某一预设周期内的历史目标用户的历史目标年龄段时,可以对这些历史目标年龄段进行汇总和统计,得到对应的催收年龄分布。例如,在100位历史目标用户中,有30位属于26到28岁年龄段,有70位属于30到32岁年龄段;又例如在上个月的催收周期中,共有100位历史目标用户,其中有30位属于26到28岁年龄段,有70位属于30到32岁年龄段。In this embodiment, when the server obtains the target age group to which the current target user belongs, it may also store it. When a preset number of historical target users or historical target age ranges of historical target users in a certain preset period are collected, these historical target age ranges can be summarized and counted to obtain the corresponding collection age distribution. For example, among 100 historical target users, 30 belong to the age range of 26 to 28, and 70 belong to the age range of 30 to 32; for example, in the collection cycle last month, there were a total of 100 historical target users. Among them, 30 are in the age group of 26 to 28, and 70 are in the age group of 30 to 32.
根据所述催收年龄分布判断是否存在用户数量异常的年龄段,若存在,则向对应管理终端发送对应的异常提示信息。According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
本实施例中,服务器在得到催收年龄分布时,可根据该催收年龄分布判断是否存在用户数量异常的年龄段。对于该异常判断,可以是预先设置对应的异常规则,然后基于该规则进行异常判断,例如年龄段对应的异常人数阈值或比例,当某一年龄段对应的用户数量超过该异常人数阈值或比例时,即可认为该年龄段用户数量异常。本实施例中,若存在用户数量异常的年龄段,则有可能是当前贷款业务在该年龄段存在一定风险,也可能是年龄识别网络模型的识别能力降低、导致将过多的用户识别为该年龄段,此时服务器可向对应的管理终端发送对应的异常提示信息,以提示相关管理人员及时进行检视和处理。In this embodiment, when the server obtains the collection age distribution, it can determine whether there is an age group with an abnormal number of users based on the collection age distribution. For the abnormality judgment, the corresponding abnormality rule can be set in advance, and then the abnormality judgment is made based on the rule, such as the threshold or proportion of the abnormal number corresponding to the age group, when the number of users corresponding to a certain age group exceeds the threshold or the proportion of the abnormal number , It can be considered that the number of users in this age group is abnormal. In this embodiment, if there is an age group with an abnormal number of users, it may be that the current loan business has a certain risk in this age group, or it may be that the recognition ability of the age recognition network model is reduced, causing too many users to be identified as such. Age group, at this time, the server can send corresponding abnormal prompt information to the corresponding management terminal to prompt the relevant management personnel to inspect and deal with it in time.
通过以上方式,本实施例的服务器可根据分析历史目标用户的催收年龄分布,并根据该催收年龄分布分析是否存在异常情况,有利于及时发现异常情况,降低业务风险性和年龄识别网络模型的稳定性。Through the above method, the server of this embodiment can analyze the collection age distribution of historical target users, and analyze whether there is an abnormal situation according to the collection age distribution, which is conducive to timely detection of abnormal situations, reducing business risks and the stability of the age identification network model sex.
此外,本申请实施例还提供一种年龄识别装置。In addition, an embodiment of the present application also provides an age identification device.
本实施例中,所述年龄识别装置包括:In this embodiment, the age identification device includes:
样本扩充模块,用于从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;The sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;
模型训练模块,用于通过所述扩充语音样本训练得到年龄识别网络模型;A model training module is used to obtain an age recognition network model through the extended speech sample training;
语音转换模块,用于获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;The voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram;
年龄确定模块,用于通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.
其中,上述年龄识别装置的各虚拟功能模块存储于图1所示年龄识别设备的存储器1005中,用于实现计算机程序的所有功能;各模块被处理器1001执行时,可实现年龄识别的功能。Among them, each virtual function module of the above-mentioned age recognition device is stored in the memory 1005 of the age recognition device shown in FIG.
进一步的,所述年龄识别网络模型包括中间特征层和特征优化层,所述年龄确定模块包括:Further, the age recognition network model includes an intermediate feature layer and a feature optimization layer, and the age determination module includes:
特征提取单元,用于通过所述年龄识别网络模型的中间特征层对所述输入频谱图进行原始特征提取,得到对应的原始特征;A feature extraction unit, configured to perform original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain corresponding original features;
特征优化单元,用于通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征,并将所述优化特征确定为所述输入频谱图的深度特征。The feature optimization unit is used to perform feature optimization on the original feature based on the attention mechanism through the feature optimization layer of the age recognition network model to obtain the corresponding optimized feature, and determine the optimized feature as the input spectrogram The depth characteristics.
进一步的,所述原始特征包括原始特征图F,所述优化特征包括优化特征图F”,所述特征优化单元,还用于通过所述通过所述年龄识别网络模型的特征优化层获取所述原始特征的原始特征图F;计算所述F对应的一维通道注意力图channel attention map;对所述F和所述channel attention map进行元素相乘element-wise multiplication,得到对应的中间特征图F';计算所述F'对应的二维通道注意力图spatial attention map;对所述F'和所述spatial attention map进行element-wise multiplication,得到对应的优化特征图F”。Further, the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F". The feature optimization unit is further configured to obtain the feature optimization layer through the age recognition network model. The original feature map F of the original feature; calculate the one-dimensional channel attention map corresponding to the F channel attention map; perform element-wise multiplication on the F and the channel attention map to obtain the corresponding intermediate feature map F' Calculate the two-dimensional channel attention map spatial attention map corresponding to the F'; perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
进一步的,所述语音转换模块包括:Further, the voice conversion module includes:
时长判断单元,用于获取目标用户的目标语音,并判断所述目标语音的语音时长是否大于预设时长阈值;The duration judging unit is configured to obtain the target voice of the target user, and determine whether the voice duration of the target voice is greater than a preset duration threshold;
语音分割单元,用于若所述语音时长大于所述预设时长阈值,则对所述目标语音进行语音切割,得到两段以上的语音片段,并分别将各语音片段转换为对应的片段频谱图;The voice segmentation unit is configured to perform voice cutting on the target voice if the voice duration is greater than the preset duration threshold to obtain two or more voice segments, and respectively convert each voice segment into a corresponding segment spectrogram ;
所述年龄确定模块,还用于通过所述年龄识别网络模型分别提取各片段频谱图的深度特征,并根据各片段频谱图的深度特征分别确定各片段频谱图对应的片段年龄段;根据各片段频谱图对应的片段年龄段确定所述目标用户所属的目标年龄段。The age determination module is further configured to extract the depth characteristics of each segment of the spectrogram through the age recognition network model, and determine the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram; The segment age group corresponding to the spectrogram determines the target age group to which the target user belongs.
进一步的,所述年龄识别装置还包括:Further, the age identification device further includes:
语音获取模块,用于在接收到催收指令时,根据所述催收指令拨打催收电话,并在电话接通后获取对应的接通语音;The voice acquisition module is configured to, when receiving the collection instruction, dial a collection call according to the collection instruction, and obtain the corresponding connected voice after the call is connected;
语音判断模块,用于判断所述接通语音中是否存在两个以上的用户语音;The voice judgment module is used to judge whether there are more than two user voices in the connected voice;
语音确定模块,用于若所述接通语音中存在两个以上的用户语音,则根据各用户语音的语音持续时长和/或语音音量在各用户语音中确定目标用户的目标语音。The voice determining module is configured to determine the target voice of the target user in each user voice according to the voice duration and/or voice volume of each user voice if there are more than two user voices in the connected voice.
进一步的,所述年龄识别装置还包括:Further, the age identification device further includes:
语音催收模块,用于根据所述目标年龄段获取对应的目标话术模板,并根据所述目标 话术模板对所述目标用户进行语音催收。The voice collection module is used to obtain the corresponding target speech template according to the target age group, and perform voice collection on the target user according to the target speech template.
进一步的,所述年龄识别装置还包括:Further, the age identification device further includes:
分布获取模块,用于获取预设数量或预设周期的历史目标用户的历史目标年龄段,并根据所述历史目标年龄段获得催收年龄分布;The distribution acquisition module is used to acquire the historical target age range of the historical target users of a preset number or a preset period, and obtain the collection age distribution according to the historical target age range;
异常判断模块,用于根据所述催收年龄分布判断是否存在用户数量异常的年龄段,若存在,则向对应管理终端发送对应的异常提示信息。The abnormality judgment module is used for judging whether there is an age group with an abnormal number of users based on the collection age distribution, and if it exists, sending corresponding abnormality prompt information to the corresponding management terminal.
其中,上述年龄识别装置中各个模块的功能实现与上述年龄识别方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。Among them, the function realization of each module in the above-mentioned age recognition device corresponds to each step in the above-mentioned embodiment of the age recognition method, and the functions and realization processes thereof will not be repeated here.
此外,本申请实施例还提供一种计算机可读存储介质。In addition, the embodiment of the present application also provides a computer-readable storage medium.
本申请计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现如上述的年龄识别方法的步骤。The computer-readable storage medium of the present application stores a computer program, and when the computer program is executed by a processor, the steps of the above-mentioned age identification method are realized.
其中,计算机程序被执行时所实现的方法可参照本申请年龄识别方法的各个实施例,此处不再赘述。For the method implemented when the computer program is executed, please refer to the various embodiments of the age identification method of this application, which will not be repeated here.
可选的,本申请涉及的存储介质如计算机可读存储介质可以是非易失性的,也可以是易失性的。Optionally, the storage medium involved in this application, such as a computer-readable storage medium, may be non-volatile or volatile.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, method, article, or system. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.
Claims (20)
- 一种年龄识别方法,其中,所述年龄识别方法包括:An age identification method, wherein the age identification method includes:从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;通过所述扩充语音样本训练得到年龄识别网络模型;Obtaining an age recognition network model through the extended speech sample training;获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
- 如权利要求1所述的年龄识别方法,其中,所述年龄识别网络模型包括中间特征层和特征优化层,所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征的步骤包括:The age identification method of claim 1, wherein the age identification network model includes an intermediate feature layer and a feature optimization layer, and the step of extracting the depth features of the input spectrogram through the age identification network model comprises:通过所述年龄识别网络模型的中间特征层对所述输入频谱图进行原始特征提取,得到对应的原始特征;Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征,并将所述优化特征确定为所述输入频谱图的深度特征。Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
- 如权利要求2所述的年龄识别方法,其中,所述原始特征包括原始特征图F,所述优化特征包括优化特征图F”,3. The age recognition method of claim 2, wherein the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F",所述通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征的步骤包括:The step of performing feature optimization on the original feature based on the attention mechanism through the feature optimization layer of the age recognition network model to obtain the corresponding optimized feature includes:通过所述年龄识别网络模型的特征优化层获取所述原始特征的原始特征图F;Acquiring the original feature map F of the original feature through the feature optimization layer of the age recognition network model;计算所述F对应的一维通道注意力图channel attention map;Calculate the one-dimensional channel attention map corresponding to the F channel attention map;对所述F和所述channel attention map进行元素相乘element-wise multiplication,得到对应的中间特征图F';Element-wise multiplication is performed on the F and the channel attention map to obtain the corresponding intermediate feature map F';计算所述F'对应的二维通道注意力图spatial attention map;Calculate the two-dimensional channel attention map corresponding to F'spatial attention map;对所述F'和所述spatial attention map进行element-wise multiplication,得到对应的优化特征图F”。Perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
- 如权利要求1所述的年龄识别方法,其中,所述获取目标用户的目标语音,并将所述目标语音转换为对应的目标频谱图的步骤包括:8. The age identification method of claim 1, wherein the step of acquiring the target voice of the target user and converting the target voice into a corresponding target spectrogram comprises:获取目标用户的目标语音,并判断所述目标语音的语音时长是否大于预设时长阈值;Acquiring the target voice of the target user, and determining whether the voice duration of the target voice is greater than a preset duration threshold;若所述语音时长大于所述预设时长阈值,则对所述目标语音进行语音切割,得到两段以上的语音片段,并分别将各语音片段转换为对应的片段频谱图;If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤包括:The step of extracting the depth feature of the input spectrogram through the age recognition network model, and determining the target age group to which the target user belongs according to the depth feature includes:通过所述年龄识别网络模型分别提取各片段频谱图的深度特征,并根据各片段频谱图的深度特征分别确定各片段频谱图对应的片段年龄段;Extracting the depth features of each segment of the spectrogram by using the age recognition network model, and separately determining the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;根据各片段频谱图对应的片段年龄段确定所述目标用户所属的目标年龄段。The target age group to which the target user belongs is determined according to the segment age group corresponding to each segment spectrogram.
- 如权利要求1所述的年龄识别方法,其中,所述获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图的步骤之前,还包括:The age recognition method of claim 1, wherein before the step of acquiring the target voice of the target user and converting the target voice into a corresponding input spectrogram, the method further comprises:在接收到催收指令时,根据所述催收指令拨打催收电话,并在电话接通后获取对应的接通语音;When receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;判断所述接通语音中是否存在两个以上的用户语音;Judging whether there are more than two user voices in the connected voice;若所述接通语音中存在两个以上的用户语音,则根据各用户语音的语音持续时长和/或语音音量在各用户语音中确定目标用户的目标语音。If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
- 如权利要求1所述的年龄识别方法,其中,所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤之后,还包括:The age identification method of claim 1, wherein the step of extracting the depth feature of the input spectrogram through the age identification network model, and determining the target age group to which the target user belongs based on the depth feature After that, it also includes:根据所述目标年龄段获取对应的目标话术模板,并根据所述目标话术模板对所述目标用户进行语音催收。Obtain a corresponding target speech template according to the target age group, and perform voice collection on the target user according to the target speech template.
- 如权利要求1至6中任一项所述的年龄识别方法,其中,所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤之后,还包括:The age identification method according to any one of claims 1 to 6, wherein the depth feature of the input spectrogram is extracted through the age identification network model, and the target user is determined to belong to After the steps of the target age group, it also includes:获取预设数量或预设周期的历史目标用户的历史目标年龄段,并根据所述历史目标年龄段获得催收年龄分布;Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;根据所述催收年龄分布判断是否存在用户数量异常的年龄段,若存在,则向对应管理终端发送对应的异常提示信息。According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
- 一种年龄识别装置,其中,所述年龄识别装置包括:An age identification device, wherein the age identification device includes:样本扩充模块,用于从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;The sample expansion module is used to obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;模型训练模块,用于通过所述扩充语音样本训练得到年龄识别网络模型;A model training module is used to obtain an age recognition network model through the extended speech sample training;语音转换模块,用于获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;The voice conversion module is used to obtain the target voice of the target user and convert the target voice into a corresponding input spectrogram;年龄确定模块,用于通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The age determination module is configured to extract the depth characteristics of the input spectrogram through the age recognition network model, and determine the target age group to which the target user belongs according to the depth characteristics.
- 一种年龄识别设备,其中,所述年龄识别设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机程序,其中所述计算机程序被所述处理器执行时,实现年龄识别方法,所述年龄识别方法包括以下步骤:An age identification device, wherein the age identification device includes a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein when the computer program is executed by the processor , To implement an age identification method, the age identification method includes the following steps:从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;通过所述扩充语音样本训练得到年龄识别网络模型;Obtaining an age recognition network model through the extended speech sample training;获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
- 如权利要求9所述的年龄识别设备,其中,所述年龄识别网络模型包括中间特征层和特征优化层,执行所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征的步骤,包括:9. The age recognition device of claim 9, wherein the age recognition network model includes an intermediate feature layer and a feature optimization layer, and the step of extracting the depth features of the input spectrogram through the age recognition network model is performed, include:通过所述年龄识别网络模型的中间特征层对所述输入频谱图进行原始特征提取,得到对应的原始特征;Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征,并将所述优化特征确定为所述输入频谱图的深度特征。Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
- 如权利要求10所述的年龄识别设备,其中,所述原始特征包括原始特征图F,所述优化特征包括优化特征图F”,The age recognition device according to claim 10, wherein the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F",执行所述通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征的步骤,包括:The step of performing the feature optimization layer of the age recognition network model to optimize the original feature based on the attention mechanism to obtain the corresponding optimized feature includes:通过所述年龄识别网络模型的特征优化层获取所述原始特征的原始特征图F;Acquiring the original feature map F of the original feature through the feature optimization layer of the age recognition network model;计算所述F对应的一维通道注意力图channel attention map;Calculate the one-dimensional channel attention map corresponding to the F channel attention map;对所述F和所述channel attention map进行元素相乘element-wise multiplication,得到 对应的中间特征图F';Element-wise multiplication is performed on the F and the channel attention map to obtain the corresponding intermediate feature map F';计算所述F'对应的二维通道注意力图spatial attention map;Calculate the two-dimensional channel attention map corresponding to F'spatial attention map;对所述F'和所述spatial attention map进行element-wise multiplication,得到对应的优化特征图F”。Perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
- 如权利要求9所述的年龄识别设备,其中,执行所述获取目标用户的目标语音,并将所述目标语音转换为对应的目标频谱图的步骤,包括:9. The age recognition device of claim 9, wherein executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding target spectrogram comprises:获取目标用户的目标语音,并判断所述目标语音的语音时长是否大于预设时长阈值;Acquiring the target voice of the target user, and determining whether the voice duration of the target voice is greater than a preset duration threshold;若所述语音时长大于所述预设时长阈值,则对所述目标语音进行语音切割,得到两段以上的语音片段,并分别将各语音片段转换为对应的片段频谱图;If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;执行所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤,包括:Performing the step of extracting the depth features of the input spectrogram through the age recognition network model, and determining the target age group to which the target user belongs based on the depth features, includes:通过所述年龄识别网络模型分别提取各片段频谱图的深度特征,并根据各片段频谱图的深度特征分别确定各片段频谱图对应的片段年龄段;Extracting the depth features of each segment of the spectrogram by using the age recognition network model, and separately determining the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;根据各片段频谱图对应的片段年龄段确定所述目标用户所属的目标年龄段。The target age group to which the target user belongs is determined according to the segment age group corresponding to each segment spectrogram.
- 如权利要求9所述的年龄识别设备,其中,执行所述获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图的步骤之前,还包括:9. The age recognition device according to claim 9, wherein before executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding input spectrogram, the method further comprises:在接收到催收指令时,根据所述催收指令拨打催收电话,并在电话接通后获取对应的接通语音;When receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;判断所述接通语音中是否存在两个以上的用户语音;Judging whether there are more than two user voices in the connected voice;若所述接通语音中存在两个以上的用户语音,则根据各用户语音的语音持续时长和/或语音音量在各用户语音中确定目标用户的目标语音。If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
- 如权利要求9至13中任一项所述的年龄识别设备,其中,执行所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤之后,还包括:The age recognition device according to any one of claims 9 to 13, wherein performing the extraction of the depth feature of the input spectrogram through the age recognition network model, and determining the target user according to the depth feature After the steps of the target age group you belong to, it also includes:获取预设数量或预设周期的历史目标用户的历史目标年龄段,并根据所述历史目标年龄段获得催收年龄分布;Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;根据所述催收年龄分布判断是否存在用户数量异常的年龄段,若存在,则向对应管理终端发送对应的异常提示信息。According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
- 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,其中所述计算机程序被处理器执行时,实现年龄识别方法,所述年龄识别方法包括以下步骤:A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, an age identification method is implemented. The age identification method includes the following steps:从预设数据库中获取真实语音样本,并基于生成式对抗网络GAN对所述真实语音样本进行样本扩充,得到扩充语音样本;Obtain real voice samples from a preset database, and perform sample expansion on the real voice samples based on the generative confrontation network GAN to obtain expanded voice samples;通过所述扩充语音样本训练得到年龄识别网络模型;Obtaining an age recognition network model through the extended speech sample training;获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图;Acquiring the target voice of the target user, and converting the target voice into a corresponding input spectrogram;通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段。The depth feature of the input spectrogram is extracted through the age recognition network model, and the target age group to which the target user belongs is determined according to the depth feature.
- 如权利要求15所述的计算机可读存储介质,其中,所述年龄识别网络模型包括中间特征层和特征优化层,执行所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征的步骤,包括:The computer-readable storage medium of claim 15, wherein the age recognition network model includes an intermediate feature layer and a feature optimization layer, and the process of extracting the depth features of the input spectrogram through the age recognition network model is performed. The steps include:通过所述年龄识别网络模型的中间特征层对所述输入频谱图进行原始特征提取,得到对应的原始特征;Performing original feature extraction on the input spectrogram through the intermediate feature layer of the age recognition network model to obtain the corresponding original feature;通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征,并将所述优化特征确定为所述输入频谱图的深度特征。Through the feature optimization layer of the age recognition network model, the original feature is optimized based on the attention mechanism to obtain the corresponding optimized feature, and the optimized feature is determined as the depth feature of the input spectrogram.
- 如权利要求16所述的计算机可读存储介质,其中,所述原始特征包括原始特征图F,所述优化特征包括优化特征图F”,16. The computer-readable storage medium of claim 16, wherein the original feature includes an original feature map F, and the optimized feature includes an optimized feature map F",执行所述通过所述年龄识别网络模型的特征优化层、基于注意力机制对所述原始特征进行特征优化,得到对应的优化特征的步骤,包括:The step of performing the feature optimization layer of the age recognition network model to optimize the original feature based on the attention mechanism to obtain the corresponding optimized feature includes:通过所述年龄识别网络模型的特征优化层获取所述原始特征的原始特征图F;Acquiring the original feature map F of the original feature through the feature optimization layer of the age recognition network model;计算所述F对应的一维通道注意力图channel attention map;Calculate the one-dimensional channel attention map corresponding to the F channel attention map;对所述F和所述channel attention map进行元素相乘element-wise multiplication,得到对应的中间特征图F';Element-wise multiplication is performed on the F and the channel attention map to obtain the corresponding intermediate feature map F';计算所述F'对应的二维通道注意力图spatial attention map;Calculate the two-dimensional channel attention map corresponding to F'spatial attention map;对所述F'和所述spatial attention map进行element-wise multiplication,得到对应的优化特征图F”。Perform element-wise multiplication on the F'and the spatial attention map to obtain the corresponding optimized feature map F".
- 如权利要求15所述的计算机可读存储介质,其中,执行所述获取目标用户的目标语音,并将所述目标语音转换为对应的目标频谱图的步骤,包括:15. The computer-readable storage medium of claim 15, wherein executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding target spectrogram comprises:获取目标用户的目标语音,并判断所述目标语音的语音时长是否大于预设时长阈值;Acquiring the target voice of the target user, and determining whether the voice duration of the target voice is greater than a preset duration threshold;若所述语音时长大于所述预设时长阈值,则对所述目标语音进行语音切割,得到两段以上的语音片段,并分别将各语音片段转换为对应的片段频谱图;If the voice duration is greater than the preset duration threshold, perform voice cutting on the target voice to obtain two or more voice segments, and convert each voice segment into a corresponding segment spectrogram;执行所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤,包括:Performing the step of extracting the depth features of the input spectrogram through the age recognition network model, and determining the target age group to which the target user belongs based on the depth features, includes:通过所述年龄识别网络模型分别提取各片段频谱图的深度特征,并根据各片段频谱图的深度特征分别确定各片段频谱图对应的片段年龄段;Extracting the depth features of each segment of the spectrogram by using the age recognition network model, and separately determining the segment age corresponding to each segment of the spectrogram according to the depth characteristics of each segment of the spectrogram;根据各片段频谱图对应的片段年龄段确定所述目标用户所属的目标年龄段。The target age group to which the target user belongs is determined according to the segment age group corresponding to each segment spectrogram.
- 如权利要求15所述的计算机可读存储介质,其中,执行所述获取目标用户的目标语音,并将所述目标语音转换为对应的输入频谱图的步骤之前,还包括:15. The computer-readable storage medium of claim 15, wherein before executing the step of acquiring the target voice of the target user and converting the target voice into a corresponding input spectrogram, the method further comprises:在接收到催收指令时,根据所述催收指令拨打催收电话,并在电话接通后获取对应的接通语音;When receiving a collection instruction, dial a collection call according to the collection instruction, and obtain a corresponding connection voice after the call is connected;判断所述接通语音中是否存在两个以上的用户语音;Judging whether there are more than two user voices in the connected voice;若所述接通语音中存在两个以上的用户语音,则根据各用户语音的语音持续时长和/或语音音量在各用户语音中确定目标用户的目标语音。If there are more than two user voices in the connected voice, the target voice of the target user is determined in each user voice according to the voice duration and/or voice volume of each user voice.
- 如权利要求15至19中任一项所述的计算机可读存储介质,其中,执行所述通过所述年龄识别网络模型提取所述输入频谱图的深度特征,并根据所述深度特征确定所述目标用户所属的目标年龄段的步骤之后,还包括:The computer-readable storage medium according to any one of claims 15 to 19, wherein the performing the extraction of the depth feature of the input spectrogram through the age recognition network model, and determining the depth feature according to the depth feature After the steps of the target age group that the target user belongs to, it also includes:获取预设数量或预设周期的历史目标用户的历史目标年龄段,并根据所述历史目标年龄段获得催收年龄分布;Acquiring a preset number or a preset period of historical target age groups of historical target users, and obtaining a collection age distribution according to the historical target age groups;根据所述催收年龄分布判断是否存在用户数量异常的年龄段,若存在,则向对应管理终端发送对应的异常提示信息。According to the collection age distribution, it is determined whether there is an age group with an abnormal number of users, and if there is, the corresponding abnormal prompt information is sent to the corresponding management terminal.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010094834.9 | 2020-02-12 | ||
CN202010094834.9A CN111312286A (en) | 2020-02-12 | 2020-02-12 | Age identification method, age identification device, age identification equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021159902A1 true WO2021159902A1 (en) | 2021-08-19 |
Family
ID=71150902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/071262 WO2021159902A1 (en) | 2020-02-12 | 2021-01-12 | Age recognition method, apparatus and device, and computer-readable storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111312286A (en) |
WO (1) | WO2021159902A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113989100A (en) * | 2021-09-18 | 2022-01-28 | 西安电子科技大学 | Infrared texture sample expansion method based on pattern generation countermeasure network |
CN114067780A (en) * | 2021-11-04 | 2022-02-18 | 国家工业信息安全发展研究中心 | Vehicle-mounted voice recognition simulation test method, system and storage medium |
CN114694685A (en) * | 2022-04-12 | 2022-07-01 | 北京小米移动软件有限公司 | Voice quality evaluation method, device and storage medium |
CN114760523A (en) * | 2022-03-30 | 2022-07-15 | 咪咕数字传媒有限公司 | Audio and video processing method, device, equipment and storage medium |
CN117690431A (en) * | 2023-12-25 | 2024-03-12 | 杭州恒芯微电子技术有限公司 | Microphone system based on voice recognition |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111312286A (en) * | 2020-02-12 | 2020-06-19 | 深圳壹账通智能科技有限公司 | Age identification method, age identification device, age identification equipment and computer readable storage medium |
CN114861746B (en) * | 2021-12-15 | 2024-10-18 | 平安科技(深圳)有限公司 | Anti-fraud identification method and device based on big data and related equipment |
CN114708872B (en) * | 2022-03-22 | 2024-10-22 | 青岛海尔科技有限公司 | Voice instruction response method and device, storage medium and electronic device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080249774A1 (en) * | 2007-04-03 | 2008-10-09 | Samsung Electronics Co., Ltd. | Method and apparatus for speech speaker recognition |
CN103310788A (en) * | 2013-05-23 | 2013-09-18 | 北京云知声信息技术有限公司 | Voice information identification method and system |
US20170018270A1 (en) * | 2015-07-16 | 2017-01-19 | Samsung Electronics Co., Ltd. | Speech recognition apparatus and method |
CN108922518A (en) * | 2018-07-18 | 2018-11-30 | 苏州思必驰信息科技有限公司 | voice data amplification method and system |
CN109147810A (en) * | 2018-09-30 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network |
CN109299701A (en) * | 2018-10-15 | 2019-02-01 | 南京信息工程大学 | Expand the face age estimation method that more ethnic group features cooperate with selection based on GAN |
CN109559736A (en) * | 2018-12-05 | 2019-04-02 | 中国计量大学 | A kind of film performer's automatic dubbing method based on confrontation network |
CN111312286A (en) * | 2020-02-12 | 2020-06-19 | 深圳壹账通智能科技有限公司 | Age identification method, age identification device, age identification equipment and computer readable storage medium |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107430677B (en) * | 2015-03-20 | 2022-04-12 | 英特尔公司 | Target identification based on improving binary convolution neural network characteristics |
KR101809511B1 (en) * | 2016-08-04 | 2017-12-15 | 세종대학교산학협력단 | Apparatus and method for age group recognition of speaker |
WO2019225801A1 (en) * | 2018-05-23 | 2019-11-28 | 한국과학기술원 | Method and system for simultaneously recognizing emotion, age, and gender on basis of voice signal of user |
CN111066082B (en) * | 2018-05-25 | 2020-08-28 | 北京嘀嘀无限科技发展有限公司 | Voice recognition system and method |
CN108924218B (en) * | 2018-06-29 | 2020-02-18 | 百度在线网络技术(北京)有限公司 | Method and device for pushing information |
CN110136726A (en) * | 2019-06-20 | 2019-08-16 | 厦门市美亚柏科信息股份有限公司 | A kind of estimation method, device, system and the storage medium of voice gender |
CN110556129B (en) * | 2019-09-09 | 2022-04-19 | 北京大学深圳研究生院 | Bimodal emotion recognition model training method and bimodal emotion recognition method |
CN110619889B (en) * | 2019-09-19 | 2022-03-15 | Oppo广东移动通信有限公司 | Sign data identification method and device, electronic equipment and storage medium |
-
2020
- 2020-02-12 CN CN202010094834.9A patent/CN111312286A/en active Pending
-
2021
- 2021-01-12 WO PCT/CN2021/071262 patent/WO2021159902A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080249774A1 (en) * | 2007-04-03 | 2008-10-09 | Samsung Electronics Co., Ltd. | Method and apparatus for speech speaker recognition |
CN103310788A (en) * | 2013-05-23 | 2013-09-18 | 北京云知声信息技术有限公司 | Voice information identification method and system |
US20170018270A1 (en) * | 2015-07-16 | 2017-01-19 | Samsung Electronics Co., Ltd. | Speech recognition apparatus and method |
CN108922518A (en) * | 2018-07-18 | 2018-11-30 | 苏州思必驰信息科技有限公司 | voice data amplification method and system |
CN109147810A (en) * | 2018-09-30 | 2019-01-04 | 百度在线网络技术(北京)有限公司 | Establish the method, apparatus, equipment and computer storage medium of speech enhan-cement network |
CN109299701A (en) * | 2018-10-15 | 2019-02-01 | 南京信息工程大学 | Expand the face age estimation method that more ethnic group features cooperate with selection based on GAN |
CN109559736A (en) * | 2018-12-05 | 2019-04-02 | 中国计量大学 | A kind of film performer's automatic dubbing method based on confrontation network |
CN111312286A (en) * | 2020-02-12 | 2020-06-19 | 深圳壹账通智能科技有限公司 | Age identification method, age identification device, age identification equipment and computer readable storage medium |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113989100A (en) * | 2021-09-18 | 2022-01-28 | 西安电子科技大学 | Infrared texture sample expansion method based on pattern generation countermeasure network |
CN114067780A (en) * | 2021-11-04 | 2022-02-18 | 国家工业信息安全发展研究中心 | Vehicle-mounted voice recognition simulation test method, system and storage medium |
CN114760523A (en) * | 2022-03-30 | 2022-07-15 | 咪咕数字传媒有限公司 | Audio and video processing method, device, equipment and storage medium |
CN114694685A (en) * | 2022-04-12 | 2022-07-01 | 北京小米移动软件有限公司 | Voice quality evaluation method, device and storage medium |
CN117690431A (en) * | 2023-12-25 | 2024-03-12 | 杭州恒芯微电子技术有限公司 | Microphone system based on voice recognition |
Also Published As
Publication number | Publication date |
---|---|
CN111312286A (en) | 2020-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021159902A1 (en) | Age recognition method, apparatus and device, and computer-readable storage medium | |
EP3839942A1 (en) | Quality inspection method, apparatus, device and computer storage medium for insurance recording | |
US20180261236A1 (en) | Speaker recognition method and apparatus, computer device and computer-readable medium | |
WO2021000408A1 (en) | Interview scoring method and apparatus, and device and storage medium | |
CN111694940B (en) | User report generation method and terminal equipment | |
CN110634472B (en) | Speech recognition method, server and computer readable storage medium | |
CN108922543B (en) | Model base establishing method, voice recognition method, device, equipment and medium | |
US20230206928A1 (en) | Audio processing method and apparatus | |
US9947323B2 (en) | Synthetic oversampling to enhance speaker identification or verification | |
CN107316635B (en) | Voice recognition method and device, storage medium and electronic equipment | |
CN108132952A (en) | A kind of active searching method and device based on speech recognition | |
CN111508524B (en) | Method and system for identifying voice source equipment | |
CN113807103B (en) | Recruitment method, device, equipment and storage medium based on artificial intelligence | |
CN112037800A (en) | Voiceprint nuclear model training method and device, medium and electronic equipment | |
CN114065720A (en) | Conference summary generation method and device, storage medium and electronic equipment | |
US20210287682A1 (en) | Information processing apparatus, control method, and program | |
WO2024093578A1 (en) | Voice recognition method and apparatus, and electronic device, storage medium and computer program product | |
US20220327803A1 (en) | Method of recognizing object, electronic device and storage medium | |
Wang et al. | Interference quality assessment of speech communication based on deep learning | |
CN115831125A (en) | Speech recognition method, device, equipment, storage medium and product | |
CN113889081A (en) | Speech recognition method, medium, device and computing equipment | |
CN109190556B (en) | Method for identifying notarization will authenticity | |
WO2021077333A1 (en) | Simultaneous interpretation method and device, and storage medium | |
CN118658467B (en) | Cheating detection method, device, equipment, storage medium and product | |
CN115620748B (en) | Comprehensive training method and device for speech synthesis and false identification evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21753357 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.12.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21753357 Country of ref document: EP Kind code of ref document: A1 |