CN113555022A

CN113555022A - Voice-based same-person identification method, device, equipment and storage medium

Info

Publication number: CN113555022A
Application number: CN202110836229.9A
Authority: CN
Inventors: 刘源; 王健宗; 彭俊清
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-10-26

Abstract

The invention relates to the field of artificial intelligence, and discloses a voice-based method, a voice-based device, voice-based equipment and a voice-based storage medium, wherein the method comprises the following steps: extracting characteristic parameters of the voice to be recognized, determining the age bracket of the target user based on a preset vector machine model and the characteristic parameters, extracting voice data corresponding to the age bracket from a preset voice database of a registered user, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting a tone characteristic vector, and judging whether the target user and the registered user are the same person. According to the invention, through carrying out format conversion and age recognition on the voice, the voice of the registered user with the same age group as the target user is extracted and compared with the same person, so that the voice recognition rate and the accuracy of the same person recognition are improved. In addition, the invention also relates to a block chain technology, and the voice and the characteristic parameters to be recognized can be stored in the block chain.

Description

Voice-based same-person identification method, device, equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a voice-based method, a voice-based device, voice-based equipment and a voice-based storage medium for identifying the same person.

Background

With the continuous development of artificial intelligence, speech is widely applied in many fields, for example, the field of human-computer interaction, such as intelligent speech dialogue by using speech control equipment or a robot, and disease-assisted diagnosis, health management, remote consultation and the like by using speech, so that a large number of human-computer interaction products need to distinguish speakers themselves, that is, identify and distinguish the identities of the speakers by speech.

In the prior art, when the identity of a speaker is identified and distinguished according to voice identification, only a section of voice features with limited length in voice data of a target user is extracted for identification, the result cannot accurately represent the individual features of the speaker, and the identification result is based on probability calculation, so that very high resolution is difficult to achieve, and the accuracy of identifying the same person is low.

Disclosure of Invention

The invention mainly aims to solve the technical problem that the accuracy of the voice-based same-person recognition is low in the prior art.

The invention provides a voice-based same-person identification method, which comprises the following steps: acquiring a voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized; performing parameter analysis on the mark parameter information to determine the format type and attribute information of the voice to be recognized; carrying out format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after format conversion; performing age recognition on the voice to be recognized based on a preset vector machine model and the characteristic parameters, determining the age group of the target user, and extracting voice data corresponding to the age group from a voice database of a preset registered user; and respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting feature parameters of the voice to be recognized after format conversion includes: extracting a sampling rate, a bit rate and a sound channel in the attribute information of the voice to be recognized according to the format type; judging whether the sampling rate and the bit rate meet preset requirements or not; if the preset requirement is not met, converting the sampling rate and the bit rate based on a preset conversion rule, and judging whether the sound channel of the voice to be recognized is a single sound channel; if the sound channel is not the single sound channel, converting the sound channel into the single sound channel according to a preset sound channel conversion rule; and extracting the characteristic parameters of the voice to be recognized after the format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing age recognition on the speech to be recognized based on a preset vector machine model and the feature parameters, determining an age group of the target user, and extracting speech data corresponding to the age group from a preset speech database of a registered user includes: performing dimensionality reduction and aggregation treatment on the characteristic parameters to obtain age characteristic parameters; performing age recognition on the voice to be recognized based on a preset vector machine model and the age characteristic parameters to obtain a recognition result; comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result; determining the age bracket of the target user according to the confidence coefficient; and extracting voice data corresponding to the age bracket from a preset voice database of the registered user.

Optionally, in a third implementation manner of the first aspect of the present invention, the respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting a tone feature vector, comparing the tone feature vectors, and determining whether the target user and the registered user are the same person includes: respectively extracting the voice data and the voiceprint features of the voice to be recognized based on a preset deep convolutional neural network; clustering the voiceprint features to obtain tone feature vectors; calculating a similarity value of the tone characteristic vector, and judging whether the similarity value is not less than a preset tone similarity threshold value; and if so, determining that the target user and the registered user are the same person.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the clustering the voiceprint features to obtain a timbre feature vector includes: calculating a chroma characteristic value of the voiceprint characteristic, and generating a voiceprint matrix according to the chroma characteristic value; inputting the voiceprint features into the deep convolutional neural network, and outputting tone feature representations; and mapping the tone characteristic representation to a preset characteristic space, and carrying out quantitative characterization on the tone characteristic representation according to the characteristic space to obtain a tone characteristic vector.

Optionally, in a fifth implementation manner of the first aspect of the present invention, before extracting the voice data and the voiceprint feature of the voice to be recognized in the preset-based deep convolutional neural network, the method further includes: respectively performing framing processing on the voice data and the voice data to be recognized to obtain audio frames; extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold, wherein the short-time energy is the strength of the audio frame at different moments; if yes, the corresponding audio frame is eliminated.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting a tone feature vector, comparing the tone feature vectors, and determining whether the target user and the registered user are the same person, the method further includes: extracting the frame voiceprint characteristics of the voice to be recognized; calculating the posterior probability of the frame voiceprint characteristics based on a preset time delay neural network; calculating a hot individual value of the posterior probability; classifying the frame voiceprint features according to the thermal unique value, and identifying the frame voiceprint features according to a classification result; and performing voice registration on the target user according to the identification, and storing the voice to be recognized into a voice database of the registered user.

A second aspect of the present invention provides a voice-based co-person recognition apparatus, including: the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring the voice to be recognized of a target user and extracting the mark parameter information of the voice to be recognized; the analysis module is used for carrying out parameter analysis on the mark parameter information and determining the format type and the attribute information of the voice to be recognized; the conversion module is used for carrying out format conversion on the voice to be recognized according to the format type and the attribute information and extracting the characteristic parameters of the voice to be recognized after the format conversion; the recognition module is used for carrying out age recognition on the voice to be recognized based on a preset vector machine model and the characteristic parameters, determining the age bracket of the target user and extracting voice data corresponding to the age bracket from a voice database of a preset registered user; and the comparison module is used for respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized and judging whether the target user and the registered user are the same person or not.

Optionally, in a first implementation manner of the second aspect of the present invention, the conversion module includes: the first extraction unit is used for extracting a sampling rate, a bit rate and a sound channel in the attribute information of the voice to be recognized according to the format type; the judging unit is used for judging whether the sampling rate and the bit rate meet preset requirements or not; the first conversion unit is used for converting the sampling rate and the bit rate based on a preset conversion rule and judging whether the sound channel of the voice to be recognized is a single sound channel or not if the sampling rate and the bit rate do not meet the preset requirement; a second conversion unit, configured to convert the channel into a mono channel according to a preset channel conversion rule if the channel is not a mono channel; and the second extraction unit is used for extracting the characteristic parameters of the voice to be recognized after the format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters.

Optionally, in a second implementation manner of the second aspect of the present invention, the identification module is specifically configured to: performing dimensionality reduction and aggregation treatment on the characteristic parameters to obtain age characteristic parameters; performing age recognition on the voice to be recognized based on a preset vector machine model and the age characteristic parameters to obtain a recognition result; comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result; determining the age bracket of the target user according to the confidence coefficient; and extracting voice data corresponding to the age bracket from a preset voice database of the registered user.

Optionally, in a third implementation manner of the second aspect of the present invention, the comparing module includes: the third extraction unit is used for respectively extracting the voice data and the voiceprint features of the voice to be recognized based on a preset deep convolutional neural network; the clustering unit is used for clustering the voiceprint features to obtain tone feature vectors; the calculation unit is used for calculating the similarity value of the tone characteristic vector and judging whether the similarity value is not less than a preset tone similarity threshold value or not; and the determining unit is used for determining that the target user and the registered user are the same person if the similarity value is not less than a preset tone similarity threshold value.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the clustering unit is specifically configured to: calculating a chroma characteristic value of the voiceprint characteristic, and generating a voiceprint matrix according to the chroma characteristic value; inputting the voiceprint features into the deep convolutional neural network, and outputting tone feature representations; and mapping the tone characteristic representation to a preset characteristic space, and carrying out quantitative characterization on the tone characteristic representation according to the characteristic space to obtain a tone characteristic vector.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the voice-based same-person recognition apparatus further includes a rejecting module, which is specifically configured to: respectively performing framing processing on the voice data and the voice data to be recognized to obtain audio frames; extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold value; if yes, the corresponding audio frame is eliminated.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the voice-based recognition device further includes a registration module, which is specifically configured to: extracting the frame voiceprint characteristics of the voice to be recognized; calculating the posterior probability of the frame voiceprint characteristics based on a preset time delay neural network; calculating a hot individual value of the posterior probability; classifying the frame voiceprint features according to the thermal unique value, and identifying the frame voiceprint features according to a classification result; and performing voice registration on the target user according to the identification, and storing the voice to be recognized into a voice database of the registered user.

A third aspect of the present invention provides a voice-based same-person recognition apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the speech-based homo-recognition device to perform the steps of the speech-based homo-recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon instructions which, when executed on a computer, cause the computer to perform the steps of the above-described speech-based homo-recognition method.

In the technical scheme provided by the invention, the identification parameter information in the voice to be recognized of the target user is obtained, the parameter analysis is carried out on the identification parameter information, the characteristic parameter of the voice to be recognized is extracted, the age bracket of the target user is determined based on a preset vector machine model and the characteristic parameter, the voice data corresponding to the age bracket is extracted from a voice database of a preset registered user, the voice data of the registered user needing to be recognized by the same person is directionally selected, the recognition rate of the voice is improved, the voice data and the voice to be recognized are respectively input into a preset deep convolutional neural network, the tone characteristic vector is output, and whether the target user and the registered user are the same person is judged.

Drawings

FIG. 1 is a diagram of a first embodiment of a method for speech-based co-person recognition according to an embodiment of the present invention;

FIG. 2 is a diagram of a second embodiment of a method for voice-based co-person recognition according to an embodiment of the present invention;

FIG. 3 is a diagram of a third embodiment of a method for identifying a person based on speech according to an embodiment of the present invention;

FIG. 4 is a diagram of a fourth embodiment of a method for voice-based co-person recognition according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a speech-based homonym recognition apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another embodiment of a speech-based homonym recognition apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an embodiment of a voice-based homonym recognition apparatus according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a method, a device, equipment and a storage medium for identifying the same person based on voice, which are used for identifying the same person by acquiring mark parameter information in the voice to be identified of a target user, performing parameter analysis on the mark parameter information, extracting characteristic parameters of the voice to be identified, determining the age bracket of the target user based on a preset vector machine model and the characteristic parameters, extracting voice data corresponding to the age bracket from a preset voice database of a registered user, directionally selecting the voice data of the registered user needing the same person identification, improving the voice identification rate, respectively inputting the voice data and the voice to be identified into a preset deep convolutional neural network, outputting tone characteristic vectors, and judging whether the target user and the registered user are the same person or not, wherein the embodiment of the invention can be applied to intelligent diagnosis and remote consultation, extracts the tone characteristic vectors from the voice to perform vector comparison, the accuracy of the same person identification is improved.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For the sake of understanding, the following describes specific contents of an embodiment of the present invention, and referring to fig. 1, a first embodiment of a method for recognizing a person based on speech according to an embodiment of the present invention includes:

101, acquiring a voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

in the embodiment of the invention, the voice-based one-person identification method can be applied to intelligent diagnosis and treatment and remote consultation. The medical platform collects voices to be recognized of a target user in real time in a digital inquiry process based on medical data, uploads the collected voices to be recognized to a server for storage, preprocesses the voices to be recognized when the voices to be recognized are stored, and extracts mark parameter information from the preprocessed voices to be recognized. Wherein the pre-processing includes framing, windowing, and pre-emphasis. The framing is to cut off the voice signal according to short-time stationarity, the frame length generally adopts 20ms, and the frame shift generally adopts 10 ms; the windowing generally adopts a Hamming window or a Hanning window, because the width of a main lobe corresponds to the frequency resolution, the wider the width of the main lobe is, the lower the frequency resolution corresponding to the main lobe is, the energy is concentrated on the main lobe as much as possible when a window function is selected, or the relative amplitude of the maximum side lobe height is as small as possible, and the side lobe attenuation of the Hamming window in the amplitude-frequency characteristic is larger, and the Gibbs effect can be reduced, so the Hamming window is generally selected for the windowing processing of the voice signals; because the voice signal is easily affected by glottic excitation and oral-nasal radiation, 6 dB/octave attenuation can occur in the frequency component above 800Hz, so that the energy of a high-frequency part needs to be improved by a pre-emphasis method, high-frequency loss is compensated by using a machine, and pre-emphasis is realized by generally adopting a first-order high-pass filter. In addition, the pre-processing may also include anti-aliasing filtering.

In addition, the embodiment of the invention can acquire and process the related voice data based on the artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The server in the embodiment of the present invention may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

102, performing parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

and performing parameter analysis on each parameter in the flag parameter information, and determining the format type and attribute information of the voice to be recognized according to the type of each parameter contained in the flag parameter information because the flag parameter information in different format types is inconsistent. The format types of the voice comprise WAV, MP3 and OGG. These speech formats all have industry-uniform standards, and a recording device uses a certain format, and a player must decode the format, otherwise the speech cannot be played normally. Specifically, the flag parameter information is input into a voice format recognition unit of the server, verification of a file header in a WAV format, standard parameters and the like is carried out, if the WAV standard is met, the voice is judged to be in the WAV format, otherwise, decoder/parameter verification of an MP3 format standard is carried out on the flag parameter information, if the MP3 standard is met, the voice is judged to be in an MP3 format, otherwise, decoder/parameter verification of an OGG format standard is carried out on the flag parameter information, if the OGG standard is met, the voice is judged to be in the OGG format, and the format is returned to the server. If the format of the voice still can not be recognized, an information prompt which does not support the format voice is sent.

In this embodiment, WAV is a sound File format developed by microsoft, and conforms to riff (resource exchange File format) File specification, and is used for storing audio information resources of Windows platform. The WAV Header (Header) is a piece of data at the beginning of a file that takes on certain tasks, and is usually a description of the body data. The WAV file header consists of: RIFF block (RIFF-Chunk), Format block (Format-Chunk), additional block (Fact-Chunk), Data block (Data-Chunk)4 parts. The file header contains 44 bytes of flag parameter information, including: 4 bytes RIFF mark, 4 bytes file length, 4 bytes WAVE type block mark, etc. through judging these mark parameter information, it can determine whether the voice file is WAV.

MP3, commonly referred to as MPEG Audio Layer 3, is an efficient computer Audio coding scheme that converts Audio files to smaller files with the extension of. MP3 at a larger compression ratio, substantially preserving the sound quality of the source file. MP3 files are generally divided into three parts: ID3V2, audio data, ID3V 1. The audio data records parameter information such as sampling rate and bit rate, and the judgment of the parameter information of the mark can determine whether the voice file is MP 3.

OGG is known as OGG Vorbis, an audio compression format, similar to the music format of MP3, etc. After decoding, the OGG file forms a bitstream, the head of the bitstream is three packet headers, and the three packet headers are sequentially in the file: an identification Header (identification Header), a comment Header (comment Header), and an equipment Header (Setup Header). The identification head sets simple audio characteristics (such as sampling rate and number of channels) of versions and streams, and whether the voice to be recognized is in the OGG format can be determined by judging the audio characteristics.

103, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after the format conversion;

according to the format type, extracting the sampling rate, the bit rate and the sound channel in the attribute information of the voice to be recognized, respectively comparing whether the sampling rate and the bit rate are the same as preset requirements, namely judging whether the sampling rate and the bit rate meet the preset requirements, if the sampling rate and the bit rate do not meet the preset requirements, carrying out format conversion, and if the sampling rate and the bit rate meet the preset requirements, carrying out format conversion, wherein the preset requirements are that the sampling rate is 8k, the bit rate is 16 bits, and the process of carrying out format conversion on the voice to be recognized is that the sampling rate and the bit rate of the voice to be recognized are converted into the sampling rate and the bit rate meeting the preset requirements according to preset conversion rules.

And judging whether the sound channel of the voice to be recognized is a single sound channel, if the voice to be recognized is not the single sound channel, namely the voice to be recognized is a double sound channel, converting the voice to be recognized into the single sound channel according to a preset sound channel conversion rule. Specifically, the server provides a jni interface based on the channel separation of the C + + program, parameters (a left channel or a right channel) are defined in the interface, the Java program calls the interface and transmits the left channel to the interface, and after the C + + program completes processing, the binaural speech is converted into a single channel and transmitted to the Java program.

Extracting characteristic parameters contained in the converted voice to be recognized, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters, the time domain characteristic parameters comprise short-time zero-crossing rate, short-time energy spectrum and pitch period, and the frequency domain characteristic parameters comprise linear prediction cepstrum frequency (LPCC) and Mel Frequency Cepstrum Coefficient (MFCC).

104, performing age recognition on the voice to be recognized based on a preset vector machine model and characteristic parameters, determining the age bracket of the user, and extracting voice data corresponding to the age bracket from a preset voice database of a registered user;

inputting the characteristic parameters into a preset vector machine model, carrying out parameter analysis on the characteristic parameters by the vector machine model, extracting age characteristic parameters from the characteristic parameters, carrying out age recognition on the voice to be recognized according to the age characteristic parameters to obtain a recognition result, and determining the age bracket corresponding to the voice to be recognized. And after the age group corresponding to the voice to be recognized is determined, extracting the voice data of the same age group from a preset voice database of the registered user. The preset voice database of the registered user comprises voice data of all registered users registered by the medical platform in the server.

After the age group of the target user corresponding to the voice to be recognized is determined, voice data corresponding to the age group are extracted from a preset voice database of the registered user, namely voice data of the registered user in the same age group are extracted from the voice database. Respectively carrying out framing processing on the voice data and the voice data to be recognized to obtain audio frames; extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold value or not, wherein the short-time energy is the strength of the audio frame at different moments; if the short-time energy is smaller than the preset energy threshold, the corresponding audio frame is removed, namely, the audio frame filtering processing is carried out on the voice data and the voice to be recognized, so that the subsequent comparison accuracy of the tone characteristic vector is improved.

And 105, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data and the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same.

Respectively inputting the voice data and the voice to be recognized into a deep convolution neural network preset by a server, analyzing the voice through the deep convolution neural network, outputting three-dimensional tone characteristic representation, mapping the tone characteristic representation into a preset space, and quantitatively representing the tone characteristic through the characteristic space to obtain a tone characteristic vector. And carrying out vector comparison on the tone characteristic vector of the registered user and the tone characteristic vector corresponding to the voice to be recognized in a characteristic space, when the two tone characteristic vectors are consistent, indicating that the target user corresponding to the voice to be recognized and the registered user are the same person, and when the two tone characteristic vectors are inconsistent, indicating that the target user corresponding to the voice to be recognized and the registered user are not the same person, wherein the preset deep convolutional neural network is trained in advance. The training process of the deep convolutional neural network is the prior art, and is not described herein.

In this embodiment, the voice data of two target users may be obtained arbitrarily, and the voice data of the two targets are subjected to format conversion, age recognition and comparison of the tone feature vectors, so as to determine whether the two target users are the same person. That is, the present embodiment may perform peer identification on the target user to be identified and the registered user, or may perform peer identification on two target users to be identified.

In the embodiment of the invention, format conversion and age identification processing are carried out on the voice to be identified, and then according to the processing result, voice data of the registered user with the same age group as the voice to be identified is directionally selected for identifying the same person, so that the voice identification rate is improved, and then the tone characteristic vector is extracted from the voice for vector comparison, so that the accuracy of identifying the same person is improved.

Referring to fig. 2, a second embodiment of the method for recognizing a person based on speech according to the embodiment of the present invention includes:

201, acquiring a voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

202, performing parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

203, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after the format conversion;

204, performing dimensionality reduction and aggregation processing on the characteristic parameters to obtain age characteristic parameters;

in this embodiment, the dimension reduction processing is data dimension reduction of the characteristic parameters by using a principal component analysis algorithm (PCA algorithm), the aggregation is data aggregation of the characteristic parameters by using a K-means clustering algorithm, and the age characteristic parameters are obtained by performing the dimension reduction and aggregation processing on the characteristic parameters. The dimensionality reduction and aggregation processing of the data belong to the prior art, and are not described herein.

205, performing age recognition on the voice to be recognized based on a preset vector machine model and age characteristic parameters to obtain a recognition result;

the server is provided with a Vector Machine model, according to a Support Vector Machine (Support Vector Machine) method of the Vector Machine model, the age characteristic parameters are mapped into a high-dimensional or even infinite-dimensional characteristic space (Hilbert space) through a nonlinear mapping p, the expansion theorem of a kernel function is applied, and on the premise of hardly increasing the calculation complexity, the age bracket to which the voice to be recognized belongs is recognized according to the Vector Machine model and the age characteristic parameters, so that the recognition result is obtained.

206, comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result;

207, determining the age bracket of the target user according to the confidence coefficient;

the recognition rate of each age group is set in the vector machine model, the recognition result is compared with the recognition rate, the confidence coefficient of the recognition result is calculated, and the confidence coefficient of the recognition result is analyzed according to the recognition rate. And when the confidence coefficient value is not less than the preset confidence threshold value, the recognition result is more accurate, and the age bracket of the target user corresponding to the voice to be recognized can be determined.

208, extracting voice data corresponding to the age group from a preset voice database of the registered user;

and 209, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data and the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same.

After the age group of the target user corresponding to the voice to be recognized is determined, voice data corresponding to the age group are extracted from a preset voice database of the registered user, namely voice data of the registered user in the same age group are extracted from the voice database. Respectively inputting the voice data and the voice to be recognized into a deep convolution neural network preset by a server, analyzing the voice through the deep convolution neural network, outputting three-dimensional tone characteristic representation, mapping the tone characteristic representation into a preset space, and quantitatively representing the tone characteristic through the characteristic space to obtain a tone characteristic vector. And carrying out vector comparison on the tone characteristic vector of the registered user and the tone characteristic vector corresponding to the voice to be recognized in a characteristic space, when the two tone characteristic vectors are consistent, indicating that the target user corresponding to the voice to be recognized and the registered user are the same person, and when the two tone characteristic vectors are inconsistent, indicating that the target user corresponding to the voice to be recognized and the registered user are not the same person, wherein the preset deep convolutional neural network is trained in advance. The training process of the deep convolutional neural network is the prior art, and is not described herein.

In the embodiment of the present invention, the steps 201-203 are the same as the steps 101-103 in the first embodiment of the voice-based human recognition method, and will not be described herein.

In the embodiment of the invention, format conversion is carried out on the voice to be recognized, the characteristic parameters are extracted, dimension reduction and aggregation processing are carried out on the characteristic parameters to obtain the age characteristic parameters, age recognition is carried out according to the age characteristic parameters, the confidence coefficient of a recognition result is calculated, and the age group of the voice to be recognized is determined, so that voice data of a registered user with the same age group as the voice to be recognized is directionally selected for recognition of the same person, and the voice recognition rate is improved.

Referring to fig. 3, a third embodiment of the method for identifying a person based on speech according to the embodiment of the present invention includes:

301, acquiring a voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

302, performing parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

303, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after the format conversion;

304, performing age recognition on the voice to be recognized based on a preset vector machine model and characteristic parameters, determining the age bracket of the user, and extracting voice data corresponding to the age bracket from a preset voice database of a registered user;

305, respectively extracting voice data and voiceprint features of the voice to be recognized based on a preset deep convolutional neural network;

in the present embodiment, the voiceprint feature (Voice print) is a feature representing the Voice characteristic of the user, i.e., the talker. The extraction of the voiceprint features can be carried out through a preset neural network model, wherein the neural network model is a pre-trained sequential deep convolution neural network.

Specifically, according to a voice endpoint detection method (VAD detection) in the deep convolutional neural network, endpoint detection is performed on voice data and voice to be recognized, so that the voice data and the voice to be recognized are divided into multiple segments of audio data of speaker speaking, for example, the audio data corresponding to each time segment of 0-3 seconds, 4-7 seconds and 7-10 seconds, and feature (embedding) extraction is performed according to the audio data, that is, voiceprint features corresponding to each segment of audio data are extracted. The voiceprint feature can be regarded as a vector, and the dimension can be set according to the requirement, for example, 128 dimension or 512 dimension, and the unique characteristic of the speaker can be characterized through the voiceprint feature. The audio data with different durations can be extracted to obtain a vector with fixed dimensionality. For example, a matrix corresponding to each audio data may be input into a deep convolutional neural network, the voiceprint may be frequency, the matrices are formed in a time sequence, that is, the matrices are two-dimensional arrays of time and frequency, and a vector of a fixed dimension corresponding to each matrix is output through the deep convolutional neural network.

306, clustering the voiceprint features to obtain tone feature vectors;

in this embodiment, the clustering process for the voiceprint features may be clustering by using a K-means clustering algorithm (K-means) or spectral clustering, where K represents the number of classes, and K may be determined according to the number of speakers in the target speech data.

After the server extracts the corresponding voiceprint features, a matrix can be formed by the voiceprint features. Each row of the matrix may represent a voiceprint feature corresponding to a segment of audio data in speech, the voiceprint feature is a fixed dimension vector, and the duration of the audio data corresponding to each row may be different. For example, the first row of the matrix may represent a vector for 0-3 seconds, the second row may represent a vector for 4-7 seconds, the third row may represent a vector for 7-10 seconds, and so on. And clustering the matrix of the voiceprint characteristics to obtain a clustering result of the voiceprint characteristics corresponding to each section of audio data. And after clustering, vectorizing the voiceprint features to obtain tone feature vectors.

307, calculating a similarity value of the tone characteristic vector, and judging whether the similarity value is not less than a preset tone similarity threshold value;

and 308, if the similarity value is not less than the preset tone similarity threshold, determining that the target user and the registered user are the same person.

And performing similarity calculation on the tone characteristic vector of the target user corresponding to the voice to be recognized and the tone characteristic vector of the registered user, namely calculating the similarity value of the two tone characteristic vectors, and judging whether the target user corresponding to the voice to be recognized is the same as the registered user or not according to the similarity value. And when the similarity value of the two tone characteristic vectors is not less than a preset tone similarity threshold, determining that the target user and the registered user are the same person, and when the similarity value is less than the tone similarity threshold, determining that the target user and the registered user are not the same person.

In the embodiment of the present invention, the steps 301-304 are the same as the steps 101-104 in the first embodiment of the voice-based same person identification method, and will not be described herein.

In the embodiment of the invention, the voiceprint features in the voices of the target user and the registered user are extracted, the voiceprint features are clustered to obtain the tone feature vectors, the similarity value of the tone feature vectors is calculated, and whether the target user and the registered user are the same person is determined according to the comparison between the similarity value and the similarity threshold value, so that the accuracy of the recognition of the same person is improved.

Referring to fig. 4, a fourth embodiment of the method for recognizing a person based on voice according to the embodiment of the present invention includes:

401, acquiring a voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

402, performing parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

403, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after the format conversion;

404, performing age recognition on the voice to be recognized based on a preset vector machine model and characteristic parameters, determining the age bracket of the user, and extracting voice data corresponding to the age bracket from a preset voice database of a registered user;

405, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting tone characteristic vectors, comparing the tone characteristic vectors, and judging whether the user is the same as the registered user;

406, extracting frame voiceprint features of the speech to be recognized;

when the target user and the registered user are not the same person, the target user can be subjected to voice registration to become the registered user, or when the target user and the registered user are the same person, the voice can also be registered and added into a voice database of the registered user, so that the voice data of the registered user is enriched. In this embodiment, the voiceprint feature may be a Mel Frequency Cepstrum Coefficient (MFCC). Specifically, the feature parameter MFCC of the speech to be recognized is directly extracted, and since the MFCC automatically performs framing in the extraction process and processes to obtain the MFCC corresponding to each frame, the frame voiceprint feature corresponding to each frame in the speech to be recognized is obtained. Alternatively, the speech to be recognized may be sliced in units of frames, and the MFCC features of the slices are extracted respectively, so as to obtain the MFCC feature corresponding to each frame, i.e., the frame voiceprint feature.

407, calculating the posterior probability of the frame voiceprint characteristics based on a preset time delay neural network;

in this embodiment, a Time Delay Neural Network (TDNN) is used to implement a Universal Background Model (UBM), that is, a TDNN-UBM model is used to calculate the posterior probability. Specifically, the voiceprint features of each frame are respectively used as data input, and a Time Delay Neural Network (TDNN) corresponding to each frame in the speech to be recognized is obtained based on a TDNN-UBM model.

408, calculating a heat unique value of the posterior probability;

409, classifying the frame voiceprint characteristics according to the thermal unique value, and identifying the frame voiceprint characteristics according to the classification result;

and classifying the voiceprint features of each frame according to the obtained posterior probability. Calculating a hot unique value (one-hot) of each posterior probability, classifying the frame voiceprint features corresponding to the same hot unique value into the same classification, and recording the hot unique value as a type identifier of the corresponding classification, namely identifying the frame voiceprint features according to the hot unique value.

In this embodiment, the collected voices to be recognized are classified, and corresponding type identifiers are recorded for searching and matching in the subsequent voiceprint recognition process. The posterior probability corresponding to each frame segment is obtained by means of a TDNN-UBM model, and the frame segments are classified based on the posterior probability, so that the convergence of the speech to be recognized is completed, and the key features in the speech to be recognized are extracted; and then, the frame segments with the same type are classified into the same type to obtain more definite identification characteristics, so that more comprehensive identification verification can be provided for the subsequent identification process to improve the identification accuracy. In addition, when the frame voiceprint features are classified based on the posterior probability, the classification standard can be generated by calculating the thermal unique value of the posterior probability, so that the classification accuracy is improved.

And 410, performing voice registration on the target user according to the identification, and storing the voice to be recognized into a voice database of the registered user.

According to the type identification, the user corresponding to the voice to be recognized is subjected to voice registration, a 1:1 registration interface provided by a medical platform erected on a server is called, parameters (identification numbers and voice paths of registered users) are defined in the registration interface, registration is achieved by calling the registration interface, and when the registration is completed, the voice to be recognized subjected to voice registration is stored in a voice database of the registered users. In addition, the voice at the time of voice registration is not the original voice to be recognized, but the voice in 8k, 16bit, wav format obtained after having been subjected to the above-described steps (homo-recognition processing, classification, and identification) processing.

In the embodiment of the present invention, the steps 401-405 are the same as the steps 101-105 in the first embodiment of the voice-based same person identification method, and are not described herein again.

In the embodiment of the invention, after the recognition of the same person is finished, the frame voiceprint characteristics of the voice of the target user are extracted, and the frame voiceprint characteristics are classified and identified, so that the voice registration is carried out on the target user, the target user is stored in the language database, and the registered user and the voice data are expanded.

With reference to fig. 5, the method for recognizing a person based on voice in the embodiment of the present invention is described above, and a device for recognizing a person based on voice in the embodiment of the present invention is described below, where an embodiment of the device for recognizing a person based on voice in the embodiment of the present invention includes:

an obtaining module 501, configured to obtain a to-be-recognized voice of a target user, and extract flag parameter information of the to-be-recognized voice;

an analysis module 502, configured to perform parameter analysis on the flag parameter information, and determine a format type and attribute information of the speech to be recognized;

a conversion module 503, configured to perform format conversion on the voice to be recognized according to the format type and the attribute information, and extract a feature parameter of the voice to be recognized after the format conversion;

the recognition module 504 is configured to perform age recognition on the speech to be recognized based on a preset vector machine model and the feature parameters, determine an age group of the target user, and extract speech data corresponding to the age group from a preset speech database of a registered user;

a comparison module 505, configured to input the voice data and the voice to be recognized into a preset deep convolutional neural network, output corresponding tone feature vectors, compare the voice data and the tone feature vectors of the voice to be recognized, and determine whether the target user and the registered user are the same person.

In the embodiment of the invention, the voice to be recognized is subjected to format conversion and age recognition processing through the voice-based same-person recognition device, and then according to the processing result, the voice data of the registered user with the same age group as the voice to be recognized is directionally selected for same-person recognition, so that the recognition rate of the voice is improved, and then the tone characteristic vector is extracted from the voice for vector comparison, so that the accuracy of same-person recognition is improved.

Referring to fig. 6, another embodiment of the speech-based same person recognition apparatus according to the embodiment of the present invention includes:

Wherein the converting module 503 comprises:

a first extracting unit 5031, configured to extract, according to the format type, a sampling rate, a bit rate, and a channel in the attribute information of the speech to be recognized;

a determining unit 5032, configured to determine whether the sampling rate and the bit rate meet preset requirements;

a first converting unit 5033, configured to, if the sampling rate and the bit rate do not meet preset requirements, convert the sampling rate and the bit rate based on a preset conversion rule, and determine whether the channel is a mono channel;

a second converting unit 5034, configured to convert the sound channel of the to-be-recognized speech into a mono channel according to a preset sound channel conversion rule if the sound channel is not a mono channel;

a second extracting unit 5035, configured to extract feature parameters of the voice to be recognized after format conversion, where the feature parameters include a time domain feature parameter and a frequency domain feature parameter.

The identification module 504 is specifically configured to:

performing dimensionality reduction and aggregation treatment on the characteristic parameters to obtain age characteristic parameters;

performing age recognition on the voice to be recognized based on a preset vector machine model and the age characteristic parameters to obtain a recognition result;

comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result;

determining the age bracket of the target user according to the confidence coefficient;

and extracting voice data corresponding to the age bracket from a preset voice database of the registered user.

Wherein the comparing module 505 comprises:

a third extraction unit 5051, configured to extract voiceprint features of the speech data and the speech to be recognized, respectively, based on a preset deep convolutional neural network;

a clustering unit 5052, configured to perform clustering processing on the voiceprint features to obtain tone feature vectors;

a calculating unit 5053, configured to calculate a similarity value of the timbre feature vector, and determine whether the similarity value is not less than a preset timbre similarity threshold;

a determining unit 5054, configured to determine that the target user is the same as the registered user if the similarity value is not smaller than a preset timbre similarity threshold.

The clustering unit 5052 is specifically configured to:

calculating a chroma characteristic value of the voiceprint characteristic, and generating a voiceprint matrix according to the chroma characteristic value;

inputting the voiceprint features into the deep convolutional neural network, and outputting tone feature representations;

and mapping the tone characteristic representation to a preset characteristic space, and carrying out quantitative characterization on the tone characteristic representation according to the characteristic space to obtain a tone characteristic vector.

The voice-based same-person recognition device further includes a removing module 506, which is specifically configured to:

respectively performing framing processing on the voice data and the voice data to be recognized to obtain audio frames;

extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold value;

if yes, the corresponding audio frame is eliminated.

The voice-based same-person recognition apparatus further includes a registration module 507, which is specifically configured to:

extracting the frame voiceprint characteristics of the voice to be recognized;

calculating the posterior probability of the frame voiceprint characteristics based on a preset time delay neural network;

calculating a hot individual value of the posterior probability;

classifying the frame voiceprint features according to the thermal unique value, and identifying the frame voiceprint features according to a classification result;

and performing voice registration on the target user according to the identification, and storing the voice to be recognized into a voice database of the registered user.

In the embodiment of the invention, the voice recognition device based on voice performs format conversion and age recognition on the voice, so that the voice of the registered user with the same age group as the target user is extracted and compared with the target user, and the voice recognition rate and the accuracy of the recognition of the same person are improved.

Referring to fig. 7, an embodiment of the speech-based homonymous recognition apparatus according to an embodiment of the present invention is described in detail below from the viewpoint of hardware processing.

Fig. 7 is a schematic structural diagram of a voice-based recognition device 700 according to an embodiment of the present invention, which may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, and one or more storage media 730 (e.g., one or more mass storage devices) storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the voice-based co-person recognition device 700. Further, the processor 710 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the voice-based co-person recognition device 700.

The voice-based peer recognition device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and the like. Those skilled in the art will appreciate that the configuration of the speech-based co-person recognition device shown in fig. 7 does not constitute a limitation of the speech-based co-person recognition device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the voice-based homo-recognition method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice-based same-person recognition method is characterized by comprising the following steps:

acquiring a voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

performing parameter analysis on the mark parameter information to determine the format type and attribute information of the voice to be recognized;

carrying out format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after format conversion;

performing age recognition on the voice to be recognized based on a preset vector machine model and the characteristic parameters, determining the age group of the target user, and extracting voice data corresponding to the age group from a voice database of a preset registered user;

and respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

2. The method according to claim 1, wherein the converting the format of the voice to be recognized according to the format type and the attribute information, and extracting the feature parameters of the voice to be recognized after the format conversion comprises:

extracting a sampling rate, a bit rate and a sound channel in the attribute information of the voice to be recognized according to the format type;

judging whether the sampling rate and the bit rate meet preset requirements or not;

if the preset requirement is not met, converting the sampling rate and the bit rate based on a preset conversion rule, and judging whether the sound channel of the voice to be recognized is a single sound channel;

if the sound channel is not the single sound channel, converting the sound channel into the single sound channel according to a preset sound channel conversion rule;

and extracting the characteristic parameters of the voice to be recognized after the format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters.

3. The method according to claim 2, wherein the performing age recognition on the speech to be recognized based on a preset vector machine model and the feature parameters, determining an age group of the target user, and extracting speech data corresponding to the age group from a preset speech database of registered users comprises:

4. The method according to claim 3, wherein the inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting a tone feature vector, comparing the tone feature vector, and determining whether the target user and the registered user are the same person comprises:

respectively extracting the voice data and the voiceprint features of the voice to be recognized based on a preset deep convolutional neural network;

clustering the voiceprint features to obtain tone feature vectors;

calculating a similarity value of the tone characteristic vector, and judging whether the similarity value is not less than a preset tone similarity threshold value;

and if so, determining that the target user and the registered user are the same person.

5. The method according to claim 4, wherein the clustering the voiceprint features to obtain a timbre feature vector comprises:

6. The method according to claim 5, further comprising, before extracting the voice data and the voiceprint feature of the voice to be recognized in the deep convolutional neural network based on the preset, respectively:

extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold, wherein the short-time energy is the strength of the audio frame at different moments;

if yes, the corresponding audio frame is eliminated.

7. The method according to any one of claims 1 to 6, wherein after the inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, respectively, outputting a tone feature vector, comparing the tone feature vector, and determining whether the target user and the registered user are the same person, the method further comprises:

extracting the frame voiceprint characteristics of the voice to be recognized;

calculating a hot individual value of the posterior probability;

8. A speech-based homo-recognition device, the speech-based homo-recognition device comprising:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring the voice to be recognized of a target user and extracting the mark parameter information of the voice to be recognized;

the analysis module is used for carrying out parameter analysis on the mark parameter information and determining the format type and the attribute information of the voice to be recognized;

the conversion module is used for carrying out format conversion on the voice to be recognized according to the format type and the attribute information and extracting the characteristic parameters of the voice to be recognized after the format conversion;

the recognition module is used for carrying out age recognition on the voice to be recognized based on a preset vector machine model and the characteristic parameters, determining the age bracket of the target user and extracting voice data corresponding to the age bracket from a voice database of a preset registered user;

and the comparison module is used for respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized and judging whether the target user and the registered user are the same person or not.

9. A speech-based homo-recognition device, the speech-based homo-recognition device comprising:

a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the network access probe device to perform the steps of the voice-based co-person recognition method of any one of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of the speech-based co-person recognition method according to any one of claims 1-7.