CN113823294B - Cross-channel voiceprint recognition method, device, equipment and storage medium - Google Patents
Cross-channel voiceprint recognition method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN113823294B CN113823294B CN202111390613.7A CN202111390613A CN113823294B CN 113823294 B CN113823294 B CN 113823294B CN 202111390613 A CN202111390613 A CN 202111390613A CN 113823294 B CN113823294 B CN 113823294B
- Authority
- CN
- China
- Prior art keywords
- channel
- voiceprint
- audio data
- data
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 100
- 230000008569 process Effects 0.000 claims abstract description 49
- 238000012545 processing Methods 0.000 claims abstract description 48
- 238000012549 training Methods 0.000 claims abstract description 41
- 230000006870 function Effects 0.000 claims description 37
- 238000004590 computer program Methods 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012804 iterative process Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 description 13
- 238000005457 optimization Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 238000012512 characterization method Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000012790 confirmation Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000001755 vocal effect Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000003928 nasal cavity Anatomy 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 210000002105 tongue Anatomy 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 210000001260 vocal cord Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a cross-channel voiceprint recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process. The technical scheme of the invention can improve the identification accuracy of cross-channel voiceprint identification.
Description
Technical Field
The present invention relates to the field of geostatistical methods, and in particular, to a cross-channel voiceprint recognition method, apparatus, electronic device, and non-transitory computer-readable storage medium.
Background
In recent years, with the intensive research of voiceprint recognition technology, the voiceprint recognition system has achieved satisfactory performance under single channel conditions. However, in practical applications, the voice signal may be transmitted through different channels, such as a network channel, a telephone channel, and so on. This channel difference will distort the speech signal to different degrees, affecting the performance of the voiceprint recognition system. For example, in the registration phase, the user's voice is collected by the network channel; in the recognition phase, the user's voice is picked up by the telephone channel. At this time, the voiceprint recognition performance will be greatly degraded due to channel mismatch. Considering the diversity of voiceprint authentication scenes, the voiceprint recognition technology of a single channel can greatly limit the popularization and application of the voiceprint technology.
Therefore, how to overcome the influence of channel change on the identification performance is a technical problem which needs to be solved at present to improve the identification performance of the voiceprint identification system under the cross-channel condition.
Disclosure of Invention
The invention provides a cross-channel voiceprint recognition method and device, electronic equipment and a non-transitory computer readable storage medium, which are used for solving the problem that cross-channel voiceprint recognition is difficult in the prior art and improving the accuracy of cross-channel voiceprint recognition.
The invention provides a cross-channel voiceprint recognition method, which comprises the following steps: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.
The cross-channel voiceprint recognition method provided by the invention further comprises a training process of the cross-channel voiceprint recognition model, wherein the training process comprises the following steps: acquiring a sample voiceprint audio data set collected in the set channel set, wherein the set channel set comprises a first channel and a second channel, and the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels; selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, selecting sample voiceprint audio data in another channel except the one channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, and completing an iteration process; and reselecting sample voiceprint audio data to perform an iterative process until the second loss function is converged to obtain the cross-channel voiceprint recognition model.
According to the cross-channel voiceprint recognition method provided by the invention, the at least two different channels comprise at least one of the following channel classifications: a wireless channel, a wired channel, and a storage channel.
According to the cross-channel voiceprint identification method provided by the invention, the voiceprint audio data to be identified comprise first data collected in a first channel and second data collected in a second channel; after obtaining the voiceprint audio data processing result, the method further includes: acquiring a similar relation between the first data and the second data according to a voiceprint audio data processing result corresponding to the first data and a voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and a set first threshold value.
According to the cross-channel voiceprint identification method provided by the invention, the voiceprint audio data to be identified comprises third data collected in a first channel; after obtaining the voiceprint audio data processing result, the method further includes: acquiring a similarity relation between the third data and the on-library data according to a voiceprint audio data processing result corresponding to the third data and on-library data in a voiceprint library, wherein the on-library data is obtained according to voiceprint audio data collected in a second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.
According to the cross-channel voiceprint recognition method provided by the invention, the similarity relation is obtained by calculating the cosine distance or performing probability linear discriminant analysis.
According to the inventionIn each iteration process, the intermediate parameters are updated according to the following formula:
wherein,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,for the learning rate of the local update,is composed ofThe amount of change in (c); model parameters are updated according to the following formulaTo:
Wherein,,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,is the global updated learning rate.
The invention provides a cross-channel voiceprint recognition device, which comprises: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voiceprint audio data to be identified, the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; the identification unit is used for inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result so as to identify the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.
According to the cross-channel voiceprint recognition device provided by the invention, the device further comprises a training unit used for performing a training process on the cross-channel voiceprint recognition model, and the training unit comprises: a first obtaining subunit, configured to obtain a sample voiceprint audio data set collected in the set channel set, where the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels; the iteration subunit is configured to select sample voiceprint audio data in one channel, calculate a first loss function and an updated intermediate parameter of the sample voiceprint audio data in a channel corresponding to the iteration subunit, select sample voiceprint audio data in another channel other than the one channel based on the updated intermediate parameter and the first loss function, calculate a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, complete an iteration process, and reselect the sample voiceprint audio data to perform the iteration process until the second loss function converges, so as to obtain the cross-channel voiceprint identification model.
According to the cross-channel voiceprint recognition device provided by the invention, the at least two different channels comprise at least one of the following channel classifications: a wireless channel, a wired channel, and a storage channel.
According to the cross-channel voiceprint recognition device provided by the invention, the voiceprint audio data to be recognized comprise first data collected in the first channel and second data collected in the second channel; the apparatus further includes a first similarity relation determination unit configured to: after the voiceprint audio data processing result is obtained, acquiring the similarity relation between the first data and the second data according to the voiceprint audio data processing result corresponding to the first data and the voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and a set first threshold value.
According to the cross-channel voiceprint recognition device provided by the invention, the voiceprint audio data to be recognized comprise third data collected in the first channel; the apparatus further includes a second similarity relation determination unit configured to: acquiring a similarity relation between the third data and the on-library data according to a voiceprint audio data processing result corresponding to the third data and the on-library data in a voiceprint library, wherein the on-library data is obtained according to the voiceprint audio data collected in the second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.
According to the cross-channel voiceprint recognition device provided by the invention, the iteration unit is further configured to: during each iteration, the intermediate parameters are updated according to the following formula:
wherein,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,for the learning rate of the local update,is composed ofThe amount of change in (c); model parameters are updated according to the following formulaTo:
Wherein,,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,is the global updated learning rate.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cross-channel voiceprint recognition method as described in any of the above when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the cross-channel voiceprint recognition method as described in any of the above.
According to the cross-channel voiceprint recognition method, the cross-channel voiceprint recognition device, the electronic equipment and the non-transient computer readable storage medium, model training of each iteration process is carried out on voiceprint audio data collected in two different channels, so that a cross-channel voiceprint recognition model suitable for different channels can be obtained, and the cross-channel voiceprint recognition model can be used for accurately recognizing the voiceprint audio data to be recognized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a cross-channel voiceprint recognition method provided by the present invention;
FIG. 2 is a flow chart illustrating a training process of a cross-channel voiceprint recognition model provided by the present invention;
FIG. 3 is a flow chart of a two iteration process provided by the present invention;
FIG. 4 is a schematic structural diagram of a cross-channel voiceprint recognition apparatus provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the one or more embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the invention. As used in one or more embodiments of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present invention refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used herein to describe various information in one or more embodiments of the present invention, such information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The terms used in the examples of the present invention are explained below:
voiceprint: one type of information in a speech signal is a general term for speech features that characterize the identity of a speaker and speech models built based on these features. Because the different speakers use different vocal organs such as tongue, oral cavity, nasal cavity, vocal cords, lung, etc. in different sizes and forms, and considering the difference of different speakers in age, character, language habit, etc., the characteristics of different speakers such as vocal volume and vocal frequency are greatly different. It can be said that the voiceprint patterns of any two persons are not identical.
And (3) voiceprint recognition: the method is also called speaker identification, which is a biological characteristic identification technology for automatically realizing speaker identification by utilizing a computer and various information identification technologies according to the voiceprint characteristics which can represent the personal information of the speaker in a voice signal. Voiceprint recognition is essentially a type of pattern recognition problem. A typical voiceprint recognition system generally consists of two phases, registration and recognition. Wherein, the registration tries to train the reserved voice of the user into a speaker model, and the recognition is to judge whether an unknown voice comes from a specified speaker.
In the related art, the traditional voiceprint recognition technology is based on a statistical probability model, wherein the most classical is a gaussian mixture model-general background model (GMM-UBM) architecture. To further enhance the expressive power of speaker characteristics under limited data, various subspace models are proposed in succession, the most notable of which is the i-vector model. The i-vector model introduces an important concept: speaker characterization vector (Speaker embedding), i.e. a continuous vector of a fixed length is used to characterize the Speaker characteristics.
In recent years, based on deep learning methods, researchers have proposed a series of voiceprint recognition model methods in sequence, such as: d-vector model, x-vector model, etc. Such models map a speech signal of random duration into a continuous vector of fixed length called Deep speaker characterization vector (Deep speaker embedding). Constructing a space for describing the characteristics of the speakers through the characterization vectors of the speakers; in this space, scoring and decision making for voiceprint recognition can be achieved.
For the mainstream speaker model, the training goal is usually to maximally distinguish different speakers without considering channel disturbance, which makes it difficult to be effective in the cross-channel task. To address the cross-channel problem, researchers have conducted a series of studies. Such research is mainly divided into two fields, one is channel adaptation; another class is channel generalization. For channel adaptation, the basic idea is to project a channel A into a channel B through a certain mapping function, and to complete registration and identification on the channel B; for channel generalization, the basic idea is to learn a channel-independent space, project both channel a and channel B into the space, and perform registration and identification.
In consideration of channel disturbance, the technical scheme of cross-channel voiceprint recognition is difficult to achieve high recognition accuracy.
To solve the problem, an embodiment of the present invention provides a cross-channel voiceprint recognition scheme. The scheme is a channel robustness optimization method, and can improve the channel generalization of a voiceprint recognition system, so that the problem of cross-channel recognition is solved. The technical scheme of the embodiment of the invention belongs to the field of second-class channel generalization.
The following detailed description of exemplary embodiments of the invention refers to the accompanying drawings.
Fig. 1 is a flowchart illustrating a cross-channel voiceprint recognition method according to an embodiment of the present invention. The method provided by the embodiment of the invention can be executed by any electronic equipment with computer processing capability, such as a terminal or a server. As shown in fig. 1, the cross-channel voiceprint recognition method includes:
Specifically, the at least two channels may be a first channel and a second channel with different transmission media.
In particular, the cross-channel voiceprint recognition model is a deep neural network model. And the data processing result is a characteristic vector of the voiceprint audio data to be recognized output by the cross-channel voiceprint recognition model, and when the voiceprint audio data to be recognized is the voiceprint audio data, the voiceprint audio data processing result is a speaker characterization vector. According to the feature vector or the speaker characterization vector, comparison between two voice print audio data or comparison between the current input voice print and the voice print in the database can be performed in the space for describing the characteristics of the speaker.
In the embodiment of the invention, in the training process of the cross-channel voiceprint recognition model, each iteration part adopts voiceprint audio data in two different channels for training, so that channel generalization can be better realized, and the accuracy is higher during cross-channel voiceprint recognition.
Before step 104, a training process for the cross-channel voiceprint recognition model is further included, as shown in fig. 2, the training process includes:
In step 203, it is determined whether the second penalty function is converged, if yes, step 204 is executed, and if not, step 202 is executed.
And step 204, obtaining a cross-channel voiceprint recognition model.
In step 202, the operation of updating the intermediate parameters is a local update phase of model parameter update, and the operation of updating the model parameters is a global update phase of model parameter update. The training data for these two phases come from different channels.
In an embodiment of the invention, the at least two different channels comprise at least one of the following channel classes: a wireless channel, a wired channel, and a storage channel.
The two different channels may be different channels in the same category of channels, for example, channels of two different transmission media in a wired channel, or two channels in different categories, for example, one is a wired channel and one is a wireless channel.
In one embodiment, the two phases of training data are from two different channels. Wherein,representation from channelAnd channelThe data set of (2).Andis a channelAnd channelA subset of (a).Are model parameters of the trained model.
Wherein,namely, it isWhich is prepared fromOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,for the learning rate of the local update,is composed ofThe amount of change in (c).
Wherein,,namely, it isWhich is aOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,is the global updated learning rate.
In this solution, the model parametersThe updates are done only at global updates, while local updates are computedOnly the intermediate parameters of the gradient are calculated as global updates.
Before step 104, preprocessing the voiceprint audio data to be recognized is required, and when the voiceprint audio data to be recognized is the voiceprint audio data, the preprocessing operation may be a noise reduction operation or a mute segment data removal operation, or the noise reduction operation or the mute segment data removal operation may be performed at the same time.
In the voiceprint recognition technology, whether two voiceprint audio data are the same speaker or not can be compared, namely one-to-one confirmation is carried out; the voiceprint audio data of the same speaker as the current voiceprint audio data can also be recognized from a plurality of voiceprint audio data, namely, one-to-many recognition is carried out.
In one embodiment of the invention, the voiceprint audio data to be identified comprises first data collected in a first channel and second data collected in a second channel; after step 104, one-to-one confirmation of the voiceprint audio data may be performed, and specifically, a similarity relationship between the first data and the second data is obtained according to a voiceprint audio data processing result corresponding to the first data and a voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and the set first threshold value.
This embodiment may be used for one-to-one validation of voiceprint audio data collected under different channels of the same determined user. For example, the voiceprint audio data of the user collected at the mobile phone end is compared with the voiceprint audio data of the same user collected at other equipment for confirmation.
In another embodiment of the invention, the voiceprint audio data to be identified comprises third data collected in the first channel; after step 104, identifying the voiceprint audio data, specifically, obtaining a similarity relationship between the third data and the in-library data according to a voiceprint audio data processing result corresponding to the third data and the in-library data in the voiceprint library, wherein the in-library data is obtained according to the voiceprint audio data collected in the second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.
Wherein the voiceprint library stores voiceprint audio data of a plurality of different speakers of the second channel. This embodiment can be used for one-to-many recognition of voiceprint audio data acquired under different channels of an uncertain user. For example, the voiceprint audio data of a certain user collected at the mobile phone end is compared with the database data collected at other equipment for identification.
The similarity relation can be obtained by calculating cosine distance or performing probability linear discriminant analysis. The scheme of calculating the similarity by adopting the cosine distance algorithm is simple, and the rear-end algorithms such as the probability linear discriminant analysis algorithm and the like are slightly complex, but have higher calculation accuracy.
The channel robustness optimization method provided by the embodiment of the invention has great advantages in cross-channel condition.
hypothesis pair channelAnd channelThe sequence in the training process is not limited, and then the following formula is provided:
in the formula (2), the first term on the right side of the equationThe method is equivalent to accumulating the loss value of each channel data in the data set, and is equivalent to the loss value of the mixed training of a plurality of channels. The second term on the right of the equation can be regarded as a regularization term, which is a loss functionInner products of gradients on different channels.
In model training, the optimization objective is to minimize the loss function. Obviously, the optimization of the first term on the right side of the equation results in model parametersGradually converging; while the first term converges, the second term on the right side of the equation ensures that the gradient directions of different channels are consistent as much as possible, i.e., the directions are consistent, and the inner product is maximum. This means that optimizing the objective function will ensure, on the one hand, that performance optimizations are identified on the individual channels and, on the other hand, that the performance optimizations are consistent across the individual channels.
Hereinafter, taking 16kHz voice data of network channel and 8kHz voice data of telephone channel as an example, the training and testing process of the present invention is shown.
In the iterative process diagram shown in fig. 3, two rounds of iterative processes are shown. Wherein,the model parameters in each iteration are respectively, and the solid arrow represents the parameter updating direction of each iteration. As indicated by the dashed arrows in the figure, the first round of training includes two steps, local update and global update, the first step local update using 8kHz channel data, the second step global update using 16kHz channel data; similarly, in the second round of training, the first step local update uses 16kHz channel data and the second step global update uses 8kHz channel data. After a plurality of rounds of training, finally optimized model parameters are obtained。
In the test stage, the network channel 16kHz voice data and the telephone channel 8kHz voice data can be respectively mapped into the same parameter space through the model, and registration and confirmation recognition are completed in the space.
In an application channel of the embodiment of the invention, a user registers voiceprint through an application program of a mobile terminal and consults services through a call center. In the process, the business system of the merchant uses voiceprint recognition to authenticate the user identity to ensure the business safety. In the process, the voice with the sampling rate of 16KHz is collected through the network channel of the mobile terminal, the voice with the sampling rate of 8KHz is collected through the telephone channel, and the comparison of the two voices belongs to cross-channel comparison, namely cross-channel identification.
The technical scheme of the embodiment of the invention has simple training process and can be easily transferred to various deep learning frames. In addition, the optimization of each channel is guaranteed, the consistency of the optimization of each channel is guaranteed, the optimization deviation among different channels is avoided, and overfitting on some channels is prevented.
As can be seen from the formula (2), the method has a strong mathematical theoretical basis, and the effectiveness of the scheme is proved. The technical scheme of the embodiment of the invention is not only suitable for the problem of cross-channel voiceprint recognition, but also can be popularized to other related applications of pattern recognition, such as image recognition channels of face recognition and the like.
According to the cross-channel voiceprint recognition method provided by the invention, model training in each iteration process is carried out by adopting the voiceprint audio data collected in two different channels, so that a cross-channel voiceprint recognition model suitable for different channels can be obtained, and the cross-channel voiceprint recognition model can be used for accurately recognizing the voiceprint audio data to be recognized.
The cross-channel voiceprint recognition device provided by the invention is described below, and the cross-channel voiceprint recognition device described below and the cross-channel voiceprint recognition method described above can be referred to correspondingly.
As shown in fig. 4, an apparatus for cross-channel voiceprint recognition according to an embodiment of the present invention includes:
an obtaining unit 402, configured to obtain voiceprint audio data to be identified, where the voiceprint audio data to be identified is collected in a channel in a set channel set, and the set channel set includes at least two different channels.
The identification unit 404 is configured to input voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and perform voiceprint audio data identification according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in a set channel through a plurality of iteration processes, and model parameters are trained by the voiceprint audio data collected in two different channels in each iteration process.
In the embodiment of the present invention, the training unit is further included for performing a training process on the cross-channel voiceprint recognition model, and the training unit includes: the first acquisition subunit is used for acquiring a sample voiceprint audio data set acquired in a set channel set, wherein the sample voiceprint audio data in the sample voiceprint audio data set are acquired in at least two different channels; the iteration subunit is used for selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, selecting sample voiceprint audio data in another channel except the channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, completing an iteration process, and reselecting the sample voiceprint audio data to perform an iteration process until the second loss function is converged to obtain a cross-channel voiceprint recognition model.
In the embodiment of the invention, the voice print audio data to be identified comprises first data collected in a first channel and second data collected in a second channel; the apparatus further includes a first similarity relation determination unit configured to: after obtaining the voiceprint audio data processing result, acquiring a similarity relation between the first data and the second data according to the voiceprint audio data processing result corresponding to the first data and the voiceprint audio data processing result corresponding to the second data; and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and the set first threshold value.
In the embodiment of the invention, the voiceprint audio data to be identified comprises third data collected in a first channel; the apparatus further includes a second similarity relation determination unit configured to: acquiring a similarity relation between the third data and the on-library data according to the voiceprint audio data processing result corresponding to the third data and the on-library data in the voiceprint library, wherein the on-library data is obtained according to the voiceprint audio data collected in the second channel; selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation; and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.
In an embodiment of the invention, the at least two different channels comprise at least one of the following channel classes: a wireless channel, a wired channel, and a storage channel.
The two different channels may be different channels in the same category of channels, for example, channels of two different transmission media in a wired channel, or two channels in different categories, for example, one is a wired channel and one is a wireless channel.
In an embodiment of the present invention, the iteration unit is further configured to: during each iteration, the intermediate parameters are updated according to the following formula:
wherein,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,for the learning rate of the local update,is composed ofThe amount of change in (c).
Wherein,,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) above,is the global updated learning rate.
Since each functional module of the cross-channel voiceprint recognition apparatus in the exemplary embodiment of the present invention corresponds to the step of the exemplary embodiment of the cross-channel voiceprint recognition method, for details that are not disclosed in the embodiment of the apparatus of the present invention, please refer to the above-mentioned embodiment of the cross-channel voiceprint recognition method of the present invention.
According to the cross-channel voiceprint recognition device provided by the invention, model training in each iteration process is carried out by adopting the voiceprint audio data collected in two different channels, so that a cross-channel voiceprint recognition model suitable for different channels can be obtained, and the cross-channel voiceprint recognition model can be used for accurately recognizing the voiceprint audio data to be recognized.
Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a cross-channel voiceprint recognition method comprising: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.
Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the cross-channel voiceprint recognition method provided by the above methods, the method comprising: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program that when executed by a processor is implemented to perform the cross-channel voiceprint recognition methods provided above, the method comprising: acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels; inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result; the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through the voiceprint audio data collected in two different channels in each iteration process.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (9)
1. A cross-channel voiceprint recognition method, comprising:
acquiring voiceprint audio data to be identified, wherein the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels;
inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result, and identifying the voiceprint audio data according to the voiceprint audio data processing result;
the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through voiceprint audio data collected in two different channels in each iteration process;
the training process of the cross-channel voiceprint recognition model comprises the following steps:
acquiring a sample voiceprint audio data set collected in the set channel set, wherein the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels;
selecting sample voiceprint audio data in one channel, calculating a first loss function and an updated intermediate parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, selecting sample voiceprint audio data in another channel except the one channel based on the updated intermediate parameter and the first loss function, calculating a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the sample voiceprint audio data, and completing an iteration process;
and reselecting sample voiceprint audio data to perform an iterative process until the second loss function is converged to obtain the cross-channel voiceprint recognition model.
2. The method of claim 1, wherein the at least two different channels comprise at least one of the following channel classifications: a wireless channel, a wired channel, and a storage channel.
3. The method according to claim 1, wherein the voiceprint audio data to be identified comprises first data collected in a first channel and second data collected in a second channel; after obtaining the voiceprint audio data processing result, the method further includes:
acquiring a similar relation between the first data and the second data according to a voiceprint audio data processing result corresponding to the first data and a voiceprint audio data processing result corresponding to the second data;
and identifying whether the first data and the second data are from the same speaker according to the magnitude relation between the similarity relation and a set first threshold value.
4. The method of claim 1, wherein the voiceprint audio data to be identified comprises third data collected on a first channel; after obtaining the voiceprint audio data processing result, the method further includes:
acquiring a similarity relation between the third data and the on-library data according to a voiceprint audio data processing result corresponding to the third data and the on-library data in a voiceprint library, wherein the on-library data is obtained according to voiceprint audio data collected in a second channel;
selecting fourth data with the maximum similarity with the third data from the database data according to the similarity relation;
and identifying whether the third data and the fourth data are from the same speaker according to the magnitude relation between the similarity of the third data and the fourth data and a set second threshold value.
5. The method according to claim 3 or 4, wherein the similarity relation is obtained by calculating a cosine distance or performing a probabilistic linear discriminant analysis.
6. The method of claim 1, wherein during each of said iterations, the intermediate parameters are updated according to the following formula:
wherein,is thatOn-channelThe loss function of (a) to (b),is collected from the channelThe voice print audio data of (a) is,for the learning rate of the local update,is composed ofThe amount of change in (c); model parameters are updated according to the following formulaTo:
7. A cross-channel voiceprint recognition apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voiceprint audio data to be identified, the voiceprint audio data to be identified are acquired in channels in a set channel set, and the set channel set comprises at least two different channels;
the identification unit is used for inputting the voiceprint audio data to be identified into a preset cross-channel voiceprint identification model to obtain a voiceprint audio data processing result so as to identify the voiceprint audio data according to the voiceprint audio data processing result;
the cross-channel voiceprint recognition model is obtained by training voiceprint audio data collected in the set channel set through multiple iteration processes, and model parameters are trained through voiceprint audio data collected in two different channels in each iteration process;
a training unit for performing a training process on the cross-channel voiceprint recognition model, the training unit comprising: a first obtaining subunit, configured to obtain a sample voiceprint audio data set collected in the set channel set, where the sample voiceprint audio data in the sample voiceprint audio data set are collected in the at least two different channels; the iteration subunit is configured to select sample voiceprint audio data in one channel, calculate a first loss function and an updated intermediate parameter of the sample voiceprint audio data in a channel corresponding to the iteration subunit, select sample voiceprint audio data in another channel other than the one channel based on the updated intermediate parameter and the first loss function, calculate a second loss function and an updated model parameter of the sample voiceprint audio data in the channel corresponding to the iteration subunit, complete an iteration process, and reselect the sample voiceprint audio data to perform the iteration process until the second loss function converges, so as to obtain the cross-channel voiceprint identification model.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the processor executes the program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111390613.7A CN113823294B (en) | 2021-11-23 | 2021-11-23 | Cross-channel voiceprint recognition method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111390613.7A CN113823294B (en) | 2021-11-23 | 2021-11-23 | Cross-channel voiceprint recognition method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113823294A CN113823294A (en) | 2021-12-21 |
CN113823294B true CN113823294B (en) | 2022-03-11 |
Family
ID=78919679
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111390613.7A Active CN113823294B (en) | 2021-11-23 | 2021-11-23 | Cross-channel voiceprint recognition method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113823294B (en) |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10548534B2 (en) * | 2016-03-21 | 2020-02-04 | Sonde Health Inc. | System and method for anhedonia measurement using acoustic and contextual cues |
CN111312283B (en) * | 2020-02-24 | 2023-03-21 | 中国工商银行股份有限公司 | Cross-channel voiceprint processing method and device |
CN111402899B (en) * | 2020-03-25 | 2023-10-13 | 中国工商银行股份有限公司 | Cross-channel voiceprint recognition method and device |
CN111611566B (en) * | 2020-05-12 | 2023-09-05 | 珠海造极智能生物科技有限公司 | Speaker verification system and replay attack detection method thereof |
CN112820299B (en) * | 2020-12-29 | 2021-09-14 | 马上消费金融股份有限公司 | Voiceprint recognition model training method and device and related equipment |
CN112820298B (en) * | 2021-01-14 | 2022-11-22 | 中国工商银行股份有限公司 | Voiceprint recognition method and device |
CN113421573B (en) * | 2021-06-18 | 2024-03-19 | 马上消费金融股份有限公司 | Identity recognition model training method, identity recognition method and device |
-
2021
- 2021-11-23 CN CN202111390613.7A patent/CN113823294B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113823294A (en) | 2021-12-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11031018B2 (en) | System and method for personalized speaker verification | |
US10347241B1 (en) | Speaker-invariant training via adversarial learning | |
CN110164452B (en) | Voiceprint recognition method, model training method and server | |
McLaren et al. | Advances in deep neural network approaches to speaker recognition | |
WO2019237517A1 (en) | Speaker clustering method and apparatus, and computer device and storage medium | |
US11456003B2 (en) | Estimation device, learning device, estimation method, learning method, and recording medium | |
CN108109613A (en) | For the audio training of Intelligent dialogue voice platform and recognition methods and electronic equipment | |
CN102024455A (en) | Speaker recognition system and method | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
WO2020045313A1 (en) | Mask estimation device, mask estimation method, and mask estimation program | |
CN110648671A (en) | Voiceprint model reconstruction method, terminal, device and readable storage medium | |
CN108877812B (en) | Voiceprint recognition method and device and storage medium | |
Meyer et al. | Anonymizing speech with generative adversarial networks to preserve speaker privacy | |
CN111462762B (en) | Speaker vector regularization method and device, electronic equipment and storage medium | |
KR102026226B1 (en) | Method for extracting signal unit features using variational inference model based deep learning and system thereof | |
CN110085236B (en) | Speaker recognition method based on self-adaptive voice frame weighting | |
CN115457938A (en) | Method, device, storage medium and electronic device for identifying awakening words | |
CN109377984A (en) | A kind of audio recognition method and device based on ArcFace | |
CN113823294B (en) | Cross-channel voiceprint recognition method, device, equipment and storage medium | |
CN111028847A (en) | Voiceprint recognition optimization method based on back-end model and related device | |
CN116434758A (en) | Voiceprint recognition model training method and device, electronic equipment and storage medium | |
CN112599118B (en) | Speech recognition method, device, electronic equipment and storage medium | |
US20230206926A1 (en) | A deep neural network training method and apparatus for speaker verification | |
CN113593525A (en) | Method, device and storage medium for training accent classification model and accent classification | |
CN113948089A (en) | Voiceprint model training and voiceprint recognition method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |