CN112328994A

CN112328994A - Voiceprint data processing method and device, electronic equipment and storage medium

Info

Publication number: CN112328994A
Application number: CN202011289173.1A
Authority: CN
Inventors: 杜诗宣; 任君; 罗超; 胡泓; 李巍
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2021-02-05

Abstract

The invention relates to the technical field of voiceprint recognition, and provides a voiceprint data processing method and device, electronic equipment and a storage medium. The voiceprint data processing method comprises the following steps: obtaining a real-time audio stream indicating an operation order and a user identification of the order; obtaining current voiceprint data containing current voiceprint characteristics and current audio quality according to the real-time audio stream; according to the user identification, searching whether first voiceprint data with the user identification as an index exists in a voiceprint database to obtain a first judgment result; if the first judgment result is yes, comparing whether the current voiceprint data is similar to the first voiceprint data or not to obtain a second judgment result; and if so, updating the first voiceprint data by the current voiceprint data according to the current audio quality. The invention realizes updating the voiceprint database based on the audio quality of the current call of the user, can realize the user identity based on voiceprint recognition to determine the order operation authority, and protects the property and information safety of the user.

Description

Voiceprint data processing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint data processing method and device, electronic equipment and a storage medium.

Background

Voiceprint recognition is a technique for identity confirmation based on the voice of a user. Due to the difference between the pronunciation organs and the pronunciation habits, even if the same sound waveform is said, the sound waveforms of different users are different, and the corresponding voiceprint information is different. Therefore, the voiceprint information can be used for identifying the identity of the user, and property and information safety of the user is guaranteed.

Particularly in a scene that a calling user expects to modify, cancel and the like the order, the voiceprint recognition has the advantages of quickly recognizing the user identity and shortening the verification time, so that whether the calling user is the order owner can be quickly confirmed during the call.

However, a premise of voiceprint recognition applications is that there is a voiceprint library that stores the user's accurate voiceprint information. In the prior art, a voiceprint library is formed by extracting voiceprint information of historical call audio of a user. But the resulting voiceprint library suffers from a number of drawbacks.

First, ideal call audio requires users to record with good recording equipment in quiet environments to ensure audio quality. However, the historical call audio of the user is easily affected by background noise and signals, and the requirements on the environment where the user is located and the used equipment cannot be met in actual operation;

secondly, compared with physiological characteristics such as irises and fingerprints which are not easy to change, the voice is unstable and can also change along with factors such as the age and the health state of the user, so that the voiceprint library needs to be updated in time;

thirdly, if the voiceprint library is updated by using the newly generated call audio of the user, the updated voiceprint information cannot be guaranteed to be better than the originally stored voiceprint information because the environment where the user is in during the call is various and the audio quality is unstable.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the invention and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

In view of the above, the present invention provides a voiceprint data processing method, apparatus, electronic device and storage medium, which can update a voiceprint database based on the audio quality of the current call of a user, and can determine an order operation authority based on a user identity identified by a voiceprint, thereby protecting the property and information security of the user.

One aspect of the present invention provides a voiceprint data processing method, including: obtaining a real-time audio stream indicating an operation order and a user identification of the order; obtaining current voiceprint data containing current voiceprint characteristics and current audio quality according to the real-time audio stream; according to the user identification, searching whether first voiceprint data with the user identification as an index exists in a voiceprint database to obtain a first judgment result; if the first judgment result is yes, comparing whether the current voiceprint data is similar to the first voiceprint data or not to obtain a second judgment result; and when the second judgment result is yes, updating the first voiceprint data by the current voiceprint data according to the current audio quality.

In some embodiments, said updating said first voiceprint data with said current voiceprint data comprises: judging whether the current audio quality exceeds the first audio quality of the first voiceprint data; if so, replacing the first voiceprint data with the current voiceprint data; and if not, replacing the first voiceprint data with the weighted average of the current voiceprint data and the first voiceprint data.

In some embodiments, the voiceprint data processing method further comprises: when the second judgment result is negative, searching whether second voiceprint data with the user identification as an index exists in the voiceprint database to obtain a third judgment result, wherein the first voiceprint data corresponds to a common voiceprint record, and the second voiceprint data corresponds to a standby voiceprint record; if so, comparing whether the current voiceprint data is similar to the second voiceprint data to obtain a fourth judgment result; and if so, updating the second voiceprint data by the current voiceprint data according to the current audio quality, and interchanging the first voiceprint data and the updated second voiceprint data.

In some embodiments, the voiceprint data processing method further comprises: if the third judgment result is negative, acquiring a fifth judgment result of whether the call user of the instruction operation order confirmed by the customer service and the user identification correspond to the same user; and if so, screening the audio segment with the audio quality higher than the quality threshold value in the real-time audio stream, and storing the current voiceprint data of the screened audio segment as second voiceprint data with the user identifier as an index.

In some embodiments, the voiceprint data processing method further comprises: when the second judgment result, the fourth judgment result or the fifth judgment result is yes, allowing the order to be operated; and if the fourth judgment result or the fifth judgment result is negative, writing the current voiceprint characteristics into a blacklist, and preventing the order from being operated.

In some embodiments, the voiceprint data processing method further comprises: if the first judgment result is negative, whether the call user indicating the operation order and the user identification correspond to the same user is confirmed through the customer service; if so, screening the audio segment with the audio quality higher than the quality threshold value in the real-time audio stream, storing the current voiceprint data of the screened audio segment as first voiceprint data with the user identification as an index, and allowing the order to be operated; and if not, writing the current voiceprint characteristics into a blacklist, and preventing the order from being operated.

In some embodiments, the comparing whether the current voiceprint data is similar to the first voiceprint data comprises: calculating cosine similarity between the current voiceprint feature and the first voiceprint feature; when the cosine similarity exceeds a similarity threshold, judging that the current voiceprint data is similar to the first voiceprint data; and when the cosine similarity is smaller than the similarity threshold, judging that the current voiceprint data is not similar to the first voiceprint data.

In some embodiments, the step of obtaining the current voiceprint characteristics comprises: preprocessing the real-time audio stream to obtain short-time Fourier characteristics; inputting the short-time Fourier features into a trained voiceprint model to obtain the current voiceprint features, wherein the method comprises the following steps: performing feature extraction on the short-time Fourier features through a feature extraction layer comprising a convolutional network and a residual error network to obtain frame-level audio features; performing feature transformation on the frame-level audio features through a feature transformation layer comprising an average layer, an affine layer and a regularization layer to obtain segment-level audio features; and carrying out vector conversion on the segment-level audio features through an embedding layer to obtain the current voiceprint features.

In some embodiments, the voiceprint model, when trained, further comprises a two-class network layer connecting the embedding layers, and the training process of the voiceprint model comprises: obtaining a plurality of groups of sample audio streams, wherein each group of sample audio streams corresponds to a user label; preprocessing each group of the sample audio streams to obtain an effective audio segment of each user label; training an initial model comprising the feature extraction layer, the feature conversion layer, the embedding layer and the two classification network layers by taking the effective audio segment and the user label as initial training data; screening the effective audio segment of each user label based on the initial model to obtain target training data containing the screened effective audio segment and the corresponding user label; and training the initial model according to the target training data to obtain the voiceprint model.

In some embodiments, said filtering the active audio segment of each of said user tags based on said initial model comprises: inputting the short-time Fourier characteristics of the effective audio segment of each user tag into the initial model to obtain initial voiceprint characteristics output by an embedding layer of the initial model; and calculating the similarity between the initial voiceprint features of each user label, and screening out the effective audio segment corresponding to the initial voiceprint features with the similarity higher than a set threshold value.

In some embodiments, the pre-processing each set of the sample audio streams includes: cutting each group of the sample audio streams to obtain a plurality of sample audio segments; and carrying out endpoint detection on each sample audio segment to obtain an effective audio segment with silence and noise filtered.

Yet another aspect of the present invention provides a voiceprint data processing apparatus comprising: the audio acquisition module is configured to acquire a real-time audio stream indicating an operation order and a user identifier of the order; a feature acquisition module configured to acquire current voiceprint data including current voiceprint features and current audio quality according to the real-time audio stream; the first judgment module is configured to search whether first voiceprint data with the user identification as an index exists in a voiceprint database according to the user identification to obtain a first judgment result; the second judging module is configured to compare whether the current voiceprint data is similar to the first voiceprint data or not to obtain a second judging result when the first judging result is yes; and a first updating module configured to update the first voiceprint data with the current voiceprint data according to the current audio quality when the second determination result is yes.

Yet another aspect of the present invention provides an electronic device, comprising: a processor; a memory having executable instructions stored therein; wherein the executable instructions, when executed by the processor, implement the voiceprint data processing method of any of the above embodiments.

Yet another aspect of the present invention provides a computer-readable storage medium storing a program that when executed implements the voiceprint data processing method of any of the above embodiments.

Compared with the prior art, the invention has the beneficial effects that:

monitoring a scene that a call user expects to operate an order by acquiring a real-time audio stream indicating the operation order and a user identifier of the order; comparing the current voiceprint data containing the voiceprint information and the audio quality with historical voiceprint data under the user identification, and accurately judging whether the call user is matched with the order user;

further, historical voiceprint data under the user identification are updated based on the current audio quality, and the updated voiceprint data are superior to prestored voiceprint data; and the order operation authority can be determined based on the identified conversation user identity, and the property and information safety of the order user can be protected.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram illustrating steps of a voiceprint data processing method according to an embodiment of the invention;

FIG. 2 is a schematic diagram illustrating the steps of a voiceprint data processing method in a further embodiment of the invention;

FIG. 3 is a schematic diagram illustrating the steps of a voiceprint data processing method in a further embodiment of the invention;

FIG. 4 is a schematic diagram of a network structure of a voiceprint model in an embodiment of the invention;

FIG. 5 is a schematic diagram illustrating a network structure of a voiceprint model during training according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating steps of a training process of a voiceprint model in an embodiment of the invention;

FIG. 7 is a diagram illustrating an exemplary audio quality determination process according to an embodiment of the present invention;

FIG. 8 shows a block schematic diagram of a voiceprint data processing apparatus in an embodiment of the invention;

fig. 9 shows a schematic structural diagram of an electronic device in an embodiment of the invention; and

fig. 10 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The step numbers in the following embodiments are merely used to indicate different execution contents, and the execution order between the steps is not strictly limited. The use of "first," "second," and similar terms in the detailed description is not intended to imply any order, quantity, or importance, but rather is used to distinguish one element from another. It should be noted that features of the embodiments of the invention and of the different embodiments may be combined with each other without conflict.

Fig. 1 shows the main steps of a voiceprint data processing method in an embodiment, and referring to fig. 1, the voiceprint data processing method in the embodiment includes: in step S110, a real-time audio stream indicating an operation order and a user identifier of the order are obtained; in step S120, obtaining current voiceprint data including current voiceprint characteristics and current audio quality according to the real-time audio stream; in step S130, according to the user identifier, retrieving from the voiceprint database whether there is first voiceprint data indexed by the user identifier, to obtain a first determination result; in step S140, if the first determination result is yes, comparing whether the current voiceprint data is similar to the first voiceprint data to obtain a second determination result; and in step S150, when the second determination result is yes, updating the first voiceprint data with the current voiceprint data according to the current audio quality.

Indicating an operation order means that a calling user desires to perform an operation on the order in the calling process, such as canceling the order, modifying the order, or querying order information, and the like, and the operations relate to property and information security of the calling user, so that the identity of the calling user needs to be verified, and the order is prevented from being maliciously modified or queried. Conventionally, the customer service checks whether the calling subscriber is an order subscriber by one or more simple information such as name, order number, etc. The order user is the order owner, and the user identification corresponds to the order user.

By adopting the voiceprint data processing method, the invention realizes the monitoring of the scene of the expected operation order of the communication user by obtaining the real-time audio stream indicating the operation order and the user identification of the order. And comparing the current voiceprint data containing the voiceprint information and the audio quality with the first voiceprint data under the user identification, and accurately judging whether the call user is matched with the order user. The related information of the call user is embodied by the current voiceprint data, and the related information of the order user is embodied by the user identification and the data content taking the user identification as the index. Further, the first voiceprint data is updated based on the current audio quality, and the updated voiceprint data is better than the pre-stored voiceprint data.

In a specific example, the step of updating the first voiceprint data with the current voiceprint data specifically includes: judging whether the current audio quality exceeds the first audio quality of the first voiceprint data; if so, replacing the first voiceprint data with the current voiceprint data; and if not, replacing the first voiceprint data by the weighted average of the current voiceprint data and the first voiceprint data.

Fig. 2 shows the main steps of a voiceprint data processing method in a further embodiment, and referring to fig. 2, on the basis of the above embodiment, the voiceprint data processing method further includes: in step S210, when the second determination result is negative, searching whether second voiceprint data indexed by the user identifier exists in the voiceprint database to obtain a third determination result; in step S220, if the third determination result is yes, comparing whether the current voiceprint data is similar to the second voiceprint data, and obtaining a fourth determination result; and in step S230, if the fourth determination result is yes, updating the second voiceprint data with the current voiceprint data according to the current audio quality, and interchanging the first voiceprint data and the updated second voiceprint data.

Wherein the first voiceprint data corresponds to a common voiceprint recording and the second voiceprint data corresponds to a standby voiceprint recording. The complete data content indexed by the user identification comprises: first voiceprint data comprising a first voiceprint characteristic and a first audio quality; and second acoustic line data including second acoustic line features and second audio quality. When creating the voiceprint database, when two types of voiceprint data belonging to the same user are judged to be dissimilar, a type with a larger number can be registered as the first voiceprint data, and a type with a smaller number can be registered as the second voiceprint data. The second voiceprint data may also be voiceprint data which is later identified as belonging to a user but not similar to the first voiceprint data of the user, and then is additionally stored under the user identification of the user.

Specifically, in an actual scene, the first voiceprint data corresponds to the sound of a user during normal speaking, and the second voiceprint data corresponds to the sound of the user during cold; or the first voiceprint data corresponds to the voice of the user in an early stage, and the second voiceprint data corresponds to the voice of the user changed in a later stage due to age, body and the like.

When updating the second voiceprint data with the current voiceprint data, referring to the updating process of the first voiceprint data, the method specifically includes: judging whether the current audio quality exceeds a second audio quality of the second voiceprint data; if so, replacing the second voiceprint data with the current voiceprint data; and if not, replacing the second voiceprint data by the weighted average of the current voiceprint data and the second voiceprint data.

Further, when the current voiceprint data of the user is not similar to the first voiceprint data but is similar to the second voiceprint data, the current real sound of the user is more close to the second voiceprint data. Therefore, the first voiceprint data and the updated second voiceprint data are exchanged, so that the user identity can be identified by the updated second voiceprint data which is a common voiceprint record.

Fig. 3 shows the main steps of a voiceprint data processing method in a further embodiment, and referring to fig. 3, on the basis of the above embodiment, the voiceprint data processing method further includes: in step S310, when the third determination result is negative, a fifth determination result is obtained whether the call user and the user identifier, which are confirmed by the customer service and indicate to operate the order, correspond to the same user; in step S320, if the fifth determination result is yes, an audio segment with an audio quality higher than the quality threshold in the real-time audio stream is screened out, and the current voiceprint data of the screened out audio segment is stored as the second voiceprint data indexed by the user identifier.

And manually confirming the identity of the user as a supplement when the voiceprint data is missing, and the customer service can confirm whether the call user is an order user or not through one or more items of order information.

Further, if the fifth determination result is negative, step S330 is executed to write the current voiceprint feature into a blacklist, and prevent the current call user from operating the order, so as to prevent the order from being maliciously tampered by an unrelated user. If the fourth determination result is negative, step S330 is also executed to write the current voiceprint feature of the calling user into the blacklist, and the operation order of the calling user can be prevented.

In addition, when any of the second determination result, the fourth determination result, and the fifth determination result is yes, it indicates that the calling user and the order user are the same person, and therefore, the calling user may be allowed to operate the order.

Further, when the first determination result is negative, the method further includes: whether a call user indicating an operation order corresponds to the same user with a user identifier is confirmed through the customer service; if yes, executing step S340 shown in fig. 3, screening out an audio segment with audio quality higher than a quality threshold in the real-time audio stream, storing current voiceprint data of the screened out audio segment as first voiceprint data with the user identifier as an index, and allowing the talking user to operate the order; and if not, writing the current voiceprint characteristics into a blacklist, and preventing the call user from operating the order. Fig. 3 mainly shows an update process of the voiceprint database, and does not specifically list a related process for determining the operation authority of the user for the order according to the identified identity of the call user.

According to the descriptions of the embodiments, in the process of sending a call to a user, the current voiceprint characteristics can be calculated and the current audio quality can be judged in real time according to the transmitted audio stream, the current voiceprint data can be obtained, and meanwhile, the first voiceprint data under the user identification can be searched in the voiceprint database. If the first voiceprint data does not exist, the audio segment with poor quality can be abandoned, and the current voiceprint data with high quality is registered. And if the first voiceprint data exists, comparing the similarity of the current voiceprint data and the first voiceprint data, and if the calling user is confirmed to be the order user, updating the first voiceprint data based on the current audio quality. And if the current voiceprint data is judged to be dissimilar to the first voiceprint data, continuously searching whether second voiceprint data of the call user exists in the voiceprint database, exchanging the second voiceprint data with the first voiceprint data if the second voiceprint data of the call user exists and the identity of the call user is confirmed, and determining whether to update the voiceprint database based on the current audio quality pair. And if the conversation user and the order user are judged to be not the same, recording the current voiceprint characteristics of the conversation user into a blacklist. And if the second voice print data does not exist, determining whether the second voice print data of the calling user is registered or not according to the identity verification result of the customer service to the calling user. Therefore, the voiceprint database is accurately updated based on the audio quality, and the information safety of the order user is protected.

In the above embodiments, comparing whether two sets of voiceprint data are similar to each other can be implemented by calculating cosine similarity of two voiceprint features. For example, the step of comparing whether the current voiceprint data is similar to the first voiceprint data comprises: calculating the cosine similarity between the current voiceprint feature and the first voiceprint feature; when the cosine similarity exceeds a similarity threshold, judging that the current voiceprint data is similar to the first voiceprint data; and when the cosine similarity is smaller than the similarity threshold, judging that the current voiceprint data is not similar to the first voiceprint data. Or, the comparison of whether two groups of voiceprint data are similar can be realized through a simple two-classification network, the two groups of voiceprint data, namely the scores of two voiceprint characteristics and two audio qualities, are input into the two-classification network, and the two-classification network outputs the result of whether the two groups of voiceprint data are similar.

In the above embodiments, the voiceprint features are obtained by voiceprint model calculation. Fig. 4 shows a network structure of a voiceprint model in the embodiment, and in combination with fig. 4, the step of obtaining the current voiceprint characteristics includes: preprocessing a real-time audio stream to obtain a short-time Fourier feature; inputting the short-time Fourier features into the trained voiceprint model 400 to obtain the current voiceprint features, which specifically comprises: performing feature extraction on the short-time Fourier features through a feature extraction layer 410 comprising a convolutional network and a residual error network to obtain frame-level audio features; performing feature transformation on the frame-level audio features through a feature transformation layer 420 comprising an average layer, an affine layer and a regularization layer to obtain segment-level audio features; and vector conversion is performed on the segment-level audio features through the embedding layer 430 to obtain current voiceprint features.

Fig. 5 shows the network structure of the voiceprint model in the embodiment during training, and referring to fig. 5, the voiceprint model during training further includes a two-classification network layer 440 connected with the embedding layer 430. The voiceprint model is an end-to-end model, and the input of the model is a short-time Fourier transform (STFT) feature extracted according to an audio segment with a preset duration, for example, 400ms, and the dimension is (400, 101). In the feature extraction layer 410, conv is a convolution layer, 5 × 5 and 3 × 3 are the sizes of convolution kernels, 64 and 64 × 3 are the number of convolution kernels, and the addition of a dropout layer after the convolution layer can improve the generalization capability. In the feature conversion layer 420, average is an average layer, affiline is an affine layer, and L2 norm is a regularization layer, and the feature conversion layer 420 can convert the frame-level audio features into segment-level audio features and also has the function of normalizing the audio features. The embedding layer (embedding)430 outputs 512-dimensional vectors as voiceprint features of the input audio segment. Further, the two-classification network layer 440 may perform identity recognition on the generated voiceprint feature, and determine whether the generated voiceprint feature corresponds to a preset user tag. The two-class network layer 440 specifically includes a Cosine similarity function (Cosine similarity) and a cross entropy loss function (coordinate cross entropy transmission).

Fig. 6 shows the main steps of the training process of the voiceprint model in the embodiment, and in combination with fig. 6, the training process of the voiceprint model includes:

step S510, a plurality of sets of sample audio streams are obtained, where each set of sample audio streams corresponds to a user tag. For example, in one particular example, the phone recordings of 5520 speakers are collected.

Step S520, pre-process each group of sample audio streams to obtain an effective audio segment of each user tag. The pretreatment process specifically comprises the following steps: cutting each group of sample audio streams to obtain a plurality of sample audio segments; and carrying out endpoint detection on each sample audio segment to obtain an effective audio segment with silence and noise filtered. In one particular example, the transcription and corresponding point in time may be derived by a speech recognition ASR service to segment a one-pass complete telephone recording into a plurality of sample audio segments of the speaker's speech. Some sample audio segments may be mostly silence or noisy, and therefore all sample audio segments are end-point detected using the VAD after cutting.

Specifically, the VAD of this embodiment adopts a clustering method, inputs a sample audio segment of a segment of 30ms, first calculates the characteristic energy including 6 subband energies, and then calculates the distribution probability of a speech segment and a non-speech segment through a gaussian model, to obtain the weighted log-likelihood ratio of the 6 subband energy characteristics. And judging whether the weighted log-likelihood ratio exceeds a set threshold value, if so, determining that the sample audio segment is a voice segment, and keeping the sample audio segment as an effective audio segment. If the value is lower than the set threshold value, the sample audio segment is discarded. By using the method, the original information of the speaker can be effectively reserved, and the training result cannot be influenced by excessively chopping the sample audio stream.

Step S530, an initial model comprising a feature extraction layer, a feature conversion layer, an embedding layer and a two-classification network layer is trained by taking the effective audio segment and the user label as initial training data. Specifically, the network model shown in fig. 5 is trained with the STFT features extracted from the active audio segment as input and the corresponding user tags as output.

And S540, screening the effective audio segment of each user label based on the initial model, and obtaining target training data containing the screened effective audio segment and the corresponding user label. The method specifically comprises the following steps: inputting the short-time Fourier characteristics of the effective audio frequency section of each user label into an initial model to obtain initial voiceprint characteristics output by an embedded layer of the initial model; and calculating the similarity between the initial voiceprint features of each user label, and screening out the effective audio segment corresponding to the initial voiceprint feature with the similarity higher than a set threshold value. And step S550, training the initial model according to the target training data to obtain the voiceprint model.

Since the training data is obtained from the telephone recording, although the valid audio segments are obtained, it cannot be guaranteed that the valid audio segments under the same user tag all correspond to the same speaker. Therefore, in this embodiment, an initial voiceprint model is trained by using the non-screened data, the initial voiceprint model is used to remove the audio segments that obviously do not belong to the user tag in the effective audio segments under the same user tag, and then the screened data is used to train to obtain a finer voiceprint model.

In the above embodiments, the audio quality is calculated by an audio quality evaluation model. Fig. 7 shows a process of determining audio quality in the embodiment, and referring to fig. 7, the audio quality evaluation adopts a single-ended objective speech evaluation method, so that the interference degree of background noise, multiplicative noise, speech truncation and the like which affect the voiceprint recognition result can be measured. The method is composed of preprocessing, characteristic parameter estimation and perception mapping models and used for measuring the distortion of three modules, namely the unnaturalness of voice, noise and language truncation. After level calibration, the speech signal is subjected to IRS filtering and fourth-order butterworth high-pass filtering, and a total of 42 speech feature parameters are extracted, wherein key parameters are pitch period, peak value of LP coefficient, SNR, robotification, EstSegSNR, sharp classifications and spechhirruptions. And finally, obtaining a final audio quality evaluation result through distortion type judgment and result mapping. The audio quality determination process shown in fig. 7 is a conventional technique and thus will not be described in detail.

Through the above description of various embodiments, in one particular scenario, such as a call request to an online travel agency, it is desirable to modify a hotel order. The voiceprint database of the online travel agency is pre-stored with historical voiceprint data (first voiceprint data/second voiceprint data) of a plurality of users. The creation of the voiceprint database can select users who have had hotel orders and have call audio in the last year, and the number of users is 400 million after the uid of the supplier and the cooperation platform is removed. And (3) obtaining the voice print characteristics and the audio quality scores corresponding to each uid by the call audio of 400 ten thousand users through a voice print model, and storing the voice print characteristics and the audio quality scores by taking the uid (user identification) as an index.

The voiceprint data processing process comprises the following steps: firstly, according to the list filtering supplier and the agent uid, obtaining the real-time audio stream of the call user, transmitting the real-time audio stream for 4s once, obtaining the effective audio segment with the mute segment and the noise filtered through the VAD endpoint detection technology, carrying out audio quality evaluation and calculating the STFT characteristic, and obtaining a 400 x 101 matrix and the current audio quality. And secondly, calculating the current voiceprint characteristics of each effective voice frequency section according to the voiceprint model obtained by training, searching the corresponding user identification in the voiceprint database, and calculating the similarity between the historical voiceprint data and the current voiceprint data if the historical voiceprint data with the user identification as the index exists in the voiceprint database. In the process of continuously transmitting the real-time audio stream, effective audio segments with audio quality scores lower than a threshold value are screened out, current voiceprint data of the reserved effective audio segments and current voiceprint characteristics of all the effective audio segments with the quality reaching the standard are weighted and averaged, and then the current voiceprint data are compared with historical voiceprint data in a voiceprint database in real time. If the voiceprints are similar, it is continuously determined how to update the historical voiceprints, and if the voiceprints are not similar, the identity of the calling user is confirmed through customer service check and the like. Therefore, by considering the audio quality, when voiceprint registration, voiceprint verification and voiceprint updating are carried out, the audio with serious interference is discarded according to the quality score, the audio with low speaker information loss degree is reserved, the accuracy of voiceprint identification can be effectively improved, and the information safety of a user is protected.

In conclusion, the voiceprint data processing method of the invention monitors the scene that the call user expects to operate the order by obtaining the real-time audio stream indicating the operation order and the user identifier of the order; comparing the current voiceprint data containing the voiceprint information and the audio quality with historical voiceprint data under the user identification, and accurately judging whether the call user is an actual owner of the order; further, historical voiceprint data under the user identification are updated based on the current audio quality, and the updated voiceprint data are superior to prestored voiceprint data; and the order operation authority can be determined based on the identified conversation user identity, and the property and information safety of the order user can be protected.

The embodiment of the present invention further provides a voiceprint data processing apparatus, which can be used to implement the voiceprint data processing method described in any of the above embodiments.

Fig. 8 shows the main blocks of the voiceprint data processing apparatus in the embodiment, and referring to fig. 8, the voiceprint data processing apparatus 600 in the embodiment includes: an audio acquisition module 610 configured to obtain a real-time audio stream indicative of an operation order and a user identification of the order; a feature obtaining module 620 configured to obtain current voiceprint data including current voiceprint features and current audio quality according to the real-time audio stream; a first determining module 630, configured to retrieve, according to the user identifier, whether there is first voiceprint data indexed by the user identifier from the voiceprint database, and obtain a first determining result; the second judging module 640 is configured to compare whether the current voiceprint data is similar to the first voiceprint data or not to obtain a second judging result if the first judging result is yes; and a first updating module 650 configured to update the first voiceprint data with the current voiceprint data according to the current audio quality when the second determination result is yes.

Further, the voiceprint data processing apparatus 600 can also comprise further steps for implementing any of the voiceprint data processing method embodiments described above. The specific principle of each module can be referred to any of the above embodiments of the voiceprint data processing method, and the description is not repeated here.

The voiceprint data processing device provided by the invention can be used for monitoring the scene of the expected operation order of the communication user by acquiring the real-time audio stream indicating the operation order and the user identification of the order; comparing the current voiceprint data containing the voiceprint information and the audio quality with historical voiceprint data under the user identification, and accurately judging whether the call user is an actual owner of the order; further, historical voiceprint data under the user identification are updated based on the current audio quality, and the updated voiceprint data are superior to prestored voiceprint data; and the order operation authority can be determined based on the identified conversation user identity, and the property and information safety of the order user can be protected.

The embodiment of the present invention further provides an electronic device, which includes a processor and a memory, where the memory stores executable instructions, and when the executable instructions are executed by the processor, the voiceprint data processing method described in any of the above embodiments is implemented.

The electronic equipment can monitor the scene that a call user expects to operate the order by acquiring the real-time audio stream indicating the operation order and the user identification of the order; comparing the current voiceprint data containing the voiceprint information and the audio quality with historical voiceprint data under the user identification, and accurately judging whether the call user is an actual owner of the order; further, historical voiceprint data under the user identification are updated based on the current audio quality, and the updated voiceprint data are superior to prestored voiceprint data; and the order operation authority can be determined based on the identified conversation user identity, and the property and information safety of the order user can be protected.

Fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present invention, and it should be understood that fig. 9 only schematically illustrates various modules, and these modules may be virtual software modules or actual hardware modules, and the combination, the splitting, and the addition of the remaining modules of these modules are within the scope of the present invention.

As shown in fig. 9, the electronic device 700 is embodied in the form of a general purpose computing device. The components of the electronic device 700 include, but are not limited to: at least one processing unit 710, at least one memory unit 720, a bus 730 connecting the different platform components (including memory unit 720 and processing unit 710), a display unit 740, etc.

Wherein the storage unit stores program codes, which can be executed by the processing unit 710, so that the processing unit 710 executes the steps of the voiceprint data processing method described in any of the above embodiments. For example, the processing unit 710 may perform the steps as shown in fig. 1 to 3 and 6.

The storage unit 720 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)7201 and/or a cache memory unit 7202, and may further include a read only memory unit (ROM) 7203.

The memory unit 720 may also include programs/utilities 7204 having one or more program modules 7205, such program modules 7205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 730 may be any representation of one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 700 may also communicate with one or more external devices 800, and the external devices 800 may be one or more of a keyboard, a pointing device, a bluetooth device, and the like. These external devices 800 enable a user to interactively communicate with the electronic device 700. The electronic device 700 may also be capable of communicating with one or more other computing devices, including routers, modems. Such communication may occur via an input/output (I/O) interface 750. Also, the electronic device 700 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 760. The network adapter 760 may communicate with other modules of the electronic device 700 via the bus 730. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 700, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

The embodiment of the present invention further provides a computer-readable storage medium for storing a program, and when the program is executed, the method for processing voiceprint data described in any of the above embodiments is implemented. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the voiceprint data processing method described in any of the embodiments above, when the program product is run on the terminal device.

The computer-readable storage medium can realize monitoring of a scene that a communication user expects to operate an order by obtaining a real-time audio stream indicating the operation order and a user identifier of the order; comparing the current voiceprint data containing the voiceprint information and the audio quality with historical voiceprint data under the user identification, and accurately judging whether the call user is an actual owner of the order; further, historical voiceprint data under the user identification are updated based on the current audio quality, and the updated voiceprint data are superior to prestored voiceprint data; and the order operation authority can be determined based on the identified conversation user identity, and the property and information safety of the order user can be protected.

Fig. 10 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 10, a program product 900 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of readable storage media include, but are not limited to: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device, such as through the internet using an internet service provider.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A voiceprint data processing method, comprising:

obtaining a real-time audio stream indicating an operation order and a user identification of the order;

obtaining current voiceprint data containing current voiceprint characteristics and current audio quality according to the real-time audio stream;

according to the user identification, searching whether first voiceprint data with the user identification as an index exists in a voiceprint database to obtain a first judgment result;

if the first judgment result is yes, comparing whether the current voiceprint data is similar to the first voiceprint data or not to obtain a second judgment result; and

and when the second judgment result is yes, updating the first voiceprint data by the current voiceprint data according to the current audio quality.

2. The voiceprint data processing method of claim 1, wherein said updating said first voiceprint data with said current voiceprint data comprises:

judging whether the current audio quality exceeds the first audio quality of the first voiceprint data;

if so, replacing the first voiceprint data with the current voiceprint data;

and if not, replacing the first voiceprint data with the weighted average of the current voiceprint data and the first voiceprint data.

3. The voiceprint data processing method according to claim 1, further comprising:

when the second judgment result is negative, searching whether second voiceprint data with the user identification as an index exists in the voiceprint database to obtain a third judgment result, wherein the first voiceprint data corresponds to a common voiceprint record, and the second voiceprint data corresponds to a standby voiceprint record;

if so, comparing whether the current voiceprint data is similar to the second voiceprint data to obtain a fourth judgment result; and

and if so, updating the second voiceprint data by the current voiceprint data according to the current audio quality, and interchanging the first voiceprint data and the updated second voiceprint data.

4. The voiceprint data processing method according to claim 3, further comprising:

if the third judgment result is negative, acquiring a fifth judgment result of whether the call user of the instruction operation order confirmed by the customer service and the user identification correspond to the same user;

and if so, screening the audio segment with the audio quality higher than the quality threshold value in the real-time audio stream, and storing the current voiceprint data of the screened audio segment as second voiceprint data with the user identifier as an index.

5. The voiceprint data processing method according to claim 4, further comprising:

when the second judgment result, the fourth judgment result or the fifth judgment result is yes, allowing the order to be operated;

and if the fourth judgment result or the fifth judgment result is negative, writing the current voiceprint characteristics into a blacklist, and preventing the order from being operated.

6. The voiceprint data processing method according to claim 1, further comprising:

if the first judgment result is negative, whether the call user indicating the operation order and the user identification correspond to the same user is confirmed through the customer service;

if so, screening the audio segment with the audio quality higher than the quality threshold value in the real-time audio stream, storing the current voiceprint data of the screened audio segment as first voiceprint data with the user identification as an index, and allowing the order to be operated;

and if not, writing the current voiceprint characteristics into a blacklist, and preventing the order from being operated.

7. The voiceprint data processing method according to claim 1, wherein said comparing whether the current voiceprint data is similar to the first voiceprint data comprises:

calculating cosine similarity between the current voiceprint feature and the first voiceprint feature;

when the cosine similarity exceeds a similarity threshold, judging that the current voiceprint data is similar to the first voiceprint data;

and when the cosine similarity is smaller than the similarity threshold, judging that the current voiceprint data is not similar to the first voiceprint data.

8. The voiceprint data processing method of claim 1, wherein the step of obtaining said current voiceprint characteristics comprises:

preprocessing the real-time audio stream to obtain short-time Fourier characteristics;

inputting the short-time Fourier features into a trained voiceprint model to obtain the current voiceprint features, wherein the method comprises the following steps:

performing feature extraction on the short-time Fourier features through a feature extraction layer comprising a convolutional network and a residual error network to obtain frame-level audio features;

performing feature transformation on the frame-level audio features through a feature transformation layer comprising an average layer, an affine layer and a regularization layer to obtain segment-level audio features; and

and carrying out vector conversion on the segment-level audio features through an embedded layer to obtain the current voiceprint features.

9. The voiceprint data processing method according to claim 8, wherein the voiceprint model, when being trained, further comprises a two-class network layer connecting the embedding layer, and the training process of the voiceprint model comprises:

obtaining a plurality of groups of sample audio streams, wherein each group of sample audio streams corresponds to a user label;

preprocessing each group of the sample audio streams to obtain an effective audio segment of each user label;

training an initial model comprising the feature extraction layer, the feature conversion layer, the embedding layer and the two classification network layers by taking the effective audio segment and the user label as initial training data;

screening the effective audio segment of each user label based on the initial model to obtain target training data containing the screened effective audio segment and the corresponding user label; and

and training the initial model according to the target training data to obtain the voiceprint model.

10. The voiceprint data processing method of claim 9 wherein said filtering the active audio segment of each said user tag based on said initial model comprises:

inputting the short-time Fourier characteristics of the effective audio segment of each user tag into the initial model to obtain initial voiceprint characteristics output by an embedding layer of the initial model; and

and calculating the similarity between the initial voiceprint features of each user label, and screening out the effective audio segment corresponding to the initial voiceprint feature with the similarity higher than a set threshold value.

11. The voiceprint data processing method of claim 9 wherein said pre-processing each set of said sample audio streams comprises:

cutting each group of the sample audio streams to obtain a plurality of sample audio segments; and

and carrying out endpoint detection on each sample audio segment to obtain an effective audio segment with silence and noise being filtered.

12. A voiceprint data processing apparatus comprising:

the audio acquisition module is configured to acquire a real-time audio stream indicating an operation order and a user identifier of the order;

a feature acquisition module configured to acquire current voiceprint data including current voiceprint features and current audio quality according to the real-time audio stream;

the first judgment module is configured to search whether first voiceprint data with the user identification as an index exists in a voiceprint database according to the user identification to obtain a first judgment result;

the second judging module is configured to compare whether the current voiceprint data is similar to the first voiceprint data or not to obtain a second judging result when the first judging result is yes; and

and a first updating module configured to update the first voiceprint data with the current voiceprint data according to the current audio quality when the second determination result is yes.

13. An electronic device, comprising:

a processor;

a memory having executable instructions stored therein;

wherein the executable instructions, when executed by the processor, implement a voiceprint data processing method according to any one of claims 1 to 11.

14. A computer-readable storage medium storing a program, wherein the program is characterized by implementing the voiceprint data processing method according to any one of claims 1 to 11 when executed.