CN114429768A

CN114429768A - Training method, device, equipment and storage medium for speaker log model

Info

Publication number: CN114429768A
Application number: CN202210177866.4A
Authority: CN
Inventors: 罗艺
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-05-03
Anticipated expiration: 2042-02-25
Also published as: CN114429768B

Abstract

The application discloses a training method, a device, equipment and a storage medium of a speaker log model, and belongs to the field of artificial intelligence. The method comprises the following steps: acquiring a characteristic sequence and a real label of a sample voice signal; obtaining an estimated attractor sequence according to the characteristic sequence; inputting the characteristic sequence and the estimated attractor sequence into a speaker log model to obtain estimated speaker classification probability; calculating a first loss function value based on the estimated speaker class probability and the real label; the model parameters are updated based on the first loss function value. By the method, the trained speaker log model can have higher speech signal identification precision, so that more accurate speaker logs can be generated.

Description

Training method, device, equipment and storage medium for speaker log model

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for training a speaker log model.

Background

The speaker log is used for identifying the identity of a speaker corresponding to each section of voice by identifying the speaking stages of different speakers in collected voice signals so as to assist a speaker identification system to correspondingly identify each speaker, and is applied to various scenes such as conference recording, customer service work supervision and the like.

In the related art, when a speaker log is generated for a voice signal, an estimated speaker class probability and an estimated speaker number probability corresponding to the voice signal are respectively calculated by using a trained speaker log model, the speaker class is determined by estimating the speaker class probability, and the speaker number is determined by estimating the speaker number probability.

In the related art, the speaker log model needs to be trained by calculating the first loss function value by using the estimated speaker class probability and the second loss function value by using the estimated speaker number probability, so that the training effect of the model is poor, the speaker identification accuracy is low, and the speaker log generation accuracy is low.

Disclosure of Invention

The application provides a training method, a device, equipment and a storage medium of a speaker log model, which can improve the generation accuracy of a speaker log. The technical scheme is as follows:

according to an aspect of the present application, there is provided a training method of a speaker log model, the method including:

acquiring a characteristic sequence and a real label of a sample voice signal, wherein the real label is a label representing the category of a real speaker;

obtaining estimated attractor sequences according to the characteristic sequences, wherein one attractor in the estimated attractor sequences represents one speaker class;

inputting the characteristic sequence and the estimated attractor sequence into the speaker log model to obtain an estimated speaker class probability, wherein the estimated speaker class probability refers to the probability of the speaker class estimated by the speaker log model;

calculating a first loss function value based on the estimated speaker class probability and the true label;

updating model parameters of the speaker log model based on the first loss function value.

According to one aspect of the application, feature extraction is performed on the speech features through a non-negative function in the feature extraction network of the speaker log model to obtain extracted features;

and normalizing the values of the extracted features to obtain the feature sequence of the sample voice signal.

According to an aspect of the present application, there is provided a speaker recognition method, the method including:

acquiring a characteristic sequence of a voice signal;

obtaining an estimated attractor sequence according to the characteristic sequence, wherein one attractor in the estimated attractor sequence represents one speaker class;

and determining the speaker type and the number of speakers corresponding to the voice signal based on the estimated speaker type probability.

According to an aspect of the application, the speaker log model further includes a feature extraction network;

acquiring voice characteristics of the voice signals, wherein the voice characteristics are time-frequency characteristic data of the voice signals;

and performing feature extraction on the voice features through the feature extraction network to obtain the feature sequence of the voice signal.

According to an aspect of the present application, there is provided a training apparatus for a speaker log model, the apparatus including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a characteristic sequence and a real label of a sample voice signal, and the real label is a label representing the category of a real speaker;

the second acquisition module is used for acquiring an estimated attractor sequence according to the characteristic sequence, wherein one attractor in the estimated attractor sequence represents one speaker class;

the estimation module is used for inputting the characteristic sequence and the estimated attractor sequence into the speaker log model to obtain an estimated speaker class probability, wherein the estimated speaker class probability refers to the probability of the speaker class obtained by estimation of the speaker log model;

a calculation module for calculating a first loss function value based on the estimated speaker class probability and the real label;

and the updating module is used for updating the model parameters of the speaker log model based on the first loss function value.

According to an aspect of the present application, there is provided a speaker recognition apparatus, the apparatus including:

the first acquisition module is used for acquiring a characteristic sequence of the voice signal;

and the recognition module is used for determining the speaker type and the speaker number corresponding to the voice signal based on the estimated speaker type probability.

According to another aspect of the present application, there is provided a computer device including: the system comprises a processor and a memory, wherein at least one computer program is stored in the memory, and is loaded and executed by the processor to realize the training method of the speaker log model or the speaker recognition method.

According to another aspect of the present application, there is provided a computer storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to implement the training method of the speaker log model or the speaker recognition method as described above.

According to another aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium; the computer program is read from the computer readable storage medium and executed by a processor of a computer device, so that the computer device executes the training method of the speaker log model or the speaker recognition method as described above.

The beneficial effect that technical scheme that this application provided brought includes at least:

acquiring an estimated attractor sequence according to a characteristic sequence by acquiring the characteristic sequence and a real label of a sample voice signal; obtaining estimated speaker class probability according to the characteristic sequence and the estimated attractor sequence, and calculating a first loss function value based on the estimated speaker class probability and the real label; model parameters of the speaker log model are updated based on the first loss function value.

According to the method and the device, the first loss function value is calculated according to the estimated speaker class probability and the real label, and the model parameter of the speaker log model is updated through the first loss function value, so that the trained speaker log model can have higher speech signal identification precision, and more accurate speaker logs are generated.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a method for training a speaker log model according to an exemplary embodiment of the present application;

FIG. 2 is an architectural diagram of a computer system provided in an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a method for training a speaker log model provided by an exemplary embodiment of the present application;

FIG. 4 is a flowchart of a method for training a speaker log model provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a structure for calculating estimated speaker class probabilities according to an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a structure for calculating a probability of an ideal speaker class according to an exemplary embodiment of the present application;

FIG. 7 is a flowchart of a speaker recognition method provided by an exemplary embodiment of the present application;

FIG. 8 is a flow chart of a method for speaker recognition provided by an exemplary embodiment of the present application;

FIG. 9 is a schematic structural diagram of a speaker recognition method according to an exemplary embodiment of the present application;

FIG. 10 is a block diagram of a training apparatus for a speaker log model provided in an exemplary embodiment of the present application;

FIG. 11 is a block diagram of a speaker recognition device according to an exemplary embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, the following detailed description of the embodiments of the present application will be made with reference to the accompanying drawings.

The embodiment of the application provides a technical scheme of a training method of a speaker log model, and as shown in fig. 1, the method can be executed by computer equipment, and the computer equipment can be a terminal or a server.

Illustratively, the computer device obtains a signature sequence 103 and a genuine tag 102 of a sample speech signal 101.

The real tag 102 refers to a tag representing a real speaker category obtained through normalization processing, a row in the real tag 102 represents a speaker category, and a column represents a timestamp, where the real tag 102 in fig. 1 may be represented as: only the first speaker speaks in the first timestamp, the second speaker and the third speaker speak in the second timestamp, only the third speaker speaks in the third timestamp, and the first speaker speaks in the fourth timestamp.

The sample speech signal 101 is a speech sample to be subjected to speaker information recognition. The feature sequence 103 is a feature vector corresponding to each time stamp in the sample speech signal 101, such as T₁，T₂，T₃…T_tWherein, T₁And representing the feature vector corresponding to the first time stamp.

The computer equipment inputs the characteristic sequence 103 into an encoder 104 in a speaker log model, and a neural network in the encoder 104 encodes the characteristic sequence 103 to obtain a speaker characteristic vector; and inputting the speaker feature vector to a decoder 105 in the speaker log model, and decoding the speaker feature vector by a neural network in the decoder 105 to obtain an estimated attractor sequence 106.

The speaker log model is used for identifying speaker types and/or speaker numbers corresponding to each time point in a voice signal, and the speaker log model detects the identity of a speaker corresponding to each section of voice by distinguishing the speaking stages of different speakers in the voice signal. It should be noted that the identity of the speaker means that the speaker log model can identify different speakers corresponding to each time point, and cannot determine the identity of a specific speaker corresponding to each time point. For example, two people are speaking in the speech signal within the third second, and the speaker log model can only determine that two people are speaking in the third second, a and B respectively, but cannot determine whether a is "zhang san" or "lie si".

In one possible implementation, the specific speaker identity corresponding to each time point is determined by prerecording the voice of each speaker through the voiceprint recognition model, and comparing the specific speaker identity corresponding to each time point in the voice signal with the prerecorded voice of the speaker in the voiceprint recognition model.

For example, in a conference recording scene, speaker recognition is performed on audio recorded in a conference process through a speaker log model, and a speaker recognition result corresponding to a voice signal is obtained, so that a speaker log corresponding to the conference is generated, for example, C and D speak simultaneously in 3 rd minute 5 second to 4 th minute 2 second, and only C speaks in 6 th minute 30 second to 8 th minute 2 second. Subsequently, when the conference content is listened to again, the conference speech of the target speaker can be selectively listened to according to the speaker log, for example, only the speaker content of C is wanted to be listened to, and only the content of 3 min, 5 sec to 4 min, 2 sec and 6 min, 30 sec to 8 min, 2 sec can be listened to.

The speaker feature vector is a vector representing a speech feature of the speaker obtained by encoding the feature sequence 103 by the encoder 104.

The estimated attractor sequence 106 refers to the estimated speaker characteristics estimated by the speaker log model based on the characteristic sequence 103, such as S₁，S₂，S₃…S_nWherein an attractor in the estimated attractor sequence 106 characterizes a speaker class.

For example, A, B, C speech signals of three persons and some noise are included in the sample speech signal 101, the characteristic sequence 103 of the sample speech signal 101 is processed by the encoder-decoder to obtain an estimated attractor sequence 106 corresponding to the sample speech signal 101, and the estimated attractor sequence 106 includes 4 attractors, which are S respectively₁，S₂，S₃And S₄Wherein the computer device divides the human voice part in the sample speech signal 101 into different attractors S₁，S₂，S₃Uniformly dividing the non-human voice part in the sample voice signal 101 into an attractor S₄。

The computer device calculates a similarity between each feature in the sequence of features 103 and each attractor in the sequence of estimated attractors 106 by a vector dot product; and inputting the similarity results of the characteristic sequence 103 and the estimated attractor sequence 106 into a classifier network 107 in the speaker log model for calculation to obtain an estimated speaker classification probability 108.

The estimated speaker class probability 108 refers to the probability of the speaker class estimated by the speaker log model. One row of the estimated speaker class probabilities 108 represents a speaker class and one column represents a timestamp, and taking the first column of the estimated speaker class probabilities 108 of fig. 1 as an example, the first column can be expressed as: the probability of the first speaker speaking in the first time stamp is 0.71, the probability of the second speaker speaking in the first time stamp is 0.13, the probability of the third speaker speaking in the first time stamp is 0.15, and the probability of the fourth speaker speaking in the first time stamp is 0.01.

The classifier network 107 is used to determine the probability of the speaker class corresponding to each timestamp.

The computer device calculates a first loss function value 109 based on the estimated speaker class probability 108 and the real label 102.

Illustratively, the computer device derives the ideal attractor sequence 110 based on the product of the feature sequence 103 and the real tag 102.

The ideal attractor sequence 110 refers to the real speaker characteristics calculated by the speaker log model based on the characteristic sequence and the real tags 102, such as Q₁，Q₂，…Q_mEach representing an attractor, wherein an attractor in the ideal attractor sequence characterizes a speaker class.

For example, A, B, C voice signals of three persons and some noises are included in the sample voice signal 101, the characteristic sequence 103 of the sample voice signal 101 is multiplied by the real label 102 to obtain the ideal attractor sequence 110 corresponding to the sample voice signal 101, and the ideal attractor sequence 110 includes 3 attractors, Q respectively₁，Q₂And Q₃Wherein the computer device speaks the sampleThe division of the human voice part in the sound signal 101 into different attractors Q₁，Q₂And Q₃The non-human voice part in the sample voice signal 101 is divided into invalid attractors, wherein the vectors corresponding to the invalid attractors are all-zero vectors.

The computer device calculates a similarity between each feature in the sequence of features 103 and each attractor in the sequence of ideal attractors 110 by a vector dot product; and inputting the similarity results of the characteristic sequence 103 and the ideal attractor sequence 110 into a classifier network 107 in a speaker log model for calculation to obtain the ideal speaker class probability 111.

The ideal speaker class probability 111 refers to the probability of the true speaker class based on the true tag. One row of the ideal speaker class probabilities 111 represents a speaker class and one column represents a timestamp.

The computer device calculates a second loss function value 112 based on the ideal speaker class probability 111 and the real label 102.

The computer device updates the model parameters of the speaker log model based on the sum of the first loss function value 109 and the second loss function value 112, thereby obtaining a trained speaker log model.

In summary, in the method provided in this embodiment, the first loss function value is calculated according to the estimated speaker class probability and the real tag, the second loss function value is calculated according to the ideal speaker class probability and the real tag, and the model parameters of the speaker log model are updated according to the first loss function value and the second loss function value, so that the trained speaker log model can have higher speech signal recognition accuracy, and thus a more accurate speaker log is generated.

Fig. 2 shows an architectural diagram of a computer system provided in an embodiment of the present application. The computer system may include: a terminal 100 and a server 200.

The terminal 100 may be an electronic device such as a mobile phone, a tablet Computer, a vehicle-mounted terminal (car machine), a wearable device, a Personal Computer (PC), a smart voice interactive device, a smart home appliance, a vehicle-mounted terminal, an aircraft, an unmanned terminal, and the like. The terminal 100 may have a client installed therein for running a target application, where the target application may be an application supporting voice recording or another application providing a voice recording function, and the application is not limited thereto. The form of the target Application is not limited in the present Application, and may include, but is not limited to, an Application (App) installed in the terminal 100, an applet, and the like, and may be a web page form.

The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud server, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The server 200 may be a background server of the target application program, and is configured to provide a background service for a client of the target application program.

The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have an own identification mark and needs to be transmitted to a background system for logic processing, data of different levels can be processed separately, and various industry data need strong system background support and can be realized only through cloud computing.

In some embodiments, the servers described above may also be implemented as nodes in a blockchain system. The Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The terminal 100 and the server 200 may communicate with each other through a network, such as a wired or wireless network.

According to the training method or the speaker recognition method of the speaker log model provided by the embodiment of the application, the execution subject of each step can be computer equipment, and the computer equipment refers to electronic equipment with data calculation, processing and storage capabilities. Taking the embodiment environment shown in fig. 2 as an example, the terminal 100 may execute a training method or a speaker recognition method for the speaker log model (for example, a client installed with a running target application program in the terminal 100 executes the training method or the speaker recognition method for the speaker log model), the server 200 may execute the training method or the speaker recognition method for the speaker log model, or the terminal 100 and the server 200 may execute the training method or the speaker recognition method in an interactive manner, which is not limited in this application.

FIG. 3 is a flowchart of a method for training a speaker log model according to an exemplary embodiment of the present application. The method may be performed by a computer device, which may be the terminal 100 or the server 200 in fig. 2. The method comprises the following steps:

step 302: and acquiring a characteristic sequence and a real label of the sample voice signal.

The real label refers to a label representing a real speaker category. One row in the real label represents a speaker class and one column represents a time stamp.

For example, the real label is

Wherein the first row represents a first speaker, the second row represents a second speaker, the third row represents a third speaker, the first column represents a timestamp t, the second column represents a timestamp t +1, and the third column represents a timestamp t +2, taking the first column as an example, the first column is represented as: in the time stamp t, 0 represents that the first speaker does not speak in the time stamp t, and 0.5 represents that the second speaker and the third speaker speak simultaneously in the time stamp t +1 respectively; taking the third column as an example, 1 represents that only the third speaker speaks within the timestamp t + 2.

The sample voice signal is a voice sample to be subjected to speaker information recognition.

The feature sequence is a feature vector corresponding to each time stamp in the sample speech signal.

For example, the characteristic sequence is

The first column represents a feature vector corresponding to the timestamp t, the second column represents a feature vector corresponding to the timestamp t +1, and the row number of the feature sequence represents the dimension of the feature vector corresponding to the timestamp.

Wherein the manner of obtaining the sample speech signal includes at least one of the following:

1. the computer device receives a sample speech signal, for example: the terminal is a terminal initiating customer service, records the call content in the customer service stage, and sends the voice signal of the call content to the server for identification after the recording is finished.

2. The computer device obtains a sample speech signal from a stored database, such as: in the conference application program, recording is carried out aiming at the conference process, the speaking content of each speaker in the conference is recorded and stored in the server, and when speaker identification is needed, the recorded voice signal of the conference is acquired from the stored conference recording for identification.

It should be noted that the above-mentioned manner of obtaining the sample speech signal is only an illustrative example, and the embodiment of the present application does not limit this.

Step 304 a: and obtaining an estimated attractor sequence according to the characteristic sequence.

Estimating an attractor sequence refers to estimating speaker characteristics estimated by the speaker log model based on the feature sequence, that is, the speaker log model estimates estimated speaker characteristics according to the features in the feature sequence, for example, S₁，S₂，S₃…S_n，S₁，S₂，S₃…S_nEach representing an attractor, wherein an attractor in the sequence of attractors is estimated to characterize a speaker class.

For example, the sample speech signal includes A, B, C speech signals of three persons and some noise, an estimated attractor sequence corresponding to the sample speech signal is obtained according to the feature sequence of the sample speech signal, the estimated attractor sequence includes 4 attractors, which are S respectively₁，S₂，S₃And S₄Wherein the computer device divides the human voice part in the sample voice signal into different attractors S₁，S₂，S₃Uniformly dividing the non-human voice part in the sample voice signal into an attractor S₄。

Step 306: and inputting the characteristic sequence and the estimated attractor sequence into a speaker log model to obtain the estimated speaker classification probability.

The estimated speaker category probability refers to the probability of speaker categories corresponding to different timestamps estimated by the speaker log model. The speaker class probabilities are estimated as a two-dimensional matrix, where a row in the matrix represents a speaker class and a column in the matrix represents a timestamp, e.g., a first probability value in a first row represents a probability of a first speaker speaking in the first timestamp.

In the case of obtaining the estimated attractor sequence, the computer device inputs the feature sequence and the estimated attractor sequence into a speaker log model, thereby obtaining an estimated speaker class probability.

Step 308: a first loss function value is calculated based on the estimated speaker class probability and the true label.

In the case of obtaining the estimated speaker class probability, the computer device calculates a first loss function value based on the estimated speaker class probability and the true label.

Optionally, the first loss function value is at least one of a cross entropy between the estimated speaker class probability and the real tag, a mean square error between the estimated speaker class probability and the real tag, and an absolute difference between the estimated speaker class probability and the real tag, but is not limited thereto.

Step 310: model parameters of the speaker log model are updated based on the first loss function value.

Illustratively, the computer device updates model parameters of the speaker log model based on the first loss function value to obtain a trained speaker log model.

The updating of the model parameters refers to updating network parameters in the speaker log model, or updating network parameters of each network module in the model, or updating network parameters of each network layer in the model, but is not limited thereto, and the embodiment of the present application does not limit this.

The model parameters of the speaker log model comprise at least one of network parameters of a feature extraction network, network parameters of an encoder, network parameters of a decoder and network parameters of a classifier network in the speaker log model.

Under the condition of obtaining the first loss function value, the computer equipment updates network parameters of the feature extraction network, the encoder, the decoder and the classifier network in the speaker log model based on the first loss function value to obtain the updated feature extraction network, the encoder, the decoder and the classifier network, so that the trained speaker log model is obtained.

In summary, in the method provided in this embodiment, an estimated attractor sequence is obtained according to a feature sequence by obtaining the feature sequence and the real tag of the sample voice signal; obtaining estimated speaker class probability according to the characteristic sequence and the estimated attractor sequence, and calculating a first loss function value based on the estimated speaker class probability and the real label; and updating the model parameters of the speaker log model according to the first loss function value, so that the trained speaker log model can have higher speech signal identification precision, and a more accurate speaker log is generated.

FIG. 4 is a flowchart of a method for training a speaker log model according to an exemplary embodiment of the present application. The method may be performed by a computer device, which may be the terminal 100 or the server 200 in fig. 2. The method comprises the following steps:

step 402: and acquiring a characteristic sequence and a real label of the sample voice signal.

The real label is a label which is obtained by normalization processing and used for representing the category of a real speaker, the real label is a two-dimensional matrix, one row in the matrix represents one speaker category, and one column in the matrix represents one time stamp.

Real label

The normalized formula of (a) is:

wherein the content of the first and second substances,

r is a real number, T is the number of timestamps, S is the number of real speakers, and M is a real label before normalization.

For example, the true tag M before normalization can be represented as

Then the normalized true label

Can be expressed as

Real label

Taking the first column as an example, 0.5 represents that the second speaker and the third speaker are speaking simultaneously.

Illustratively, the computer device obtains a speech feature of a sample speech signal, and performs feature extraction on the speech feature through a feature extraction network to obtain a feature sequence of the sample speech signal.

The speech features are time-frequency feature data of the sample speech signal. The feature extraction network extracts features in the speech features.

Optionally, the speech feature includes at least one of a speech spectrum and a Frequency Cepstrum Coefficient (MFCC) of the sample speech signal, but is not limited thereto.

In one possible implementation mode, the voice characteristics are subjected to characteristic extraction through a non-negative function in a characteristic extraction network of a speaker log model to obtain extracted characteristics; and normalizing the extracted feature value to obtain a feature sequence of the sample voice signal.

Optionally, the non-negative function in the feature extraction network is at least one of an activation function Sigmoid, an activation function Softplus, and a linear rectification function ReLU, but is not limited thereto.

Step 404 a: and obtaining an estimated attractor sequence according to the characteristic sequence.

The estimated attractor sequence refers to the estimated speaker characteristics obtained by the speaker log model based on the characteristic sequence estimation, namely, the speaker log model estimates the estimated speaker characteristics according to the characteristics in the characteristic sequence,for example, S₁，S₂，S₃…S_nWherein an attractor in the sequence of attractors is estimated to characterize a speaker class.

Exemplarily, the speaker log model comprises an encoder and a decoder, the computer device inputs the characteristic sequence into the encoder, and a neural network in the encoder encodes the characteristic sequence to obtain a speaker characteristic vector; the computer equipment inputs the speaker characteristic vector into a decoder, and the neural network in the decoder decodes the speaker characteristic vector to obtain an estimated attractor sequence.

Alternatively, the neural network in the encoder is a neural network independent of the input order of the feature sequences, for example, the neural network in the decoder is an all-self attention network Transformer. The Neural Network in the decoder is at least one of a Recurrent Neural Network (RNN) and a Long-Short-Term Memory Network (LSTM), but is not limited thereto.

For example, A, B, C speech signals of three persons and some noise are included in the sample speech signal, and the characteristic sequence of the sample speech signal is processed by the encoder-decoder to obtain an estimated attractor sequence corresponding to the sample speech signal, where the estimated attractor sequence includes 4 attractors, each of which is S₁，S₂，S₃And S₄Wherein the computer device divides the human voice part in the sample voice signal into different attractors S₁，S₂，S₃Uniformly dividing the non-human voice part in the sample voice signal into an attractor S₄。

Step 406 a: and inputting the characteristic sequence and the estimated attractor sequence into a speaker log model to obtain the estimated speaker classification probability.

The estimated speaker class probability refers to the probability of speaker classes corresponding to different timestamps estimated by the speaker log model. Estimating the speaker class probability is a two-dimensional matrix, where one row of the matrix represents a speaker class and one column of the matrix represents a timestamp, e.g., a first probability value of a first row represents a probability of a first speaker speaking in a first timestamp.

Illustratively, the speaker log model includes a network of classifiers by a computer device computing a similarity between each feature in a sequence of features and each attractor in a sequence of estimated attractors; and inputting the similarity results of the characteristic sequences and the estimated attractor sequences into a classifier network for calculation, thereby obtaining the estimated speaker classification probability.

The classifier network is used to determine the probability of the speaker class corresponding to each timestamp.

The determination formula for estimating the speaker class probability p can be expressed as:

P＝Softmax(A^TN,dim＝0)

wherein, A is an estimated attractor sequence output by a decoder, and A belongs to R^D×SR is real number, D is vector dimension, S is attractor number, A^TFor estimating the transpose of a two-dimensional matrix characterized by an attractor sequence, N being a signature sequence, N ∈ R^D×TT is the number of time stamps, and P is belonged to R^S×TAnd dim is 0, which is the first dimension of the vector.

Step 408 a: a first loss function value is calculated based on the estimated speaker class probability and the true label.

For example, taking the first loss function value as the cross entropy as an example, the computer device obtains the cross entropy between the estimated speaker class probability and the real label based on the probability value of each position in the estimated speaker class probability and the value of the corresponding position in the real label.

The cross entropy is calculated as:

where m is the true label, n is the estimated speaker class probability, m_iTo estimate the ith probability value, n, in the speaker class probability_iFor m in the real label_iThe value of the corresponding location.

For example, fig. 5 shows a schematic structural diagram of calculating the estimated speaker class probability, and a computer device obtains the speech features 502 and the real labels 511 of the sample speech signal 501, and performs feature extraction on the speech features 502 through the feature extraction network 503 to obtain the feature sequence 504 of the sample speech signal 501. The computer equipment inputs the characteristic sequence 504 into an encoder 505 for encoding processing to obtain a speaker characteristic vector; and the speaker feature vector is input to the decoder 506 for decoding processing to obtain an estimated attractor sequence 507.

The computer device performs similarity calculation 508 on the feature sequence 504 and the estimated attractor sequence 507, and the computer device calculates the similarity between each feature in the feature sequence 504 and each attractor in the estimated attractor sequence 507; and inputting the similarity results of the feature sequence 504 and the estimated attractor sequence 507 into a classifier network 509 for calculation, thereby obtaining an estimated speaker classification probability 510.

The estimated speaker class probability 510 refers to the probability of the speaker class estimated by the speaker log model. One row in estimated speaker class probability 510 represents a speaker class and one column represents a timestamp.

For example, the speaker class probability 510 is estimated as

In this case, the speaker class probabilities 510 are estimatedThe 4 rows represent a first speaker, a second speaker, a third speaker, and a fourth speaker, respectively, and the 4 columns in the estimated speaker class probability 510 represent a first timestamp, a second timestamp, a third speaking timestamp, and a fourth timestamp, respectively.

Taking the first column as an example, the first column in estimating the speaker classification probability 510 can be expressed as: the probability of the first speaker speaking in the first time stamp is 0.71, the probability of the second speaker speaking in the first time stamp is 0.13, the probability of the third speaker speaking in the first time stamp is 0.15, and the probability of the fourth speaker speaking in the first time stamp is 0.05.

The computer device calculates a first loss function value 512 based on the estimated speaker class probability 510 and the true label 511.

Step 404 b: and obtaining an ideal attractor sequence based on the product of the real label and the characteristic sequence.

The ideal attractor sequence refers to the real speaker characteristics calculated by the speaker log model based on the characteristic sequence and the real label, that is, the speaker log model determines the real speaker characteristics in the characteristic sequence according to the speaker category in the real label, for example, Q₁，Q₂，…Q_m. Wherein an attractor in the ideal attractor sequence characterizes a speaker class.

Illustratively, the computer device derives an ideal attractor sequence based on a product of the true tag and the feature sequence.

Ideal attractor sequence

The calculation formula of (c) can be expressed as:

wherein the content of the first and second substances,

r is real number, D is vector dimension, S is attractor number,

a transpose of the two-dimensional matrix characterized by the real label,

for each row along the real label the sum is,

the ith column in the real label is selected.

Step 406 b: and inputting the characteristic sequence and the ideal attractor sequence into a speaker log model to obtain the ideal speaker classification probability.

The ideal speaker class probability refers to the probability of a true speaker class based on the true label. Similar to estimating the speaker class probability, the ideal speaker class probability is a two-dimensional matrix, where one row of the matrix represents a speaker class and one column of the matrix represents a timestamp, e.g., the first probability value of the first row represents the probability of the first speaker speaking in the first timestamp.

Illustratively, the speaker log model includes a network of classifiers, the computer device computing a similarity between each feature in the sequence of features and each attractor in the sequence of ideal attractors; and inputting the similarity results of the characteristic sequence and the ideal attractor sequence into a classifier network in a speaker log model for calculation to obtain the probability of the ideal speaker.

Ideal speaker class probability

The determination formula of (c) can be expressed as:

wherein the content of the first and second substances,

r is real number, D is vector dimension, S is attractor number,

is an ideal sequence of an attractor and is,

and (3) transposing a two-dimensional matrix characterized by an attractor sequence, wherein N is the first dimension of a vector when the characteristic sequence dim is 0.

Step 408 b: and calculating a second loss function value based on the ideal speaker category probability and the real label.

Under the condition of obtaining the ideal speaker class probability, the computer equipment calculates a second loss function value based on the ideal speaker class probability and the real label.

Optionally, the second loss function value is at least one of a cross entropy between the ideal speaker class probability and the real tag, a mean square error between the ideal speaker class probability and the real tag, and an absolute difference between the ideal speaker class probability and the real tag, but is not limited thereto.

For example, fig. 6 shows a schematic structural diagram of calculating the ideal speaker class probability, and a computer device obtains the speech feature 602 and the real tag 609 of the sample speech signal 601, and performs feature extraction on the speech feature 602 through the feature extraction network 603 to obtain the feature sequence 604 of the sample speech signal 601. The computer device derives an ideal attractor sequence 605 based on the product of the true tag 609 and the signature sequence 604.

The computer device performs similarity calculation 606 on the sequence of features 604 and the sequence of ideal attractors 605, and the computer device calculates the similarity between each feature in the sequence of features 604 and each attractor in the sequence of ideal attractors 605; and the similarity results of the feature sequence 604 and the ideal attractor sequence 605 are input into a classifier network 607 for calculation, so as to obtain the ideal speaker classification probability 608.

The ideal speaker class probability 608 refers to the probability of the true speaker class based on the true tag 609. One row in the ideal speaker class probability 608 represents a speaker class and one column represents a timestamp.

For example, the ideal speaker class probability 608 is

In this case, 4 rows of the ideal speaker class probability 608 represent the first speaker, the second speaker, the third speaker, and the fourth speaker, respectively, and 4 columns of the ideal speaker class probability 608 represent the first time stamp, the second time stamp, the third speaker time stamp, and the fourth time stamp, respectively.

Taking the first column as an example, the first column in the ideal speaker class probability 608 may be represented as: the probability of a first speaker speaking in the first time stamp is 0.61, the probability of a second speaker speaking in the first time stamp is 0.21, the probability of a third speaker speaking in the first time stamp is 0.17, and the probability of a fourth speaker speaking in the first time stamp is 0.01.

The computer device calculates a second loss function value 610 based on the ideal speaker class probability 608 and the true label 609.

The computer device determines an ideal attractor sequence based on the real tag and the feature sequence, namely, determines the real speaker feature in the feature sequence according to the speaker class in the real tag, obtains the ideal speaker class probability through the feature sequence and the ideal attractor sequence, and calculates a second loss function value based on the ideal speaker class probability and the real tag. And updating the model parameters of the speaker log model based on the second loss function value, so that the feature extraction network in the updated speaker log model can more accurately extract the speaker features in the voice features and form a more accurate feature sequence when the voice features are extracted, thereby enabling the speaker log model to have higher voice signal identification accuracy.

Step 410: updating a model parameter of the speaker log model based on a sum of the first loss function value and the second loss function value.

Illustratively, the first loss function value is a first cross entropy between the estimated speaker class probability and the true label, and the second loss function value is a second cross entropy between the ideal speaker class probability and the true label.

Updating the model parameters of the speaker log model based on the sum of the first cross entropy and the second cross entropy, thereby obtaining the trained speaker log model.

The updating of the model parameter refers to updating a network parameter in the speaker log model, or updating a network parameter of each network module in the model, or updating a network parameter of each network layer in the model, but is not limited thereto, and the embodiment of the present application does not limit this.

In some embodiments, updating the model parameters of the speaker log model includes updating the network parameters of all of the network modules in the speaker log model, or, fixing the network parameters of a portion of the network modules in the speaker log model and updating only the network parameters of the remaining portion of the network modules. For example, when updating the model parameters of the speaker log model, the fixed speaker log model extracts the network parameters of the network, the encoder network parameters and the decoder network parameters, and only updates the network parameters of the classifier network.

And updating network parameters of a feature extraction network, an encoder, a decoder and a classifier network in the speaker log model based on the sum of the first cross entropy and the second cross entropy as an error and based on an error back propagation algorithm so that the sum of the first cross entropy and the second cross entropy is smaller and smaller until the sum of the first cross entropy and the second cross entropy converges, thereby obtaining the trained speaker log model.

The convergence of the sum of the first cross entropy and the second cross entropy means that the sum of the first cross entropy and the second cross entropy does not change any more, or an error difference between two adjacent iterations during the training of the speaker log model is smaller than a preset value, or the training frequency of the speaker log model reaches at least one of preset frequencies, but is not limited thereto.

In order to verify the effectiveness of the training method for the speaker log model provided in the embodiment of the present application, a comparison speaker log model and an improved speaker log model (i.e., the speaker log model in the present solution) are selected for a comparison test, and the specific conditions are as follows:

TABLE-speaker log error rate comparison TABLE

As can be seen from table one, the speaker log model provided in the embodiment of the present application has a greater effect improvement compared to the contrasted speaker log model, and the accuracy of the speaker log model is improved.

In summary, in the method provided in this embodiment, the voice feature of the sample voice signal is extracted through the feature extraction network of the speaker log model, so as to obtain the feature sequence and the real tag of the sample voice signal, obtain the estimated attractor sequence according to the feature sequence, and obtain the ideal attractor sequence according to the product of the feature sequence and the real tag; obtaining estimated speaker class probability according to the characteristic sequence and the estimated attractor sequence, and calculating a first cross entropy based on the estimated speaker class probability and the real label; obtaining ideal speaker class probability according to the characteristic sequence and the ideal attractor sequence, and calculating a second cross entropy based on the ideal speaker class probability and the real label; and updating the model parameters of the speaker log model according to the first cross entropy and the second cross entropy, so that the trained speaker log model can have higher speech signal identification precision, and more accurate speaker logs can be generated.

The above embodiments illustrate the training method of the speaker log model, and the method for speaker recognition based on the pre-trained speaker log model will be further described.

FIG. 7 is a flowchart of a speaker recognition method according to an exemplary embodiment of the present application. The method may be performed by a computer device, which may be the terminal 100 or the server 200 in fig. 2. The method comprises the following steps:

step 702: a feature sequence of a speech signal is obtained.

The voice signal is the voice to be identified by the speaker information. The feature sequence is a feature vector corresponding to each time stamp in the speech signal.

Wherein, the mode of acquiring the voice signal comprises at least one of the following conditions:

1. the computer device receives a speech signal, for example: the terminal is a terminal initiating customer service, records the call content in the customer service stage, and sends the voice signal of the call content to the server for identification after the recording is finished.

2. The computer device obtains the voice signal from the stored database, such as: in the conference application program, recording is carried out aiming at the conference process, the speaking content of each speaker in the conference is recorded and stored in the server, and when speaker identification is needed, the recorded voice signal of the conference is acquired from the stored conference recording for identification.

It should be noted that the above-mentioned manner of acquiring the speech signal is only an illustrative example, and the embodiment of the present application does not limit this.

Step 704: and obtaining an estimated attractor sequence according to the characteristic sequence.

Estimating an attractor sequence refers to estimating speaker characteristics estimated by the speaker log model based on the feature sequence, that is, the speaker log model estimates estimated speaker characteristics according to the features in the feature sequence, for example, S₁，S₂，S₃…S_nWherein an attractor in the sequence of attractors is estimated to characterize a speaker class.

Step 706: and inputting the characteristic sequence and the estimated attractor sequence into a speaker log model to obtain the estimated speaker classification probability.

The estimated speaker class probability refers to the probability of speaker classes corresponding to different timestamps estimated by the speaker log model. The speaker class probabilities are estimated as a two-dimensional matrix, where a row in the matrix represents a speaker class and a column in the matrix represents a timestamp, e.g., a first probability value in a first row represents a probability of a first speaker speaking in the first timestamp.

Under the condition of acquiring the estimated attractor sequence, the computer equipment inputs the characteristic sequence and the estimated attractor sequence into a pre-trained speaker log model so as to obtain the estimated speaker classification probability.

Step 708: and determining the speaker type and the number of speakers corresponding to the voice signal based on the estimated speaker type probability.

The speaker category is a category of a speaker in a speech signal, that is, a speaker category corresponding to different time stamps is determined in a section of recognized speech signal, for example, a speaks in a first time stamp, and a speaks in a second time stamp; the number of speakers refers to the number of speakers in the speech signal.

Under the condition of obtaining the probability of the estimated speaker class, the computer equipment determines the speaker class and the number of speakers corresponding to the voice signal according to the probability value in the probability of the estimated speaker class.

In summary, in the method provided by this embodiment, by obtaining the feature sequence of the voice signal, the estimated attractor sequence is obtained according to the feature sequence; obtaining estimated speaker identification probability according to the characteristic sequence and the estimated attractor sequence; and determining the speaker type and the number of speakers corresponding to the voice signal according to the speaker type probability, thereby generating a more accurate speaker log.

FIG. 8 is a flowchart of a speaker recognition method according to an exemplary embodiment of the present application. The method may be performed by a computer device, which may be the terminal 100 or the server 200 in fig. 2. The method comprises the following steps:

step 802: a feature sequence of a speech signal is obtained.

Illustratively, the computer device acquires a voice feature of the voice signal, and performs feature extraction on the voice feature through a feature extraction network to obtain a feature sequence of the voice signal.

The speech features are time-frequency feature data of the speech signal. The feature extraction network extracts features in the speech features.

Optionally, the speech feature includes at least one of a speech spectrum and a Frequency Cepstrum Coefficient (MFCC) of the speech signal, but is not limited thereto.

In one possible implementation mode, the voice characteristics are subjected to characteristic extraction through a non-negative function in a characteristic extraction network of a speaker log model to obtain extracted characteristics; and normalizing the extracted feature value to obtain a feature sequence of the voice signal.

Step 804: and obtaining an estimated attractor sequence according to the characteristic sequence.

Optionally, the neural network in the encoder is a neural network independent of the input order of the feature sequences, for example, the neural network in the decoder is an all-self attention network Transformer. The Neural Network in the decoder is at least one of a Recurrent Neural Network (RNN) and a Long-Short-Term Memory Network (LSTM), but is not limited thereto.

For example, the speech signal includes A, B, C speech signals of three persons and some noise, and the characteristic sequence of the speech signal is processed by the encoder-decoder to obtain an estimated attractor sequence corresponding to the speech signal, where the estimated attractor sequence includes 4 attractors, each of which is S₁，S₂，S₃And S₄Wherein the computer device divides the human voice part in the voice signal into different attractors S₁，S₂，S₃Uniformly dividing the non-human voice part in the voice signal into an attractor S₄。

Step 806: and inputting the characteristic sequence and the estimated attractor sequence into a speaker log model to obtain the estimated speaker classification probability.

In one possible implementation, the speaker log model includes a network of classifiers, the computer device estimating a likelihood between each feature in the sequence of features and each attractor in the sequence of attractors by computing a similarity between each feature in the sequence of features and each attractor in the sequence of estimated attractors; and inputting the similarity results of the characteristic sequences and the estimated attractor sequences into a classifier network for calculation, thereby obtaining the estimated speaker classification probability.

P＝Softmax(A^TN,dim＝0)

For example, fig. 9 shows a schematic structural diagram of a speaker recognition method, and a computer device acquires a speech feature 902 of a speech signal 901, and performs feature extraction on the speech feature 902 through a feature extraction network 903 to obtain a feature sequence 904 of the speech signal 901. The computer equipment inputs the characteristic sequence 904 into an encoder 905 for encoding processing to obtain a speaker characteristic vector; and the speaker feature vector is input to the decoder 906 for decoding processing, resulting in the estimated attractor sequence 907.

The computer device performs similarity calculation 908 on the feature sequence 904 and the estimated attractor sequence 907, and the computer device calculates the similarity between each feature in the feature sequence 904 and each attractor in the estimated attractor sequence 507; and the similarity results of the feature sequence 904 and the estimated attractor sequence 907 are input into a classifier network 909 for calculation, so as to obtain an estimated speaker classification probability 910.

The estimated speaker class probability 910 refers to the probability of the speaker class estimated by the speaker log model. One row in the estimated speaker class probability 910 represents a speaker class and one column represents a timestamp.

For example, estimating a speaker class probability 910 is

In the case of (1), 4 rows in the estimated speaker class probability 910 represent a first speaker, a second speaker, a third speaker, and a fourth speaker, respectively, and 4 columns in the estimated speaker class probability 910 represent a first time stamp, a second time stamp, a third time stamp, and a fourth time stamp, respectively.

Taking the first column as an example, the first column in estimating the speaker classification probability 910 can be expressed as: the probability of the first speaker speaking in the first time stamp is 0.71, the probability of the second speaker speaking in the first time stamp is 0.13, the probability of the third speaker speaking in the first time stamp is 0.15, and the probability of the fourth speaker speaking in the first time stamp is 0.01.

Step 808 a: and under the condition that a first probability value corresponding to the first time stamp in the estimated speaker class probability is larger than a first threshold value, determining the speaker class corresponding to the first time stamp in the speaker human voice signal corresponding to the first probability value.

The speaker classification refers to the classification of speakers in a speech signal, i.e., the speaker classification corresponding to different time stamps is determined in a recognized speech signal, for example, a speaks in a first time stamp, and a and B speak in a second time stamp.

For example, the estimated speaker classification probability is obtained as

If the first threshold is set to 0.5, then the speaker class corresponding to the estimated speaker class probability can be determined as follows according to the estimated speaker class probability: the speaker class is the first speaker in the first time stamp, no speaker is speaking in the second time stamp, the speaker class is the third speaker in the third time stamp, and the speaker class is the fourth timeThe speaker class within the timestamp is the first speaker.

Step 808 b: determining a speaker category label vector corresponding to the voice signal based on the estimated speaker category probability; and determining the number of speakers corresponding to the voice signals according to the speaker category label vector.

The number of speakers refers to the number of speakers in the speech signal.

The speaker class label vector is a label vector representing a speaker class obtained based on a probability value in the estimated speaker class probability.

In one possible implementation, the class label corresponding to the first probability value in the speaker class label vector is set to 1 if the first probability value corresponding to the first timestamp in the estimated speaker class probability is greater than the second threshold.

And under the condition that the second probability value and the third probability value corresponding to the second timestamp in the estimated speaker category probability are both greater than a third threshold and both smaller than the second threshold, setting a category label corresponding to the second probability value and a category label corresponding to the third probability value in the speaker category label vector to be 1.

And under the condition that the fourth probability value corresponding to the third timestamp in the estimated speaker category probability is not more than the third threshold, setting the category label corresponding to the maximum probability value in the fourth probability value corresponding to the third timestamp in the speaker category label vector as 1.

The computer device derives a speaker class label vector based on the values of the class labels. The computer equipment sums the speaker identification tag vectors along the dimension of the time stamp to obtain speaker identification number vectors; and determining the number of speakers corresponding to the voice signals based on the number of non-zero elements in the speaker category number vector.

For example, the estimated speaker classification probability is obtained as

In the case of (1), 4 rows in the estimated speaker class probability represent the first speaker, the second speaker, the third speaker, and the fourth speaker, respectively, and the estimated utterance isThe 4 columns in the speaker identity probability represent the first time stamp, the second time stamp, the third time stamp, and the fourth time stamp, respectively.

Setting the second threshold to 0.5 and the third threshold to 0.25, the speaker class label vector can be determined as

The computer device sums the speaker category label vectors along the timestamp dimension to obtain a speaker category number vector, which is expressed as

The computer equipment determines the number of the speakers corresponding to the voice signals based on the number of the nonzero elements in the speaker category number vector, and can know that the number of the speakers corresponding to the estimated speaker category probability is 3.

In summary, in the method provided in this embodiment, by acquiring the feature sequence of the voice signal, the estimated attractor sequence is acquired according to the feature sequence; obtaining estimated speaker identification probability according to the characteristic sequence and the estimated attractor sequence; and determining the speaker type and the number of speakers corresponding to the voice signal according to the speaker type probability, thereby generating a more accurate speaker log.

The method provided by the embodiment can be applied to various scenes such as meeting records, customer service work supervision and the like, and schematically, the application scene of the method comprises at least one of the following scenes:

first, a meeting records a scene.

That is, the audio in the conference process is recorded to obtain a voice signal, and the speaker recognition method provided by the embodiment of the application is used for carrying out speaker recognition on the voice signal to obtain a speaker recognition result corresponding to the voice signal, so that a speaker log corresponding to the conference is generated. Subsequently, when the conference content is listened to again, the conference speech of the target speaker can be selectively listened to according to the speaker log.

And secondly, a customer service work supervision scene.

The method comprises the steps of recording telephone/voice communication contents of customer service to obtain voice signals, carrying out speaker identification on the voice signals through the speaker identification method provided by the embodiment of the application to obtain speaker identification results corresponding to the voice signals, and generating speaker logs corresponding to the telephone communication process of the customer service. Subsequently, the manager can extract the telephone/voice content of the customer service and select the part expressed by the customer service according to the speaker log to selectively listen, thereby realizing the work supervision of the customer service.

It is to be noted that the above application scenarios are only illustrative examples, and the embodiment of the present application may also be applied to various scenarios such as cloud technology, artificial intelligence, car networking, map navigation, smart transportation, and assisted driving.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are authorized by the user or sufficiently authorized by various parties, and the collection, use, and processing of the relevant data is required to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the speech signal and the sample speech signal referred to in this application are both obtained with sufficient authorization.

FIG. 10 is a block diagram illustrating an exemplary embodiment of a training apparatus for a speaker log model. The apparatus may be implemented as all or part of a computer device in software, hardware or a combination of both, the apparatus comprising:

the first obtaining module 1001 is configured to obtain a feature sequence and a real tag of a sample speech signal, where the real tag is a tag representing a category of a real speaker.

A second obtaining module 1002, configured to obtain estimated attractor sequences according to the feature sequences, where one attractor in the estimated attractor sequences represents one speaker category.

An estimating module 1003, configured to input the feature sequence and the estimated attractor sequence into the speaker log model to obtain an estimated speaker class probability, where the estimated speaker class probability is a probability of a speaker class estimated by the speaker log model.

A calculating module 1004 for calculating a first loss function value based on the estimated speaker class probability and the true label.

An updating module 1005, configured to update the model parameters of the speaker log model based on the first loss function value.

In a possible implementation manner, the second obtaining module 1002 is further configured to input the feature sequence to the encoder for encoding processing, so as to obtain a speaker feature vector; and inputting the speaker characteristic vector to the decoder for decoding processing to obtain the estimated attractor sequence.

In a possible implementation manner, the second obtaining module 1002 is further configured to calculate a similarity between each feature in the feature sequence and each attractor in the estimated attractor sequence;

and inputting the similarity results of the characteristic sequences and the estimated attractor sequences into the classifier network for calculation to obtain the estimated speaker classification probability.

In a possible implementation manner, the second obtaining module 1002 is further configured to obtain an ideal attractor sequence according to the feature sequence and the real tag, where an attractor in the ideal attractor sequence represents a speaker category.

In a possible implementation manner, the estimation module 1003 is further configured to input the feature sequence and the ideal attractor sequence into the speaker log model to obtain an ideal speaker class probability, where the ideal speaker class probability is a probability of a real speaker class obtained based on the real tag.

In one possible implementation, the calculating module 1004 is further configured to calculate a second loss function value based on the ideal speaker class probability and the real label.

In a possible implementation manner, the updating module 1005 is further configured to update the model parameter of the speaker log model based on the first loss function value and the second loss function value.

In a possible implementation manner, the second obtaining module 1002 is further configured to obtain the ideal attractor sequence based on a product of the real tag and the feature sequence.

In a possible implementation manner, the second obtaining module 1002 is further configured to calculate a similarity between each feature in the feature sequence and each attractor in the ideal attractor sequence; and inputting the similarity results of the characteristic sequences and the ideal attractor sequences into the classifier network in the speaker log model for calculation to obtain the identification probability of the ideal speaker.

In a possible implementation manner, the first obtaining module 1001 is further configured to obtain a speech feature of the sample speech signal, where the speech feature is time-frequency feature data of the sample speech signal; and extracting the characteristics of the voice characteristics through the characteristic extraction network to obtain the characteristic sequence of the sample voice signal.

In a possible implementation manner, the first obtaining module 1001 is further configured to perform feature extraction on the speech feature through a non-negative function in the feature extraction network of the speaker log model to obtain an extracted feature; and normalizing the extracted feature value to obtain the feature sequence of the sample voice signal.

The first loss function value is a first cross entropy between the estimated speaker class probability and the real tags, and the second loss function value is a second cross entropy between the ideal speaker class probability and the real tags.

In a possible implementation manner, the updating module 1005 is further configured to update the model parameter of the speaker log model based on a sum of the first cross entropy and the second cross entropy.

FIG. 11 is a schematic diagram illustrating a speaker recognition apparatus according to an exemplary embodiment of the present application. The apparatus may be implemented as all or part of a computer device in software, hardware or a combination of both, the apparatus comprising:

a first obtaining module 1101, configured to obtain a feature sequence of a speech signal.

A second obtaining module 1102, configured to obtain estimated attractor sequences according to the feature sequences, where an attractor in the estimated attractor sequences represents a speaker category.

An estimating module 1103, configured to input the feature sequence and the estimated attractor sequence into the speaker log model to obtain an estimated speaker class probability, where the estimated speaker class probability is a probability of a speaker class estimated by the speaker log model.

And the recognition module 1104 is configured to determine a speaker type and a speaker number corresponding to the voice signal based on the estimated speaker type probability.

In a possible implementation manner, the second obtaining module 1102 is further configured to input the feature sequence to the encoder for encoding, so as to obtain a speaker feature vector; and inputting the speaker characteristic vector into the decoder for decoding to obtain the estimated attractor sequence.

The speaker log model includes a classifier network.

In a possible implementation manner, the second obtaining module 1102 is further configured to calculate a similarity between each feature in the feature sequence and each estimated attractor in the estimated attractor sequence; and inputting the similarity results of the characteristic sequence and the estimated attractor sequence into the classifier network for calculation to obtain the estimated speaker class probability.

In one possible implementation, the identifying module 1104 is further configured to determine that the speaker corresponding to the first probability value is the speaker class corresponding to the first timestamp in the speech signal if the first probability value corresponding to the first timestamp in the estimated speaker class probability is greater than a first threshold.

In a possible implementation manner, the identifying module 1104 is further configured to determine a speaker class label vector corresponding to the speech signal based on the estimated speaker class probability; and determining the number of the speakers corresponding to the voice signals according to the speaker category label vector.

The speaker category label vector is a label vector representing a speaker category obtained based on a probability value in the estimated speaker category probability.

In a possible implementation, the identifying module 1104 is further configured to set a class label corresponding to a first probability value in the estimated speaker class probability to 1 if the first probability value corresponding to a first timestamp is greater than a second threshold;

setting the class label corresponding to the second probability value and the class label corresponding to the third probability value in the speaker class label vector to be 1 under the condition that the second probability value and the third probability value corresponding to a second timestamp in the estimated speaker class probability are both greater than a third threshold and are both smaller than the second threshold;

setting the category label corresponding to the maximum probability value in the fourth probability values corresponding to the third timestamp in the speaker category label vector to be 1 under the condition that the fourth probability values corresponding to the third timestamp in the estimated speaker category probability are not more than the third threshold;

the speaker class label vector is derived based on the values of the class labels.

In a possible implementation manner, the identifying module 1104 is further configured to sum the speaker identification tag vectors along a timestamp dimension to obtain a speaker identification number vector; and determining the number of the speakers corresponding to the voice signals based on the number of nonzero elements in the speaker category number vector.

In a possible implementation manner, the first obtaining module 1101 is further configured to obtain a voice feature of the voice signal, where the voice feature is time-frequency feature data of the voice signal; and extracting the characteristics of the voice characteristics through the characteristic extraction network to obtain the characteristic sequence of the voice signal.

In a possible implementation manner, the first obtaining module 1101 is further configured to perform feature extraction on the speech feature through a non-negative function in the feature extraction network of the speaker log model to obtain an extracted feature; and normalizing the values of the extracted features to obtain the feature sequence of the voice signal.

Fig. 12 shows a block diagram of a computer device 1200 according to an exemplary embodiment of the present application. The computer device 1200 may be a portable mobile terminal, such as: smart phones, tablet computers, MP3 players (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4). Computer device 1200 may also be referred to by other names such as user equipment, portable terminals, and the like.

Generally, computer device 1200 includes: a processor 1201 and a memory 1202.

The processor 1201 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1201 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1201 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing content that the display screen needs to display. In some embodiments, the processor 1201 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

Memory 1202 may include one or more computer-readable storage media, which may be tangible and non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1202 is used to store at least one instruction for execution by the processor 1201 to implement a speaker log model training method or a speaker recognition method provided in embodiments of the present application.

In some embodiments, the computer device 1200 may further optionally include: a peripheral interface 1203 and at least one peripheral. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1204, touch display 1205, camera 1206, audio circuitry 1207, and power supply 1208.

The peripheral interface 1203 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1201 and the memory 1202. In some embodiments, the processor 1201, memory 1202, and peripheral interface 1203 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1201, the memory 1202 and the peripheral device interface 1203 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1204 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1204 communicates with a communication network and other communication devices by electromagnetic signals. The radio frequency circuit 1204 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1204 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, etc. The radio frequency circuit 1204 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1204 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The touch display 1205 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. The touch display screen 1205 also has the ability to acquire touch signals on or over the surface of the touch display screen 1205. The touch signal may be input to the processor 1201 as a control signal for processing. The touch display 1205 is used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the touch display 1205 may be one, providing the front panel of the computer device 1200; in other embodiments, the touch display 1205 can be at least two, respectively disposed on different surfaces of the computer device 1200 or in a folded design; in some embodiments, the touch display 1205 may be a flexible display disposed on a curved surface or on a folded surface of the computer device 1200. Even more, the touch display panel 1205 can be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The touch Display panel 1205 can be made of a material such as an LCD (Liquid Crystal Display) or an OLED (Organic Light-Emitting Diode).

Camera assembly 1206 is used to capture images or video. Optionally, camera assembly 1206 includes a front camera and a rear camera. Generally, a front camera is used for realizing video call or self-shooting, and a rear camera is used for realizing shooting of pictures or videos. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera and a wide-angle camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting function and a VR (Virtual Reality) shooting function. In some embodiments, camera assembly 1206 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 1207 is used to provide an audio interface between a user and the computer device 1200. The audio circuitry 1207 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1201 for processing or inputting the electric signals into the radio frequency circuit 1204 to achieve voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and located at different locations on the computer device 1200. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1201 or the radio frequency circuit 1204 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1207 may also include a headphone jack.

The power supply 1208 is used to power the various components in the computer device 1200. The power supply 1208 may be an alternating current, direct current, disposable battery, or rechargeable battery. When power supply 1208 includes a rechargeable battery, the rechargeable battery can be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the computer device 1200 also includes one or more sensors 1209. The one or more sensors 1209 include, but are not limited to: acceleration sensor 1210, gyro sensor 1211, pressure sensor 1212, optical sensor 1213 and proximity sensor 1214.

The acceleration sensor 1210 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the computer apparatus 1200. For example, the acceleration sensor 1210 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1201 may control the touch display 1205 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1210. The acceleration sensor 1210 may also be used for game or user motion data acquisition.

The gyro sensor 1211 may detect a body direction and a rotation angle of the computer device 1200, and the gyro sensor 1211 may collect a 3D motion of the user on the computer device 1200 in cooperation with the acceleration sensor 1210. The processor 1201, based on the data collected by the gyro sensor 1211, may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensors 1212 may be disposed on the side bezel of the computer device 1200 and/or on the lower layers of the touch display 1205. When the pressure sensor 1212 is disposed on the side frame of the computer device 1200, a user's grip signal on the computer device 1200 may be detected, and left-right hand recognition or shortcut operation may be performed according to the grip signal. When the pressure sensor 1212 is disposed at a lower layer of the touch display screen 1205, the operability control on the UI interface can be controlled according to the pressure operation of the user on the touch display screen 1205. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The optical sensor 1213 is used to collect the ambient light intensity. In one embodiment, the processor 1201 may control the display brightness of the touch display 1205 according to the ambient light intensity collected by the optical sensor 1213. Specifically, when the ambient light intensity is high, the display brightness of the touch display panel 1205 is increased; when the ambient light intensity is low, the display brightness of the touch display panel 1205 is turned down. In another embodiment, the processor 1201 may also dynamically adjust the shooting parameters of the camera assembly 1206 according to the ambient light intensity collected by the optical sensor 1213.

Proximity sensors 1214, also known as distance sensors, are typically provided on the front of the computer device 1200. The proximity sensor 1214 is used to capture the distance between the user and the front of the computer device 1200. In one embodiment, the processor 1201 controls the touch display 1205 to switch from the bright screen state to the dark screen state when the proximity sensor 1214 detects that the distance between the user and the front of the computer device 1200 is gradually decreasing; when the proximity sensor 1214 detects that the distance between the user and the front of the computer device 1200 is gradually increasing, the processor 1201 controls the touch display 1205 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in FIG. 12 is not intended to be limiting of the computer device 1200 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one program, and the at least one program is loaded by the processor and executed to implement the training method for the speaker log model or the speaker recognition method provided in the foregoing method embodiments.

The embodiment of the present application further provides a computer-readable storage medium, where at least one program is stored in the storage medium, and the at least one program is loaded and executed by a processor to implement the method for training a speaker log model or the method for speaker recognition provided in the above-mentioned method embodiments.

It is understood that in the embodiments of the present application, data related to user data processing related to user identity or characteristics, such as related data, historical data, and figures, etc., need to be approved or approved by a user when the above embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only an example of the present application and should not be taken as limiting, and any modifications, equivalent switches, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of training a speaker log model, the method comprising:

2. The training method of claim 1, wherein the speaker log model comprises an encoder and a decoder;

the obtaining of the estimated attractor sequence according to the characteristic sequence comprises:

inputting the characteristic sequence into the encoder for encoding processing to obtain a speaker characteristic vector;

and inputting the speaker characteristic vector into the decoder for decoding to obtain the estimated attractor sequence.

3. The training method of claim 2, wherein the speaker log model comprises a classifier network;

inputting the characteristic sequence and the estimated attractor sequence into the speaker log model to obtain an estimated speaker classification probability, including:

calculating a similarity between each feature in the sequence of features and each attractor in the sequence of estimated attractors;

4. A training method as claimed in any one of claims 1 to 3, characterized in that the method further comprises:

acquiring an ideal attractor sequence according to the characteristic sequence and the real label, wherein one attractor in the ideal attractor sequence represents a speaker class;

inputting the characteristic sequence and the ideal attractor sequence into the speaker log model to obtain an ideal speaker class probability, wherein the ideal speaker class probability is the probability of a real speaker class obtained based on the real label;

calculating a second loss function value based on the ideal speaker class probability and the real label;

updating the model parameters of the speaker log model based on the first loss function value, comprising:

updating model parameters of the speaker log model based on the first loss function value and the second loss function value.

5. The training method of claim 4, wherein the obtaining of the ideal attractor sequence according to the feature sequence and the real label comprises:

and obtaining the ideal attractor sequence based on the product of the real label and the characteristic sequence.

6. The training method of claim 5, wherein the speaker log model comprises a classifier network;

inputting the feature sequence and the ideal attractor sequence into the classifier network in the speaker log model to obtain an ideal speaker classification probability, including:

calculating a similarity between each feature in the sequence of features and each attractor in the sequence of ideal attractors;

and inputting the similarity results of the characteristic sequences and the ideal attractor sequences into the classifier network in the speaker log model for calculation to obtain the identification probability of the ideal speaker.

7. The training method as claimed in claim 4, wherein the first loss function value is a first cross entropy between the estimated speaker class probability and the real label, and the second loss function value is a second cross entropy between the ideal speaker class probability and the real label;

updating model parameters of the speaker log model based on the first loss function value and the second loss function value, comprising:

updating model parameters of the speaker log model based on the sum of the first cross entropy and the second cross entropy.

8. The training method of any one of claims 1 to 7, wherein the speaker log model further comprises a feature extraction network;

the obtaining of the feature sequence of the sample speech signal includes:

acquiring voice characteristics of the sample voice signal, wherein the voice characteristics are time-frequency characteristic data of the sample voice signal;

and performing feature extraction on the voice features through the feature extraction network to obtain the feature sequence of the sample voice signal.

9. A method for speaker recognition, the method comprising:

acquiring a characteristic sequence of a voice signal;

10. The method of claim 9, wherein the speaker log model comprises an encoder and a decoder;

11. The method of claim 10, wherein the speaker log model comprises a classifier network;

calculating a similarity between each feature in the sequence of features and each of the estimated attractors in the sequence of estimated attractors;

12. The method according to any one of claims 9 to 11, wherein said determining the speaker class and the speaker count corresponding to the speech signal based on the estimated speaker class probability comprises:

and determining that the speaker corresponding to the first probability value is the speaker class corresponding to the first timestamp in the voice signal when the first probability value corresponding to the first timestamp in the estimated speaker class probability is greater than a first threshold.

13. The method according to any one of claims 9 to 11, wherein said determining the speaker class and the speaker count corresponding to the speech signal based on the estimated speaker class probability comprises:

determining a speaker class label vector corresponding to the voice signal based on the estimated speaker class probability;

determining the number of the speakers corresponding to the voice signals according to the speaker category label vectors;

14. The method of claim 13, wherein determining a speaker class label vector corresponding to the speech signal based on the estimated speaker class probability comprises:

setting a class label corresponding to a first probability value in the estimated speaker class probability to 1 when the first probability value corresponding to a first timestamp is greater than a second threshold;

15. The method of claim 14, wherein said determining the number of speakers corresponding to the speech signal based on the speaker class label vector comprises:

summing the speaker category label vectors along the dimension of the time stamp to obtain speaker category number vectors;

and determining the number of the speakers corresponding to the voice signals based on the number of nonzero elements in the speaker category number vector.

16. An apparatus for training a speaker log model, the apparatus comprising:

17. A speaker recognition apparatus, the apparatus comprising:

and the recognition module is used for determining the speaker type and the number of speakers corresponding to the voice signal based on the estimated speaker type probability.

18. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory having stored therein at least one computer program, the at least one computer program being loaded into and executed by the processor to implement the method of training a speaker log model according to any one of claims 1 to 8, or the method of speaker recognition according to any one of claims 9 to 15.

19. A computer storage medium having at least one computer program stored thereon, the at least one computer program being loaded into and executed by a processor to perform a method of training a speaker log model according to any one of claims 1 to 8, or a method of speaker recognition according to any one of claims 9 to 15.

20. A computer program product, characterized in that the computer program product comprises a computer program, the computer program being stored in a computer readable storage medium; the computer program is read from the computer-readable storage medium and executed by a processor of a computer device, causing the computer device to perform the training method of the speaker log model according to any one of claims 1 to 8, or the speaker recognition method according to any one of claims 9 to 15.