CN113345466A

CN113345466A - Main speaker voice detection method, device and equipment based on multi-microphone scene

Info

Publication number: CN113345466A
Application number: CN202110609713.8A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-09-03
Anticipated expiration: 2041-06-01
Also published as: CN113345466B

Abstract

The application provides a method, a device, computer equipment and a computer readable storage medium for detecting the voice of a main speaker based on a multi-microphone scene, which belong to the technical field of artificial intelligence, and are characterized in that voice data transmitted by a plurality of preset voice channels are obtained through a plurality of preset voice channels corresponding to a plurality of preset microphones arranged based on the voice scene, voiceprint characteristics of the main speaker contained in the voice data are obtained based on the voice data, multi-voice channel fusion coding is carried out on the voice data and the voiceprint characteristics, so as to extract voice hiding vector sequence data of the main speaker contained in the voice data and the voiceprint characteristics, the voice hiding vector sequence data are decoded to obtain the voice data of the main speaker, the voice data of the main speaker can be automatically detected by analyzing the main speaker automatically and utilizing multi-microphone hardware resources in the voice scene, and the accuracy of the voice detection of the main speaker in the voice scene can be improved.

Description

Main speaker voice detection method, device and equipment based on multi-microphone scene

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for detecting a main speaker voice based on a multi-microphone scenario, a computer device, and a computer-readable storage medium.

Background

In speech recognition including interference sound, for example, in speech recognition scenes such as vehicle-mounted speech assistants, automatic generation of public class subtitles, automatic conference president and the like, generally, a main speaker speech filtering system needs to register the voiceprint features of a main speaker first, and then performs speech recognition according to the voiceprint features of the main speaker, otherwise, the voice of the main speaker cannot be recognized. Wherein, registering the voiceprint characteristics of the main speaker, an x-vector (i.e. the voiceprint characteristics of the speaker) is generally generated by recording the data of the speaker in minutes. The inventor realizes that the traditional technology carries out voice recognition by pre-registering the voiceprint characteristic mode of the main speaker, which causes inconvenience in the application of voice recognition and reduces the efficiency of recognizing the voice of the main speaker and the flexibility of voice recognition.

Disclosure of Invention

The application provides a method and a device for detecting the voice of a main speaker based on a multi-microphone scene, computer equipment and a computer readable storage medium, which can solve the technical problem of low efficiency of recognizing the voice of the main speaker in the traditional technology.

In a first aspect, the present application provides a method for detecting a speech of a main speaker based on a multi-microphone scenario, the method comprising: acquiring voice data transmitted by a plurality of preset voice channels corresponding to a plurality of preset microphones set based on a voice scene; acquiring the voiceprint characteristics of a main speaker contained in the voice data based on the voice data; performing multi-voice channel fusion coding on the voice data and the voiceprint features to extract voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint features; and decoding the voice hiding vector sequence data to obtain the main speaker voice data of the main speaker.

In a second aspect, the present application further provides a device for detecting a speech of a main speaker based on a multi-microphone scenario, the device comprising: the first acquisition unit is used for acquiring voice data transmitted by a plurality of preset voice channels corresponding to a plurality of preset microphones set based on a voice scene; the second acquisition unit is used for acquiring the voiceprint characteristics of a main speaker contained in the voice data based on the voice data; a coding unit, configured to perform multi-voice channel fusion coding on the voice data and the voiceprint feature to extract voice hidden vector sequence data of the main speaker included in the voice data and the voiceprint feature; and the decoding unit is used for decoding the voice hiding vector sequence data to obtain the main speaker voice data of the main speaker.

In a third aspect, the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the multi-microphone scene-based main speaker speech detection method when executing the computer program.

In a fourth aspect, the present application further provides a computer readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to perform the steps of the multi-microphone scene based primary speaker speech detection method.

The application provides a method and a device for detecting the voice of a main speaker based on a multi-microphone scene, computer equipment and a computer readable storage medium. The method comprises the steps of acquiring a plurality of voice data transmitted by a preset voice channel through a plurality of preset voice channels corresponding to a plurality of preset microphones arranged based on a voice scene, acquiring voiceprint characteristics of a main speaker contained in the voice data based on the voice data, and carrying out multi-voice channel fusion coding on the voice data and the voiceprint characteristics to extract voice hiding vector sequence data of the main speaker contained in the voice data and the voiceprint characteristics, decoding the voice hiding vector sequence data to obtain main speaker voice data of the main speaker, compared with the voice of the main speaker identified based on pre-registration in the traditional technology, the voice of the main speaker is identified through automatic main speaker analysis and full utilization of multi-microphone hardware resources in the voice scene without registering the voice of the main speaker in the embodiment of the application, the method and the device can realize automatic detection of the voice of the main speaker in the voice scene, improve the accuracy of the voice detection of the main speaker in the voice scene, and improve the efficiency of recognizing the voice of the main speaker and the flexibility of voice recognition, thereby improving the convenience of wide use of voice detection and simultaneously improving the use efficiency of multi-microphone hardware resources in the voice scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flowchart of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a main speaker voice automatic filtering system of a main speaker voice detection method based on a multi-microphone scene according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a first sub-flow of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a second sub-flow of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a third sub-flow of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a fourth sub-flow of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

FIG. 7 is a fifth sub-flowchart of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a main speaker analysis and voiceprint extraction module of the multi-microphone scene-based main speaker voice detection method according to the embodiment of the present application;

FIG. 9 is a sixth sub-flowchart of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application;

FIG. 10 is a block diagram illustrating a multi-voice channel fused encoder module of a multi-microphone scene based method for detecting a primary speaker's voice in accordance with an embodiment of the present invention;

FIG. 11 is a block diagram illustrating a decoder module of the multi-microphone scene-based method for detecting a main speaker's speech according to an embodiment of the present invention;

FIG. 12 is a schematic block diagram of a multi-microphone scenario based primary speaker speech detection apparatus according to an embodiment of the present application; and

fig. 13 is a schematic block diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application. As shown in FIG. 1, the method includes the following steps S11-S13:

s11, acquiring voice data transmitted by the preset voice channel based on a plurality of preset voice channels corresponding to a plurality of preset microphones set in the voice scene.

Specifically, a plurality of microphones are arranged in a voice scene, and when voice is generated in the voice scene, voice data corresponding to the voice generated in the voice scene is collected through a preset voice channel corresponding to each preset microphone. For example, in a voice scene such as a vehicle-mounted environment, a public class environment or a conference scene, a plurality of microphones are set in the vehicle-mounted environment, a plurality of microphones are set in the public class environment, or a plurality of microphones are set in the conference scene, each preset microphone corresponds to one preset voice channel, the plurality of preset microphones correspond to a plurality of preset voice channels, and voices generated in the voice scene are simultaneously collected through the plurality of microphones in the same preset voice scene. For example, when a voice a is generated in a voice scene, a plurality of microphones disposed in the voice scene are used to simultaneously collect the generated voice a, for example, n microphones may be disposed, n ≧ 1, and n is An integer, so as to obtain voice data a 'corresponding to the voice a, since each microphone collects a voice a, for the n microphones, the ith microphone collects voice data corresponding to the voice a as Ai, where i is 1 … n, and the obtained voice data a' is { a1, a2, A3 … An }, where, although the data sources of the plurality of microphones are all the voice a, due to the difference of collection factors such as the collection angle of each microphone, the plurality of voice data a1, a2, A3 … An may not be identical, please refer to fig. 2, fig. 2 is a schematic structural diagram of the main speaker automatic filtering system of the voice detection method based on the multi-microphone voice based on the multi-scene provided in this application, as shown in fig. 2, in the application scenario example shown in fig. 2, the microphone 0, the microphone 1, the microphone 2, and the microphone 3 are included, and 4 microphones are included, and the speech generated in the application scenario is collected through the microphone 0, the microphone 1, the microphone 2, and the microphone 3 at the same time.

And S12, acquiring the voiceprint characteristics of the main speaker contained in the voice data based on the voice data.

Specifically, after the voice data transmitted through the plurality of preset voice channels is obtained, because in the voice scene, besides the voice of the main speaker, other interfering voices may exist, such as voices of other people, the voice data needs to be analyzed, so that the voiceprint feature of the main speaker is extracted from the voice data, and then the voice of the main speaker is analyzed from the voice data according to the voiceprint feature of the main speaker.

Further, referring to fig. 3, fig. 3 is a schematic view of a first sub-flow of a method for detecting a voice of a main speaker based on a multi-microphone scenario according to an embodiment of the present application, as shown in fig. 3, in this example, the step of acquiring a voiceprint feature of the main speaker included in the voice data based on the voice data includes:

s121, slicing the voice data according to a time sequence corresponding to the voice data to obtain a plurality of voice slice data;

s122, inputting the voice slice data into a preset voiceprint detection model based on a residual error network so as to extract slice voiceprint characteristics contained in the voice slice data;

and S123, clustering all the slice voiceprint characteristics to obtain target voiceprint characteristics, and taking the target voiceprint characteristics as the voiceprint characteristics of the main speaker contained in the voice data.

Voiceprint, Voiceprint in english, is a sound wave spectrum carrying speech information, and Voiceprint recognition, also called speaker recognition, can perform speaker recognition and speaker confirmation by using Voiceprint characteristics of speech.

Specifically, the voice data includes voice of a main speaker and other interfering sounds, the voice of the main speaker is obviously different from the voice of other sounds in terms of voiceprints, the voice data is data with a time sequence formed according to the pronunciation sequence of the voice, the voiceprint features of the main speaker are included in all the voice data, and the voiceprint features of the main speaker can be extracted from the voice data.

The voice data may be sliced according to a time sequence corresponding to the voice data to obtain a plurality of voice slice data, and then based on a preset voiceprint detection model, the preset voiceprint detection model may be a preset voiceprint detection model based on deep learning, the preset voiceprint detection model based on deep learning may be a preset voiceprint detection model based on a residual error network, the voice slice data is subjected to voiceprint detection to extract slice voiceprint features included in each voice slice data from the voice slice data, and then all the slice voiceprint features are clustered, so as to extract main voiceprint features from all the voice slice data, the main voiceprint features are used as target voiceprint features, and the target voiceprint features are used as voiceprint features of a main speaker included in the voice data, therefore, the voiceprint characteristics of the main speaker contained in the voice data are obtained according to the voice slice data. The voice slice data is subjected to voiceprint target detection through a preset voiceprint detection model based on deep learning so as to extract the voiceprint characteristics of the voice slice data, and besides the preset voiceprint detection model based on a residual error network, other deep learning target detection models can be adopted, such as a Gate Recurrentunit (GRU) neural network based on RNN improvement. The residual error network, namely ResNet in English, is a classified CNN model, in the embodiment of the present application, a Wide Resnet is preferred in the residual error network, further, Wide ResNet-34 can be adopted, voiceprint features can be extracted more effectively, it should be noted that parameters of the preset voiceprint detection model based on the residual error network are trained separately, and after the parameter training of the preset voiceprint detection model based on the residual error network is completed, the parameters are not updated in the training process of the model in the embodiment of the present application.

Further, referring to fig. 4, fig. 4 is a schematic diagram illustrating a second sub-flow of a method for detecting a voice of a main speaker based on a multi-microphone scenario according to an embodiment of the present application, as shown in fig. 4, in this example, the step of inputting the voice slice data to a preset voiceprint detection model based on a residual error network to extract a slice voiceprint feature included in the voice slice data includes:

s1221, judging whether the voice slice data represent a mute scene;

s1222, if the voice slice data does not represent a mute scene, taking the voice slice data as target voice slice data;

s1223, inputting all the target voice slice data to a preset voiceprint detection model based on a residual error network so as to extract slice voiceprint characteristics contained in the voice slice data;

s1224, if the voice slice data represents a mute scene, discarding the voice slice data.

The silence Detection may also be referred to as silence suppression, Voice Activity Detection, Voice endpoint Detection, or Voice boundary Detection, and the Voice Activity Detection, abbreviated as VAD, is used to detect whether the Voice is human Voice, and may perform Detection based on HMM (Hidden Markov Model, english), MLP (multi layer Perceptron, english), or DNN (Deep Neural Networks, english).

Specifically, since it is the voice slice data, the silence generated for the silent period in the voice, such as a pause in pronunciation, etc., if the silence is cut into the voice slice data, it is meaningless to extract the voiceprint feature for silence, and in order to improve the efficiency of acquiring the voiceprint feature of the main speaker included in the voice data, thereby improving the efficiency of voice detection of the main speaker, judging whether the voice slice data represents a mute scene or not firstly, namely, judging whether the voice corresponding to the voice slice data is a mute scene, if the voice slice data represents the mute scene, the voice slice data does not contain voiceprint features, further voiceprint extraction is not needed to be carried out on the voice slice data, if the voice slice data does not represent a mute scene, the voice slice data contains the voiceprint features, slice voiceprint features contained in the voice slice data are further extracted from the voice slice data. Therefore, the voice slice data is subjected to silence detection to judge whether the voice slice data represents a silence scene or not, if the voice slice data represents the silence scene, the voice slice data is discarded, if the voice slice data does not represent the silence scene, the voice slice data is used as target voice slice data, and then all the target voice slice data are input to a preset voiceprint detection model based on a residual error network to extract slice voiceprint characteristics contained in the voice slice data.

Further, referring to fig. 5, fig. 5 is a third sub-flowchart of the method for detecting a voice of a main speaker based on a multi-microphone scene according to the present invention, as shown in fig. 5, in this example, the step of slicing the voice data according to a time sequence corresponding to the voice data to obtain a plurality of voice slice data includes:

s1211, acquiring a voice data tensor corresponding to the voice data, wherein the voice data tensor comprises a time shaft corresponding to the voice data;

and S1212, segmenting the voice data tensor into overlapped slices along the time axis to obtain a plurality of voice slice data.

The data used by the neural network is stored in a multidimensional Numpy array, which is also called a Tensor (english is Tensor), the Tensor is a data container, the Dimension (english is Dimension) of the Tensor is generally called an Axis (english is), when time (or sequence order) is important for the data, the data should be stored in the Tensor with a time Axis, and because the sequence order of pronunciation time of voice is important for the voice data, the embodiment of the present application stores the voice data in the data Tensor with the time Axis.

Specifically, after the voice data tensor corresponding to the voice data is obtained, the voice data tensor is segmented according to a time axis included in the voice data tensor, except that the voice data tensor is continuously segmented according to a time sequence of voice, for example, the voice data is segmented into voice slice data every 2 seconds, the voice data tensor can be segmented into overlapped slices along the time axis, for example, the voice data tensor is segmented into voice slices with a length of two seconds and an overlap of one second along the time axis, for example, the 1 st second and the 2 nd second are segmented into one voice slice, the 2 nd second and the 3 rd second are segmented into one voice slice, the 3 rd second and the 4 th second are segmented into one voice slice, and the like, so that a plurality of overlapped voice slice data are obtained, because an overlap exists between every two voice slice data before and after the overlap, the method can enable the parameters extracted from the voice slice data to be in stable transition, so that the voiceprint characteristics of the main speaker can be more accurately described by the voiceprint characteristics of the slices contained in each subsequently extracted voice slice data, the voiceprint characteristics of the main speaker can be more accurately reflected, and the accuracy of voice detection of the main speaker is realized by improving the accuracy of extracting the voiceprint characteristics.

Further, referring to fig. 6, fig. 6 is a fourth sub-flowchart of the method for detecting a voice of a main speaker based on a multi-microphone scene according to the present invention, as shown in fig. 6, in this example, the step of clustering all the slice voiceprint features to obtain a target voiceprint feature includes:

s1231, based on preset condensed hierarchical classification, combining all the voiceprint features of the slices to obtain a voiceprint feature binary tree;

and S1232, determining the target voiceprint characteristics according to the voiceprint characteristic binary tree.

The method comprises the steps of Clustering in an agglomeration type hierarchy mode, taking an individual point as a cluster, combining two closest clusters successively until only one cluster is left, finally forming a binary tree, namely each object is regarded as a category, and then combining two similar clusters together continuously until all the objects are in the same category.

Specifically, since the cohesive hierarchical classification is to merge two clustering objects with the shortest distance (i.e. the closest clustering objects), after all the slice voiceprint features are transmitted into the preset cohesive hierarchical classification, the preset cohesive hierarchical classification will continuously merge a sample pair consisting of two slice voiceprint features with the shortest distance, where the sample pair is a pair of the slice voiceprint features clustered each time in the clustering process until the slice voiceprint features are completely merged, and thus a binary voiceprint tree, i.e. a binary voiceprint feature tree, is finally generated, where the binary voiceprint feature tree includes a plurality of clusters (which may also be referred to as clusters), and a target voiceprint feature can be determined according to the binary voiceprint feature tree, for example, a voiceprint feature corresponding to the last clustering result in the binary voiceprint feature tree can be used as a target voiceprint feature according to the binary voiceprint feature tree, or selecting the voiceprint feature corresponding to the clustering result of the sub-cluster with the most node members in the sub-clusters contained in the voiceprint feature binary tree as the target voiceprint feature.

Further, referring to fig. 7, fig. 7 is a fifth sub-flowchart of a method for detecting a voice of a main speaker based on a multi-microphone scenario according to an embodiment of the present application, as shown in fig. 7, in this example, the step of determining a target voiceprint feature according to the binary voiceprint feature tree includes:

s12321, determining all initial clusters of which the respective proximity degrees of all sample pairs in the clusters contained in the binary voiceprint feature tree are less than or equal to a preset proximity threshold according to the binary voiceprint feature tree, wherein the sample pairs are a pair of slice voiceprint features clustered each time in a clustering process;

s12322, screening out the clusters with the most node members in all the initial clusters as target clusters;

and S12323, obtaining the slice voiceprint characteristics corresponding to the central point of the target cluster, and taking the slice voiceprint characteristics corresponding to the central point as the target voiceprint characteristics.

Specifically, the target voiceprint feature is determined according to the voiceprint feature binary tree, a heuristic algorithm may be adopted, that is, according to the voiceprint feature binary tree, the respective proximity of all sample pairs in each cluster included in the voiceprint feature binary tree is calculated, where the proximity may be described by an euclidean distance, a manhattan distance, a mahalanobis distance, a cosine similarity, a Jaccard coefficient, or Bregman divergence, and the like, so that according to the proximity, all initial clusters in which the respective proximity of all sample pairs in the clusters included in the voiceprint feature binary tree is less than or equal to a preset proximity threshold are determined, and a cluster with the most node members in all the initial clusters is screened out as a target cluster, then a slice voiceprint feature corresponding to a center point of the target cluster is obtained, and a slice voiceprint feature corresponding to the center point is used as the target voiceprint feature, therefore, clustering is carried out on all the slice voiceprint characteristics to obtain target voiceprint characteristics. The central point of the target cluster may be obtained by dividing (i.e. PAM) around the central point, where PAM may be calculated using an arbitrary distance, and the PAM algorithm is as follows: (1) randomly selecting K observations (each called a center point); (2) calculating the distance/dissimilarity between the observed values and each center; (3) assigning each observation to the nearest center point; (4) calculating the sum of the distances from each center point to each observation (total cost); (5) selecting a point in the class that is not the center and interchanging the center point; (6) reassigning each point to its nearest center point; (7) calculating the total cost again; (8) if the total cost is less than the total cost calculated in the step (4), taking the new point as a central point; (9) and (5) repeating the steps (5) to (8) until the central point is not changed any more.

Further, please refer to fig. 8, fig. 8 is a schematic structural diagram of a main speaker analysis and voiceprint extraction module of a multi-microphone scene-based main speaker voice detection method according to an embodiment of the present application, as shown in fig. 8, in this example, a windowing process may be performed on an input signal tensor corresponding to voice data by using a window function, where the window function is a signal with a limited time domain width, and a Wide ResNet is used to extract a speaker feature vector x-vector of each frame of signal, and finally clustering is performed to obtain a feature vector of a most main speaker, where T represents a total length of the input data in the time domain, C represents a voice channel number of the input data, and N represents an MFCC feature dimension number of the input data, where ccs is a mfmff Frequency cepsl coefficients. For example, when an input voice signal is subjected to windowing processing, the input voice data can be divided into voice segments with the length of two seconds and overlapped for one second to obtain voice slice data, and all the voice slice data is input into Wide ResNet-34, especially, Wide ResNet-34 can more effectively extract voiceprint features, parameters of the Wide ResNet-34 network are individually pre-trained, and the parameters are not updated in the training process of the embodiment of the application, x-vector extracted from each voice segment is transmitted into a cohesive hierarchical classification, and the cohesive hierarchical classification continuously merges sample pairs with the shortest distance until complete merging, so that a binary tree is finally generated. Further, a heuristic algorithm may be used, i.e. selecting the center point of the most member class among all clusters with distances smaller than a certain threshold λ as the final estimated x-vector. Namely:

wherein

Is a set of all x-vectors, node_iFor the ith node in the binary tree, count _ subtree is a function for counting the number of child nodes of a certain cluster node, height is a function for calculating the Euclidean distance of two child clusters of a certain cluster, lambda is a preset distance threshold of the Euclidean distance of the preset two child clusters, k is the node of the class with the most node members,

used to describe the final estimated x-vector.

And S13, performing multi-voice channel fusion coding on the voice data and the voiceprint features to extract voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint features.

The speech hiding vector sequence data, which may also be referred to as a hiding vector sequence or a hiding vector sequence, is used for describing vector features of the speech of the main speaker hidden in the speech data.

Specifically, the voice data is derived from a plurality of preset microphones, that is, the voice data is multi-voice channel data, and based on the voice data, voiceprint features of a main speaker included in the voice data are obtained, and then, based on the voice data and the voiceprint features, multi-voice channel fusion coding can be performed on the voice data and the voiceprint features, so as to extract voice hiding vector sequence data of the main speaker included in the voice data and the voiceprint features, where the voice hiding vector sequence data is used for describing features of the voice of the main speaker originally hidden in the voice data, so as to achieve obtaining of voice data corresponding to the voice of the main speaker corresponding to the fusion multi-voice channel, and because the voiceprint features of the main speaker are introduced in an early stage, with the help of the voiceprint features of the main speaker, and learning the voice of the main speaker corresponding to the voiceprint characteristics of the main speaker from a plurality of voice channels, so that the accuracy of the voice of the main speaker corresponding to the extracted voice hidden vector sequence data of the main speaker is higher, and the detected voice of the main speaker is more accurate.

Further, fig. 9 is a sixth sub-flowchart of the method for detecting a speech of a main speaker based on a multi-microphone scenario according to the present application, as shown in fig. 9, in this example, the step of performing multi-voice channel fusion coding on the speech data and the voiceprint feature to extract speech hidden vector sequence data of the main speaker included in the speech data and the voiceprint feature includes:

s131, acquiring hidden voice channel voiceprint characteristics contained among a plurality of voice channels corresponding to the voice data according to the voiceprint characteristics based on a self-attention module among preset voice channels;

s132, acquiring hidden time axis voiceprint characteristics, contained in a time axis, of each voice channel corresponding to the voice data according to the voiceprint characteristics based on a preset time domain self-attention module;

s133, combining the voiceprint feature of the hidden voice channel and the voiceprint feature of the hidden time axis into a sequence to obtain voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint feature.

Specifically, on the basis of a preset self-attention module among voice channels in the horizontal angle of voice data, hidden voice channel voiceprint features contained among a plurality of voice channels corresponding to the voice data are obtained according to the voiceprint features, on the basis of a preset time domain self-attention module in the longitudinal angle of the voice data, hidden time axis voiceprint features contained on a time axis of each voice channel corresponding to the voice data are obtained according to the voiceprint features, the hidden voice channel voiceprint features and the hidden time axis voiceprint features are combined into a sequence, and voice hidden vectors of the main speaker contained in the voice data and the voiceprint features are obtained.

Referring to fig. 10, fig. 10 is a schematic diagram of a multi-voice channel fusion encoder module structure of a multi-voice channel fusion method for detecting a main speaker voice based on a multi-microphone scenario according to an embodiment of the present application, in which the multi-voice channel fusion encoder in this example fuses voice data input by multiple voice channels and introduces x-vector in an early stage, that is, x-vector is used as an input parameter of a full connection layer in a model, so that the encoder can learn and use information of x-vector to learn, according to x-vector, a voice of a main speaker corresponding to x-vector from multiple voice channels, where a most important module is a time domain-frequency domain self-attention module, which is repeatedly stacked n times, and the module is mainly divided into two parts, that is, an inter-voice channel self-attention module and a time domain self-attention module.

In the inter-voice channel self-attention module, C is the number of voice channels, N is the feature dimension degree, T is the frame number of the input tensor duration,

the recorded features for each voice channel at time t,

feature sets recorded at all times for each speech channel. Thus the inter-speech channel self-attention module can be expressed as:

wherein

The method is a spliced vector operation, MultiHeadAttn is a multi-head self-attention mechanism, FCN is a fully-connected network, and by splicing x-vector in front of the fully-connected network, a model can fuse a target x-vector at the early feature extraction stage.

In the time domain self-attention module, the model is required to learn information on the time axis. Since the length of the time axis may exceed the dimension number of data that can be processed at one time by the self-attention mechanism, the embodiment of the present application may adopt a long-distance dependent model structure of transform-XL. The method adds a self-attention mechanism among cross segments on a time axis on a basic Transformer structure, and realizes ultra-long distance dependent transmission. Its main principle can be expressed as:

where τ represents the number of segments currently being processed and n represents the number of levels of the time-frequency domain self-attention module currently being located. stop _ grad is a function of the termination gradient transfer, indicating that the previous hidden state will not be updated any more. Dattn represents the number of dimensions of data that can be processed at one time from the attention mechanism. W_Q、W_KAnd W_VAre parameters that can be learned by the user,

the hidden vector after learning by self-attention is described.

The above formula is a calculation case in a single head of a single voice channel. The hidden vectors of each voice channel are pieced together, so that the following formula (5) is obtained:

due to the fact that the x is t,

the hidden vectors of the c-1 st speech channel at the t moment of the nth layer of the self-attention module are described, so that the input size of each layer of the time domain-frequency domain self-attention module is consistent. Therefore, the structure of the above-described inter-voice channel self-attention module is common to each layer.

Hidden vector for final nth layer

The embodiment of the application uses the full-connection network to fuse the characteristics among all voice channels:

and (3) as shown in formula (7), subsequently, the hidden vector sequence is transmitted to a decoder, wherein R describes an input voice data set, N, N and N' describe the Nth layer of the model, C describes the number of voice channels, and T describes the Tth moment.

And S14, decoding the voice hiding vector sequence data to obtain the main speaker voice data of the main speaker.

Specifically, after the voice hidden vector sequence data is obtained, the voice hidden vector sequence data is decoded to obtain main speaker voice data of the main speaker, and the main speaker voice data is based on the screened voiceprint characteristics of the main speaker, andtherefore, the voice data of the main speaker can be considered as the voice data corresponding to the voice of the main speaker, so as to filter other noises, extract clean voice of the main speaker and output the clean voice, so that the voice of the main speaker can be processed in the following process, for example, voice recognition can be carried out on the voice data of the main speaker. Referring to fig. 11, fig. 11 is a schematic diagram illustrating a decoder module structure of a method for detecting a main speaker speech based on a multi-microphone scenario according to an embodiment of the present application, and an overall structure of the decoder is shown in fig. 11

The regression Loss function Huber Loss will be used to compute the sample fact speech x corresponding to the training sample_t(i.e., true sample values), the calculation formula used by Huber Loss includes:

in actual use, the recognized voice of the main speaker is directly corresponded

The voice of the main speaker can be further recognized by transmitting downstream tasks, and the accuracy of the voice recognition of the main speaker is improved.

In the embodiment of the application, voice data transmitted by a plurality of preset voice channels are acquired through a plurality of preset voice channels corresponding to a plurality of preset microphones arranged based on a voice scene, voiceprint characteristics of a main speaker contained in the voice data are acquired based on the voice data, the voice data and the voiceprint characteristics are acquired, multi-voice channel fusion coding is carried out on the voice data and the voiceprint characteristics, so as to extract voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint characteristics, the voice hidden vector sequence data is decoded to obtain the voice data of the main speaker, compared with the voice of the main speaker identified based on pre-registration in the traditional technology, the embodiment of the application utilizes automatic main speaker analysis and fully utilizes multi-microphone hardware resources in the voice scene, the method has the advantages that the voice of the main speaker can be automatically detected in the voice scene without registering the voice of the main speaker, the accuracy of the voice detection of the main speaker in the voice scene can be improved, the efficiency of recognizing the voice of the main speaker and the flexibility of voice recognition are improved, the convenience of wide use of voice detection is improved, and the use efficiency of multi-microphone hardware resources in the voice scene is improved.

It should be noted that the method for detecting the speech of the main speaker based on the multi-microphone scenario described in the foregoing embodiments may recombine the technical features included in different embodiments as needed to obtain a combined implementation, but all of the methods are within the scope of the present application.

Referring to fig. 12, fig. 12 is a schematic block diagram of a device for detecting a speech of a main speaker based on a multi-microphone scenario according to an embodiment of the present application. Corresponding to the multi-microphone scene-based main speaker voice detection method, the embodiment of the application also provides a multi-microphone scene-based main speaker voice detection device. As shown in fig. 12, the multi-microphone scene based primary speaker speech detection apparatus 12 includes means for performing the multi-microphone scene based primary speaker speech detection method described above, and the multi-microphone scene based primary speaker speech detection apparatus 12 may be configured in a computer device. Specifically, referring to fig. 12, the multi-microphone scene-based main speaker speech detection apparatus 12 includes a first obtaining unit 121, a second obtaining unit 122, an encoding unit 123 and a decoding unit 124.

The first obtaining unit 121 is configured to obtain, based on a plurality of preset voice channels corresponding to a plurality of preset microphones set in a voice scene, voice data transmitted by the preset voice channels; a second obtaining unit 122, configured to obtain a voiceprint feature of a main speaker included in the voice data based on the voice data; an encoding unit 123, configured to perform multi-voice channel fusion encoding on the voice data and the voiceprint feature to extract voice hidden vector sequence data of the main speaker included in the voice data and the voiceprint feature; a decoding unit 124, configured to decode the speech hidden vector sequence data to obtain main speaker speech data of the main speaker.

In an embodiment, the second obtaining unit 122 includes:

the first slicing subunit is used for slicing the voice data according to the time sequence corresponding to the voice data to obtain a plurality of voice slicing data;

the first extraction subunit is used for inputting the voice slice data into a preset voiceprint detection model based on a residual error network so as to extract slice voiceprint features contained in the voice slice data;

and the clustering subunit is used for clustering all the slice voiceprint characteristics to obtain target voiceprint characteristics, and taking the target voiceprint characteristics as the voiceprint characteristics of a main speaker contained in the voice data.

In one embodiment, the extraction subunit includes:

the judging subunit is used for judging whether the voice slice data represents a mute scene;

the first screening subunit is used for taking the voice slice data as target voice slice data if the voice slice data does not represent a mute scene;

and the second extraction subunit is used for inputting all the target voice slice data into a preset voiceprint detection model based on a residual error network so as to extract the slice voiceprint features contained in the voice slice data.

In one embodiment, the first slice subunit comprises:

the first acquiring subunit is configured to acquire a voice data tensor corresponding to the voice data, where the voice data tensor includes a time axis corresponding to the voice data;

and the second slicing subunit is used for slicing the voice data tensor into overlapped slices along the time axis to obtain a plurality of voice slice data.

In one embodiment, the clustering subunit includes:

the merging subunit is used for merging all the voiceprint features of the slices based on preset condensed hierarchical classification to obtain a voiceprint feature binary tree;

and the first determining subunit is used for determining the target voiceprint characteristics according to the voiceprint characteristic binary tree.

In an embodiment, the first determining subunit includes:

a second determining subunit, configured to determine, according to the binary voiceprint feature tree, all initial clusters in which respective proximity degrees of all sample pairs in clusters included in the binary voiceprint feature tree are smaller than or equal to a preset proximity threshold, where the sample pair is a pair of the slice voiceprint features clustered each time in a clustering process;

the second screening subunit is used for screening out the cluster with the most node members in all the initial clusters as a target cluster;

and the second acquisition subunit is used for acquiring the slice voiceprint characteristics corresponding to the central point of the target cluster and taking the slice voiceprint characteristics corresponding to the central point as the target voiceprint characteristics.

In one embodiment, the encoding unit 123 includes:

a third obtaining subunit, configured to obtain, based on a preset inter-voice-channel self-attention module, a voiceprint feature of a hidden voice channel included among multiple voice channels corresponding to the voice data according to the voiceprint feature;

a fourth obtaining subunit, configured to obtain, based on a preset time domain self-attention module, a hidden time axis voiceprint feature included on a time axis in each of the voice channels corresponding to the voice data according to the voiceprint feature;

and the combining subunit is used for combining the voiceprint features of the hidden voice channel and the voiceprint features of the hidden time axis into a sequence to obtain voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint features.

It should be noted that, as can be clearly understood by those skilled in the art, the specific implementation process of the main speaker voice detection apparatus and each unit based on the multi-microphone scenario may refer to the corresponding description in the foregoing method embodiment, and for convenience and brevity of description, no further description is provided herein.

Meanwhile, the division and connection modes of the units in the multi-microphone scene-based main speaker voice detection device are only used for illustration, in other embodiments, the multi-microphone scene-based main speaker voice detection device can be divided into different units as required, and the units in the multi-microphone scene-based main speaker voice detection device can also adopt different connection sequences and modes to complete all or part of the functions of the multi-microphone scene-based main speaker voice detection device.

The above-described multi-microphone scenario based primary speaker speech detection apparatus may be implemented in the form of a computer program that may be run on a computer device such as that shown in fig. 13.

Referring to fig. 13, fig. 13 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a computer device such as a desktop computer or a server, or may be a component or part of another device.

Referring to fig. 13, the computer device 500 includes a processor 502, a memory, which may include a non-volatile storage medium 503 and an internal memory 504, which may also be a volatile storage medium, and a network interface 505 connected by a system bus 501.

The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a method for detecting the speech of a main speaker based on a multi-microphone scenario as described above.

The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can execute a method for detecting the main speaker voice based on the multi-microphone scenario.

The network interface 505 is used for network communication with other devices. Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing device 500 to which the disclosed aspects apply, as a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 13, and are not described herein again.

In the method for detecting the speech of the main speaker based on the multi-microphone scenario, the processor 502 is configured to run the computer program 5032 stored in the memory to implement the following steps: acquiring voice data transmitted by a plurality of preset voice channels corresponding to a plurality of preset microphones set based on a voice scene; acquiring the voiceprint characteristics of a main speaker contained in the voice data based on the voice data; performing multi-voice channel fusion coding on the voice data and the voiceprint features to extract voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint features; and decoding the voice hiding vector sequence data to obtain the main speaker voice data of the main speaker.

In an embodiment, when the processor 502 implements the step of obtaining the voiceprint feature of the main speaker included in the speech data based on the speech data, the following steps are specifically implemented:

slicing the voice data according to a time sequence corresponding to the voice data to obtain a plurality of voice slice data; inputting the voice slice data into a preset voiceprint detection model based on a residual error network so as to extract slice voiceprint characteristics contained in the voice slice data; and clustering all the slice voiceprint characteristics to obtain target voiceprint characteristics, and taking the target voiceprint characteristics as the voiceprint characteristics of the main speaker contained in the voice data.

In an embodiment, when the processor 502 implements the step of inputting the voice slice data into a preset voiceprint detection model based on a residual error network to extract a slice voiceprint feature included in the voice slice data, the following steps are specifically implemented:

judging whether the voice slice data represent a mute scene or not; if the voice slice data do not represent a mute scene, taking the voice slice data as target voice slice data; and inputting all the target voice slice data into a preset voiceprint detection model based on a residual error network so as to extract the slice voiceprint characteristics contained in the voice slice data.

In an embodiment, when the processor 502 implements the step of slicing the voice data according to the time sequence corresponding to the voice data to obtain a plurality of voice slice data, the following steps are specifically implemented:

acquiring a voice data tensor corresponding to the voice data, wherein the voice data tensor comprises a time shaft corresponding to the voice data; and segmenting the voice data tensor into overlapped slices along the time axis to obtain a plurality of voice slice data.

In an embodiment, when the processor 502 implements the step of clustering all the slice voiceprint features to obtain the target voiceprint feature, the following steps are specifically implemented:

based on preset condensed hierarchical classification, combining all the voiceprint features of the slices to obtain a voiceprint feature binary tree; and determining the target voiceprint characteristics according to the voiceprint characteristic binary tree.

In an embodiment, when the processor 502 implements the step of determining the target voiceprint feature according to the binary voiceprint feature tree, the following steps are specifically implemented:

determining all initial clusters of which the respective proximity degrees of all sample pairs in the clusters contained in the binary voiceprint feature tree are less than or equal to a preset proximity threshold according to the binary voiceprint feature tree, wherein the sample pairs are a pair of slice voiceprint features clustered each time in a clustering process; screening out the cluster with the most node members in all the initial clusters as a target cluster; and acquiring the slice voiceprint characteristics corresponding to the central point of the target cluster, and taking the slice voiceprint characteristics corresponding to the central point as target voiceprint characteristics.

In an embodiment, when the processor 502 performs the multi-voice channel fusion coding on the voice data and the voiceprint feature to extract the voice hidden vector sequence data of the main speaker included in the voice data and the voiceprint feature, the following steps are specifically performed:

acquiring hidden voice channel voiceprint characteristics contained among a plurality of voice channels corresponding to the voice data according to the voiceprint characteristics based on a self-attention module among preset voice channels; acquiring hidden time axis voiceprint characteristics, contained on a time axis, of each voice channel corresponding to the voice data according to the voiceprint characteristics based on a preset time domain self-attention module; and combining the voice print feature of the hidden voice channel and the voice print feature of the hidden time axis into a sequence to obtain voice hidden vector sequence data of the main speaker contained in the voice data and the voice print feature.

It should be understood that in the embodiment of the present Application, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be understood by those skilled in the art that all or part of the processes in the method for implementing the above embodiments may be implemented by a computer program, and the computer program may be stored in a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a non-volatile computer readable storage medium or a volatile computer readable storage medium, and the computer readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the steps of the multi-microphone scene based main speaker voice detection method described in the above embodiments.

The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The storage medium is an entity and non-transitory storage medium, and may be various entity storage media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a terminal, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multi-microphone scene based primary speaker speech detection method, the method comprising:

acquiring voice data transmitted by a plurality of preset voice channels corresponding to a plurality of preset microphones set based on a voice scene;

acquiring the voiceprint characteristics of a main speaker contained in the voice data based on the voice data;

performing multi-voice channel fusion coding on the voice data and the voiceprint features to extract voice hidden vector sequence data of the main speaker contained in the voice data and the voiceprint features;

and decoding the voice hiding vector sequence data to obtain the main speaker voice data of the main speaker.

2. The multi-microphone scene-based voice detection method for the main speaker according to claim 1, wherein the step of obtaining the voiceprint characteristics of the main speaker contained in the voice data based on the voice data comprises:

slicing the voice data according to a time sequence corresponding to the voice data to obtain a plurality of voice slice data;

inputting the voice slice data into a preset voiceprint detection model based on a residual error network so as to extract slice voiceprint characteristics contained in the voice slice data;

and clustering all the slice voiceprint characteristics to obtain target voiceprint characteristics, and taking the target voiceprint characteristics as the voiceprint characteristics of the main speaker contained in the voice data.

3. The multi-microphone scene-based method for detecting the voice of the main speaker as claimed in claim 2, wherein the step of inputting the voice slice data into a preset voiceprint detection model based on a residual error network to extract the slice voiceprint features contained in the voice slice data comprises:

judging whether the voice slice data represent a mute scene or not;

if the voice slice data do not represent a mute scene, taking the voice slice data as target voice slice data;

and inputting all the target voice slice data into a preset voiceprint detection model based on a residual error network so as to extract the slice voiceprint characteristics contained in the voice slice data.

4. The multi-microphone scene-based method for detecting the speech of the main speaker as claimed in claim 2, wherein the step of slicing the speech data according to the time sequence corresponding to the speech data to obtain a plurality of speech slice data comprises:

acquiring a voice data tensor corresponding to the voice data, wherein the voice data tensor comprises a time shaft corresponding to the voice data;

and segmenting the voice data tensor into overlapped slices along the time axis to obtain a plurality of voice slice data.

5. The multi-microphone scene-based method for detecting the speech of the main speaker as claimed in claim 2, wherein the step of clustering all the slice voiceprint features to obtain a target voiceprint feature comprises:

based on preset condensed hierarchical classification, combining all the voiceprint features of the slices to obtain a voiceprint feature binary tree;

and determining the target voiceprint characteristics according to the voiceprint characteristic binary tree.

6. The multi-microphone scene-based method for detecting the voice of the main speaker as claimed in claim 5, wherein the step of determining the target voiceprint characteristics according to the binary voiceprint characteristic tree comprises:

determining all initial clusters of which the respective proximity degrees of all sample pairs in the clusters contained in the binary voiceprint feature tree are less than or equal to a preset proximity threshold according to the binary voiceprint feature tree, wherein the sample pairs are a pair of slice voiceprint features clustered each time in a clustering process;

screening out the cluster with the most node members in all the initial clusters as a target cluster;

and acquiring the slice voiceprint characteristics corresponding to the central point of the target cluster, and taking the slice voiceprint characteristics corresponding to the central point as target voiceprint characteristics.

7. The multi-microphone scene-based method for detecting the speech of the main speaker as claimed in claim 1, wherein the step of performing multi-voice channel fusion coding on the speech data and the voiceprint features to extract the sequence data of the speech concealment vector of the main speaker included in the speech data and the voiceprint features comprises:

acquiring hidden voice channel voiceprint characteristics contained among a plurality of voice channels corresponding to the voice data according to the voiceprint characteristics based on a self-attention module among preset voice channels;

acquiring hidden time axis voiceprint characteristics, contained on a time axis, of each voice channel corresponding to the voice data according to the voiceprint characteristics based on a preset time domain self-attention module;

and combining the voice print feature of the hidden voice channel and the voice print feature of the hidden time axis into a sequence to obtain voice hidden vector sequence data of the main speaker contained in the voice data and the voice print feature.

8. A multi-microphone scene-based primary speaker speech detection apparatus, the apparatus comprising:

the first acquisition unit is used for acquiring voice data transmitted by a plurality of preset voice channels corresponding to a plurality of preset microphones set based on a voice scene;

the second acquisition unit is used for acquiring the voiceprint characteristics of a main speaker contained in the voice data based on the voice data;

a coding unit, configured to perform multi-voice channel fusion coding on the voice data and the voiceprint feature to extract voice hidden vector sequence data of the main speaker included in the voice data and the voiceprint feature;

and the decoding unit is used for decoding the voice hiding vector sequence data to obtain the main speaker voice data of the main speaker.

9. A computer device, comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is adapted to run the computer program to perform the steps of the method according to any of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, realizes the steps of the method according to any one of claims 1 to 7.