CN118136045A

CN118136045A - Speech feature extraction method, and related method, device, equipment and storage medium

Info

Publication number: CN118136045A
Application number: CN202410096686.2A
Authority: CN
Inventors: 胡今朝; 吴重亮; 李永超; 吴明辉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-06-04

Abstract

The application discloses a voice feature extraction method, a related device and a related storage medium, wherein the voice feature extraction method comprises the following steps: acquiring voice to be processed; performing feature extraction on the voice segments in the voice channels based on the feature extraction model to obtain voice features of each voice segment in the voice channels; the feature extraction model is obtained by at least three comparison learning training methods through a sample voice set, wherein the sample voice set contains sample voice sections from a plurality of sample multichannel voices, and the three comparison learning methods comprise the following steps: the method comprises the steps of comparing first feature similarities between sample voice sections from the same and different sample multi-channel voices, comparing second feature similarities between sample voice sections from the same and different channels in the same sample multi-channel voice, and comparing third feature similarities between sample voice sections from the same and different time sequences in the same sample multi-channel voice. According to the scheme, the voice feature extraction precision of the multi-channel voice can be improved.

Description

Speech feature extraction method, and related method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method for extracting speech features, and related methods, devices, apparatuses, and storage media.

Background

The voice processing tasks such as voice recognition, voice emotion classification and the like have extremely high application value in many scenes such as international conferences, cross-border travel, intelligent customer service and the like.

However, the accuracy of speech processing is greatly dependent on the accuracy of extraction of speech features. Currently, existing speech feature extraction techniques focus mainly on single-channel speech, and do not focus on multi-channel speech. In addition, the voice feature extraction technology partially applied to the multi-channel voice only regards each voice channel in the multi-channel voice as independent single-channel voice for feature extraction, so that additional information of the multi-channel voice compared with the single-channel voice cannot be fully modeled, and the accuracy of subsequent voice processing tasks is greatly limited. In view of this, how to improve the accuracy of extracting speech features of multi-channel speech is a problem to be solved.

Disclosure of Invention

The application mainly solves the technical problem of providing a voice feature extraction method, a related device, a related equipment and a related storage medium, and can improve the voice feature extraction precision of multi-channel voice.

In order to solve the above technical problem, a first aspect of the present application provides a method for extracting a speech feature, including: acquiring voice to be processed; wherein the voice to be processed comprises a plurality of voice channels; performing feature extraction on the voice segments in the voice channels based on the feature extraction model to obtain voice features of each voice segment in the voice channels; the feature extraction model is obtained by at least three comparison learning training methods through a sample voice set, wherein the sample voice set contains sample voice sections from a plurality of sample multichannel voices, and the three comparison learning methods comprise the following steps: the method comprises the steps of comparing first feature similarities between sample voice sections from the same and different sample multi-channel voices, comparing second feature similarities between sample voice sections from the same and different channels in the same sample multi-channel voice, and comparing third feature similarities between sample voice sections from the same and different time sequences in the same sample multi-channel voice.

In order to solve the above technical problem, a second aspect of the present application provides a speech processing method, including: extracting features based on the voice to be processed to obtain voice features of each voice segment in a plurality of voice channels in the voice to be processed; wherein the speech feature is obtained based on the speech feature extraction method in the first aspect; processing is carried out based on the voice characteristics of each voice segment in a plurality of voice channels in the voice to be processed, and a processing result of the voice to be processed is obtained.

In order to solve the above technical problem, a third aspect of the present application provides a speech feature extraction apparatus, including: the device comprises an acquisition module and an extraction module, wherein the acquisition module is used for acquiring voice to be processed; wherein the voice to be processed comprises a plurality of voice channels; the extraction module is used for extracting the characteristics of the voice segments in the voice channels based on the characteristic extraction model to obtain the voice characteristics of each voice segment in the voice channels; the feature extraction model is obtained by at least three comparison learning training methods through a sample voice set, wherein the sample voice set contains sample voice sections from a plurality of sample multichannel voices, and the three comparison learning methods comprise the following steps: the method comprises the steps of comparing first feature similarities between sample voice sections from the same and different sample multi-channel voices, comparing second feature similarities between sample voice sections from the same and different channels in the same sample multi-channel voice, and comparing third feature similarities between sample voice sections from the same and different time sequences in the same sample multi-channel voice.

In order to solve the above technical problem, a fourth aspect of the present application provides a speech processing apparatus, including: the device comprises an extraction module and a processing module, wherein the extraction module is used for carrying out feature extraction based on voice to be processed to obtain voice features of each voice segment in a plurality of voice channels in the voice to be processed; wherein the speech feature is obtained based on the speech feature extraction means in the above third aspect; the processing module is used for processing based on the voice characteristics of each voice segment in the voice channels in the voice to be processed to obtain the processing result of the voice to be processed.

In order to solve the above technical problem, a fifth aspect of the present application provides an electronic device, including a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the method for extracting a speech feature in the first aspect or implement the method for processing a speech in the second aspect.

In order to solve the above technical problem, a sixth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor for implementing the speech feature extraction method of the above first aspect or implementing the speech processing method of the above second aspect.

According to the scheme, the voice to be processed is obtained, the voice to be processed comprises a plurality of voice channels, the voice segments in the voice channels are subjected to feature extraction based on the feature extraction model, the voice features of each voice segment in the voice channels are obtained, the feature extraction model is obtained by at least three comparison learning training through a sample voice set, the sample voice set comprises sample voice segments from a plurality of sample multi-channel voices, and the three comparison learning comprises: the feature extraction model can utilize the characteristics of content similarity of a plurality of channels of the multi-channel voice, the similarity of the same channel and the similarity of different voice segments of the same time sequence in the same multi-channel voice to conduct contrast learning, so that the feature extraction model can fully model the additional information of the multi-channel compared with a single channel, and further can extract voice features with richer information. Therefore, the voice characteristic extraction precision of the multi-channel voice can be improved.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for extracting speech features according to the present application;

FIG. 2a is a schematic diagram of one embodiment of a speech segmentation;

FIG. 2b is a schematic diagram of a framework of one embodiment of a feature extraction model;

FIG. 2c is a schematic diagram of one embodiment of a speech feature;

FIG. 2d is a process diagram of an embodiment of measuring a first loss;

FIG. 2e is a process diagram of an embodiment of measuring a second loss;

FIG. 2f is a process diagram of an embodiment of measuring third loss;

FIG. 3a is a schematic flow chart of an embodiment of a first stage training;

FIG. 3b is a flow chart of an embodiment of the second stage training;

FIG. 4 is a flow chart of an embodiment of a speech processing method of the present application;

FIG. 5 is a schematic diagram of a speech feature extraction apparatus according to an embodiment of the application;

FIG. 6 is a schematic diagram of a speech processing device according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a frame of an embodiment of an electronic device of the present application;

FIG. 8 is a schematic diagram of a frame of one embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes embodiments of the present application in detail with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "/" herein generally indicates that the associated object is an "or" relationship. Further, "a plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a flowchart illustrating an embodiment of a method for extracting speech features according to the present application. Specifically, the method may include the steps of:

step S11: and acquiring the voice to be processed.

In the embodiment of the disclosure, the voice to be processed includes a plurality of voice channels. It should be noted that, in the practical application process, although the different positions of the microphones cause a certain difference in the voice signals of different channels, the multiple channels of voice have substantially the same sound source, that is, substantially have a homologous complementary relationship between the different voice channels. In addition, the embodiment of the present disclosure does not limit the specific number of voice channels included in the voice to be processed. For example, the speech to be processed may include two speech channels, three speech channels, or more than three speech channels, which are not exemplified herein.

Step S12: and carrying out feature extraction on the voice segments in the voice channels based on the feature extraction model to obtain the voice features of each voice segment in the voice channels.

In the embodiment of the disclosure, the feature extraction model is obtained by at least three comparison learning training methods using a sample speech set, wherein the sample speech set contains sample speech segments from a plurality of sample multichannel speech, and the three comparison learning methods include: the method comprises the steps of comparing first feature similarities between sample voice sections from the same and different sample multi-channel voices, comparing second feature similarities between sample voice sections from the same and different channels in the same sample multi-channel voice, and comparing third feature similarities between sample voice sections from the same and different time sequences in the same sample multi-channel voice.

In one implementation scenario, a microphone array may be used to collect real speech, resulting in sample multichannel speech. Or the voice can be played back by adopting a single-channel sample voice to obtain a multi-channel sample voice. The above examples are just a few possible examples of obtaining sample multi-channel speech and are not therefore limiting as to the specific manner in which sample multi-channel speech is obtained during actual use.

In one implementation scenario, after obtaining the sample multi-channel speech, to obtain speech segments containing speech information therein, a voice activity test (Voice Activity Detection, VAD) may be performed on the sample multi-channel speech, so that the sample speech segments in each speech channel contained in the sample multi-channel speech may be obtained. It should be noted that, the voice activity detection may be implemented using a network structure such as LSTM, CNN, DNN, a self-attention network, and the like, which is not limited herein. Referring to fig. 2a in combination, fig. 2a is a schematic diagram of an embodiment of speech segmentation. As shown in fig. 2a, taking an example that the sample multi-channel speech includes four speech channels, the sample multi-channel speech may be segmented by voice activity detection (as shown by the dashed line in fig. 2 a), so as to obtain the sample speech segments in each speech channel. Of course, the illustration of fig. 2a is only one possible example of speech segmentation in a practical application process, and is not limited to specific results of speech segmentation.

In one implementation scenario, the feature extraction model may include, but is not limited to, a network model such as WavLM, and the network structure of the feature extraction model is not limited herein. Referring to fig. 2b in combination, fig. 2b is a schematic diagram illustrating an embodiment of a feature extraction model. As shown in fig. 2b, the feature extraction model may comprise an extraction network for extracting initial features of the speech segment at least for acoustic features of the speech segment and an encoding network for encoding at least the initial features of the speech segment to speech features of the speech segment. The extraction network may include, but is not limited to, convolutional neural network, and the encoding network may include, but is not limited to, transform, and the like, and the network structures of the extraction network and the encoding network are not limited herein.

In one particular implementation scenario, the acoustic features of the speech segment may include, but are not limited to: FBank, MFCC, etc., the specific kind of acoustic features is not limited herein.

In a specific implementation scenario, taking the convolutional neural network as an example of the extraction network, the extraction network may specifically include N layers, and each layer may include a time domain convolutional layer, a layer normalization layer, and an activation function layer. The number of layers N included in the extraction network may be set to 2, 3, 4, 5, 6, 7, or the like, and the number of layers included in the extraction network is not limited.

In one particular implementation scenario, using a transducer for the encoding network, the encoding network may employ a gated relative position code (GATED RELATIVE position bias) so that the relative position may be introduced into the computation of the attention network within the transducer to better model the local information.

In a specific implementation scenario, please refer to fig. 2b and fig. 2c in combination, fig. 2c is a schematic diagram of an embodiment of a speech feature. In the process of extracting features from multi-channel speech, initial features that respectively represent each speech channel in the multi-channel speech may be obtained, so that in order to facilitate distinguishing the initial features of the speech segments, the initial features extracted based on the acoustic features of the speech segments may be referred to as a first initial feature (for example, X ₁ to X ₆ in fig. 2 b), and the initial features that represent the speech channels to which the speech segments belong may be referred to as a second initial feature (for example, P in fig. 2 b). On the basis of this, the first speech feature (e.g. Z ₁ to Z ₆ in fig. 2 b) of each speech segment in the speech channel and the second speech feature (e.g. E in fig. 2 b) characterizing the speech channel can be encoded based on the second initial feature of the speech channel and the first initial feature of each speech segment in the speech channel. It should be noted that, in the practical application process, a feature sequence (as shown in the second row of feature sequences in fig. 2 c) of the first voice feature including each voice segment may be output, so as to perform a voice processing task such as voice recognition, voice emotion classification, and the like subsequently; in the latter case, in the practical application process, the feature sequence (as shown in the first line of feature sequence in fig. 2 c) including the first voice feature and the second voice feature may also be output, so as to perform the voice processing tasks such as voice recognition, voice emotion classification, and the like, where the specific content of the voice feature is not limited. According to the mode, the first initial characteristic of the voice section is extracted based on the acoustic characteristic of the voice section, the second initial characteristic of the voice channel to which the voice section belongs is obtained, the first voice characteristic of each voice section in the voice channel and the second voice characteristic of the voice channel are obtained through encoding based on the second initial characteristic of the voice channel and the first initial characteristic of each voice section in the voice channel, the voice characteristic of the voice channel can be additionally provided besides the voice characteristic of the voice section, and the richness of the voice characteristic is improved.

In one implementation scenario, the sample speech set may be partitioned into two first subsets, based on which the two first subsets may be for: the first loss is measured based on the first feature similarity of the sample speech features between each two of the contained sample speech segments, the second loss is measured based on the second feature similarity of the sample speech features between each two of the contained sample speech segments, and the third loss is measured based on the third feature similarity of the sample speech features between each two of the contained sample speech segments, so that network parameters of the feature extraction model can be adjusted based on the first loss, the second loss and the third loss. According to the method, through the three comparison learning losses, different voice segments belonging to the same time sequence in the same multi-channel voice are forced to have voice characteristics as close as possible, voice segments belonging to the same channel in the same multi-channel voice are forced to have voice characteristics as close as possible, and voice segments belonging to the same multi-channel voice are forced to have voice characteristics as close as possible, so that the voice characteristic extraction precision of the characteristic extraction model is improved.

In a specific implementation scenario, the same number of sample speech segments may be contained in the two first subsets, and then in the case where the sample speech set contains an odd number of sample speech segments, before dividing the sample speech set, either of the following may be performed: any odd number of sample speech segments in the sample speech set are randomly discarded and any odd number of sample speech segments in the sample speech set are randomly duplicated. Taking the example that 9 sample speech segments are contained in the sample speech set, an odd number of sample speech segments such as 1, 3, etc. may be randomly discarded or an odd number of sample speech segments such as 1, 3, etc. may be randomly duplicated. Of course, the above examples are only a few possible examples of performing pre-processing on a sample speech set during actual application, and are not limited to other processing manners. In addition, in the case where an even number of sample speech segments are contained in the sample speech set, the sample speech set may be divided directly into two first subsets without performing the foregoing preprocessing. In the above manner, the two first subsets include the same number of sample voice segments, and before the sample voice set is divided into the two first subsets, any odd number of sample voice segments in the sample voice set are also randomly discarded or any odd number of sample voice segments in the sample voice set are randomly copied, so that the total number of the sample voice segments in the sample voice set can be ensured to be even.

In a specific implementation scenario, the sample acoustic features of each sample speech segment may be extracted first, and then, based on the acoustic features of the sample speech segment, the sample initial features of the sample speech segment may be extracted, so that based on the sample initial features of the sample speech segment, the sample speech features of the sample speech segment may be obtained by encoding. It should be noted that, the sample acoustic feature, the sample initial feature, and the sample voice feature of the sample voice segment may refer to the relevant extraction process of the foregoing voice segment specifically, which is not described herein again.

In a specific implementation scenario, after obtaining the sample speech features of the sample speech segment, the first loss may be obtained by the first comparison learning and the measurement, respectively. Specifically, the first feature similarity between sample speech segments from the same sample multi-channel speech is inversely related to the first penalty, and the first feature similarity between sample speech segments from different sample multi-channel speech is positively related to the first penalty. For ease of description, taking the example of sample speech features from two first subset of sample speech segments S _i and S _j, respectively, the first feature similarity P _ij between the two can be characterized as:

In the above formula (1), SIMLARITY (S _i,S_j) represents similarity between the sample speech features recorded as S _i and S _j, and the feature distance may be used for measurement, for example, without limitation. On this basis, the first Loss _{contrastive-source} can be expressed as:

Loss_{contrastive-source}＝-∑_positivelog(P_ij)+∑_negativelog(P_ij)……(2)

In the above formula (2), positive represents a sample speech segment from the same sample multi-channel speech, and negative represents a sample speech segment from a different sample multi-channel speech. Referring to fig. 2d in combination, fig. 2d is a schematic process diagram of an embodiment for measuring the first loss. As shown in fig. 2d, the sample speech set is divided into two first subsets, and the sample speech segments in the two first subsets are respectively subjected to a feature extraction model, so that corresponding sample speech features can be obtained. For example, sample speech features I.e. the sample speech segment from the 2 nd sample multi-channel speech in the sample speech set with a timing sub9, i.e. the superscript indicates the sequence number of the sample multi-channel speech from which the subscript indicates the timing in the sample multi-channel speech, and other sample speech features in fig. 2d are thus deduced, which is not illustrated here. The horizontal and vertical directions in fig. 2d represent sample speech features from sample speech segments within two first subsets, respectively, wherein diagonal filled-in blocks represent first feature similarities between sample speech segments from the same sample multi-channel speech and blank filled-in blocks represent first feature similarities between sample speech segments from different sample multi-channel speech. In the above manner, the first feature similarity between the sample speech segments from the same sample multi-channel speech is inversely related to the first loss, and the first feature similarity between the sample speech segments from different sample multi-channel speech is positively related to the first loss, so that by minimizing the first loss in the training process, the first feature similarity between the sample speech segments from the same sample multi-channel speech can be forced to be as large as possible, and the first feature similarity between the sample speech segments from different sample multi-channel speech is as small as possible, which is helpful to promote the feature extraction accuracy of the feature extraction model in the whole speech dimension.

In a specific implementation scenario, after the sample speech features of the sample speech segment are obtained, the second loss is measured through the second comparison learning. Specifically, the second feature similarity between the sample speech segments from the same channel in the same sample multi-channel speech is inversely related to the second penalty, and the second feature similarity between the sample speech segments from different channels in the same sample multi-channel speech is positively related to the second penalty. It should be noted that, the specific process of measuring the second feature similarity based on the sample voice features of any two sample voice segments may refer to the foregoing process of measuring the first feature similarity, which is not described herein. In addition, the calculation method of the second loss can refer to the calculation process of the first loss, and the difference is that when the second loss is calculated, in the formula (2), positive represents a sample speech segment from the same channel in the same sample multi-channel speech, and negative represents a sample speech segment from a different channel in the same sample multi-channel speech. Referring to fig. 2e in combination, fig. 2e is a schematic process diagram of an embodiment for measuring the second loss. As shown in fig. 2e, the sample speech set is divided into two first subsets, and the sample speech segments in the two first subsets are respectively subjected to a feature extraction model, so that corresponding sample speech features can be obtained. For example, sample speech featuresThat is, the 2 nd channel of the sample multi-channel voice is represented by the sample voice segment with the time sequence of sub3, that is, the superscript represents the channel of the sample multi-channel voice from which the sample voice segment is derived, and the subscript represents the time sequence of the sample multi-channel voice, so that other sample voice features in fig. 2e are pushed in this way, which is not exemplified here. The horizontal and vertical directions in fig. 2e represent the characteristics of sample voices from the sample voices in the two first subsets, respectively, wherein diagonal filling blocks are used for representing the second characteristic similarity between the sample voices from the same channel in the same sample multi-channel voices, and blank filling blocks are used for representing the second characteristic similarity between the sample voices from different channels in the same sample multi-channel voices. In the above manner, the second feature similarity between the sample speech segments of the same channel in the same sample multi-channel speech is inversely related to the second loss, and the second feature similarity between the sample speech segments of different channels in the same sample multi-channel speech is positively related to the second loss, so that by minimizing the second loss in the training process, the second feature similarity between the sample speech segments of the same channel in the same sample multi-channel speech can be forced to be as large as possible, and the second feature similarity between the sample speech segments of different channels in the same sample multi-channel speech is as small as possible, thereby being beneficial to improving the feature extraction precision of the feature extraction model in the dimension of the speech channel.

In a specific implementation scenario, after the sample speech features of the sample speech segment are obtained, the third loss is obtained by the third comparison learning and the measurement. Specifically, the third feature similarity between sample speech segments from the same time sequence in the same sample multi-channel speech is inversely related to the third penalty, and the third feature similarity between sample speech segments from different time sequences in the same sample multi-channel speech is positively related to the third penalty. It should be noted that, the specific process of measuring the third feature similarity based on the sample voice features of any two sample voice segments may refer to the foregoing process of measuring the first feature similarity, which is not described herein. In addition, the calculation method of the third loss may refer to the calculation process of the first loss, and the difference is that when the third loss is calculated, in the formula (2), positive represents a sample speech segment with the same time sequence in the same sample multi-channel speech, and negative represents a sample speech segment with different time sequences in the same sample multi-channel speech. Referring to fig. 2f in combination, fig. 2f is a schematic process diagram of an embodiment of measuring a third loss. As shown in fig. 2f, the sample speech set is divided into two first subsets, and the sample speech segments in the two first subsets are respectively subjected to a feature extraction model, so that corresponding sample speech features can be obtained. For example, sample speech featuresThat is, the 4 th channel of the sample multi-channel voice is represented by the sample voice segment with the time sequence of sub9, that is, the superscript represents the channel of the sample multi-channel voice from which the sample voice segment is derived, and the subscript represents the time sequence of the sample multi-channel voice, so that other sample voice features in fig. 2f are pushed in this way, which is not exemplified here. The horizontal and vertical directions in fig. 2f represent the characteristics of sample voices from the sample voices in the two first subsets, respectively, wherein diagonal line filled blocks represent the third characteristic similarity between sample voices in the same time sequence in the same sample multi-channel voices, and blank filled blocks represent the third characteristic similarity between sample voices in different time sequences in the same sample multi-channel voices. In the above manner, the third feature similarity between the sample speech segments with the same time sequence in the same sample multi-channel speech is inversely related to the third loss, and the third feature similarity between the sample speech segments with different time sequences in the same sample multi-channel speech is positively related to the third loss, so that the third feature similarity between the sample speech segments with the same time sequence in the same sample multi-channel speech can be forced to be as large as possible by minimizing the third loss in the training process, and the third feature similarity between the sample speech segments with different time sequences in the same sample multi-channel speech is as small as possible, thereby being beneficial to improving the feature extraction precision of the feature extraction model in the speech time sequence dimension.

In a specific implementation scenario, in order to further improve the modeling capability of the feature extraction model, after the sample acoustic features of each sample speech segment are extracted, clustering may be performed based on the sample acoustic features of the sample speech segment to obtain a sample clustering label of the sample speech segment, and after the sample initial features of the sample speech segment are extracted based on the sample acoustic features of the sample speech segment, the sample initial features of at least one sample speech segment may be randomly masked. Illustratively, as shown in FIG. 2b, the sample initial features X ₂ to X ₄ may be masked. Of course, the illustration of FIG. 2b is only one possible example of random masking during actual application, and is not therefore limiting of the specific masking of the sample initial features. Based on the above, the sample voice features (such as Z ₁ to Z ₆ in fig. 2 b) of each sample voice segment can be obtained by encoding based on the sample initial features of the non-occluded sample voice segment, and the predictive clustering labels of the occluded sample voice segment can be obtained by prediction, so that the network parameters of the feature extraction model can be adjusted based on the first loss, the second loss, the third loss and the fourth loss measured based on the difference between the sample clustering labels and the predictive clustering labels of the occluded sample voice segment. Illustratively, the first, second, third, and fourth losses described above may be summed to yield a total Loss _total:

Loss_total＝Loss_{contrastive-channel}+Loss_{contrastive-source}+Loss_{contrastive-content}+Loss_{maskprediction}……(3)

In the above formula (3), loss _{contrastive-source} represents a first Loss, loss _{contrastive-channel} represents a second Loss, loss _{contrastive-content} represents a third Loss, and Loss _{maskprediction} represents a fourth Loss. In the training process of any round after the first round of training, the feature extraction model can firstly acquire the sample voice features extracted by each sample voice segment after the previous round of training, then cluster the sample voice features based on the sample voice features, obtain a sample clustering label of the sample voice segment in the current round of training process, and measure the fourth loss in the training process of the round. According to the method, the sample initial characteristics of at least one sample voice section are randomly shielded in the training process, the sample voice characteristics of each sample voice section are obtained through encoding based on the sample initial characteristics of the sample voice section which is not shielded, the prediction clustering label of the shielded sample voice section is obtained through prediction, and therefore the network parameters of the feature extraction model are adjusted based on the first loss, the second loss and the third loss and the fourth loss which is obtained through measurement based on the difference between the sample clustering label of the shielded sample voice section and the prediction clustering label, and further the correlation modeling capacity of the feature extraction model to the voice sections can be improved, and the feature extraction precision of the feature extraction model is improved.

In one implementation scenario, in order to further improve the extraction accuracy of the feature extraction model, the feature extraction model is further trained to converge through at least a first contrast learning before training through at least three contrast learning, and then trained to converge through at least a second contrast learning and a third contrast learning. Note that the first contrast learning includes: comparing first feature similarities between sample speech segments from the same and different sample multichannel speech, the second comparison learning comprising: comparing second feature similarities between sample speech segments from the same and different channels in the same sample multi-channel speech, the third comparison learning comprising: and comparing the third feature similarity between the sample voice segments with the same and different time sequences in the same sample multichannel voice. Specific processes may be referred to the following description of the disclosed embodiments, and are not described in detail herein.

Referring to fig. 3a, fig. 3a is a schematic flow chart of an embodiment of the first stage training. Specifically, the method may include the steps of:

Step S3a1: the sample speech set is partitioned into two first subsets and sample acoustic features of each sample speech segment are extracted.

Reference may be made specifically to the foregoing descriptions of the disclosed embodiments, and details are not repeated herein.

Step S3a2: clustering is carried out based on the sample acoustic features of the sample voice segments to obtain sample clustering labels of the sample voice segments, and the sample initial features of the sample voice segments are extracted based on the sample acoustic features of the sample voice segments.

Step S3a3: randomly masking sample initial characteristics of at least one sample voice segment, coding to obtain sample voice characteristics of each sample voice segment based on the sample initial characteristics of the unmasked sample voice segments, and predicting to obtain prediction clustering labels of the masked sample voice segments.

Step S3a4: the first loss is measured based on the first feature similarity of the sample voice features between every two sample voice segments contained in the two first subsets, and the fourth loss is measured based on the sample clustering label and the prediction clustering label of the shaded sample voice segments.

Step S3a5: based on the first loss and the fourth loss, network parameters of the feature extraction model are adjusted.

Illustratively, the first Loss and the fourth Loss may be summed to obtain a total Loss _total:

Loss_total＝Loss_{contrastive-source}+Loss_{maskprediction}……(4)

In the above formula (4), loss _{contrastive-source} represents the first Loss, and Loss _{maskprediction} represents the fourth Loss. On this basis, the network parameters of the feature extraction model may be adjusted based on the total loss. Reference may be made specifically to the foregoing descriptions of the disclosed embodiments, and details are not repeated herein.

In the above scheme, the sample voice set is divided into two first subsets, and the sample acoustic features of each sample voice segment are extracted. On the basis, based on sample acoustic features of sample voice segments, clustering is carried out to obtain sample clustering labels of the sample voice segments, based on the sample acoustic features of the sample voice segments, sample initial features of the sample voice segments are extracted to obtain sample initial features of at least one sample voice segment, based on sample initial features of non-shielded sample voice segments, sample voice features of all sample voice segments are obtained through encoding and prediction to obtain prediction clustering labels of the shielded sample voice segments, further, based on first feature similarity of the sample voice features contained in two first subsets, first loss is measured, fourth loss is measured and obtained based on the sample clustering labels and the prediction clustering labels of the shielded sample voice segments, and based on the first loss and the fourth loss, network parameters of a feature extraction model are adjusted, so that feature extraction accuracy of a feature extraction model in a voice channel dimension and correlation modeling capacity between voice segments can be improved in a macroscopic level at first stage through the first loss and the fourth loss, and training efficiency of the feature extraction model is improved.

Referring to fig. 3b, fig. 3b is a flowchart illustrating an embodiment of the second training stage. Specifically, the method may include the steps of:

Step S3b1: sample speech segments from the same sample multi-channel speech within the sample speech set are partitioned into two second subsets.

It should be noted that, unlike the above disclosed embodiments, each training round in the second stage trains the sample speech segments obtained by slicing the single sample multi-channel speech, rather than training on the whole set of the sample speech set. In addition, the number of the sample speech segments contained in each of the two second subsets may be the same, and the dividing manner may refer to the dividing manner of the first subset in the foregoing disclosed embodiment, which is not described herein again.

Step S3b2: for two second subsets: the fifth loss is measured based on the second feature similarity of the sample speech features between each of the contained sample speech segments, and the sixth loss is measured based on the third feature similarity of the sample speech features between each of the contained sample speech segments.

Specifically, similar to the previously disclosed embodiments, the second feature similarity between sample speech segments from the same channel in the same sample multi-channel speech is inversely related to the fifth penalty, and the second feature similarity between sample speech segments from different channels in the same sample multi-channel speech is positively related to the fifth penalty. In addition, the third feature similarity between the sample speech segments from the same time sequence in the same sample multi-channel speech is inversely related to the sixth loss, and the third feature similarity between the sample speech segments from different time sequences in the same sample multi-channel speech is positively related to the sixth loss. The details of the second loss and the third loss can be referred to in the foregoing disclosed embodiments, and will not be described herein.

Step S3b3: based on the fifth loss and the sixth loss, network parameters of the feature extraction model are adjusted.

Specifically, the fifth Loss and the sixth Loss may be summed to obtain a total Loss _total:

Loss_total＝Loss_{contrastive-channel}+Loss_{contrastive-content}……(5)

In the above formula (5), loss _{contrastive-channel} represents a fifth Loss, and Loss _{contrastive-content} represents a sixth Loss. On this basis, the network parameters of the feature extraction model may be adjusted based on the total loss.

In the above scheme, the sample speech segments from the same sample multi-channel speech in the sample speech set are divided into two second subsets, for which: the method comprises the steps of measuring fifth loss based on the second feature similarity of sample voice features between every two sample voice segments, measuring sixth loss based on the third feature similarity of sample voice features between every two sample voice segments, and adjusting network parameters of a feature extraction model based on the fifth loss and the sixth loss, so that feature extraction accuracy of the feature extraction model in a voice global dimension and associated modeling capacity between voice segments can be improved from a macroscopic level in a first stage, modeling capacity of the feature extraction model facing channels and time sequences in a microscopic level is improved through the fifth loss and the sixth loss in a second stage, and training efficiency of the feature extraction model is improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating an embodiment of a speech processing method according to the present application.

Specifically, the method may include the steps of:

Step S41: and extracting the characteristics based on the voice to be processed to obtain the voice characteristics of each voice segment in a plurality of voice channels in the voice to be processed.

In the embodiment of the present disclosure, the voice feature is obtained based on the steps in the embodiment of the method for extracting a voice feature, and specifically, reference may be made to the foregoing disclosed embodiment, which is not described herein again.

Step S42: processing is carried out based on the voice characteristics of each voice segment in a plurality of voice channels in the voice to be processed, and a processing result of the voice to be processed is obtained.

Note that, the voice processing may include, but is not limited to: the tasks such as speech recognition and speech emotion classification are not limited herein. For example, taking voice processing as a voice recognition task as an example, a voice decoding model may be further connected to the feature extraction model, and used for decoding voice features of each voice segment in a plurality of voice channels in the voice to be processed to obtain a recognition text of the voice to be processed, where the voice decoding model may include, but is not limited to, a network structure such as a transducer; or taking voice processing as a voice emotion classification task as an example, the feature extraction model can be further connected with an emotion classification model for classifying voice features of each voice segment in a plurality of voice channels in the voice to be processed to obtain emotion types of the voice to be processed, and the emotion classification model can comprise, but is not limited to, network structures such as a multi-layer perceptron and the like, and is not limited herein. Of course, the above examples are just a few possible examples of speech processing, and the specific content of speech processing is not limited herein.

According to the scheme, the voice characteristics of each voice segment in the voice channels in the voice to be processed are obtained based on the characteristic extraction of the voice to be processed, and the voice characteristics are obtained based on the steps in the voice characteristic extraction method embodiment, so that the voice characteristic extraction precision of the multi-channel voice can be improved, and then the voice characteristics of each voice segment in the voice channels in the voice to be processed are processed to obtain the processing result of the voice to be processed, so that the processing precision of the multi-channel voice can be improved.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a frame of a speech feature extraction apparatus 50 according to an embodiment of the application. The voice feature extraction device 50 includes: the device comprises an acquisition module 51 and an extraction module 52, wherein the acquisition module 51 is used for acquiring voice to be processed; wherein the voice to be processed comprises a plurality of voice channels; the extracting module 52 is configured to perform feature extraction on the speech segments in the plurality of speech channels based on the feature extraction model, so as to obtain speech features of each speech segment in the plurality of speech channels; the feature extraction model is obtained by at least three comparison learning training methods through a sample voice set, wherein the sample voice set contains sample voice sections from a plurality of sample multichannel voices, and the three comparison learning methods comprise the following steps: the method comprises the steps of comparing first feature similarities between sample voice sections from the same and different sample multi-channel voices, comparing second feature similarities between sample voice sections from the same and different channels in the same sample multi-channel voice, and comparing third feature similarities between sample voice sections from the same and different time sequences in the same sample multi-channel voice.

Above-mentioned scheme, the voice feature extraction device 50 obtains the voice to be processed, and the voice to be processed contains a plurality of voice channels, and then carries out feature extraction to the voice sections in a plurality of voice channels based on the feature extraction model, obtains the voice feature of each voice section in a plurality of voice channels, and the feature extraction model adopts sample voice set to obtain through three kinds of contrast study training at least, contains the sample voice section from a plurality of sample multichannel voices in the sample voice set, and three kinds of contrast study include: the feature extraction model can utilize the characteristics of content similarity of a plurality of channels of the multi-channel voice, the similarity of the same channel and the similarity of different voice segments of the same time sequence in the same multi-channel voice to conduct contrast learning, so that the feature extraction model can fully model the additional information of the multi-channel compared with a single channel, and further can extract voice features with richer information. Therefore, the voice characteristic extraction precision of the multi-channel voice can be improved.

In some disclosed embodiments, the speech feature extraction apparatus 50 includes a first partitioning module for partitioning the sample speech set into two first subsets; the speech feature extraction means 50 comprise a first metric module for, for two first subsets: measuring to obtain a first loss based on the first feature similarity of the sample voice features between every two of the contained sample voice segments, measuring to obtain a second loss based on the second feature similarity of the sample voice features between every two of the contained sample voice segments, and measuring to obtain a third loss based on the third feature similarity of the sample voice features between every two of the contained sample voice segments; the speech feature extraction means 50 comprises a first adjustment module for adjusting network parameters of the feature extraction model based on the first loss, the second loss and the third loss.

In some disclosed embodiments, the first feature similarity between sample speech segments from the same sample multi-channel speech is inversely related to the first penalty, and the first feature similarity between sample speech segments from different sample multi-channel speech is positively related to the first penalty; and/or the second feature similarities between the sample speech segments from the same channel in the same sample multi-channel speech are inversely related to the second penalty, and the second feature similarities between the sample speech segments from different channels in the same sample multi-channel speech are positively related to the second penalty; and/or the third feature similarity between the sample speech segments from the same time sequence in the same sample multi-channel speech is inversely related to the third penalty, and the third feature similarity between the sample speech segments from different time sequences in the same sample multi-channel speech is positively related to the third penalty.

In some disclosed embodiments, the speech feature extraction device 50 includes an acoustic extraction module for extracting sample acoustic features of each sample speech segment; the voice feature extraction device 50 comprises a feature clustering module for clustering based on the sample acoustic features of the sample voice segments to obtain sample clustering labels of the sample voice segments, and the voice feature extraction device 50 comprises an initial extraction module for extracting to obtain sample initial features of the sample voice segments based on the sample acoustic features of the sample voice segments; the speech feature extraction means 50 comprise a feature masking module for randomly masking sample initial features of at least one sample speech segment; the voice feature extraction device 50 includes a feature encoding module, configured to encode, based on sample initial features of the non-masked sample voice segments, sample voice features of each sample voice segment and predict a prediction cluster label of the masked sample voice segments; the first adjustment module is specifically configured to adjust network parameters of the feature extraction model based on the first loss, the second loss, the third loss, and a fourth loss measured based on a difference between a sample cluster label and a predictive cluster label of the masked sample speech segment.

In some disclosed embodiments, the same number of sample speech segments are contained within the two first subsets, and in the case that the sample speech set contains an odd number of sample speech segments, the first partitioning module is specifically for performing any one of: randomly discarding any odd number of sample speech segments in the sample speech set; any odd number of sample speech segments within the sample speech set are randomly replicated.

In some disclosed embodiments, the feature extraction model is further trained to converge by at least a first contrast learning prior to training by at least three contrast learning, and then trained to converge by at least a second contrast learning and a third contrast learning; wherein the first contrast learning comprises: comparing first feature similarities between sample speech segments from the same and different sample multichannel speech, the second comparison learning comprising: comparing second feature similarities between sample speech segments from the same and different channels in the same sample multi-channel speech, the third comparison learning comprising: and comparing the third feature similarity between the sample voice segments with the same and different time sequences in the same sample multichannel voice.

In some disclosed embodiments, the speech feature extraction device 50 includes a first dividing module for dividing the sample speech set into two first subsets, the speech feature extraction device 50 including an acoustic extraction module for extracting sample acoustic features of each sample speech segment; the voice feature extraction device 50 comprises a feature clustering module, which is used for clustering based on the sample acoustic features of the sample voice segments to obtain sample clustering labels of the sample voice segments; the voice feature extraction device 50 includes an initial extraction module, configured to extract, based on the sample acoustic features of the sample voice segment, sample initial features of the sample voice segment; the voice feature extraction device 50 comprises a feature masking module for randomly masking sample initial features of at least one sample voice segment, and the voice feature extraction device 50 comprises a feature encoding module for encoding sample voice features of each sample voice segment based on sample initial features of non-masked sample voice segments and predicting to obtain prediction clustering labels of the masked sample voice segments; the voice feature extraction device 50 includes a second metric module, configured to measure a first loss based on a first feature similarity of sample voice features between two sample voice segments included in each of the two first subsets, and measure a fourth loss based on sample cluster labels and predictive cluster labels of the masked sample voice segments; the speech feature extraction means 50 comprises a second adjustment module for adjusting network parameters of the feature extraction model based on the first loss and the fourth loss.

In some disclosed embodiments, the speech feature extraction device 50 includes a second partitioning module for partitioning sample speech segments from the same sample multi-channel speech within a sample speech set into two second subsets; the speech feature extraction means 50 comprise a third metric module for, for two second subsets: measuring to obtain a fifth loss based on the second feature similarity of the sample voice features between every two of the contained sample voice segments, and measuring to obtain a sixth loss based on the third feature similarity of the sample voice features between every two of the contained sample voice segments; the speech feature extraction means 50 comprises a third adjustment module for adjusting network parameters of the feature extraction model based on the fifth loss and the sixth loss.

In some disclosed embodiments, the extraction module 52 includes a first initial acquisition sub-module for extracting a first initial feature of a speech segment based on an acoustic feature of the speech segment, and the extraction module 52 includes a second initial acquisition sub-module for acquiring a second initial feature characterizing a speech channel to which the speech segment belongs; the extraction module 52 includes a feature encoding sub-module for encoding a first speech feature of each speech segment in the speech channel and a second speech feature characterizing the speech channel based on the second initial feature of the speech channel and the first initial feature of each speech segment in the speech channel.

Referring to fig. 6, fig. 6 is a schematic diagram of a frame of a speech processing device 60 according to an embodiment of the application. The speech processing apparatus 60 includes: the device comprises an extraction module 61 and a processing module 62, wherein the extraction module 61 is used for carrying out feature extraction based on the voice to be processed to obtain the voice features of each voice segment in a plurality of voice channels in the voice to be processed; the voice characteristic is obtained based on the voice characteristic extracting device in the voice characteristic extracting device embodiment; the processing module 62 is configured to process based on the voice characteristics of each voice segment in the plurality of voice channels in the voice to be processed, so as to obtain a processing result of the voice to be processed.

In the above-mentioned scheme, the voice processing device 60 performs feature extraction based on the voice to be processed to obtain the voice features of each voice segment in the plurality of voice channels in the voice to be processed, and the voice features are obtained based on the voice feature extraction device in the embodiment of the voice feature extraction device, so as to improve the voice feature extraction precision of the multi-channel voice, and then performs processing based on the voice features of each voice segment in the plurality of voice channels in the voice to be processed to obtain the processing result of the voice to be processed, so that the processing precision of the multi-channel voice can be improved.

Referring to fig. 7, fig. 7 is a schematic diagram of a frame of an electronic device 70 according to an embodiment of the application. The electronic device 70 comprises a memory 71 and a processor 72, the memory 71 having stored therein program instructions, the processor 72 being adapted to execute the program instructions to implement the steps of any of the speech feature extraction method embodiments described above, or to implement the steps of the speech processing method embodiments described above. Reference may be made specifically to the foregoing disclosed embodiments, and details are not repeated here. The electronic device 70 may include, but is not limited to, a server or the like in particular.

In particular, the processor 72 is adapted to control itself and the memory 71 to implement the steps of any of the speech feature extraction method embodiments described above, or to implement the steps of the speech processing method embodiments described above. The processor 72 may also be referred to as a CPU (Central Processing Unit ). The processor 72 may be an integrated circuit chip having signal processing capabilities. The Processor 72 may also be a general purpose Processor, a digital signal Processor (DIGITAL SIGNAL Processor, DSP), an Application SPECIFIC INTEGRATED Circuit (ASIC), a Field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 72 may be commonly implemented by an integrated circuit chip.

In the above-mentioned scheme, the electronic device 70 obtains the voice to be processed, where the voice to be processed includes a plurality of voice channels, and performs feature extraction on voice segments in the plurality of voice channels based on a feature extraction model to obtain voice features of each voice segment in the plurality of voice channels, where the feature extraction model is obtained by at least three types of comparison learning training using a sample voice set, where the sample voice set includes sample voice segments from a plurality of sample multi-channel voices, where the three types of comparison learning include: the feature extraction model can utilize the characteristics of content similarity of a plurality of channels of the multi-channel voice, the similarity of the same channel and the similarity of different voice segments of the same time sequence in the same multi-channel voice to conduct contrast learning, so that the feature extraction model can fully model the additional information of the multi-channel compared with a single channel, and further can extract voice features with richer information. Therefore, the voice feature extraction precision of the multi-channel voice can be improved, and the voice processing precision can be improved.

Referring to FIG. 8, FIG. 8 is a schematic diagram of a computer readable storage medium 80 according to an embodiment of the application. The computer readable storage medium 80 stores program instructions 81 that can be executed by a processor, the program instructions 81 being configured to implement steps in any of the above-described speech feature extraction method embodiments or to implement steps in the above-described speech processing method embodiments.

In the above aspect, the computer-readable storage medium 80 obtains the voice to be processed, where the voice to be processed includes a plurality of voice channels, and performs feature extraction on voice segments in the plurality of voice channels based on a feature extraction model to obtain voice features of each voice segment in the plurality of voice channels, where the feature extraction model is obtained by at least three types of comparison learning training using a sample voice set, where the sample voice set includes sample voice segments from a plurality of sample multi-channel voices, where the three types of comparison learning includes: the feature extraction model can utilize the characteristics of content similarity of a plurality of channels of the multi-channel voice, the similarity of the same channel and the similarity of different voice segments of the same time sequence in the same multi-channel voice to conduct contrast learning, so that the feature extraction model can fully model the additional information of the multi-channel compared with a single channel, and further can extract voice features with richer information. Therefore, the voice feature extraction precision of the multi-channel voice can be improved, and the voice processing precision can be improved.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical, or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

Claims

1. A method for extracting speech features, comprising:

acquiring voice to be processed; wherein the voice to be processed comprises a plurality of voice channels;

Performing feature extraction on the voice segments in the voice channels based on a feature extraction model to obtain voice features of the voice segments in the voice channels;

The feature extraction model is obtained by at least three comparison learning training methods through a sample voice set, the sample voice set contains sample voice sections from a plurality of sample multichannel voices, and the three comparison learning methods comprise: comparing first feature similarities between sample speech segments from the same and different sample multi-channel speech, comparing second feature similarities between sample speech segments from the same and different channels in the same sample multi-channel speech, and comparing third feature similarities between sample speech segments from the same and different time sequences in the same sample multi-channel speech.

2. The method of claim 1, wherein the training step of the feature extraction model comprises:

dividing the sample speech set into two first subsets;

For the two first subsets: measuring to obtain a first loss based on the first feature similarity of the sample voice features between every two of the sample voice segments contained in each, measuring to obtain a second loss based on the second feature similarity of the sample voice features between every two of the sample voice segments contained in each, and measuring to obtain a third loss based on the third feature similarity of the sample voice features between every two of the sample voice segments contained in each;

Based on the first loss, the second loss, and the third loss, network parameters of the feature extraction model are adjusted.

3. The method of claim 2, wherein the first feature similarity between sample speech segments from the same sample multi-channel speech is inversely related to the first penalty, and wherein the first feature similarity between sample speech segments from different sample multi-channel speech is positively related to the first penalty;

And/or the second feature similarity between sample speech segments from the same channel in the same sample multi-channel speech is inversely related to the second penalty, and the second feature similarity between sample speech segments from different channels in the same sample multi-channel speech is positively related to the second penalty;

And/or the third feature similarity between sample speech segments from the same time sequence in the same sample multi-channel speech is inversely related to the third penalty, and the third feature similarity between sample speech segments from different time sequences in the same sample multi-channel speech is positively related to the third penalty.

4. The method of claim 2, wherein the performing the first measurement based on feature similarities between sample speech features of the sample speech segments contained in the two first subsets with respect to each other, before deriving the first loss, further comprises:

Extracting sample acoustic features of each sample voice segment;

Clustering based on the sample acoustic features of the sample voice segments to obtain sample clustering labels of the sample voice segments, and extracting to obtain sample initial features of the sample voice segments based on the sample acoustic features of the sample voice segments;

Randomly masking sample initial characteristics of at least one sample voice segment, coding to obtain sample voice characteristics of each sample voice segment based on the sample initial characteristics of the sample voice segments which are not masked, and predicting to obtain prediction clustering labels of the sample voice segments which are masked;

The adjusting network parameters of the feature extraction model based on the first loss, the second loss, and the third loss includes:

and adjusting network parameters of the feature extraction model based on the first loss, the second loss, the third loss and a fourth loss measured based on a difference between a sample cluster label and a predicted cluster label of the masked sample speech segment.

5. The method of claim 2, wherein the same number of the sample speech segments are contained within the two first subsets, and wherein, if the sample speech set contains an odd number of the sample speech segments, the method further comprises, prior to the dividing the sample speech set into the two first subsets, any of:

Randomly discarding any odd number of the sample speech segments in the sample speech set;

Randomly copying any odd number of the sample speech segments within the sample speech set.

6. The method of any one of claims 1 to 5, wherein the feature extraction model is further trained to converge by at least a first contrast learning and then at least a second contrast learning and a third contrast learning prior to training by at least the three contrast learning;

wherein the first contrast learning comprises: comparing first feature similarities between sample speech segments from the same and different sample multichannel speech, the second comparison learning comprising: comparing second feature similarities between sample speech segments from the same and different channels in the same sample multi-channel speech, the third comparison learning comprising: and comparing the third feature similarity between the sample voice segments from the same and different time sequences in the same sample multi-channel voice.

7. The method of claim 6, wherein the training to converge through at least a first contrast learning comprises:

dividing the sample voice set into two first subsets, and extracting sample acoustic features of each sample voice segment;

Measuring to obtain a first loss based on first feature similarity of the sample voice features between every two sample voice segments contained in the two first subsets, and measuring to obtain a fourth loss based on sample clustering labels and predictive clustering labels of the masked sample voice segments;

based on the first loss and the fourth loss, network parameters of the feature extraction model are adjusted.

8. The method of claim 6, wherein the training to convergence through at least a second contrast learning and a third contrast learning comprises:

dividing sample speech segments from the same sample multi-channel speech within the sample speech set into two second subsets;

for the two second subsets: measuring to obtain a fifth loss based on second feature similarities of the sample speech features between every two of the sample speech segments contained in each, and measuring to obtain a sixth loss based on third feature similarities of the sample speech features between every two of the sample speech segments contained in each;

based on the fifth loss and the sixth loss, network parameters of the feature extraction model are adjusted.

9. The method of claim 1, wherein the feature extraction of the speech segments in the plurality of speech channels based on the feature extraction model to obtain the speech features of each of the speech segments in the plurality of speech channels comprises:

extracting and obtaining a first initial characteristic of the voice segment based on the acoustic characteristic of the voice segment, and obtaining a second initial characteristic representing the voice channel to which the voice segment belongs;

And based on the second initial characteristic of the voice channel and the first initial characteristic of each voice segment in the voice channel, encoding to obtain the first voice characteristic of each voice segment in the voice channel and the second voice characteristic representing the voice channel.

10. A method of speech processing, comprising:

extracting features based on the voice to be processed to obtain voice features of each voice segment in a plurality of voice channels in the voice to be processed; wherein the speech feature is obtained based on the speech feature extraction method of any one of claims 1 to 9;

Processing based on the voice characteristics of each voice segment in a plurality of voice channels in the voice to be processed to obtain a processing result of the voice to be processed.

11. A speech feature extraction apparatus, comprising:

The acquisition module is used for acquiring the voice to be processed; wherein the voice to be processed comprises a plurality of voice channels;

the extraction module is used for carrying out feature extraction on the voice segments in the voice channels based on a feature extraction model to obtain voice features of the voice segments in the voice channels;

12. A speech processing apparatus, comprising:

The extraction module is used for extracting characteristics based on the voice to be processed to obtain voice characteristics of each voice segment in a plurality of voice channels in the voice to be processed; wherein the speech features are derived based on the speech feature extraction means of claim 11;

and the processing module is used for processing based on the voice characteristics of each voice segment in a plurality of voice channels in the voice to be processed to obtain a processing result of the voice to be processed.

13. An electronic device comprising at least a memory and a processor coupled to each other, the memory having stored therein program instructions for executing the program instructions to implement the speech feature extraction method of any one of claims 1 to 9 or to implement the speech processing method of claim 10.

14. A computer readable storage medium, characterized in that program instructions executable by a processor for implementing the speech feature extraction method of any one of claims 1 to 9 or the speech processing method of claim 10 are stored.