US20200234717A1 - Speaker separation model training method, two-speaker separation method and computing device - Google Patents
Speaker separation model training method, two-speaker separation method and computing device Download PDFInfo
- Publication number
- US20200234717A1 US20200234717A1 US16/652,452 US201816652452A US2020234717A1 US 20200234717 A1 US20200234717 A1 US 20200234717A1 US 201816652452 A US201816652452 A US 201816652452A US 2020234717 A1 US2020234717 A1 US 2020234717A1
- Authority
- US
- United States
- Prior art keywords
- feature
- speaker
- vector
- speech
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 144
- 238000000034 method Methods 0.000 title claims abstract description 100
- 238000012549 training Methods 0.000 title claims abstract description 66
- 230000008569 process Effects 0.000 claims abstract description 45
- 238000012545 processing Methods 0.000 claims abstract description 39
- 238000003062 neural network model Methods 0.000 claims abstract description 16
- 230000011218 segmentation Effects 0.000 claims description 105
- 238000010606 normalization Methods 0.000 claims description 18
- 238000013528 artificial neural network Methods 0.000 claims description 14
- 238000001514 detection method Methods 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 4
- 238000003491 array Methods 0.000 claims 2
- 238000000605 extraction Methods 0.000 abstract description 9
- 230000006870 function Effects 0.000 description 89
- 239000012634 fragment Substances 0.000 description 14
- 230000015654 memory Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 8
- 238000009432 framing Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 230000015556 catabolic process Effects 0.000 description 4
- 238000006731 degradation reaction Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000007774 longterm Effects 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Definitions
- the present disclosure relates to a technical field of biometrics, specifically a speaker separation model training method, a two-speaker separation method, a terminal, and a storage medium.
- Speaker separation technology refers to a process of automatically dividing speech according to the speakers from a multi-person conversation and labeling it, to solve a problem of “when and who speaks.”
- Separation of two speakers refers to separating recordings of two speakers speaking one after the other on the same audio track into two audio tracks, each audio track containing the recording of one speaker.
- the separation of two speakers is widely used in many fields, and has extensive needs in industries and fields such as radio, television, media, and customer service centers.
- BIC Bayesian Information Criterion
- a speaker separation model training method a two-speaker separation method, a terminal, and a storage medium are disclosed.
- Training the speaker separation model in advance significantly enhances a feature extraction capability of the model on input speech data and reduce a risk of performance degradation when the network level deepens; separating the two speaker's speech according to the trained speaker separation model improves an accuracy of the two-speaker separation, especially for lengthy conversations.
- a first aspect of the present disclosure provides a speaker separation model training method, the method includes:
- a second aspect of the present disclosure provides a two-speaker separation method, the method includes:
- a third aspect of the present disclosure provides a terminal, the terminal includes a processor and a storage device, and the processor executes computer-readable instructions stored in the storage device to implement the speaker separation model training method and/or the two-speaker separation method.
- a fourth aspect of the present disclosure provides a non-transitory storage medium having stored thereon computer-readable instructions that, when the computer-readable instructions are executed by a processor to implement the speaker separation model training method and/or the two-speaker separation method.
- the speaker separation model training method, the two-speaker separation method, the terminal, and the storage medium described in the present disclosure trains the speaker separation model in advance, which enhances a feature extraction capability of the model on input speech data, and reduces a risk of performance degradation when the network hierarchy deepens; separating the two speaker's speech according to the trained speaker separation model improves an accuracy of the two-speaker separation, especially for a long conversation.
- FIG. 1 shows a flowchart a speaker separation model training method provided in an embodiment of the present disclosure.
- FIG. 2 shows a flowchart of a two-speaker separation method provided in an embodiment of the present disclosure.
- FIG. 3 shows a schematic diagram of determining a local maximum according to a segmentation point and a corresponding distance value provided in an embodiment of the present disclosure.
- FIG. 4 shows a schematic structural diagram of a speaker separation model training device provided in an embodiment of the present disclosure.
- FIG. 5 shows a schematic structural diagram of a two-speaker separation device provided in an embodiment of the present disclosure.
- FIG. 6 shows a schematic structural diagram of a terminal provided in an embodiment of the present disclosure.
- the speaker separation model training method and/or the two speaker separation method in the embodiments of the present disclosure are applied to one or more electronic terminals.
- the speaker separation model training method and/or the two speaker separation method may also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network.
- the network includes, but is not limited to: a wide area network, a metropolitan area network, or a local area network.
- the speaker separation model training method and/or the two speaker separation method in the embodiment of the present disclosure can be executed by the server at the same time, or can be performed by the terminal at the same time, or may also be performed by the server and the terminal.
- the server separation model training method in the embodiment of the present disclosure is executed by the server, and the two speaker separation method in the embodiment of the present disclosure is executed by the terminal.
- the speaker separation model training function and/or the two speaker separation function provided by the method of the present disclosure can be directly integrated on the terminal, or installed for a client for implementing the methods of the present disclosure.
- the methods provided in the present disclosure can also be run on a server or other device in the form of a Software Development Kit (SDK), and provide the speaker separation model training function and/or two speaker separation function in the form of an SDK interface.
- SDK Software Development Kit
- the terminal or other equipment can implement the speaker separation model training method and/or the two speaker separation method through the provided interface.
- FIG. 1 is a flowchart of a speaker separation training method in an embodiment of the present disclosure. According to different requirements, the order of the steps in the flow can be changed, and some steps can be omitted. Within each step, sub-steps can be sub-numbered.
- the acquiring of the plurality of audio data may include the following two manners:
- An audio device for example, a voice recorder, etc.
- a voice recorder for example, a voice recorder, etc.
- the audio data set is an open source data set, such as a UBM data set and a TV data set.
- the open source audio data set is dedicated to training a speaker separation model and testing an accuracy of the trained speaker separation model.
- the UBM data set and the TV data set are from NIST04, NIST05 and NIST06, including about 500 h of audio data, recording a total of 577 speakers, with an average of about 15 sentences per person.
- processing of the plurality of audio data includes one or more of the following combinations:
- the acquired audio data may contain various noises.
- a low-pass filter can be used to remove white noise and random noise in the audio data.
- a dual-threshold comparison method can be used for voice activity detection to detect valid audio and invalid audio in the audio data.
- the valid audio is voices speaking, and the invalid audio is relative to the valid audio, including but not limited to, mute passages which are recorded.
- a label refers to an identity attribute tag representing the audio data.
- a first audio data of speaker A is labeled with an identity attribute tag 01
- a second audio data of speaker B is labeled with an identity attribute tag 02 .
- MFCC Mel Frequency Cepstrum Coefficient
- the preset neural network model is stacked using a neural network structure with a predetermined number of layers.
- the preset predetermined number of layers is a preset number of network layers. For example, 9-12 layers in a neural network structure are set in advance to train a neural network training model.
- each layer of the neural network structure includes: a first convolution layer, a first modified linear unit, a second convolution layer, and a second modified linear unit, an average layer, a fully connected layer, and a normalization layer.
- a convolution kernel of the convolution layer is 3*3, a step size is 1*1, and a number of channels is 64;
- a specific process of training the preset neural network model based on the input audio features includes:
- a function of the average layer can be as a temporary pool (“temporary pool” used to calculate an average value of vector sequences along the time axis).
- the average layer calculates an average value of vector sequences output from a previous long-term short-term memory network to obtain a forward average vector and an average value of vector sequences output from a backward long-term short-term memory network to obtain a backward average vector.
- the fully connected layer concatenates the forward average vector and the backward average vector into one vector.
- the normalization layer uses a normalization function to process the concatenated vector to obtain a normalized one-dimensional vector feature with a length of 1.
- the normalization function can be a Euclidean distance function, a Manhattan distance function, and a minimum absolute error function.
- the Euclidean distance function is used to normalize the concatenated vector processed by the fully connected layer to obtain a normalized one-dimensional vector feature.
- the normalization process can compress the concatenated vector process by the fully connected layer, so that the concatenated vector processed by the fully connected layer is robust, thereby further improving a robustness of the speaker separation model.
- the normalization using the Euclidean distance function prevents the speaker separation model from overfitting, thereby improving a generalization ability of the speaker separation model. Subsequently optimizing neural network parameters in the speaker separation model becomes more stable and faster.
- the preset first similarity function can be a cosine similarity function, as shown (in formula (1-1)) below:
- x i represents the first vector feature of the first speaker
- x j represents the second vector feature of the first speaker
- COS(x i ,x j ) is the calculated first similarity value
- the preset second similarity function can be same as or different from the preset first similarity function.
- the preset second similarity function is an LP norm, as shown in the following formula (1-2):
- x i represents the first vector feature of the first speaker
- y i represents the third vector feature of the second speaker
- L p (x i , y i ) is the calculated second similarity value
- the preset loss function can be as shown in the following formula (1-3):
- S i 13 is the second similarity value, that is, a value of similarity between the first vector feature of the first speaker and the third vector feature of the second speaker.
- S i 12 is the first similarity value, that is, a similarity between the first vector feature of the first speaker and the second vector feature of the first speaker.
- L is the calculated loss function value.
- the present disclosure acquires a plurality of audio data as training samples and labels the audio data. Audio features of the audio data are extracted. Vector features are obtained by inputting the audio features into a preset neural network model for training. A first vector feature and a second vector feature of a first speaker are selected, and a first similarity value between the first vector feature and the second vector feature is calculated according to a preset first similarity function. A third vector feature of a second speaker is selected, and a second similarity value between the first vector feature and the third vector feature is calculated according to a preset second similarity function.
- the first similarity value and the second similarity value are inputted into a preset loss function to calculate a loss function value; and when the loss function value is less than or equal to a preset loss function threshold, the training process of the speaker separation model is ended, and parameters in the speaker separation model are updated.
- the speaker separation model for training based on the convolutional neural network in the present disclosure has strong feature extraction capabilities, which reduces a risk of performance degradation when the network level deepens.
- vector features of different audio data of the same speaker can be guaranteed to be the same as much as possible, and vector features of the audio data of different speaker are as different as possible, so that the calculated loss function can match the convergence condition faster and record and save the speaker. Training time of the speaker separation model is saved, and a separation efficiency of the speaker separation model is improved.
- FIG. 2 is a flowchart of a method for two-speaker separation in an embodiment of the present disclosure. According to different requirements, the order of the steps in the flow can be changed, and some can be omitted. Within each step, sub-steps can be sub-numbered.
- processing a speech signal to be separated In block 21 , processing a speech signal to be separated.
- a process of processing the speech signal to be separated includes:
- a digital filter can be used to perform pre-emphasis processing on the speech signal to be separated to improve a speech signal of the high-frequency part.
- the details are shown below in formula (2-1):
- S(n) is the speech signal to be separated
- a is a pre-emphasis coefficient and 0.95 is generally taken
- ⁇ tilde over (S) ⁇ (n) is a speech signal after the pre-emphasis processing.
- the speech signal to be separated can be framed according to a preset framing parameter.
- the preset framing parameter can be, for example, a frame length of 10-30 ms.
- the speech signal to be separated is framed every 25 milliseconds to obtain a plurality of speech frames.
- each speech frame obtained after framing is a characteristic parameter time series composed of the characteristic parameters of each frame.
- a length of the first sliding window and the second sliding window can be 0.7-2 seconds.
- a segmentation point between the first sliding window and the second sliding window is a possible speaker dividing point of the processed voice signal.
- the first sliding window corresponds to the first speech segment
- the second sliding window corresponds to the second speech segment.
- the segmentation point between the first sliding window and the second sliding window is also the segmentation point between the first speech segment and the second speech segment.
- the first speech segment is inputted into the trained speaker separation model, and the trained speaker separation model extracts an MFCC(Mel Frequency Cepstrum Coefficient) of the first speech segment to obtain a first speech vector.
- the second speech segment is inputted into the trained speaker separation model and the trained speaker separation model extracts an MFCC of the second speech segment to obtain a second speech vector.
- a preset distance function can be used to calculate a distance value between each first speech vector and each corresponding second speech vector.
- the preset distance function is a preset distance function, and can be, for example, a Euclidean distance.
- a process of calculating the distance value of the first speech vector and the second speech vector by using the European distance function is not described in the present disclosure.
- the preset time period can be 5 ms.
- a plurality of segmentation points of the sliding window can be obtained, thereby obtaining a plurality of first speech fragments and a plurality of second speech fragments. That is, each time the first sliding window and the second sliding window are slid at the same time according to the preset time period, a candidate segmentation point is obtained.
- Each candidate segmentation point is the segmentation point of the first speech segment and the second speech segment, and a distance value can be calculated correspondingly. How many segmentation point there are will correspond to how many distance values there are.
- a specific process of determining local maximum values according to all the distance values includes:
- f(n) whether it is greater than f(n ⁇ 1) and greater than f(n+1), where f(n) is a distance value corresponding to the segmentation point, f(n ⁇ 1) is a distance value corresponding to a segmentation point before the segmentation point, and f(n+1) is a distance value corresponding to a segmentation point after the segmentation point;
- a plurality of local maximum values can be determined, and a segmentation point corresponding to each local maximum value is a segmentation point to be found.
- each segmentation points corresponds to a distance value, for example, S 1 , S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 , S 9 , and S 10 .
- the remaining current distance values are respectively determined from a previous distance value or a subsequent distance value to determine local maximum values.
- the determining of local maximum values according to all the distance values can include:
- a smooth curve can be drawn with the segmentation points as the horizontal axis and the distance values corresponding to the segmentation points as the vertical axis, as shown in FIG. 3 .
- Solving a tangent of each point in FIG. 3 shows that a slope of the tangent of the points corresponding to S 2 , S 4 , S 6 , and S 9 is zero, and S 2 , S 4 , S 6 , and S 9 are determined as local maximum values.
- the speech signal to be separated is again segmented by using the segmentation points corresponding to the local maximum values as a new segmentation point, so as to obtain a plurality of new speech fragments.
- a process of re-segmentation is to find time points of dialogue exchange between two different speakers from the speech signal to be separated, and then the speech signal to be separated can be segmented into several speech segments according to the time points. Each speech segment contains the speech of only one speaker.
- each new speech segment contains only one speaker's speech.
- All speech segments belonging to one speaker are clustered by a clustering method and then recombined together.
- the clustering method can be a K-means clustering or a bottom-up hierarchical clustering (HAC).
- HAC hierarchical clustering
- a speech signal to be separated processes a speech signal to be separated.
- a first sliding window and a second sliding window that are adjacent to each other are established and sliding from a starting position of the processed speech signal.
- a first speech segment and a second speech segment are obtained according to segmentation point of the first sliding window and the second sliding window.
- the first speech segment is inputted into a speaker separation model to extract feature to obtain a first speech vector
- the second speech segment is inputted into the speaker separation model to extract feature to obtain a second speech vector.
- a distance value between the first speech vector and the second speech vector is calculated as a distance value corresponding to the segmentation point.
- the sliding window can be moved according to a preset time period, and each time the sliding window is moved, two speech fragments are obtained until the second sliding window reaches the end of the pre-processed voice signal and the distance value corresponding to each segmentation point is obtained.
- a local maximum according to the distance value is determined and the speech signal to be separated is segmented according to a segmentation point corresponding to the local maximum to obtain a new speech segment.
- the new speech segment is clustered into the respective speech of different speakers.
- a plurality of speech fragments are obtained through several sliding processes.
- the trained speaker separation model is used to extract features of the speech fragments. By comparing the calculated distance values to determine the local maximum values, the segmentation points corresponding to the local maximum values are used as new segmentation points to segment the speech signal to be separated again, and the speech fragments of the two speakers are obtained, and the separation is effective.
- the following describes functional modules and hardware structure of a terminal that implements the above-mentioned speaker separation model training method and the two-speaker separation method, with reference to FIGS. 4-6 .
- FIG. 4 is a schematic structural diagram of a device in a preferred embodiment for speaker separation model training.
- the speaker separation model training device 40 runs in a terminal.
- the speaker separation model training device 40 can include a plurality of function modules consisting of program code segments.
- the program code of each program code segment in the speaker separation model training device 40 can be stored in a memory and executed by at least one processor to train the speaker separation model (described in detail in FIG. 1 ).
- the speaker separation model training device 40 in the terminal can be divided into a plurality of functional modules, according to the performed functions.
- the functional modules can include: an acquisition module 401 , a processing module 402 , a feature extraction module 403 , a training module 404 , a calculation module 405 , and an update module 406 .
- a module as referred to in the present disclosure refers to a series of computer-readable instruction segments that can be executed by at least one processor and that are capable of performing fixed functions, which are stored in a memory. In some embodiment, the functions of each module will be detailed in the following embodiments.
- the acquisition module 401 is configured to acquire a plurality of audio data of multiple speakers.
- the acquiring of the plurality of audio data may include the following two manners:
- An audio device for example, a voice recorder, etc.
- a voice recorder for example, a voice recorder, etc.
- the audio data set is an open source data set, such as a UBM data set and a TV data set.
- the open source audio data set is dedicated to training a speaker separation model and testing an accuracy of the trained speaker separation model.
- the UBM data set and the TV data set are from NIST04, NIST05 and NIST06, including about 500 h of audio data, recording a total of 577 speakers, with an average of about 15 sentences per person.
- the processing module 402 is configured to process each of the plurality of audio data.
- processing of the plurality of audio data includes one or more of the following combinations:
- the acquired audio data may contain various noises.
- a low-pass filter can be used to remove white noise and random noise in the audio data.
- a dual-threshold comparison method can be used for voice activity detection to detect valid audio and invalid audio in the audio data.
- the valid audio is voices speaking, and the invalid audio is relative to the valid audio, including but not limited to, mute passages which are recorded.
- a label refers to an identity attribute tag representing the audio data.
- a first audio data of speaker A is labeled with an identity attribute tag 01
- a second audio data of speaker B is labeled with an identity attribute tag 02 .
- the feature extraction module 403 is configured to extract audio features of the processed audio data.
- MFCC Mel Frequency Cepstrum Coefficient
- the training module 404 is configured to input the audio features into a preset neural network model for training, to obtain vector features.
- the preset neural network model is stacked using a neural network structure with a predetermined number of layers.
- the preset predetermined number of layers is a preset number of network layers. For example, 9-12 layers in a neural network structure are set in advance to train a neural network training model.
- each layer of the neural network structure includes: a first convolution layer, a first modified linear unit, a second convolution layer, and a second modified linear unit, an average layer, a fully connected layer, and a normalization layer.
- a convolution kernel of the convolution layer is 3*3, a step size is 1*1, and a number of channels is 64;
- a specific process of training the preset neural network model based on the input audio features includes:
- a function of the average layer can be as a temporary pool (“temporary pool” used to calculate an average value of vector sequences along the time axis).
- the average layer calculates an average value of vector sequences output from a previous long-term short-term memory network to obtain a forward average vector and an average value of vector sequences output from a backward long-term short-term memory network to obtain a backward average vector.
- the fully connected layer concatenates the forward average vector and the backward average vector into one vector.
- the normalization layer uses a normalization function to process the concatenated vector to obtain a normalized one-dimensional vector feature with a length of 1.
- the normalization function can be a Euclidean distance function, a Manhattan distance function, and a minimum absolute error function.
- the Euclidean distance function is used to normalize the concatenated vector processed by the fully connected layer to obtain a normalized one-dimensional vector feature.
- the normalization process can compress the concatenated vector process by the fully connected layer, so that the concatenated vector processed by the fully connected layer is robust, thereby further improving a robustness of the speaker separation model.
- the normalization using the Euclidean distance function prevents the speaker separation model from overfitting, thereby improving a generalization ability of the speaker separation model. Subsequently optimizing neural network parameters in the speaker separation model becomes more stable and faster.
- the calculation module 405 is configured to select a first vector feature and a second vector feature of a first speaker, and calculate a first similarity value between the first vector feature and the second vector feature according to a preset first similarity function.
- the preset first similarity function can be a cosine similarity function, as shown (in formula (1-1)) below:
- x i represents the first vector feature of the first speaker
- x j represents the second vector feature of the first speaker
- COS(x i ,x j ) is the calculated first similarity value
- the calculation module 405 is also configured to select a third vector feature of a second speaker, and calculate a second similarity value between the first vector feature and the third vector feature according to a preset second similarity function.
- the preset second similarity function can be same as or different from the preset first similarity function.
- the preset second similarity function is an LP norm, as shown in the following formula (1-2):
- x i represents the first vector feature of the first speaker
- y i represents the third vector feature of the second speaker
- L p (x i , y i ) is the calculated second similarity value
- the update module 406 is configured to input the first similarity value and the second similarity value into a preset loss function to calculate a loss function value.
- the loss function value is less than or equal to a preset loss function threshold, the training process of the speaker separation model is ended, and parameters in the speaker separation model are updated.
- the preset loss function can be as shown in the following formula (1-3):
- S i 13 is the second similarity value, that is, a value of similarity between the first vector feature of the first speaker and the third vector feature of the second speaker.
- S i 12 is the first similarity value, that is, a similarity between the first vector feature of the first speaker and the second vector feature of the first speaker.
- L is the calculated loss function value.
- the present disclosure acquires a plurality of audio data as training samples and labels the audio data. Audio features of the audio data are extracted. Vector features are obtained by inputting the audio features into a preset neural network model for training. A first vector feature and a second vector feature of a first speaker are selected, and a first similarity value between the first vector feature and the second vector feature is calculated according to a preset first similarity function. A third vector feature of a second speaker is selected, and a second similarity value between the first vector feature and the third vector feature is calculated according to a preset second similarity function.
- the first similarity value and the second similarity value are inputted into a preset loss function to calculate a loss function value; and when the loss function value is less than or equal to a preset loss function threshold, the training process of the speaker separation model is ended, and parameters in the speaker separation model are updated.
- the speaker separation model for training based on the convolutional neural network in the present disclosure has strong feature extraction capabilities, which reduces a risk of performance degradation when the network level deepens.
- vector features of different audio data of the same speaker can be guaranteed to be the same as much as possible, and vector features of the audio data of different speaker are as different as possible, so that the calculated loss function can match the convergence condition faster and record and save the speaker. Training time of the speaker separation model is saved, and a separation efficiency of the speaker separation model is improved.
- FIG. 5 is a schematic structural diagram of a preferred embodiment of a two-speaker separation device of the present disclosure.
- the two-speaker separation device 50 runs in a terminal.
- the two-speaker separation device 50 can include a plurality of function modules consisting of program code segments.
- the program code of each program code segments in the two-speaker separation device 50 can be stored in a memory and executed by at least one processor to perform separation of speech signals of two speakers to obtain two speech segments.
- Each segment of speech contains the speech of only one speaker (described in detail in FIG. 2 ).
- the two-speaker separation device 50 in the terminal can be divided into a plurality of functional modules, according to the performed functions.
- the functional modules can include: a signal processing module 501 , a first segmentation module 502 , a vector extraction module 503 , a calculation module 504 , a comparison module 505 , a second segmentation module 506 , and a clustering module 507 .
- a module as referred to in the present disclosure refers to a series of computer-readable instruction segments that can be executed by at least one processor and that are capable of performing fixed functions, which are stored in a memory. In some embodiment, the functions of each module will be detailed in the following embodiments.
- the above-mentioned integrated unit implemented in a form of software functional modules can be stored in a non-transitory readable storage medium.
- the above software function modules are stored in a storage medium and includes several instructions for causing a computer device (which can be a personal computer, a dual-screen device, or a network device) or a processor to execute the method described in various embodiments in the present disclosure.
- the signal processing module 501 is configured to process a speech signal to be separated.
- a process of the signal processing module 501 processing the speech signal to be separated includes:
- a digital filter can be used to perform pre-emphasis processing on the speech signal to be separated to improve a speech signal of the high-frequency part.
- the details are shown below in formula (2-1):
- S(n) is the speech signal to be separated
- a is a pre-emphasis coefficient and 0.95 is generally taken
- ⁇ tilde over (S) ⁇ (n) is a speech signal after the pre-emphasis processing.
- the speech signal to be separated can be framed according to a preset framing parameter.
- the preset framing parameter can be, for example, a frame length of 10-30 ms.
- the speech signal to be separated is framed every 25 milliseconds to obtain a plurality of speech frames.
- each speech frame obtained after framing is a characteristic parameter time series composed of the characteristic parameters of each frame.
- the first segmentation module 502 is configured to establish a first sliding window and a second sliding window that are adjacent to each other and slid from a starting position of the processed speech signal. This obtains a first speech segment and a second speech segment according to segmentation point of the first sliding window and the second sliding window.
- a length of the first sliding window and the second sliding window can be 0.7-2 seconds.
- a segmentation point between the first sliding window and the second sliding window is a possible speaker dividing point of the processed voice signal.
- the first sliding window corresponds to the first speech segment
- the second sliding window corresponds to the second speech segment.
- the segmentation point between the first sliding window and the second sliding window is also the segmentation point between the first speech segment and the second speech segment.
- the vector extraction module 503 is configured to input the first speech segment into a speaker separation model to extract feature to obtain a first speech vector, and input the second speech segment into the speaker separation model to extract feature to obtain a second speech vector.
- the first speech segment is inputted into the trained speaker separation model, and the trained speaker separation model extracts an MFCC(Mel Frequency Cepstrum Coefficient) of the first speech segment to obtain a first speech vector.
- the second speech segment is inputted into the trained speaker separation model and the trained speaker separation model extracts an MFCC of the second speech segment to obtain a second speech vector.
- the calculation module 504 is configured to calculate a distance value between the first speech vector and the second speech vector as a distance value corresponding to the segmentation point.
- a preset distance function can be used to calculate a distance value between each first speech vector and each corresponding second speech vector.
- the preset distance function is a preset distance function, and can be, for example, a Euclidean distance.
- a process of calculating the distance value of the first speech vector and the second speech vector by using the European distance function is not described in the present disclosure.
- the preset time period can be 5 ms.
- a plurality of segmentation points of the sliding window can be obtained, thereby obtaining a plurality of first speech fragments and a plurality of second speech fragments. That is, each time the first sliding window and the second sliding window are slid at the same time according to the preset time period, a candidate segmentation point is obtained.
- Each candidate segmentation point is the segmentation point of the first speech segment and the second speech segment, and a distance value can be calculated correspondingly. How many segmentation point there are will correspond to how many distance values there are.
- the comparison module 505 is configured to acquire distance value corresponding to each segmentation point, and determine local maximum values according to all the distance values.
- a specific process of the comparison module 505 determining local maximum values according to all the distance values includes:
- f(n) is a distance value corresponding to the segmentation point
- f(n ⁇ 1) is a distance value corresponding to a segmentation point before the segmentation point
- f(n+1) is a distance value corresponding to a segmentation point after the segmentation point
- a plurality of local maximum values can be determined, and a segmentation point corresponding to each local maximum value is a segmentation point to be found.
- each segmentation points corresponds to a distance value, for example, S 1 , S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 , S 9 , and S 10 .
- the remaining current distance values are respectively determined from a previous distance value or a subsequent distance value to determine local maximum values.
- the comparison module 505 determining of local maximum values according to all the distance values can include:
- a smooth curve can be drawn with the segmentation points as the horizontal axis and the distance values corresponding to the segmentation points as the vertical axis, as shown in FIG. 3 .
- Solving a tangent of each point in FIG. 3 shows that a slope of the tangent of the points corresponding to S 2 , S 4 , S 6 , and S 9 is zero, and S 2 , S 4 , S 6 , and S 9 are determined as local maximum values.
- the second segmentation module 506 is configured to segment the speech signal to be separated according to the segmentation points corresponding to the local maximum values to obtain new speech segments.
- the speech signal to be separated is again segmented by using the segmentation points corresponding to the local maximum values as a new segmentation point, so as to obtain a plurality of new speech fragments.
- a process of re-segmentation is to find time points of dialogue exchange between two different speakers from the speech signal to be separated, and then the speech signal to be separated can be segmented into several speech segments according to the time points. Each speech segment contains the speech of only one speaker.
- each new speech segment contains only one speaker's speech.
- the clustering module 507 is configured to cluster the new speech segments into speech segments of two speakers.
- All speech segments belonging to one speaker are clustered by a clustering method and then recombined together.
- the clustering method can be a K-means clustering or a bottom-up hierarchical clustering (HAC).
- HAC hierarchical clustering
- a speech signal to be separated processes a speech signal to be separated.
- a first sliding window and a second sliding window that are adjacent to each other are established and sliding from a starting position of the processed speech signal.
- a first speech segment and a second speech segment are obtained according to segmentation point of the first sliding window and the second sliding window.
- the first speech segment is inputted into a speaker separation model to extract feature to obtain a first speech vector
- the second speech segment is inputted into the speaker separation model to extract feature to obtain a second speech vector.
- a distance value between the first speech vector and the second speech vector is calculated as a distance value corresponding to the segmentation point.
- the sliding window can be moved according to a preset time period, and each time the sliding window is moved, two speech fragments are obtained until the second sliding window reaches the end of the pre-processed voice signal and the distance value corresponding to each segmentation point is obtained.
- a local maximum according to the distance value is determined and the speech signal to be separated is segmented according to a segmentation point corresponding to the local maximum to obtain a new speech segment.
- the new speech segment is clustered into the respective speech of different speakers.
- a plurality of speech fragments are obtained through several sliding processes.
- the trained speaker separation model is used to extract features of the speech fragments. By comparing the calculated distance values to determine the local maximum values, the segmentation points corresponding to the local maximum values are used as new segmentation points to segment the speech signal to be separated again, and the speech fragments of the two speakers are obtained, and the separation is effective.
- FIG. 6 is a schematic structural diagram of a terminal provided in embodiment 5 of the present disclosure.
- the terminal 3 may include: a memory 31 , at least one processor 32 , computer-readable instructions 33 stored in the memory 31 and executable on the at least one processor 32 , and at least one communication bus 34 .
- the at least one processor 32 executes the computer-readable instructions 33 to implement the steps in the speaker separation model training method and/or two speaker separation method described above.
- the computer-readable instructions 33 can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 31 and executed by the at least one processor 32 to complete the speaker separation model training method and/or the two speaker separation method of the present disclosure.
- the one or more modules/units can be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe execution processes of the computer-readable instructions 33 in the terminal 3 .
- the terminal 3 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
- the schematic diagram 3 is only an example of the terminal 3 , and does not constitute a limitation on the terminal 3 .
- Another terminal 3 may include more or fewer components than shown in the figures, or combine some components or have different components.
- the terminal 3 may further include an input/output device, a network access device, a bus, and the like.
- the at least one processor 32 can be a central processing unit (CPU), or can be other general-purpose processor, digital signal processor (DSPs), and application specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA), or other programmable logic device, discrete gate, or transistor logic device, or discrete hardware component, etc.
- the processor 32 can be a microprocessor, or the processor 32 can be any conventional processor.
- the processor 32 is a control center of the terminal 3 , and connects various parts of the entire terminal 3 by using various interfaces and lines.
- the memory 31 can be configured to store the computer-readable instructions 33 and/or modules/units.
- the processor 32 may run or execute the computer-readable instructions and/or modules/units stored in the memory 31 , and may call data stored in the memory 31 to implement various functions of the terminal 3 .
- the memory 31 mainly includes a storage program area and a storage data area.
- the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.
- the storage data area may store data (such as audio data, a phone book, etc.) created according to use of the terminal 3 .
- the memory 31 may include a high-speed random access memory, and may also include a non-transitory storage medium, such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) Card, a flash card, at least one disk storage device, a flash memory device, or other non-transitory solid-state storage device.
- a non-transitory storage medium such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) Card, a flash card, at least one disk storage device, a flash memory device, or other non-transitory solid-state storage device.
- modules/units integrated in the terminal 3 When the modules/units integrated in the terminal 3 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-transitory readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments implemented by the present disclosure can also be completed by related hardware instructed by computer-readable instructions.
- the computer-readable instructions can be stored in a non-transitory readable storage medium.
- the computer-readable instructions when executed by the processor, may implement the steps of the foregoing method embodiments.
- the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes can be in a source code form, an object code form, an executable file, or some intermediate form.
- the non-transitory readable storage medium can include any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM).
- a recording medium a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM).
- ROM read-only memory
- each functional unit in each embodiment of the present disclosure can be integrated into one processing unit, or can be physically present separately in each unit, or two or more units can be integrated into one unit.
- the above integrated unit can be implemented in a form of hardware or in a form of a software functional unit.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
- This application claims priority of Chinese Patent Application No. 201810519521.6, entitled “speaker separation model training method, two-speaker separation method and related equipment” filed on May 28, 2018 in the China National Intellectual Property Administration (CNIPA), the entire contents of which are incorporated by reference herein.
- The present disclosure relates to a technical field of biometrics, specifically a speaker separation model training method, a two-speaker separation method, a terminal, and a storage medium.
- Separating and obtaining specific human voices from others within massive amounts of data, such as telephone recordings, news broadcasts, conference recordings, etc. is often necessary. Speaker separation technology refers to a process of automatically dividing speech according to the speakers from a multi-person conversation and labeling it, to solve a problem of “when and who speaks.”
- Separation of two speakers refers to separating recordings of two speakers speaking one after the other on the same audio track into two audio tracks, each audio track containing the recording of one speaker. The separation of two speakers is widely used in many fields, and has extensive needs in industries and fields such as radio, television, media, and customer service centers.
- Traditional speaker separation technology that uses Bayesian Information Criterion (BIC) as a similarity measure can achieve better results in a separation task of short-term conversations, but as the duration of the conversations increases, the Single Gaussian model of the BIC is not enough to describe the distribution of different speaker data, so speaker separation capability is poor.
- In view of the above, it is necessary to propose a speaker separation model training method, a two-speaker separation method, a terminal, and a storage medium are disclosed. Training the speaker separation model in advance significantly enhances a feature extraction capability of the model on input speech data and reduce a risk of performance degradation when the network level deepens; separating the two speaker's speech according to the trained speaker separation model improves an accuracy of the two-speaker separation, especially for lengthy conversations.
- A first aspect of the present disclosure provides a speaker separation model training method, the method includes:
- acquiring a plurality of audio data of multiple speakers;
- processing each of the plurality of audio data:
- extracting audio features of the processed audio data;
- inputting the audio features into a preset neural network model for training to obtain vector features;
- selecting a first vector feature and a second vector feature of a first speaker, and calculating a first similarity value between the first vector feature and the second vector feature according to a preset first similarity function;
- selecting a third vector feature of a second speaker, and calculating a second similarity value between the first vector feature and the third vector feature according to a preset second similarity function;
- inputting the first similarity value and the second similarity value into a preset loss function to calculate a loss function value; and when the loss function value is less than or equal to a preset loss function threshold, ending the training process of the speaker separation model, and updating parameters in the speaker separation model.
- A second aspect of the present disclosure provides a two-speaker separation method, the method includes:
- 1) processing a speech signal to be separated;
- 2) establishing a first sliding window and a second sliding window that are adjacent to each other and sliding from a starting position of the processed speech signal, and obtaining a first speech segment and a second speech segment according to segmentation point of the first sliding window and the second sliding window;
- 3) inputting the first speech segment into a speaker separation model to extract feature to obtain a first speech vector, and inputting the second speech segment into the speaker separation model to extract feature to obtain a second speech vector, where the speaker separation model is trained by using the method according to any one of claims 1 to 5;
- 4) calculating a distance value between the first speech vector and the second speech vector as a distance value corresponding to the segmentation point;
- 5) moving the first sliding window and the second sliding window simultaneously along a time axis direction for a preset time period, and repeating steps 2)-5) until the second sliding window reaches an end of the processed speech signal;
- 6) acquiring distance value corresponding to each segmentation point, and determining local maximum values according to all the distance values;
- 7) segmenting the speech signal to be separated according to the segmentation points corresponding to the local maximum values to obtain new speech segments;
- 8) clustering the new speech segments into speech segments of two speakers.
- A third aspect of the present disclosure provides a terminal, the terminal includes a processor and a storage device, and the processor executes computer-readable instructions stored in the storage device to implement the speaker separation model training method and/or the two-speaker separation method.
- A fourth aspect of the present disclosure provides a non-transitory storage medium having stored thereon computer-readable instructions that, when the computer-readable instructions are executed by a processor to implement the speaker separation model training method and/or the two-speaker separation method.
- The speaker separation model training method, the two-speaker separation method, the terminal, and the storage medium described in the present disclosure trains the speaker separation model in advance, which enhances a feature extraction capability of the model on input speech data, and reduces a risk of performance degradation when the network hierarchy deepens; separating the two speaker's speech according to the trained speaker separation model improves an accuracy of the two-speaker separation, especially for a long conversation.
-
FIG. 1 shows a flowchart a speaker separation model training method provided in an embodiment of the present disclosure. -
FIG. 2 shows a flowchart of a two-speaker separation method provided in an embodiment of the present disclosure. -
FIG. 3 shows a schematic diagram of determining a local maximum according to a segmentation point and a corresponding distance value provided in an embodiment of the present disclosure. -
FIG. 4 shows a schematic structural diagram of a speaker separation model training device provided in an embodiment of the present disclosure. -
FIG. 5 shows a schematic structural diagram of a two-speaker separation device provided in an embodiment of the present disclosure. -
FIG. 6 shows a schematic structural diagram of a terminal provided in an embodiment of the present disclosure. - The following specific embodiments will further explain the present disclosure in combination with the above drawings.
- For clarity, of illustration of objectives, features and advantages of the present disclosure, the drawings combined with the detailed description illustrate the embodiments of the present disclosure hereinafter. It is noted that embodiments of the present disclosure and features of the embodiments can be combined, when there is no conflict.
- Various details are described in the following descriptions for better understanding of the present disclosure, however, the present disclosure may also be implemented in other ways other than those described herein. The scope of the present disclosure is not to be limited by the specific embodiments disclosed below.
- Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. The terms used herein in the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure.
- The speaker separation model training method and/or the two speaker separation method in the embodiments of the present disclosure are applied to one or more electronic terminals. The speaker separation model training method and/or the two speaker separation method may also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network. The network includes, but is not limited to: a wide area network, a metropolitan area network, or a local area network. The speaker separation model training method and/or the two speaker separation method in the embodiment of the present disclosure can be executed by the server at the same time, or can be performed by the terminal at the same time, or may also be performed by the server and the terminal. The server separation model training method in the embodiment of the present disclosure is executed by the server, and the two speaker separation method in the embodiment of the present disclosure is executed by the terminal.
- For a terminal that needs to perform the speaker separation model training method and/or the two speaker separation method, the speaker separation model training function and/or the two speaker separation function provided by the method of the present disclosure can be directly integrated on the terminal, or installed for a client for implementing the methods of the present disclosure. For another example, the methods provided in the present disclosure can also be run on a server or other device in the form of a Software Development Kit (SDK), and provide the speaker separation model training function and/or two speaker separation function in the form of an SDK interface. The terminal or other equipment can implement the speaker separation model training method and/or the two speaker separation method through the provided interface.
-
FIG. 1 is a flowchart of a speaker separation training method in an embodiment of the present disclosure. According to different requirements, the order of the steps in the flow can be changed, and some steps can be omitted. Within each step, sub-steps can be sub-numbered. - In
block 11, acquiring a plurality of audio data of multiple speakers. - In the embodiment, the acquiring of the plurality of audio data may include the following two manners:
- (1) An audio device (for example, a voice recorder, etc.) is set in advance, and the speeches of a plurality of people talking amongst themselves are recorded on-site through the audio device to obtain audio data.
- (2) Acquire plurality of audio data from an audio data set.
- The audio data set is an open source data set, such as a UBM data set and a TV data set. The open source audio data set is dedicated to training a speaker separation model and testing an accuracy of the trained speaker separation model. The UBM data set and the TV data set are from NIST04, NIST05 and NIST06, including about 500 h of audio data, recording a total of 577 speakers, with an average of about 15 sentences per person.
- In
block 12, processing each of the plurality of audio data. - In the embodiment, after acquiring the plurality of audio data, processing of the plurality of audio data is required. Processing of the audio data includes one or more of the following combinations:
- 1) performing noise reduction processing on the audio data;
- The acquired audio data may contain various noises. In order to extract purest audio data from the original noisy audio data, a low-pass filter can be used to remove white noise and random noise in the audio data.
- 2) performing voice activity detection on the noise-reduced audio data, and deleting invalid audio to obtain standard audio data samples;
- In the embodiment, a dual-threshold comparison method can be used for voice activity detection to detect valid audio and invalid audio in the audio data. The valid audio is voices speaking, and the invalid audio is relative to the valid audio, including but not limited to, mute passages which are recorded.
- 3) labeling the standard audio data samples to indicate the speaker to which each of the standard audio data samples belongs.
- A label refers to an identity attribute tag representing the audio data. For example, a first audio data of speaker A is labeled with an identity attribute tag 01, a second audio data of speaker B is labeled with an identity attribute tag 02.
- In
block 13, extracting audio features of the processed audio data. - In the embodiment, Mel Frequency Cepstrum Coefficient (MFCC) spectral characteristics etc. can be used to extract the audio features of the processed audio data. The MFCC is known in the prior art and not described in detail in the present disclosure.
- In
block 14, inputting the audio features into a preset neural network model for training to obtain vector features. - In the embodiment, the preset neural network model is stacked using a neural network structure with a predetermined number of layers. The preset predetermined number of layers is a preset number of network layers. For example, 9-12 layers in a neural network structure are set in advance to train a neural network training model.
- Specifically, each layer of the neural network structure includes: a first convolution layer, a first modified linear unit, a second convolution layer, and a second modified linear unit, an average layer, a fully connected layer, and a normalization layer. A convolution kernel of the convolution layer is 3*3, a step size is 1*1, and a number of channels is 64;
- A specific process of training the preset neural network model based on the input audio features includes:
- 1) inputting the audio feature into the first convolution layer to perform a first convolution process to obtain a first convolution feature;
- 2) inputting the first convolution feature into the first modified linear unit to perform a first modified process to obtain a first modified feature;
- 3) inputting the first modified feature into the second convolution layer to perform a second convolution process to obtain a second convolution feature;
- 4) summing the audio feature and the second convolution feature and inputting the sum into the second modified linear unit to obtain a second modified feature;
- 5) inputting the second modified feature to the average layer, the fully connected layer, and the normalization layer in order to obtain a one-dimensional vector feature.
- A function of the average layer can be as a temporary pool (“temporary pool” used to calculate an average value of vector sequences along the time axis). The average layer calculates an average value of vector sequences output from a previous long-term short-term memory network to obtain a forward average vector and an average value of vector sequences output from a backward long-term short-term memory network to obtain a backward average vector. The fully connected layer concatenates the forward average vector and the backward average vector into one vector. The normalization layer uses a normalization function to process the concatenated vector to obtain a normalized one-dimensional vector feature with a length of 1.
- In the embodiment, the normalization function can be a Euclidean distance function, a Manhattan distance function, and a minimum absolute error function. Optionally, the Euclidean distance function is used to normalize the concatenated vector processed by the fully connected layer to obtain a normalized one-dimensional vector feature. The normalization process can compress the concatenated vector process by the fully connected layer, so that the concatenated vector processed by the fully connected layer is robust, thereby further improving a robustness of the speaker separation model. In addition, the normalization using the Euclidean distance function prevents the speaker separation model from overfitting, thereby improving a generalization ability of the speaker separation model. Subsequently optimizing neural network parameters in the speaker separation model becomes more stable and faster.
- In
block 15, selecting a first vector feature and a second vector feature of a first speaker, and calculating a first similarity value between the first vector feature and the second vector feature according to a preset first similarity function. - In the embodiment, the preset first similarity function can be a cosine similarity function, as shown (in formula (1-1)) below:
-
COS(x i ,x j)=x i T x j (1-1) - Wherein, xi represents the first vector feature of the first speaker, xj represents the second vector feature of the first speaker and COS(xi,xj) is the calculated first similarity value.
- In
block 16, selecting a third vector feature of a second speaker, and calculating a second similarity value between the first vector feature and the third vector feature according to a preset second similarity function. - In the embodiment, the preset second similarity function can be same as or different from the preset first similarity function.
- Optionally, the preset second similarity function is an LP norm, as shown in the following formula (1-2):
-
- Wherein, xi represents the first vector feature of the first speaker, yi represents the third vector feature of the second speaker, and Lp(xi, yi) is the calculated second similarity value.
- In
block 17, inputting the first similarity value and the second similarity value into a preset loss function to calculate a loss function value; and when the loss function value is less than or equal to a preset loss function threshold, ending the training process of the speaker separation model, and updating parameters in the speaker separation model. - In the embodiment, the preset loss function can be as shown in the following formula (1-3):
-
- wherein, α is a normal number and generally ranges from 0.05 to 0.2. Si 13 is the second similarity value, that is, a value of similarity between the first vector feature of the first speaker and the third vector feature of the second speaker. Si 12 is the first similarity value, that is, a similarity between the first vector feature of the first speaker and the second vector feature of the first speaker. L is the calculated loss function value.
- The present disclosure acquires a plurality of audio data as training samples and labels the audio data. Audio features of the audio data are extracted. Vector features are obtained by inputting the audio features into a preset neural network model for training. A first vector feature and a second vector feature of a first speaker are selected, and a first similarity value between the first vector feature and the second vector feature is calculated according to a preset first similarity function. A third vector feature of a second speaker is selected, and a second similarity value between the first vector feature and the third vector feature is calculated according to a preset second similarity function. The first similarity value and the second similarity value are inputted into a preset loss function to calculate a loss function value; and when the loss function value is less than or equal to a preset loss function threshold, the training process of the speaker separation model is ended, and parameters in the speaker separation model are updated. The speaker separation model for training based on the convolutional neural network in the present disclosure has strong feature extraction capabilities, which reduces a risk of performance degradation when the network level deepens. In addition, vector features of different audio data of the same speaker can be guaranteed to be the same as much as possible, and vector features of the audio data of different speaker are as different as possible, so that the calculated loss function can match the convergence condition faster and record and save the speaker. Training time of the speaker separation model is saved, and a separation efficiency of the speaker separation model is improved.
-
FIG. 2 is a flowchart of a method for two-speaker separation in an embodiment of the present disclosure. According to different requirements, the order of the steps in the flow can be changed, and some can be omitted. Within each step, sub-steps can be sub-numbered. - In
block 21, processing a speech signal to be separated. - In the embodiment, a process of processing the speech signal to be separated includes:
- 1) Pre-emphasis processing
- In the embodiment, a digital filter can be used to perform pre-emphasis processing on the speech signal to be separated to improve a speech signal of the high-frequency part. The details are shown below in formula (2-1):
-
{tilde over (S)}(n)=S(n)−a*S(n−1) (2-1) - wherein, S(n) is the speech signal to be separated, a is a pre-emphasis coefficient and 0.95 is generally taken, and {tilde over (S)}(n) is a speech signal after the pre-emphasis processing.
- Due to factors such as human vocal organs and equipment that collects speech signals, problems such as aliasing and higher-order harmonic distortion readily appear in the collected speech signals. By Pre-emphasis processing of the speech signal to be separated, the high-frequency parts of the speech signal that are suppressed by the pronunciation system can be compensated, and the high-frequency formants are highlighted to ensure that the speech signal to be separated is more uniform and smoother, and improve actual separation of the speech signal to be separated.
- 2) Framed Processing
- The speech signal to be separated can be framed according to a preset framing parameter. The preset framing parameter can be, for example, a frame length of 10-30 ms. Optionally, the speech signal to be separated is framed every 25 milliseconds to obtain a plurality of speech frames. For the speech signal to be separated, each speech frame obtained after framing is a characteristic parameter time series composed of the characteristic parameters of each frame.
- In
block 22, establishing a first sliding window and a second sliding window that are adjacent to each other and sliding from a starting position of the processed speech signal, and obtaining a first speech segment and a second speech segment according to segmentation point of the first sliding window and the second sliding window. - In the embodiment, a length of the first sliding window and the second sliding window can be 0.7-2 seconds. A segmentation point between the first sliding window and the second sliding window is a possible speaker dividing point of the processed voice signal. The first sliding window corresponds to the first speech segment, and the second sliding window corresponds to the second speech segment. The segmentation point between the first sliding window and the second sliding window is also the segmentation point between the first speech segment and the second speech segment.
- In
block 23, inputting the first speech segment into a speaker separation model to extract feature to obtain a first speech vector, and inputting the second speech segment into the speaker separation model to extract feature to obtain a second speech vector. - The first speech segment is inputted into the trained speaker separation model, and the trained speaker separation model extracts an MFCC(Mel Frequency Cepstrum Coefficient) of the first speech segment to obtain a first speech vector. The second speech segment is inputted into the trained speaker separation model and the trained speaker separation model extracts an MFCC of the second speech segment to obtain a second speech vector.
- In
block 24, calculating a distance value between the first speech vector and the second speech vector as a distance value corresponding to the segmentation point. - In the embodiment, a preset distance function can be used to calculate a distance value between each first speech vector and each corresponding second speech vector. The preset distance function is a preset distance function, and can be, for example, a Euclidean distance. A process of calculating the distance value of the first speech vector and the second speech vector by using the European distance function is not described in the present disclosure.
- In
block 25, moving the first sliding window and the second sliding window simultaneously along a time axis direction for a preset time period, and repeating steps 22)-25) until the second sliding window reaches an end of the processed speech signal. - In the embodiment, the preset time period can be 5 ms. By sliding the first sliding window and the second sliding window on the processed voice signal, a plurality of segmentation points of the sliding window can be obtained, thereby obtaining a plurality of first speech fragments and a plurality of second speech fragments. That is, each time the first sliding window and the second sliding window are slid at the same time according to the preset time period, a candidate segmentation point is obtained. Each candidate segmentation point is the segmentation point of the first speech segment and the second speech segment, and a distance value can be calculated correspondingly. How many segmentation point there are will correspond to how many distance values there are.
- In
block 26, acquiring distance value corresponding to each segmentation point, and determining local maximum values according to all the distance values. - In the embodiment, a specific process of determining local maximum values according to all the distance values includes:
- arranging the distance values corresponding to the division points in chronological order of the segmentation points;
- determining f(n) whether it is greater than f(n−1) and greater than f(n+1), where f(n) is a distance value corresponding to the segmentation point, f(n−1) is a distance value corresponding to a segmentation point before the segmentation point, and f(n+1) is a distance value corresponding to a segmentation point after the segmentation point;
- when f(n)≥f(n−1) and f(n)≥f(n+1), determining that f(n) is the local maximum.
- A plurality of local maximum values can be determined, and a segmentation point corresponding to each local maximum value is a segmentation point to be found.
- For example, suppose that 10 segmentation points are obtained according to the sliding of the first sliding window and the second sliding window, such as T1, T2, T3, T4, T5, T6, T7, T8, T9, and T10. Each segmentation points corresponds to a distance value, for example, S1, S2, S3, S4, S5, S6, S7, S8, S9, and S10. The 10 distance values are arranged in chronological order of the segmentation points. If S2>=S1 and S2>=S3, S2 is a local maximum value. Then determine whether S4>=S3, and S4>=S5. If S4>=S3, and S4>=S5, then S4 is a local maximum value. By analogy, the remaining current distance values are respectively determined from a previous distance value or a subsequent distance value to determine local maximum values.
- In an alternative embodiment, the determining of local maximum values according to all the distance values can include:
- drawing a smooth curve with the segmentation points as the horizontal axis and the distance values corresponding to the segmentation points as the vertical axis.
- calculating a slope of a tangent to each point on the curve;
- determining as the local maximum a distance value corresponding to the point where the slope of the tangent is zero.
- In order to visualize the local maximum values, a smooth curve can be drawn with the segmentation points as the horizontal axis and the distance values corresponding to the segmentation points as the vertical axis, as shown in
FIG. 3 . Solving a tangent of each point inFIG. 3 shows that a slope of the tangent of the points corresponding to S2, S4, S6, and S9 is zero, and S2, S4, S6, and S9 are determined as local maximum values. - In
block 27, segmenting the speech signal to be separated according to the segmentation points corresponding to the local maximum values to obtain new speech segments. - In the embodiment, after determining the local maximum values, the speech signal to be separated is again segmented by using the segmentation points corresponding to the local maximum values as a new segmentation point, so as to obtain a plurality of new speech fragments. A process of re-segmentation is to find time points of dialogue exchange between two different speakers from the speech signal to be separated, and then the speech signal to be separated can be segmented into several speech segments according to the time points. Each speech segment contains the speech of only one speaker.
- For example, if S2, S4, S6, and S9 are the local maximum values, the corresponding segmentation points T2, T4, T6, and T9 are used as new segmentation points to segment the speech signal to be separated, to obtain 5 new speech segments. Each new speech segment contains only one speaker's speech.
- In
block 28, clustering the new speech segments into speech segments of two speakers. - All speech segments belonging to one speaker are clustered by a clustering method and then recombined together.
- In the embodiment, the clustering method can be a K-means clustering or a bottom-up hierarchical clustering (HAC). The clustering method is known in the prior art and is not described in detail in the present disclosure.
- It can be known from the above that the present disclosure processes a speech signal to be separated. A first sliding window and a second sliding window that are adjacent to each other are established and sliding from a starting position of the processed speech signal. A first speech segment and a second speech segment are obtained according to segmentation point of the first sliding window and the second sliding window. The first speech segment is inputted into a speaker separation model to extract feature to obtain a first speech vector, and the second speech segment is inputted into the speaker separation model to extract feature to obtain a second speech vector. A distance value between the first speech vector and the second speech vector is calculated as a distance value corresponding to the segmentation point. The sliding window can be moved according to a preset time period, and each time the sliding window is moved, two speech fragments are obtained until the second sliding window reaches the end of the pre-processed voice signal and the distance value corresponding to each segmentation point is obtained. A local maximum according to the distance value is determined and the speech signal to be separated is segmented according to a segmentation point corresponding to the local maximum to obtain a new speech segment. The new speech segment is clustered into the respective speech of different speakers. In the present disclosure, a plurality of speech fragments are obtained through several sliding processes. The trained speaker separation model is used to extract features of the speech fragments. By comparing the calculated distance values to determine the local maximum values, the segmentation points corresponding to the local maximum values are used as new segmentation points to segment the speech signal to be separated again, and the speech fragments of the two speakers are obtained, and the separation is effective.
- The embodiments described above are only specific implementations of the present disclosure, but a scope of protection of the present disclosure is not limited to this. For those of ordinary skill in the art, without departing from the creative concept of this disclosure, they can also make improvements, but these all belong to the scope of this disclosure.
- The following describes functional modules and hardware structure of a terminal that implements the above-mentioned speaker separation model training method and the two-speaker separation method, with reference to
FIGS. 4-6 . -
FIG. 4 is a schematic structural diagram of a device in a preferred embodiment for speaker separation model training. - In some embodiments, the speaker separation
model training device 40 runs in a terminal. The speaker separationmodel training device 40 can include a plurality of function modules consisting of program code segments. The program code of each program code segment in the speaker separationmodel training device 40 can be stored in a memory and executed by at least one processor to train the speaker separation model (described in detail inFIG. 1 ). - In the embodiment, the speaker separation
model training device 40 in the terminal can be divided into a plurality of functional modules, according to the performed functions. The functional modules can include: anacquisition module 401, aprocessing module 402, afeature extraction module 403, atraining module 404, acalculation module 405, and anupdate module 406. A module as referred to in the present disclosure refers to a series of computer-readable instruction segments that can be executed by at least one processor and that are capable of performing fixed functions, which are stored in a memory. In some embodiment, the functions of each module will be detailed in the following embodiments. - The
acquisition module 401 is configured to acquire a plurality of audio data of multiple speakers. - In the embodiment, the acquiring of the plurality of audio data may include the following two manners:
- (1) An audio device (for example, a voice recorder, etc.) is set in advance, and the speeches of a plurality of people talking amongst themselves are recorded on-site through the audio device to obtain audio data.
- (2) Acquire plurality of audio data from an audio data set.
- The audio data set is an open source data set, such as a UBM data set and a TV data set. The open source audio data set is dedicated to training a speaker separation model and testing an accuracy of the trained speaker separation model. The UBM data set and the TV data set are from NIST04, NIST05 and NIST06, including about 500 h of audio data, recording a total of 577 speakers, with an average of about 15 sentences per person.
- The
processing module 402 is configured to process each of the plurality of audio data. - In the embodiment, after acquiring the plurality of audio data, processing of the plurality of audio data is required. The
processing module 402 processing of the audio data includes one or more of the following combinations: - 1) performing noise reduction processing on the audio data;
- The acquired audio data may contain various noises. In order to extract purest audio data from the original noisy audio data, a low-pass filter can be used to remove white noise and random noise in the audio data.
- 2) performing voice activity detection on the noise-reduced audio data, and deleting invalid audio to obtain standard audio data samples:
- In the embodiment, a dual-threshold comparison method can be used for voice activity detection to detect valid audio and invalid audio in the audio data. The valid audio is voices speaking, and the invalid audio is relative to the valid audio, including but not limited to, mute passages which are recorded.
- 3) labeling the standard audio data samples to indicate the speaker to which each of the standard audio data samples belongs.
- A label refers to an identity attribute tag representing the audio data. For example, a first audio data of speaker A is labeled with an identity attribute tag 01, a second audio data of speaker B is labeled with an identity attribute tag 02.
- The
feature extraction module 403 is configured to extract audio features of the processed audio data. - In the embodiment, Mel Frequency Cepstrum Coefficient (MFCC) spectral characteristics etc. can be used to extract the audio features of the processed audio data. The MFCC is known in the prior art and not described in detail in the present disclosure.
- The
training module 404 is configured to input the audio features into a preset neural network model for training, to obtain vector features. - In the embodiment, the preset neural network model is stacked using a neural network structure with a predetermined number of layers. The preset predetermined number of layers is a preset number of network layers. For example, 9-12 layers in a neural network structure are set in advance to train a neural network training model.
- Specifically, each layer of the neural network structure includes: a first convolution layer, a first modified linear unit, a second convolution layer, and a second modified linear unit, an average layer, a fully connected layer, and a normalization layer. A convolution kernel of the convolution layer is 3*3, a step size is 1*1, and a number of channels is 64;
- A specific process of training the preset neural network model based on the input audio features includes:
- 1) inputting the audio feature into the first convolution layer to perform a first convolution process to obtain a first convolution feature;
- 2) inputting the first convolution feature into the first modified linear unit to perform a first modified process to obtain a first modified feature;
- 3) inputting the first modified feature into the second convolution layer to perform a second convolution process to obtain a second convolution feature;
- 4) summing the audio feature and the second convolution feature and inputting the sum into the second modified linear unit to obtain a second modified feature;
- 5) inputting the second modified feature to the average layer, the fully connected layer, and the normalization layer in order to obtain a one-dimensional vector feature.
- A function of the average layer can be as a temporary pool (“temporary pool” used to calculate an average value of vector sequences along the time axis). The average layer calculates an average value of vector sequences output from a previous long-term short-term memory network to obtain a forward average vector and an average value of vector sequences output from a backward long-term short-term memory network to obtain a backward average vector. The fully connected layer concatenates the forward average vector and the backward average vector into one vector. The normalization layer uses a normalization function to process the concatenated vector to obtain a normalized one-dimensional vector feature with a length of 1.
- In the embodiment, the normalization function can be a Euclidean distance function, a Manhattan distance function, and a minimum absolute error function. Optionally, the Euclidean distance function is used to normalize the concatenated vector processed by the fully connected layer to obtain a normalized one-dimensional vector feature. The normalization process can compress the concatenated vector process by the fully connected layer, so that the concatenated vector processed by the fully connected layer is robust, thereby further improving a robustness of the speaker separation model. In addition, the normalization using the Euclidean distance function prevents the speaker separation model from overfitting, thereby improving a generalization ability of the speaker separation model. Subsequently optimizing neural network parameters in the speaker separation model becomes more stable and faster.
- The
calculation module 405 is configured to select a first vector feature and a second vector feature of a first speaker, and calculate a first similarity value between the first vector feature and the second vector feature according to a preset first similarity function. - In the embodiment, the preset first similarity function can be a cosine similarity function, as shown (in formula (1-1)) below:
-
COS(x i ,x j)=x i T x j (1-1) - Wherein, xi represents the first vector feature of the first speaker, xj represents the second vector feature of the first speaker and COS(xi,xj) is the calculated first similarity value.
- The
calculation module 405 is also configured to select a third vector feature of a second speaker, and calculate a second similarity value between the first vector feature and the third vector feature according to a preset second similarity function. - In the embodiment, the preset second similarity function can be same as or different from the preset first similarity function.
- Optionally, the preset second similarity function is an LP norm, as shown in the following formula (1-2):
-
- Wherein, xi represents the first vector feature of the first speaker, yi represents the third vector feature of the second speaker, and Lp(xi, yi) is the calculated second similarity value.
- The
update module 406 is configured to input the first similarity value and the second similarity value into a preset loss function to calculate a loss function value. When the loss function value is less than or equal to a preset loss function threshold, the training process of the speaker separation model is ended, and parameters in the speaker separation model are updated. - In the embodiment, the preset loss function can be as shown in the following formula (1-3):
-
- wherein, α is a normal number and generally ranges from 0.05 to 0.2. Si 13 is the second similarity value, that is, a value of similarity between the first vector feature of the first speaker and the third vector feature of the second speaker. Si 12 is the first similarity value, that is, a similarity between the first vector feature of the first speaker and the second vector feature of the first speaker. L is the calculated loss function value.
- The present disclosure acquires a plurality of audio data as training samples and labels the audio data. Audio features of the audio data are extracted. Vector features are obtained by inputting the audio features into a preset neural network model for training. A first vector feature and a second vector feature of a first speaker are selected, and a first similarity value between the first vector feature and the second vector feature is calculated according to a preset first similarity function. A third vector feature of a second speaker is selected, and a second similarity value between the first vector feature and the third vector feature is calculated according to a preset second similarity function. The first similarity value and the second similarity value are inputted into a preset loss function to calculate a loss function value; and when the loss function value is less than or equal to a preset loss function threshold, the training process of the speaker separation model is ended, and parameters in the speaker separation model are updated. The speaker separation model for training based on the convolutional neural network in the present disclosure has strong feature extraction capabilities, which reduces a risk of performance degradation when the network level deepens. In addition, vector features of different audio data of the same speaker can be guaranteed to be the same as much as possible, and vector features of the audio data of different speaker are as different as possible, so that the calculated loss function can match the convergence condition faster and record and save the speaker. Training time of the speaker separation model is saved, and a separation efficiency of the speaker separation model is improved.
-
FIG. 5 is a schematic structural diagram of a preferred embodiment of a two-speaker separation device of the present disclosure. - In some embodiments, the two-
speaker separation device 50 runs in a terminal. The two-speaker separation device 50 can include a plurality of function modules consisting of program code segments. The program code of each program code segments in the two-speaker separation device 50 can be stored in a memory and executed by at least one processor to perform separation of speech signals of two speakers to obtain two speech segments. Each segment of speech contains the speech of only one speaker (described in detail inFIG. 2 ). - In the embodiment, the two-
speaker separation device 50 in the terminal can be divided into a plurality of functional modules, according to the performed functions. The functional modules can include: asignal processing module 501, afirst segmentation module 502, avector extraction module 503, acalculation module 504, acomparison module 505, asecond segmentation module 506, and a clustering module 507. A module as referred to in the present disclosure refers to a series of computer-readable instruction segments that can be executed by at least one processor and that are capable of performing fixed functions, which are stored in a memory. In some embodiment, the functions of each module will be detailed in the following embodiments. - The above-mentioned integrated unit implemented in a form of software functional modules can be stored in a non-transitory readable storage medium. The above software function modules are stored in a storage medium and includes several instructions for causing a computer device (which can be a personal computer, a dual-screen device, or a network device) or a processor to execute the method described in various embodiments in the present disclosure.
- The
signal processing module 501 is configured to process a speech signal to be separated. - In the embodiment, a process of the
signal processing module 501 processing the speech signal to be separated includes: - 1) Pre-emphasis processing
- In the embodiment, a digital filter can be used to perform pre-emphasis processing on the speech signal to be separated to improve a speech signal of the high-frequency part. The details are shown below in formula (2-1):
-
{tilde over (S)}(n)=S(n)−a*S(n−1) (2-1) - wherein, S(n) is the speech signal to be separated, a is a pre-emphasis coefficient and 0.95 is generally taken, and {tilde over (S)}(n) is a speech signal after the pre-emphasis processing.
- Due to factors such as human vocal organs and equipment that collects speech signals, problems such as aliasing and higher-order harmonic distortion readily appear in the collected speech signals. By Pre-emphasis processing of the speech signal to be separated, the high-frequency parts of the speech signal that are suppressed by the pronunciation system can be compensated, and the high-frequency formants are highlighted to ensure that the speech signal to be separated is more uniform and smoother, and improve actual separation of the speech signal to be separated.
- 2) Framed Processing
- The speech signal to be separated can be framed according to a preset framing parameter. The preset framing parameter can be, for example, a frame length of 10-30 ms. Optionally, the speech signal to be separated is framed every 25 milliseconds to obtain a plurality of speech frames. For the speech signal to be separated, each speech frame obtained after framing is a characteristic parameter time series composed of the characteristic parameters of each frame.
- The
first segmentation module 502 is configured to establish a first sliding window and a second sliding window that are adjacent to each other and slid from a starting position of the processed speech signal. This obtains a first speech segment and a second speech segment according to segmentation point of the first sliding window and the second sliding window. - In the embodiment, a length of the first sliding window and the second sliding window can be 0.7-2 seconds. A segmentation point between the first sliding window and the second sliding window is a possible speaker dividing point of the processed voice signal. The first sliding window corresponds to the first speech segment, and the second sliding window corresponds to the second speech segment. The segmentation point between the first sliding window and the second sliding window is also the segmentation point between the first speech segment and the second speech segment.
- The
vector extraction module 503 is configured to input the first speech segment into a speaker separation model to extract feature to obtain a first speech vector, and input the second speech segment into the speaker separation model to extract feature to obtain a second speech vector. - The first speech segment is inputted into the trained speaker separation model, and the trained speaker separation model extracts an MFCC(Mel Frequency Cepstrum Coefficient) of the first speech segment to obtain a first speech vector. The second speech segment is inputted into the trained speaker separation model and the trained speaker separation model extracts an MFCC of the second speech segment to obtain a second speech vector.
- The
calculation module 504 is configured to calculate a distance value between the first speech vector and the second speech vector as a distance value corresponding to the segmentation point. - In the embodiment, a preset distance function can be used to calculate a distance value between each first speech vector and each corresponding second speech vector.
- The preset distance function is a preset distance function, and can be, for example, a Euclidean distance. A process of calculating the distance value of the first speech vector and the second speech vector by using the European distance function is not described in the present disclosure.
- Moving the first sliding window and the second sliding window simultaneously along a time axis direction for a preset time period, and the above-mentioned modules (502-504) are repeatedly executed until the second sliding window reaches the end of the pre-processed speech signal.
- In the embodiment, the preset time period can be 5 ms. By sliding the first sliding window and the second sliding window on the processed voice signal, a plurality of segmentation points of the sliding window can be obtained, thereby obtaining a plurality of first speech fragments and a plurality of second speech fragments. That is, each time the first sliding window and the second sliding window are slid at the same time according to the preset time period, a candidate segmentation point is obtained. Each candidate segmentation point is the segmentation point of the first speech segment and the second speech segment, and a distance value can be calculated correspondingly. How many segmentation point there are will correspond to how many distance values there are.
- The
comparison module 505 is configured to acquire distance value corresponding to each segmentation point, and determine local maximum values according to all the distance values. - In the embodiment, a specific process of the
comparison module 505 determining local maximum values according to all the distance values includes: - 1) arranging the distance values corresponding to the division points in chronological order of the segmentation points;
- 2) determining f(n) whether it is greater than f(n−1) and greater than f(n+1), where f(n) is a distance value corresponding to the segmentation point, f(n−1) is a distance value corresponding to a segmentation point before the segmentation point, and f(n+1) is a distance value corresponding to a segmentation point after the segmentation point;
- 3) when f(n)≥f(n−1) and f(n)≥f(n+1), determining that f(n) is the local maximum.
- A plurality of local maximum values can be determined, and a segmentation point corresponding to each local maximum value is a segmentation point to be found.
- For example, suppose that 10 segmentation points are obtained according to the sliding of the first sliding window and the second sliding window, such as T1, T2, T3, T4, T5, T6, T7, T8, T9, and T10. Each segmentation points corresponds to a distance value, for example, S1, S2, S3, S4, S5, S6, S7, S8, S9, and S10. The 10 distance values are arranged in chronological order of the segmentation points. If S2>=S1 and S2>=S3, S2 is a local maximum value. Then determine whether S4>=S3, and S4>=S5. If S4>=S3, and S4>=S5, then S4 is a local maximum value. By analogy, the remaining current distance values are respectively determined from a previous distance value or a subsequent distance value to determine local maximum values.
- In an alternative embodiment, the
comparison module 505 determining of local maximum values according to all the distance values can include: - drawing a smooth curve with the segmentation points as the horizontal axis and the distance values corresponding to the segmentation points as the vertical axis.
- calculating a slope of a tangent to each point on the curve;
- determining as the local maximum a distance value corresponding to the point where the slope of the tangent is zero.
- In order to visualize the local maximum values, a smooth curve can be drawn with the segmentation points as the horizontal axis and the distance values corresponding to the segmentation points as the vertical axis, as shown in
FIG. 3 . Solving a tangent of each point inFIG. 3 shows that a slope of the tangent of the points corresponding to S2, S4, S6, and S9 is zero, and S2, S4, S6, and S9 are determined as local maximum values. - The
second segmentation module 506 is configured to segment the speech signal to be separated according to the segmentation points corresponding to the local maximum values to obtain new speech segments. - In the embodiment, after determining the local maximum values, the speech signal to be separated is again segmented by using the segmentation points corresponding to the local maximum values as a new segmentation point, so as to obtain a plurality of new speech fragments. A process of re-segmentation is to find time points of dialogue exchange between two different speakers from the speech signal to be separated, and then the speech signal to be separated can be segmented into several speech segments according to the time points. Each speech segment contains the speech of only one speaker.
- For example, if S2, S4, S6, and S9 are the local maximum values, the corresponding segmentation points T2, T4, T6, and T9 are used as new segmentation points to segment the speech signal to be separated, to obtain 5 new speech segments. Each new speech segment contains only one speaker's speech.
- The clustering module 507 is configured to cluster the new speech segments into speech segments of two speakers.
- All speech segments belonging to one speaker are clustered by a clustering method and then recombined together.
- In the embodiment, the clustering method can be a K-means clustering or a bottom-up hierarchical clustering (HAC). The clustering method is known in the prior art and is not described in detail in the present disclosure.
- It can be known from the above that the present disclosure processes a speech signal to be separated. A first sliding window and a second sliding window that are adjacent to each other are established and sliding from a starting position of the processed speech signal. A first speech segment and a second speech segment are obtained according to segmentation point of the first sliding window and the second sliding window. The first speech segment is inputted into a speaker separation model to extract feature to obtain a first speech vector, and the second speech segment is inputted into the speaker separation model to extract feature to obtain a second speech vector. A distance value between the first speech vector and the second speech vector is calculated as a distance value corresponding to the segmentation point. The sliding window can be moved according to a preset time period, and each time the sliding window is moved, two speech fragments are obtained until the second sliding window reaches the end of the pre-processed voice signal and the distance value corresponding to each segmentation point is obtained. A local maximum according to the distance value is determined and the speech signal to be separated is segmented according to a segmentation point corresponding to the local maximum to obtain a new speech segment. The new speech segment is clustered into the respective speech of different speakers. In the present disclosure, a plurality of speech fragments are obtained through several sliding processes. The trained speaker separation model is used to extract features of the speech fragments. By comparing the calculated distance values to determine the local maximum values, the segmentation points corresponding to the local maximum values are used as new segmentation points to segment the speech signal to be separated again, and the speech fragments of the two speakers are obtained, and the separation is effective.
-
FIG. 6 is a schematic structural diagram of a terminal provided in embodiment 5 of the present disclosure. - The
terminal 3 may include: amemory 31, at least oneprocessor 32, computer-readable instructions 33 stored in thememory 31 and executable on the at least oneprocessor 32, and at least onecommunication bus 34. - The at least one
processor 32 executes the computer-readable instructions 33 to implement the steps in the speaker separation model training method and/or two speaker separation method described above. - Exemplarily, the computer-
readable instructions 33 can be divided into one or more modules/units, and the one or more modules/units are stored in thememory 31 and executed by the at least oneprocessor 32 to complete the speaker separation model training method and/or the two speaker separation method of the present disclosure. The one or more modules/units can be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe execution processes of the computer-readable instructions 33 in theterminal 3. - The
terminal 3 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. Those skilled in the art will understand that the schematic diagram 3 is only an example of theterminal 3, and does not constitute a limitation on theterminal 3. Anotherterminal 3 may include more or fewer components than shown in the figures, or combine some components or have different components. For example, theterminal 3 may further include an input/output device, a network access device, a bus, and the like. - The at least one
processor 32 can be a central processing unit (CPU), or can be other general-purpose processor, digital signal processor (DSPs), and application specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA), or other programmable logic device, discrete gate, or transistor logic device, or discrete hardware component, etc. Theprocessor 32 can be a microprocessor, or theprocessor 32 can be any conventional processor. Theprocessor 32 is a control center of theterminal 3, and connects various parts of theentire terminal 3 by using various interfaces and lines. - The
memory 31 can be configured to store the computer-readable instructions 33 and/or modules/units. Theprocessor 32 may run or execute the computer-readable instructions and/or modules/units stored in thememory 31, and may call data stored in thememory 31 to implement various functions of theterminal 3. Thememory 31 mainly includes a storage program area and a storage data area. The storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc. The storage data area may store data (such as audio data, a phone book, etc.) created according to use of theterminal 3. In addition, thememory 31 may include a high-speed random access memory, and may also include a non-transitory storage medium, such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) Card, a flash card, at least one disk storage device, a flash memory device, or other non-transitory solid-state storage device. - When the modules/units integrated in the
terminal 3 are implemented in the form of software functional units and sold or used as independent products, they can be stored in a non-transitory readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments implemented by the present disclosure can also be completed by related hardware instructed by computer-readable instructions. The computer-readable instructions can be stored in a non-transitory readable storage medium. The computer-readable instructions, when executed by the processor, may implement the steps of the foregoing method embodiments. The computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes can be in a source code form, an object code form, an executable file, or some intermediate form. The non-transitory readable storage medium can include any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM). - In several embodiments provided in the preset application, it should be understood that the disclosed terminal and method can be implemented in other ways. For example, the embodiments of the devices described above are merely illustrative. For example, divisions of the units are only logical function divisions, and there can be other manners of division in actual implementation.
- In addition, each functional unit in each embodiment of the present disclosure can be integrated into one processing unit, or can be physically present separately in each unit, or two or more units can be integrated into one unit. The above integrated unit can be implemented in a form of hardware or in a form of a software functional unit.
- The present disclosure is not limited to the details of the above-described exemplary embodiments, and the present disclosure can be embodied in other specific forms without departing from the spirit or essential characteristics of the present disclosure. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present disclosure is defined by the appended claims. All changes and variations in the meaning and scope of equivalent elements are included in the present disclosure. Any reference sign in the claims should not be construed as limiting the claim. Furthermore, the word “comprising” does not exclude other units nor does the singular exclude the plural. A plurality of units or devices stated in the system claims may also be implemented by one unit or device through software or hardware. Words such as first and second are used to indicate names, but not in any particular order.
- Finally, the above embodiments are only used to illustrate technical solutions of the present disclosure, and are not to be taken as restrictions on the technical solutions. Although the present disclosure has been described in detail with reference to the above embodiments, those skilled in the art should understand that the technical solutions described in one embodiments can be modified, or some of technical features can be equivalently substituted, and that these modifications or substitutions are not to detract from the essence of the technical solutions or from the scope of the technical solutions of the embodiments of the present disclosure.
Claims (16)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810519521.6 | 2018-05-28 | ||
CN201810519521.6A CN108766440B (en) | 2018-05-28 | 2018-05-28 | Speaker separation model training method, two-speaker separation method and related equipment |
PCT/CN2018/100174 WO2019227672A1 (en) | 2018-05-28 | 2018-08-13 | Voice separation model training method, two-speaker separation method and associated apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
US20200234717A1 true US20200234717A1 (en) | 2020-07-23 |
US11158324B2 US11158324B2 (en) | 2021-10-26 |
Family
ID=64006219
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/652,452 Active 2038-08-20 US11158324B2 (en) | 2018-05-28 | 2018-08-13 | Speaker separation model training method, two-speaker separation method and computing device |
Country Status (5)
Country | Link |
---|---|
US (1) | US11158324B2 (en) |
JP (1) | JP2020527248A (en) |
CN (1) | CN108766440B (en) |
SG (1) | SG11202003722SA (en) |
WO (1) | WO2019227672A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111613249A (en) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | Voice analysis method and equipment |
CN112071329A (en) * | 2020-09-16 | 2020-12-11 | 腾讯科技(深圳)有限公司 | Multi-person voice separation method and device, electronic equipment and storage medium |
CN112700766A (en) * | 2020-12-23 | 2021-04-23 | 北京猿力未来科技有限公司 | Training method and device of voice recognition model and voice recognition method and device |
CN112820292A (en) * | 2020-12-29 | 2021-05-18 | 平安银行股份有限公司 | Method, device, electronic device and storage medium for generating conference summary |
CN113362831A (en) * | 2021-07-12 | 2021-09-07 | 科大讯飞股份有限公司 | Speaker separation method and related equipment thereof |
CN113544700A (en) * | 2020-12-31 | 2021-10-22 | 商汤国际私人有限公司 | Neural network training method and device, and associated object detection method and device |
CN113571085A (en) * | 2021-07-24 | 2021-10-29 | 平安科技(深圳)有限公司 | Voice separation method, system, device and storage medium |
CN113657289A (en) * | 2021-08-19 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method and device of threshold estimation model and electronic equipment |
US11250854B2 (en) | 2019-11-25 | 2022-02-15 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for voice interaction, device and computer-readable storage medium |
CN114363531A (en) * | 2022-01-14 | 2022-04-15 | 中国平安人寿保险股份有限公司 | H5-based case comment video generation method, device, equipment and medium |
WO2022173104A1 (en) * | 2021-02-10 | 2022-08-18 | 삼성전자 주식회사 | Electronic device supporting improvement of voice activity detection |
WO2022211590A1 (en) * | 2021-04-01 | 2022-10-06 | Samsung Electronics Co., Ltd. | Electronic device for processing user utterance and controlling method thereof |
WO2022265210A1 (en) * | 2021-06-18 | 2022-12-22 | 삼성전자주식회사 | Electronic device and personalized voice-processing method for electronic device |
CN115659162A (en) * | 2022-09-15 | 2023-01-31 | 云南财经大学 | Method, system and equipment for extracting features in radar radiation source signal pulse |
Families Citing this family (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109545186B (en) * | 2018-12-16 | 2022-05-27 | 魔门塔(苏州)科技有限公司 | Speech recognition training system and method |
CN109686382A (en) * | 2018-12-29 | 2019-04-26 | 平安科技(深圳)有限公司 | A kind of speaker clustering method and device |
CN110197665B (en) * | 2019-06-25 | 2021-07-09 | 广东工业大学 | Voice separation and tracking method for public security criminal investigation monitoring |
CN110444223B (en) * | 2019-06-26 | 2023-05-23 | 平安科技(深圳)有限公司 | Speaker separation method and device based on cyclic neural network and acoustic characteristics |
CN110289002B (en) * | 2019-06-28 | 2021-04-27 | 四川长虹电器股份有限公司 | End-to-end speaker clustering method and system |
CN110390946A (en) * | 2019-07-26 | 2019-10-29 | 龙马智芯(珠海横琴)科技有限公司 | A kind of audio signal processing method, device, electronic equipment and storage medium |
CN110718228B (en) * | 2019-10-22 | 2022-04-12 | 中信银行股份有限公司 | Voice separation method and device, electronic equipment and computer readable storage medium |
CN111312256B (en) * | 2019-10-31 | 2024-05-10 | 平安科技(深圳)有限公司 | Voice identification method and device and computer equipment |
CN110853618B (en) * | 2019-11-19 | 2022-08-19 | 腾讯科技(深圳)有限公司 | Language identification method, model training method, device and equipment |
CN110992967A (en) * | 2019-12-27 | 2020-04-10 | 苏州思必驰信息科技有限公司 | Voice signal processing method and device, hearing aid and storage medium |
CN111145761B (en) * | 2019-12-27 | 2022-05-24 | 携程计算机技术(上海)有限公司 | Model training method, voiceprint confirmation method, system, device and medium |
CN111191787B (en) * | 2019-12-30 | 2022-07-15 | 思必驰科技股份有限公司 | Training method and device of neural network for extracting speaker embedded features |
CN111370032B (en) * | 2020-02-20 | 2023-02-14 | 厦门快商通科技股份有限公司 | Voice separation method, system, mobile terminal and storage medium |
JP7359028B2 (en) * | 2020-02-21 | 2023-10-11 | 日本電信電話株式会社 | Learning devices, learning methods, and learning programs |
CN111370019B (en) * | 2020-03-02 | 2023-08-29 | 字节跳动有限公司 | Sound source separation method and device, and neural network model training method and device |
CN111009258A (en) * | 2020-03-11 | 2020-04-14 | 浙江百应科技有限公司 | Single sound channel speaker separation model, training method and separation method |
US11392639B2 (en) * | 2020-03-31 | 2022-07-19 | Uniphore Software Systems, Inc. | Method and apparatus for automatic speaker diarization |
CN111477240B (en) * | 2020-04-07 | 2023-04-07 | 浙江同花顺智能科技有限公司 | Audio processing method, device, equipment and storage medium |
CN111524521B (en) * | 2020-04-22 | 2023-08-08 | 北京小米松果电子有限公司 | Voiceprint extraction model training method, voiceprint recognition method, voiceprint extraction model training device and voiceprint recognition device |
CN111524527B (en) * | 2020-04-30 | 2023-08-22 | 合肥讯飞数码科技有限公司 | Speaker separation method, speaker separation device, electronic device and storage medium |
CN111640438B (en) * | 2020-05-26 | 2023-09-05 | 同盾控股有限公司 | Audio data processing method and device, storage medium and electronic equipment |
CN111680631B (en) * | 2020-06-09 | 2023-12-22 | 广州视源电子科技股份有限公司 | Model training method and device |
CN111785291A (en) * | 2020-07-02 | 2020-10-16 | 北京捷通华声科技股份有限公司 | Voice separation method and voice separation device |
CN111933153B (en) * | 2020-07-07 | 2024-03-08 | 北京捷通华声科技股份有限公司 | Voice segmentation point determining method and device |
CN111985934A (en) * | 2020-07-30 | 2020-11-24 | 浙江百世技术有限公司 | Intelligent customer service dialogue model construction method and application |
CN111899755A (en) * | 2020-08-11 | 2020-11-06 | 华院数据技术(上海)有限公司 | Speaker voice separation method and related equipment |
CN112071330B (en) * | 2020-09-16 | 2022-09-20 | 腾讯科技(深圳)有限公司 | Audio data processing method and device and computer readable storage medium |
CN112489682B (en) * | 2020-11-25 | 2023-05-23 | 平安科技(深圳)有限公司 | Audio processing method, device, electronic equipment and storage medium |
CN112289323B (en) * | 2020-12-29 | 2021-05-28 | 深圳追一科技有限公司 | Voice data processing method and device, computer equipment and storage medium |
WO2023281717A1 (en) * | 2021-07-08 | 2023-01-12 | 日本電信電話株式会社 | Speaker diarization method, speaker diarization device, and speaker diarization program |
WO2023047475A1 (en) * | 2021-09-21 | 2023-03-30 | 日本電信電話株式会社 | Estimation device, estimation method, and estimation program |
CN115171716B (en) * | 2022-06-14 | 2024-04-19 | 武汉大学 | Continuous voice separation method and system based on spatial feature clustering and electronic equipment |
CN117037255A (en) * | 2023-08-22 | 2023-11-10 | 北京中科深智科技有限公司 | 3D expression synthesis method based on directed graph |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0272398A (en) * | 1988-09-07 | 1990-03-12 | Hitachi Ltd | Preprocessor for speech signal |
KR100612840B1 (en) | 2004-02-18 | 2006-08-18 | 삼성전자주식회사 | Speaker clustering method and speaker adaptation method based on model transformation, and apparatus using the same |
JP2008051907A (en) * | 2006-08-22 | 2008-03-06 | Toshiba Corp | Utterance section identification apparatus and method |
WO2016095218A1 (en) * | 2014-12-19 | 2016-06-23 | Dolby Laboratories Licensing Corporation | Speaker identification using spatial information |
JP6430318B2 (en) * | 2015-04-06 | 2018-11-28 | 日本電信電話株式会社 | Unauthorized voice input determination device, method and program |
CN106683661B (en) * | 2015-11-05 | 2021-02-05 | 阿里巴巴集团控股有限公司 | Role separation method and device based on voice |
JP2017120595A (en) * | 2015-12-29 | 2017-07-06 | 花王株式会社 | Method for evaluating state of application of cosmetics |
KR102648770B1 (en) * | 2016-07-14 | 2024-03-15 | 매직 립, 인코포레이티드 | Deep neural network for iris identification |
US9824692B1 (en) * | 2016-09-12 | 2017-11-21 | Pindrop Security, Inc. | End-to-end speaker recognition using deep neural network |
JP6365859B1 (en) | 2016-10-11 | 2018-08-01 | エスゼット ディージェイアイ テクノロジー カンパニー リミテッドSz Dji Technology Co.,Ltd | IMAGING DEVICE, IMAGING SYSTEM, MOBILE BODY, METHOD, AND PROGRAM |
US10497382B2 (en) * | 2016-12-16 | 2019-12-03 | Google Llc | Associating faces with voices for speaker diarization within videos |
CN107221320A (en) * | 2017-05-19 | 2017-09-29 | 百度在线网络技术(北京)有限公司 | Train method, device, equipment and the computer-readable storage medium of acoustic feature extraction model |
CN107180628A (en) * | 2017-05-19 | 2017-09-19 | 百度在线网络技术(北京)有限公司 | Set up the method, the method for extracting acoustic feature, device of acoustic feature extraction model |
CN107342077A (en) * | 2017-05-27 | 2017-11-10 | 国家计算机网络与信息安全管理中心 | A kind of speaker segmentation clustering method and system based on factorial analysis |
CN107680611B (en) * | 2017-09-13 | 2020-06-16 | 电子科技大学 | Single-channel sound separation method based on convolutional neural network |
US10529349B2 (en) * | 2018-04-16 | 2020-01-07 | Mitsubishi Electric Research Laboratories, Inc. | Methods and systems for end-to-end speech separation with unfolded iterative phase reconstruction |
US10963273B2 (en) * | 2018-04-20 | 2021-03-30 | Facebook, Inc. | Generating personalized content summaries for users |
-
2018
- 2018-05-28 CN CN201810519521.6A patent/CN108766440B/en active Active
- 2018-08-13 WO PCT/CN2018/100174 patent/WO2019227672A1/en active Application Filing
- 2018-08-13 JP JP2019572830A patent/JP2020527248A/en active Pending
- 2018-08-13 US US16/652,452 patent/US11158324B2/en active Active
- 2018-08-13 SG SG11202003722SA patent/SG11202003722SA/en unknown
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11250854B2 (en) | 2019-11-25 | 2022-02-15 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for voice interaction, device and computer-readable storage medium |
CN111613249A (en) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | Voice analysis method and equipment |
CN112071329A (en) * | 2020-09-16 | 2020-12-11 | 腾讯科技(深圳)有限公司 | Multi-person voice separation method and device, electronic equipment and storage medium |
CN112700766A (en) * | 2020-12-23 | 2021-04-23 | 北京猿力未来科技有限公司 | Training method and device of voice recognition model and voice recognition method and device |
CN112820292A (en) * | 2020-12-29 | 2021-05-18 | 平安银行股份有限公司 | Method, device, electronic device and storage medium for generating conference summary |
CN113544700A (en) * | 2020-12-31 | 2021-10-22 | 商汤国际私人有限公司 | Neural network training method and device, and associated object detection method and device |
WO2022173104A1 (en) * | 2021-02-10 | 2022-08-18 | 삼성전자 주식회사 | Electronic device supporting improvement of voice activity detection |
WO2022211590A1 (en) * | 2021-04-01 | 2022-10-06 | Samsung Electronics Co., Ltd. | Electronic device for processing user utterance and controlling method thereof |
US11676580B2 (en) | 2021-04-01 | 2023-06-13 | Samsung Electronics Co., Ltd. | Electronic device for processing user utterance and controlling method thereof |
WO2022265210A1 (en) * | 2021-06-18 | 2022-12-22 | 삼성전자주식회사 | Electronic device and personalized voice-processing method for electronic device |
CN113362831A (en) * | 2021-07-12 | 2021-09-07 | 科大讯飞股份有限公司 | Speaker separation method and related equipment thereof |
CN113571085A (en) * | 2021-07-24 | 2021-10-29 | 平安科技(深圳)有限公司 | Voice separation method, system, device and storage medium |
CN113657289A (en) * | 2021-08-19 | 2021-11-16 | 北京百度网讯科技有限公司 | Training method and device of threshold estimation model and electronic equipment |
CN114363531A (en) * | 2022-01-14 | 2022-04-15 | 中国平安人寿保险股份有限公司 | H5-based case comment video generation method, device, equipment and medium |
CN115659162A (en) * | 2022-09-15 | 2023-01-31 | 云南财经大学 | Method, system and equipment for extracting features in radar radiation source signal pulse |
Also Published As
Publication number | Publication date |
---|---|
WO2019227672A1 (en) | 2019-12-05 |
CN108766440A (en) | 2018-11-06 |
US11158324B2 (en) | 2021-10-26 |
CN108766440B (en) | 2020-01-14 |
JP2020527248A (en) | 2020-09-03 |
SG11202003722SA (en) | 2020-12-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11158324B2 (en) | Speaker separation model training method, two-speaker separation method and computing device | |
Czyzewski et al. | An audio-visual corpus for multimodal automatic speech recognition | |
Vijayasenan et al. | An information theoretic approach to speaker diarization of meeting data | |
WO2020253051A1 (en) | Lip language recognition method and apparatus | |
CN111128223A (en) | Text information-based auxiliary speaker separation method and related device | |
CN111785275A (en) | Voice recognition method and device | |
EP3944238A1 (en) | Audio signal processing method and related product | |
WO2021082941A1 (en) | Video figure recognition method and apparatus, and storage medium and electronic device | |
CN110136726A (en) | A kind of estimation method, device, system and the storage medium of voice gender | |
CN107680584B (en) | Method and device for segmenting audio | |
CN115798459B (en) | Audio processing method and device, storage medium and electronic equipment | |
Kenai et al. | A new architecture based VAD for speaker diarization/detection systems | |
CN116741155A (en) | Speech recognition method, training method, device and equipment of speech recognition model | |
CN113782005B (en) | Speech recognition method and device, storage medium and electronic equipment | |
CN114171032A (en) | Cross-channel voiceprint model training method, recognition method, device and readable medium | |
CN114495911A (en) | Speaker clustering method, device and equipment | |
CN113889081A (en) | Speech recognition method, medium, device and computing equipment | |
US20220277761A1 (en) | Impression estimation apparatus, learning apparatus, methods and programs for the same | |
CN112820292B (en) | Method, device, electronic device and storage medium for generating meeting summary | |
US20230169981A1 (en) | Method and apparatus for performing speaker diarization on mixed-bandwidth speech signals | |
US20240160849A1 (en) | Speaker diarization supporting episodical content | |
CN113689861B (en) | Intelligent track dividing method, device and system for mono call recording | |
CN114078484B (en) | Speech emotion recognition method, device and storage medium | |
CN113724689B (en) | Speech recognition method and related device, electronic equipment and storage medium | |
WO2024055751A1 (en) | Audio data processing method and apparatus, device, storage medium, and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: PING AN TECHNOLOGY (SHENZHEN) CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ZHAO, FENG;WANG, JIANZONG;XIAO, JING;REEL/FRAME:052271/0135 Effective date: 20200110 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: EX PARTE QUAYLE ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO EX PARTE QUAYLE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |