CN115457973A

CN115457973A - Speaker segmentation method, system, terminal and storage medium

Info

Publication number: CN115457973A
Application number: CN202211085775.4A
Authority: CN
Inventors: 沈华东; 李轶杰; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-12-09

Abstract

The invention provides a speaker segmentation method, a system, a terminal and a storage medium, wherein the method comprises the following steps: acquiring voice data to be segmented, and extracting a feature vector in the voice data to be segmented; standard deviation processing is carried out on the characteristic vectors to obtain standard deviation vectors, and the standard deviation vectors are combined with the initial zero vectors to obtain speaker number vectors; and inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector. The invention effectively eliminates the influence of unit and scale difference between the characteristic vectors and improves the accuracy of the characteristic vectors by processing the standard deviation of the characteristic vectors, obtains the number vector of speakers by combining the standard deviation vector and the initial zero vector, effectively plays a characteristic enhancement effect on the standard deviation vector and improves the accuracy of speaker segmentation.

Description

Speaker segmentation method, system, terminal and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method, a system, a terminal, and a storage medium for segmenting a speaker.

Background

The speaker segmentation technology mainly solves the problem of who speaks at what time, and the industry mostly uses a scheme of the combined action of multiple modules, such as voice time segmentation, voiceprint positioning of speakers, clustering and the like; due to the relative complexity of multi-module co-optimization and incapability of well dealing with problems of sound aliasing and the like, the scheme of overall optimization of the end-to-end neural network is generated.

Compared with the technical scheme that a plurality of different modules are tuned and optimized together, the end-to-end speaker segmentation task is realized by adopting a single-module neural network model, and the problem of aliasing of multiple persons can be well dealt with by multi-classification of frame levels. The objective function of the network is two: firstly, distinguishing whether each tiny fragment time (generally meaning frame is unit) has a plurality of speaking persons; secondly, judging which persons each minute fragment time is; by improving the accuracy of the two targets, the performance of segmenting the speaker is improved;

in the use process of the existing speaker segmentation method, the segmentation accuracy of the number of speakers is low, so that the overall speech segmentation performance effect is poor, and the use experience of a user is reduced.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a system, a terminal and a storage medium for segmenting a speaker, and aims to solve the problem that the existing speaker segmentation method is low in accuracy.

The embodiment of the invention is realized in such a way that a speaker segmentation method comprises the following steps:

acquiring voice data to be segmented, and extracting a feature vector in the voice data to be segmented;

performing standard deviation processing on the characteristic vector to obtain a standard deviation vector, and combining the standard deviation vector with an initial zero vector to obtain a speaker number vector;

and inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector.

Further, the determining the number of speakers in the voice data to be segmented according to the number of speakers vector includes:

and matching the numerical value in the speaker number vector with a preset numerical value, and setting the matching success frequency as the speaker number.

Further, before the obtaining the voice data to be segmented, the method further includes:

performing voice segmentation on the voice data to be segmented to obtain segmented voices, and respectively extracting voice features of the segmented voices;

and respectively calculating the feature similarity between the voice features of different segmented voices, and performing voice combination on the segmented voices according to the feature similarity.

Furthermore, after determining the number of speakers in the voice data to be segmented according to the speaker number vector, the method further includes:

respectively obtaining the phase sensitivity masking of each speaker in the voice data to be segmented, and obtaining the voice frequency spectrum of the voice data to be segmented;

respectively determining the voice frequency spectrum of each speaker according to the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented;

and respectively determining the speaking voice of each speaker in the voice data to be segmented according to the voice frequency spectrum of each speaker.

Furthermore, after the obtaining the voice data to be segmented, the method further includes:

filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering;

and performing frame division processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the frame division processing.

Further, after performing the speech windowing on the speech data to be segmented after the framing processing, the method further includes:

and carrying out endpoint detection on the voice data to be segmented after the voice windowing processing, and denoising the voice data to be segmented according to an endpoint detection result.

Furthermore, the determining, according to the speech spectrum of each speaker, that each speaker is behind the speech of the speech data to be segmented further includes:

and respectively acquiring a time-frequency mask of each speaking voice, and carrying out voice optimization on the speaking voice according to the time-frequency mask.

It is another object of an embodiment of the present invention to provide a speaker segmentation system, which includes:

the characteristic extraction module is used for acquiring voice data to be segmented and extracting characteristic vectors in the voice data to be segmented;

the vector combination module is used for carrying out standard deviation processing on the characteristic vectors to obtain standard deviation vectors and combining the standard deviation vectors with initial zero vectors to obtain the speaker number vectors;

and the speaker number output module is used for inputting the speaker number vector into the pre-trained speaker analysis model for speaker analysis to obtain the speaker number vector and determining the number of speakers in the voice data to be segmented according to the speaker number vector.

It is another object of the embodiments of the present invention to provide a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method when executing the computer program.

It is a further object of embodiments of the present invention to provide a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the above-mentioned method steps.

According to the embodiment of the invention, the standard deviation processing is carried out on the feature vectors, the influence of unit and scale difference among the feature vectors is effectively eliminated, the accuracy of the feature vectors is improved, the speaker number vectors are obtained by combining the standard deviation vectors and the initial zero vectors, the feature enhancement effect is effectively played on the standard deviation vectors, the accuracy of speaker segmentation is improved, the speaker number vectors can be automatically obtained by inputting the speaker number vectors into a pre-trained speaker analysis model for speaker analysis, and the speaker number in the voice data to be segmented can be effectively determined based on the speaker number vectors.

Drawings

FIG. 1 is a flowchart of a speaker segmentation method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a speaker segmentation method according to a first embodiment of the present invention;

FIG. 3 is a flowchart of a speaker segmentation method according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of a speaker segmentation system according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Example one

Referring to fig. 1 to fig. 2, which are flowcharts of a speaker segmentation method according to a first embodiment of the present invention, the speaker segmentation method can be applied to any terminal device or system, and the speaker segmentation method includes the steps of:

step S10, acquiring voice data to be segmented and extracting a feature vector in the voice data to be segmented;

the feature vector may be extracted by using a feature extraction network, where the feature extraction network may be set according to requirements, for example, the feature extraction network may be set as a transformer network or a transformer network;

optionally, in this step, before the obtaining of the voice data to be segmented, the method further includes:

the method comprises the steps that voice segmentation is carried out on voice data to be segmented through preset time length to obtain segmented voices, feature extraction is carried out on each segmented voice input feature extraction network to obtain voice features, and the preset time length can be set according to requirements;

Further, after the obtaining of the voice data to be segmented, the method further includes:

filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering; the voice data to be segmented is filtered, so that the noise filtering effect is effectively achieved on the voice data to be segmented;

performing framing processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the framing processing; the accuracy of the voice windowing processing of the voice data to be segmented is effectively improved by performing the framing processing on the voice data to be segmented after the pre-emphasis processing.

carrying out endpoint detection on the voice data to be segmented after voice windowing, and denoising the voice data to be segmented according to an endpoint detection result; the method comprises the steps of carrying out endpoint detection on voice data to be segmented after voice windowing, effectively determining mute data in the voice data to be segmented based on an endpoint detection result, deleting the mute data in the voice data to be segmented, effectively playing a denoising effect on the voice data to be segmented, and improving the accuracy of the voice data to be segmented.

Step S20, standard deviation processing is carried out on the characteristic vector to obtain a standard deviation vector, and the standard deviation vector is combined with the initial zero vector to obtain a speaker number vector;

the method comprises the steps of carrying out standard deviation processing on feature vectors, effectively eliminating the influence of unit and scale difference between the feature vectors, improving the accuracy of the feature vectors, obtaining a speaker number vector by combining the standard deviation vector and an initial zero vector, effectively playing a feature enhancement effect on the standard deviation vector, improving the accuracy of speaker segmentation, and optionally carrying out dimension adaptation by adding a DNN network during the standard deviation processing on the feature vectors, so that the accuracy of the standard deviation vector is further improved; specifically, in this step, the lstm network is used as a decoder to combine the standard deviation Vector and the initial Zero Vector (Zero Vector), and a speaker number Vector is output;

step S30, inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector;

the number vector of the speakers can be automatically obtained by inputting the number vector of the speakers into a pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vector of the speakers;

optionally, in this step, the determining, according to the speaker number vector, the number of speakers in the speech data to be segmented includes:

matching the numerical value in the speaker number vector with a preset numerical value, and setting the number of times of successful matching as the speaker number, wherein the preset numerical value can be set according to requirements, the dimension of the vector which is larger than a threshold (generally 0.5) is set as a speaker, when the dimension is smaller than the threshold, the judgment is terminated, and finally, the number 1 in the speaker number vector is the number of the speakers, such as (1, 0), (1, 0) are vectors of 3 and 2 speakers.

In the embodiment, the standard deviation processing is carried out on the characteristic vectors, the influence of unit and scale difference among the characteristic vectors is effectively eliminated, the accuracy of the characteristic vectors is improved, the number vectors of speakers are obtained by combining the standard deviation vectors and the initial zero vectors, the characteristic enhancement effect is effectively achieved on the standard deviation vectors, the accuracy of speaker segmentation is improved, the number vectors of the speakers can be automatically obtained by inputting the number vectors of the speakers into a pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vectors of the speakers.

Example two

Please refer to fig. 3, which is a flowchart of a speaker segmentation method according to a second embodiment of the present invention, the method for further refining the steps after step S30 in the first embodiment includes the steps of:

step S40, respectively obtaining the phase sensitive mask of each speaker in the voice data to be segmented, and obtaining the voice frequency spectrum of the voice data to be segmented;

wherein, the Phase Sensitive Mask (PSM) is an Ideal Amplitude Mask (IAM) multiplied by the cosine similarity between the clean speech and the noisy speech;

s50, respectively determining the voice frequency spectrum of each speaker according to the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented;

the method comprises the steps that the phase sensitivity mask of each speaker and the voice frequency spectrum of voice data to be segmented can effectively and respectively determine the voice frequency spectrum of each speaker, and the accuracy of subsequent speaking voice of each speaker in the voice data to be segmented is improved based on the voice frequency spectrum of each speaker;

step S60, determining the speaking voice of each speaker in the voice data to be segmented according to the voice frequency spectrum of each speaker;

optionally, in this step, after determining the speaking voice of each speaker in the to-be-segmented voice data according to the voice spectrum of each speaker, the method further includes: respectively acquiring a time-frequency mask of each speaking voice, and performing voice optimization on the speaking voice according to the time-frequency mask; the deep learning model is trained based on the time-frequency mask, and the speaking voice is subjected to voice optimization according to the trained deep learning model, so that the accuracy of the speaking voice is improved.

In the embodiment, the voice frequency spectrum of each speaker can be effectively and respectively determined through the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented, the accuracy of the subsequent speaking voice of each speaker in the voice data to be segmented is improved based on the voice frequency spectrum of each speaker, the speaking voice of each speaker in the voice data to be segmented can be automatically determined based on the voice frequency spectrum of each speaker, and the segmentation of each speaking voice in the voice data to be segmented is further facilitated.

EXAMPLE III

Please refer to fig. 4, which is a schematic structural diagram of a speaker segmentation system 100 according to a third embodiment of the present invention, including: the system comprises a feature extraction module 10, a vector combination module 11 and a speaker number output module 12, wherein:

the feature extraction module 10 is configured to obtain voice data to be segmented, and extract feature vectors in the voice data to be segmented. The feature vector may be extracted by using a feature extraction network, and the feature extraction network may be set according to requirements, for example, the feature extraction network may be set as a transform network or a former network.

Optionally, the feature extraction module 10 is further configured to: performing voice segmentation on the voice data to be segmented to obtain segmented voices, and respectively extracting voice features of the segmented voices; the method comprises the steps that voice segmentation is carried out on voice data to be segmented through preset time length to obtain segmented voices, feature extraction is carried out on each segmented voice input feature extraction network to obtain voice features, and the preset time length can be set according to requirements;

Further, the feature extraction module 10 is further configured to: filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering; the voice data to be segmented is filtered, so that the noise filtering effect is effectively achieved on the voice data to be segmented;

performing framing processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the framing processing; the accuracy of voice windowing processing of the voice data to be segmented is effectively improved by performing framing processing on the voice data to be segmented after pre-emphasis processing.

Further, the feature extraction module 10 is further configured to: carrying out endpoint detection on the voice data to be segmented after voice windowing, and denoising the voice data to be segmented according to an endpoint detection result; the voice data to be segmented after the voice windowing processing is performed with the end point detection, the mute data in the voice data to be segmented can be effectively determined based on the end point detection result, the mute data in the voice data to be segmented is deleted, the denoising effect is effectively achieved on the voice data to be segmented, and the accuracy of the voice data to be segmented is improved.

And the vector combination module 11 is used for performing standard deviation processing on the characteristic vectors to obtain standard deviation vectors, and combining the standard deviation vectors with the initial zero vectors to obtain the speaker number vectors. The method comprises the steps of obtaining a standard deviation vector, obtaining a number vector of speakers by combining the standard deviation vector and an initial zero vector, effectively playing a characteristic enhancement effect on the standard deviation vector, and improving the accuracy of segmentation of the speakers, wherein the standard deviation processing is carried out on the characteristic vector, so that the accuracy of the standard deviation vector is further improved by adding a DNN (digital network) for dimension adaptation in the standard deviation processing process of the characteristic vector; specifically, in this step, the standard deviation Vector and the initial Zero Vector are combined (Zero Vector) using an lstm network as a decoder, and a speaker number Vector is output.

And the speaker number output module 12 is used for inputting the speaker number vector into the pre-trained speaker analysis model for speaker analysis to obtain the speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector. The number vector of the speakers can be automatically obtained by inputting the number vector of the speakers into the pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vector of the speakers.

Optionally, the speaker count output module 12 is further configured to: matching the numerical value in the speaker number vector with a preset numerical value, and setting the number of times of successful matching as the speaker number; the preset value can be set according to the requirement, the dimension of the vector which is larger than the threshold (generally 0.5) is set as the speaker, when the dimension is smaller than the threshold, the judgment is terminated, and finally, the number of 1 in the speaker number vector is the number of the speakers, such as the vectors of the speaker numbers of (1, 0) and (1, 0) which are 3 and 2.

Further, the speaker count output module 12 is further configured to: respectively acquiring the phase sensitivity masking of each speaker in the voice data to be segmented, and acquiring the voice frequency spectrum of the voice data to be segmented;

Further, the speaker count output module 12 is further configured to: and respectively acquiring a time-frequency mask of each speaking voice, and carrying out voice optimization on the speaking voice according to the time-frequency mask.

Example four

Fig. 5 is a block diagram of a terminal device 2 according to a fourth embodiment of the present application. As shown in fig. 5, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program for a speaker segmentation method, stored in said memory 21 and executable on said processor 20. The processor 20, when executing the computer program 22, performs the steps of the various embodiments of the individual speaker segmentation method described above.

Illustratively, the computer program 22 may be partitioned into one or more modules, which are stored in the memory 21 and executed by the processor 20 to accomplish the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20, a memory 21.

The Processor 20 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated module, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be non-volatile or volatile, among others. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and they should be construed as being included in the present application.

Claims

1. A method for speaker segmentation, the method comprising:

2. The method as claimed in claim 1, wherein said determining the number of speakers in the voice data to be segmented according to the number vector of speakers comprises:

and matching the numerical value in the speaker number vector with a preset numerical value, and setting the number of times of successful matching as the speaker number.

3. The speaker segmentation method according to claim 1, wherein before the obtaining the speech data to be segmented, further comprising:

4. The method for segmenting speakers according to claim 1, wherein after determining the number of speakers in the speech data to be segmented according to the number of speakers vector, the method further comprises:

respectively acquiring the phase sensitivity masking of each speaker in the voice data to be segmented, and acquiring the voice frequency spectrum of the voice data to be segmented;

5. The speaker segmentation method according to claim 1, wherein after the obtaining the speech data to be segmented, further comprising:

6. The speaker segmentation method according to claim 5, wherein after the speech windowing of the framed speech data to be segmented, further comprising:

and carrying out end point detection on the voice data to be segmented after the voice windowing processing, and denoising the voice data to be segmented according to an end point detection result.

7. The method as claimed in claim 4, wherein said determining each speaker after the speaking voice of the voice data to be segmented according to the voice spectrum of each speaker further comprises:

8. A speaker segmentation system, the system comprising:

the vector combination module is used for carrying out standard deviation processing on the characteristic vectors to obtain standard deviation vectors and combining the standard deviation vectors with the initial zero vectors to obtain the speaker number vectors;

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method according to any one of claims 1 to 7.