CN115457973A - Speaker segmentation method, system, terminal and storage medium - Google Patents
Speaker segmentation method, system, terminal and storage medium Download PDFInfo
- Publication number
- CN115457973A CN115457973A CN202211085775.4A CN202211085775A CN115457973A CN 115457973 A CN115457973 A CN 115457973A CN 202211085775 A CN202211085775 A CN 202211085775A CN 115457973 A CN115457973 A CN 115457973A
- Authority
- CN
- China
- Prior art keywords
- speaker
- segmented
- voice
- voice data
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 230000011218 segmentation Effects 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 152
- 238000001228 spectrum Methods 0.000 claims description 26
- 238000004590 computer program Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 17
- 238000001514 detection method Methods 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 10
- 230000035945 sensitivity Effects 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 7
- 230000000873 masking effect Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 11
- 238000009432 framing Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 230000006978 adaptation Effects 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a speaker segmentation method, a system, a terminal and a storage medium, wherein the method comprises the following steps: acquiring voice data to be segmented, and extracting a feature vector in the voice data to be segmented; standard deviation processing is carried out on the characteristic vectors to obtain standard deviation vectors, and the standard deviation vectors are combined with the initial zero vectors to obtain speaker number vectors; and inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector. The invention effectively eliminates the influence of unit and scale difference between the characteristic vectors and improves the accuracy of the characteristic vectors by processing the standard deviation of the characteristic vectors, obtains the number vector of speakers by combining the standard deviation vector and the initial zero vector, effectively plays a characteristic enhancement effect on the standard deviation vector and improves the accuracy of speaker segmentation.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method, a system, a terminal, and a storage medium for segmenting a speaker.
Background
The speaker segmentation technology mainly solves the problem of who speaks at what time, and the industry mostly uses a scheme of the combined action of multiple modules, such as voice time segmentation, voiceprint positioning of speakers, clustering and the like; due to the relative complexity of multi-module co-optimization and incapability of well dealing with problems of sound aliasing and the like, the scheme of overall optimization of the end-to-end neural network is generated.
Compared with the technical scheme that a plurality of different modules are tuned and optimized together, the end-to-end speaker segmentation task is realized by adopting a single-module neural network model, and the problem of aliasing of multiple persons can be well dealt with by multi-classification of frame levels. The objective function of the network is two: firstly, distinguishing whether each tiny fragment time (generally meaning frame is unit) has a plurality of speaking persons; secondly, judging which persons each minute fragment time is; by improving the accuracy of the two targets, the performance of segmenting the speaker is improved;
in the use process of the existing speaker segmentation method, the segmentation accuracy of the number of speakers is low, so that the overall speech segmentation performance effect is poor, and the use experience of a user is reduced.
Disclosure of Invention
The embodiment of the invention aims to provide a method, a system, a terminal and a storage medium for segmenting a speaker, and aims to solve the problem that the existing speaker segmentation method is low in accuracy.
The embodiment of the invention is realized in such a way that a speaker segmentation method comprises the following steps:
acquiring voice data to be segmented, and extracting a feature vector in the voice data to be segmented;
performing standard deviation processing on the characteristic vector to obtain a standard deviation vector, and combining the standard deviation vector with an initial zero vector to obtain a speaker number vector;
and inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector.
Further, the determining the number of speakers in the voice data to be segmented according to the number of speakers vector includes:
and matching the numerical value in the speaker number vector with a preset numerical value, and setting the matching success frequency as the speaker number.
Further, before the obtaining the voice data to be segmented, the method further includes:
performing voice segmentation on the voice data to be segmented to obtain segmented voices, and respectively extracting voice features of the segmented voices;
and respectively calculating the feature similarity between the voice features of different segmented voices, and performing voice combination on the segmented voices according to the feature similarity.
Furthermore, after determining the number of speakers in the voice data to be segmented according to the speaker number vector, the method further includes:
respectively obtaining the phase sensitivity masking of each speaker in the voice data to be segmented, and obtaining the voice frequency spectrum of the voice data to be segmented;
respectively determining the voice frequency spectrum of each speaker according to the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented;
and respectively determining the speaking voice of each speaker in the voice data to be segmented according to the voice frequency spectrum of each speaker.
Furthermore, after the obtaining the voice data to be segmented, the method further includes:
filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering;
and performing frame division processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the frame division processing.
Further, after performing the speech windowing on the speech data to be segmented after the framing processing, the method further includes:
and carrying out endpoint detection on the voice data to be segmented after the voice windowing processing, and denoising the voice data to be segmented according to an endpoint detection result.
Furthermore, the determining, according to the speech spectrum of each speaker, that each speaker is behind the speech of the speech data to be segmented further includes:
and respectively acquiring a time-frequency mask of each speaking voice, and carrying out voice optimization on the speaking voice according to the time-frequency mask.
It is another object of an embodiment of the present invention to provide a speaker segmentation system, which includes:
the characteristic extraction module is used for acquiring voice data to be segmented and extracting characteristic vectors in the voice data to be segmented;
the vector combination module is used for carrying out standard deviation processing on the characteristic vectors to obtain standard deviation vectors and combining the standard deviation vectors with initial zero vectors to obtain the speaker number vectors;
and the speaker number output module is used for inputting the speaker number vector into the pre-trained speaker analysis model for speaker analysis to obtain the speaker number vector and determining the number of speakers in the voice data to be segmented according to the speaker number vector.
It is another object of the embodiments of the present invention to provide a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method when executing the computer program.
It is a further object of embodiments of the present invention to provide a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the above-mentioned method steps.
According to the embodiment of the invention, the standard deviation processing is carried out on the feature vectors, the influence of unit and scale difference among the feature vectors is effectively eliminated, the accuracy of the feature vectors is improved, the speaker number vectors are obtained by combining the standard deviation vectors and the initial zero vectors, the feature enhancement effect is effectively played on the standard deviation vectors, the accuracy of speaker segmentation is improved, the speaker number vectors can be automatically obtained by inputting the speaker number vectors into a pre-trained speaker analysis model for speaker analysis, and the speaker number in the voice data to be segmented can be effectively determined based on the speaker number vectors.
Drawings
FIG. 1 is a flowchart of a speaker segmentation method according to a first embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a speaker segmentation method according to a first embodiment of the present invention;
FIG. 3 is a flowchart of a speaker segmentation method according to a second embodiment of the present invention;
FIG. 4 is a schematic diagram of a speaker segmentation system according to a third embodiment of the present invention;
fig. 5 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Example one
Referring to fig. 1 to fig. 2, which are flowcharts of a speaker segmentation method according to a first embodiment of the present invention, the speaker segmentation method can be applied to any terminal device or system, and the speaker segmentation method includes the steps of:
step S10, acquiring voice data to be segmented and extracting a feature vector in the voice data to be segmented;
the feature vector may be extracted by using a feature extraction network, where the feature extraction network may be set according to requirements, for example, the feature extraction network may be set as a transformer network or a transformer network;
optionally, in this step, before the obtaining of the voice data to be segmented, the method further includes:
performing voice segmentation on the voice data to be segmented to obtain segmented voices, and respectively extracting voice features of the segmented voices;
the method comprises the steps that voice segmentation is carried out on voice data to be segmented through preset time length to obtain segmented voices, feature extraction is carried out on each segmented voice input feature extraction network to obtain voice features, and the preset time length can be set according to requirements;
and respectively calculating the feature similarity between the voice features of different segmented voices, and performing voice combination on the segmented voices according to the feature similarity.
Further, after the obtaining of the voice data to be segmented, the method further includes:
filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering; the voice data to be segmented is filtered, so that the noise filtering effect is effectively achieved on the voice data to be segmented;
performing framing processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the framing processing; the accuracy of the voice windowing processing of the voice data to be segmented is effectively improved by performing the framing processing on the voice data to be segmented after the pre-emphasis processing.
Further, after performing the speech windowing on the speech data to be segmented after the framing processing, the method further includes:
carrying out endpoint detection on the voice data to be segmented after voice windowing, and denoising the voice data to be segmented according to an endpoint detection result; the method comprises the steps of carrying out endpoint detection on voice data to be segmented after voice windowing, effectively determining mute data in the voice data to be segmented based on an endpoint detection result, deleting the mute data in the voice data to be segmented, effectively playing a denoising effect on the voice data to be segmented, and improving the accuracy of the voice data to be segmented.
Step S20, standard deviation processing is carried out on the characteristic vector to obtain a standard deviation vector, and the standard deviation vector is combined with the initial zero vector to obtain a speaker number vector;
the method comprises the steps of carrying out standard deviation processing on feature vectors, effectively eliminating the influence of unit and scale difference between the feature vectors, improving the accuracy of the feature vectors, obtaining a speaker number vector by combining the standard deviation vector and an initial zero vector, effectively playing a feature enhancement effect on the standard deviation vector, improving the accuracy of speaker segmentation, and optionally carrying out dimension adaptation by adding a DNN network during the standard deviation processing on the feature vectors, so that the accuracy of the standard deviation vector is further improved; specifically, in this step, the lstm network is used as a decoder to combine the standard deviation Vector and the initial Zero Vector (Zero Vector), and a speaker number Vector is output;
step S30, inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector;
the number vector of the speakers can be automatically obtained by inputting the number vector of the speakers into a pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vector of the speakers;
optionally, in this step, the determining, according to the speaker number vector, the number of speakers in the speech data to be segmented includes:
matching the numerical value in the speaker number vector with a preset numerical value, and setting the number of times of successful matching as the speaker number, wherein the preset numerical value can be set according to requirements, the dimension of the vector which is larger than a threshold (generally 0.5) is set as a speaker, when the dimension is smaller than the threshold, the judgment is terminated, and finally, the number 1 in the speaker number vector is the number of the speakers, such as (1, 0), (1, 0) are vectors of 3 and 2 speakers.
In the embodiment, the standard deviation processing is carried out on the characteristic vectors, the influence of unit and scale difference among the characteristic vectors is effectively eliminated, the accuracy of the characteristic vectors is improved, the number vectors of speakers are obtained by combining the standard deviation vectors and the initial zero vectors, the characteristic enhancement effect is effectively achieved on the standard deviation vectors, the accuracy of speaker segmentation is improved, the number vectors of the speakers can be automatically obtained by inputting the number vectors of the speakers into a pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vectors of the speakers.
Example two
Please refer to fig. 3, which is a flowchart of a speaker segmentation method according to a second embodiment of the present invention, the method for further refining the steps after step S30 in the first embodiment includes the steps of:
step S40, respectively obtaining the phase sensitive mask of each speaker in the voice data to be segmented, and obtaining the voice frequency spectrum of the voice data to be segmented;
wherein, the Phase Sensitive Mask (PSM) is an Ideal Amplitude Mask (IAM) multiplied by the cosine similarity between the clean speech and the noisy speech;
s50, respectively determining the voice frequency spectrum of each speaker according to the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented;
the method comprises the steps that the phase sensitivity mask of each speaker and the voice frequency spectrum of voice data to be segmented can effectively and respectively determine the voice frequency spectrum of each speaker, and the accuracy of subsequent speaking voice of each speaker in the voice data to be segmented is improved based on the voice frequency spectrum of each speaker;
step S60, determining the speaking voice of each speaker in the voice data to be segmented according to the voice frequency spectrum of each speaker;
optionally, in this step, after determining the speaking voice of each speaker in the to-be-segmented voice data according to the voice spectrum of each speaker, the method further includes: respectively acquiring a time-frequency mask of each speaking voice, and performing voice optimization on the speaking voice according to the time-frequency mask; the deep learning model is trained based on the time-frequency mask, and the speaking voice is subjected to voice optimization according to the trained deep learning model, so that the accuracy of the speaking voice is improved.
In the embodiment, the voice frequency spectrum of each speaker can be effectively and respectively determined through the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented, the accuracy of the subsequent speaking voice of each speaker in the voice data to be segmented is improved based on the voice frequency spectrum of each speaker, the speaking voice of each speaker in the voice data to be segmented can be automatically determined based on the voice frequency spectrum of each speaker, and the segmentation of each speaking voice in the voice data to be segmented is further facilitated.
EXAMPLE III
Please refer to fig. 4, which is a schematic structural diagram of a speaker segmentation system 100 according to a third embodiment of the present invention, including: the system comprises a feature extraction module 10, a vector combination module 11 and a speaker number output module 12, wherein:
the feature extraction module 10 is configured to obtain voice data to be segmented, and extract feature vectors in the voice data to be segmented. The feature vector may be extracted by using a feature extraction network, and the feature extraction network may be set according to requirements, for example, the feature extraction network may be set as a transform network or a former network.
Optionally, the feature extraction module 10 is further configured to: performing voice segmentation on the voice data to be segmented to obtain segmented voices, and respectively extracting voice features of the segmented voices; the method comprises the steps that voice segmentation is carried out on voice data to be segmented through preset time length to obtain segmented voices, feature extraction is carried out on each segmented voice input feature extraction network to obtain voice features, and the preset time length can be set according to requirements;
and respectively calculating the feature similarity between the voice features of different segmented voices, and performing voice combination on the segmented voices according to the feature similarity.
Further, the feature extraction module 10 is further configured to: filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering; the voice data to be segmented is filtered, so that the noise filtering effect is effectively achieved on the voice data to be segmented;
performing framing processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the framing processing; the accuracy of voice windowing processing of the voice data to be segmented is effectively improved by performing framing processing on the voice data to be segmented after pre-emphasis processing.
Further, the feature extraction module 10 is further configured to: carrying out endpoint detection on the voice data to be segmented after voice windowing, and denoising the voice data to be segmented according to an endpoint detection result; the voice data to be segmented after the voice windowing processing is performed with the end point detection, the mute data in the voice data to be segmented can be effectively determined based on the end point detection result, the mute data in the voice data to be segmented is deleted, the denoising effect is effectively achieved on the voice data to be segmented, and the accuracy of the voice data to be segmented is improved.
And the vector combination module 11 is used for performing standard deviation processing on the characteristic vectors to obtain standard deviation vectors, and combining the standard deviation vectors with the initial zero vectors to obtain the speaker number vectors. The method comprises the steps of obtaining a standard deviation vector, obtaining a number vector of speakers by combining the standard deviation vector and an initial zero vector, effectively playing a characteristic enhancement effect on the standard deviation vector, and improving the accuracy of segmentation of the speakers, wherein the standard deviation processing is carried out on the characteristic vector, so that the accuracy of the standard deviation vector is further improved by adding a DNN (digital network) for dimension adaptation in the standard deviation processing process of the characteristic vector; specifically, in this step, the standard deviation Vector and the initial Zero Vector are combined (Zero Vector) using an lstm network as a decoder, and a speaker number Vector is output.
And the speaker number output module 12 is used for inputting the speaker number vector into the pre-trained speaker analysis model for speaker analysis to obtain the speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector. The number vector of the speakers can be automatically obtained by inputting the number vector of the speakers into the pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vector of the speakers.
Optionally, the speaker count output module 12 is further configured to: matching the numerical value in the speaker number vector with a preset numerical value, and setting the number of times of successful matching as the speaker number; the preset value can be set according to the requirement, the dimension of the vector which is larger than the threshold (generally 0.5) is set as the speaker, when the dimension is smaller than the threshold, the judgment is terminated, and finally, the number of 1 in the speaker number vector is the number of the speakers, such as the vectors of the speaker numbers of (1, 0) and (1, 0) which are 3 and 2.
Further, the speaker count output module 12 is further configured to: respectively acquiring the phase sensitivity masking of each speaker in the voice data to be segmented, and acquiring the voice frequency spectrum of the voice data to be segmented;
respectively determining the voice frequency spectrum of each speaker according to the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented;
and respectively determining the speaking voice of each speaker in the voice data to be segmented according to the voice frequency spectrum of each speaker.
Further, the speaker count output module 12 is further configured to: and respectively acquiring a time-frequency mask of each speaking voice, and carrying out voice optimization on the speaking voice according to the time-frequency mask.
In the embodiment, the standard deviation processing is carried out on the characteristic vectors, the influence of unit and scale difference among the characteristic vectors is effectively eliminated, the accuracy of the characteristic vectors is improved, the number vectors of speakers are obtained by combining the standard deviation vectors and the initial zero vectors, the characteristic enhancement effect is effectively achieved on the standard deviation vectors, the accuracy of speaker segmentation is improved, the number vectors of the speakers can be automatically obtained by inputting the number vectors of the speakers into a pre-trained speaker analysis model for speaker analysis, and the number of the speakers in the voice data to be segmented can be effectively determined based on the number vectors of the speakers.
Example four
Fig. 5 is a block diagram of a terminal device 2 according to a fourth embodiment of the present application. As shown in fig. 5, the terminal device 2 of this embodiment includes: a processor 20, a memory 21 and a computer program 22, such as a program for a speaker segmentation method, stored in said memory 21 and executable on said processor 20. The processor 20, when executing the computer program 22, performs the steps of the various embodiments of the individual speaker segmentation method described above.
Illustratively, the computer program 22 may be partitioned into one or more modules, which are stored in the memory 21 and executed by the processor 20 to accomplish the present application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, a processor 20, a memory 21.
The Processor 20 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 21 may be an internal storage unit of the terminal device 2, such as a hard disk or a memory of the terminal device 2. The memory 21 may also be an external storage device of the terminal device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the terminal device 2. The memory 21 is used for storing the computer program and other programs and data required by the terminal device. The memory 21 may also be used to temporarily store data that has been output or is to be output.
In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated module, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. The computer readable storage medium may be non-volatile or volatile, among others. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic diskette, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signal, telecommunications signal, software distribution medium, etc. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present application, and they should be construed as being included in the present application.
Claims (10)
1. A method for speaker segmentation, the method comprising:
acquiring voice data to be segmented, and extracting a feature vector in the voice data to be segmented;
performing standard deviation processing on the characteristic vector to obtain a standard deviation vector, and combining the standard deviation vector with an initial zero vector to obtain a speaker number vector;
and inputting the speaker number vector into a pre-trained speaker analysis model for speaker analysis to obtain a speaker number vector, and determining the number of speakers in the voice data to be segmented according to the speaker number vector.
2. The method as claimed in claim 1, wherein said determining the number of speakers in the voice data to be segmented according to the number vector of speakers comprises:
and matching the numerical value in the speaker number vector with a preset numerical value, and setting the number of times of successful matching as the speaker number.
3. The speaker segmentation method according to claim 1, wherein before the obtaining the speech data to be segmented, further comprising:
performing voice segmentation on the voice data to be segmented to obtain segmented voices, and respectively extracting voice features of the segmented voices;
and respectively calculating the feature similarity between the voice features of different segmented voices, and performing voice combination on the segmented voices according to the feature similarity.
4. The method for segmenting speakers according to claim 1, wherein after determining the number of speakers in the speech data to be segmented according to the number of speakers vector, the method further comprises:
respectively acquiring the phase sensitivity masking of each speaker in the voice data to be segmented, and acquiring the voice frequency spectrum of the voice data to be segmented;
respectively determining the voice frequency spectrum of each speaker according to the phase sensitivity mask of each speaker and the voice frequency spectrum of the voice data to be segmented;
and respectively determining the speaking voice of each speaker in the voice data to be segmented according to the voice frequency spectrum of each speaker.
5. The speaker segmentation method according to claim 1, wherein after the obtaining the speech data to be segmented, further comprising:
filtering the voice data to be segmented, and pre-emphasizing the voice data to be segmented after filtering;
and performing frame division processing on the pre-emphasized voice data to be segmented, and performing voice windowing processing on the voice data to be segmented after the frame division processing.
6. The speaker segmentation method according to claim 5, wherein after the speech windowing of the framed speech data to be segmented, further comprising:
and carrying out end point detection on the voice data to be segmented after the voice windowing processing, and denoising the voice data to be segmented according to an end point detection result.
7. The method as claimed in claim 4, wherein said determining each speaker after the speaking voice of the voice data to be segmented according to the voice spectrum of each speaker further comprises:
and respectively acquiring a time-frequency mask of each speaking voice, and carrying out voice optimization on the speaking voice according to the time-frequency mask.
8. A speaker segmentation system, the system comprising:
the characteristic extraction module is used for acquiring voice data to be segmented and extracting characteristic vectors in the voice data to be segmented;
the vector combination module is used for carrying out standard deviation processing on the characteristic vectors to obtain standard deviation vectors and combining the standard deviation vectors with the initial zero vectors to obtain the speaker number vectors;
and the speaker number output module is used for inputting the speaker number vector into the pre-trained speaker analysis model for speaker analysis to obtain the speaker number vector and determining the number of speakers in the voice data to be segmented according to the speaker number vector.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211085775.4A CN115457973A (en) | 2022-09-06 | 2022-09-06 | Speaker segmentation method, system, terminal and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211085775.4A CN115457973A (en) | 2022-09-06 | 2022-09-06 | Speaker segmentation method, system, terminal and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115457973A true CN115457973A (en) | 2022-12-09 |
Family
ID=84302937
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211085775.4A Pending CN115457973A (en) | 2022-09-06 | 2022-09-06 | Speaker segmentation method, system, terminal and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115457973A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117011924A (en) * | 2023-10-07 | 2023-11-07 | 之江实验室 | Method and system for estimating number of speakers based on voice and image |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5687287A (en) * | 1995-05-22 | 1997-11-11 | Lucent Technologies Inc. | Speaker verification method and apparatus using mixture decomposition discrimination |
JP2004139049A (en) * | 2002-09-24 | 2004-05-13 | Matsushita Electric Ind Co Ltd | Speaker normalization method and speech recognition device using the same |
CN107393527A (en) * | 2017-07-17 | 2017-11-24 | 广东讯飞启明科技发展有限公司 | The determination methods of speaker's number |
CN108777146A (en) * | 2018-05-31 | 2018-11-09 | 平安科技(深圳)有限公司 | Speech model training method, method for distinguishing speek person, device, equipment and medium |
CN110176243A (en) * | 2018-08-10 | 2019-08-27 | 腾讯科技(深圳)有限公司 | Sound enhancement method, model training method, device and computer equipment |
CN111462762A (en) * | 2020-03-25 | 2020-07-28 | 清华大学 | Speaker vector regularization method and device, electronic equipment and storage medium |
CN112331216A (en) * | 2020-10-29 | 2021-02-05 | 同济大学 | Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN |
-
2022
- 2022-09-06 CN CN202211085775.4A patent/CN115457973A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5687287A (en) * | 1995-05-22 | 1997-11-11 | Lucent Technologies Inc. | Speaker verification method and apparatus using mixture decomposition discrimination |
JP2004139049A (en) * | 2002-09-24 | 2004-05-13 | Matsushita Electric Ind Co Ltd | Speaker normalization method and speech recognition device using the same |
CN107393527A (en) * | 2017-07-17 | 2017-11-24 | 广东讯飞启明科技发展有限公司 | The determination methods of speaker's number |
CN108777146A (en) * | 2018-05-31 | 2018-11-09 | 平安科技(深圳)有限公司 | Speech model training method, method for distinguishing speek person, device, equipment and medium |
CN110176243A (en) * | 2018-08-10 | 2019-08-27 | 腾讯科技(深圳)有限公司 | Sound enhancement method, model training method, device and computer equipment |
CN111462762A (en) * | 2020-03-25 | 2020-07-28 | 清华大学 | Speaker vector regularization method and device, electronic equipment and storage medium |
CN112331216A (en) * | 2020-10-29 | 2021-02-05 | 同济大学 | Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN |
Non-Patent Citations (1)
Title |
---|
RUCHIR TRAVADI, ET AL.: "Total Variability Layer in Deep Neural Network Embeddings for Speaker Verification", 《IEEE SIGNAL PROCESSING LETTERS》, vol. 26, no. 6, 11 April 2019 (2019-04-11) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117011924A (en) * | 2023-10-07 | 2023-11-07 | 之江实验室 | Method and system for estimating number of speakers based on voice and image |
CN117011924B (en) * | 2023-10-07 | 2024-02-13 | 之江实验室 | Method and system for estimating number of speakers based on voice and image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109473123B (en) | Voice activity detection method and device | |
CN110956957B (en) | Training method and system of speech enhancement model | |
CN106486130B (en) | Noise elimination and voice recognition method and device | |
CN112053695A (en) | Voiceprint recognition method and device, electronic equipment and storage medium | |
CN110047519B (en) | Voice endpoint detection method, device and equipment | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN110428835B (en) | Voice equipment adjusting method and device, storage medium and voice equipment | |
CN108806707B (en) | Voice processing method, device, equipment and storage medium | |
US20200227069A1 (en) | Method, device and apparatus for recognizing voice signal, and storage medium | |
CN116490920A (en) | Method for detecting an audio challenge, corresponding device, computer program product and computer readable carrier medium for a speech input processed by an automatic speech recognition system | |
CN112599148A (en) | Voice recognition method and device | |
CN113112992B (en) | Voice recognition method and device, storage medium and server | |
CN115457973A (en) | Speaker segmentation method, system, terminal and storage medium | |
CN117727298B (en) | Deep learning-based portable computer voice recognition method and system | |
WO2021127990A1 (en) | Voiceprint recognition method based on voice noise reduction and related apparatus | |
CN114996489A (en) | Method, device and equipment for detecting violation of news data and storage medium | |
CN113782036B (en) | Audio quality assessment method, device, electronic equipment and storage medium | |
CN112002307B (en) | Voice recognition method and device | |
CN116229987B (en) | Campus voice recognition method, device and storage medium | |
CN110299133B (en) | Method for judging illegal broadcast based on keyword | |
CN108053834A (en) | audio data processing method, device, terminal and system | |
CN114420136A (en) | Method and device for training voiceprint recognition model and storage medium | |
EP3680901A1 (en) | A sound processing apparatus and method | |
CN107993666B (en) | Speech recognition method, speech recognition device, computer equipment and readable storage medium | |
CN111402898B (en) | Audio signal processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |