CN114974258A - Speaker separation method, device, equipment and storage medium based on voice processing - Google Patents

Speaker separation method, device, equipment and storage medium based on voice processing Download PDF

Info

Publication number
CN114974258A
CN114974258A CN202210891372.2A CN202210891372A CN114974258A CN 114974258 A CN114974258 A CN 114974258A CN 202210891372 A CN202210891372 A CN 202210891372A CN 114974258 A CN114974258 A CN 114974258A
Authority
CN
China
Prior art keywords
voice
feature
speaker
processed
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210891372.2A
Other languages
Chinese (zh)
Other versions
CN114974258B (en
Inventor
黄石磊
程刚
陈诚
廖晨
熊霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202210891372.2A priority Critical patent/CN114974258B/en
Publication of CN114974258A publication Critical patent/CN114974258A/en
Application granted granted Critical
Publication of CN114974258B publication Critical patent/CN114974258B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to a speaker separation method, a speaker separation device, speaker separation equipment and a storage medium based on voice processing. The method comprises the following steps: segmenting the voice to be processed according to the speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice fragment set, generating voice characteristics of each voice fragment set, performing characteristic extraction and characteristic fusion operation on the voice characteristics of each voice fragment set based on a pre-constructed model to obtain a target characteristic matrix of each voice fragment set, calculating a similarity matrix of each target characteristic matrix, performing clustering operation on each similarity characteristic matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set, performing voting operation on the clustering result of each voice fragment set, and generating a target result of the voice to be processed. The method and the device can accurately separate the information of the speaker in the voice to be processed to obtain the initial time point, the speaking duration and/or the speaker label information of the voice to be processed.

Description

Speaker separation method, device, equipment and storage medium based on voice processing
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speaker separation method, apparatus, device, and storage medium based on speech processing.
Background
At present, speaker separation technology can classify each frame of audio data in voice according to different speakers, so as to identify the audio data belonging to the same speaker, and then mark speaker information on the identified data.
However, when the existing speaker separation technology is applied in practice, the speaker separation cannot be performed on some complex speech data (for example, speech data in which multiple speakers speak simultaneously), so that the accuracy and robustness of the final separation result are poor.
Disclosure of Invention
In view of the foregoing, the present application provides a method, an apparatus, a device and a storage medium for speaker separation based on speech processing, which aims to improve the accuracy and robustness of the separation result of speakers in the speech to be processed.
In a first aspect, the present application provides a speaker separation method based on speech processing, the method comprising:
acquiring a voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
respectively generating voice features of each voice fragment set, and respectively performing feature extraction and feature fusion operations on the voice features of each voice fragment set based on a pre-constructed model to obtain a target feature matrix of each voice fragment set;
calculating a similarity matrix of each target feature matrix, and performing clustering operation on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
and executing voting operation on the clustering result of each voice segment set to generate a target result of the voice to be processed, wherein the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
In a second aspect, the present application provides a speaker separating apparatus based on speech processing, comprising:
a segmentation module: the voice segmentation device is used for acquiring voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
an extraction module: the voice feature extraction and feature fusion operation is respectively carried out on the voice features of each voice fragment set based on a pre-constructed model, and a target feature matrix of each voice fragment set is obtained;
a clustering module: the similarity matrix is used for calculating the similarity matrix of each target feature matrix, and clustering operation is carried out on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
a voting module: and the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the speaker separation method based on speech processing according to any embodiment of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speaker separation method based on speech processing according to any one of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the speaker separation method, device, equipment and storage medium based on voice processing, voice to be processed is segmented through speaker change point marks of the voice to be processed and preset time scales, voice fragment sets of multiple scales can be obtained for subsequent feature fusion, voice features of each voice fragment set are generated, feature extraction and feature fusion operations are respectively performed on the voice features of each voice fragment set according to a pre-established model, shallow features and deep features in the voice fragment sets can be fused, clustering operations are performed on the basis of the fused features by using a spectral clustering algorithm to obtain clustering results of each voice fragment set, voting operations are performed on each clustering result, speaker change points and the clustering results of the voice fragment sets segmented by a preset time scale can be fused, and a speaker starting time point with more accurate separation results and higher robustness can be obtained, A speaking duration and/or speaker tag information.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flowchart illustrating a speaker separation method based on speech processing according to a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of a network structure of a pre-constructed model according to the present application;
FIG. 3 is a diagram of RTTM documents corresponding to the clustering result of the present application;
FIG. 4 is a schematic diagram of the target result of the pending speech generation of the present application;
FIG. 5 is a block diagram of a speaker separation apparatus according to a preferred embodiment of the present application based on speech processing;
FIG. 6 is a schematic view of an electronic device according to a preferred embodiment of the present application;
the implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.
The application provides a speaker separation method based on voice processing. Referring to fig. 1, a flowchart of a speaker separation method based on speech processing according to an embodiment of the present invention is shown. The method may be performed by an electronic device, which may be implemented by software and/or hardware. The speaker separation method based on the voice processing comprises the following steps:
step S10: acquiring a voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
in this embodiment, the to-be-processed speech may be speech that needs to identify an identity of a speaker in the speech, or identify information such as a start point of speaking time and speaking duration of each speaker in the speech, for example, the to-be-processed speech may be speech recorded in a conference process, speech recorded by a law enforcement recorder, or the like. The speaker change point mark of the voice to be processed can refer to a real time point mark of the voice to be processed, the time point mark of the voice to be processed can be obtained according to an annotation file of the voice to be processed and used as the speaker change point mark, the voice to be processed can be segmented through the speaker change point mark to obtain a voice segment set, and then, the voice segment set is uniformly segmented by adopting preset different time scales on the basis of the voice segment set to obtain another voice segment set. By adopting a method of combining the changing point based on the speaker and the uniform segmentation, the mute part in the voice to be processed can be deleted, and the voice information with different scales can be obtained at the same time.
In one embodiment, the segmenting the to-be-processed speech according to the speaker change point markers of the to-be-processed speech and a preset time scale to obtain at least one speech segment set includes:
detecting the speaker change point mark of each voice segment in the voice to be processed by utilizing a voice endpoint detection algorithm;
segmenting the voice to be processed based on the time point marks to obtain a first voice fragment set;
and uniformly dividing the first voice fragment set according to a preset time scale to obtain a second voice fragment set.
The speaker change point markers of each speech segment in the speech to be processed can be detected by using a speech endpoint detection algorithm, and the speech to be processed is segmented based on the time point markers to obtain a first speech segment set, for example, the segmented first speech segment set contains speech segments [ I, J, K ]. The first set of speech segments is then uniformly divided into a second set of speech segments, e.g., the second set of speech segments includes speech segments [ I1, I2, J1, J2, K1, K2 ].
Further, the uniformly segmenting the first voice segment set according to a preset time scale to obtain a second voice segment set includes:
respectively segmenting each voice segment in the first voice segment set by adopting a plurality of time scales to obtain a plurality of sub voice segments corresponding to each voice segment;
and summarizing a plurality of sub voice sections corresponding to each voice section to obtain the second voice section set.
Specifically, three time scales of window length 1.0 s/frame shift 0.25s, window length 1.0 s/frame shift 0.50s, and window length 1.5 s/frame shift 0.75s can be adopted, segmenting each speech segment in the first set of speech segments to obtain a plurality of sub-speech segments corresponding to each speech segment, and summarizing the plurality of sub-speech segments corresponding to each speech segment to obtain a second set of speech segments, for example, if the first set of speech segments includes speech segment [ I, J, K ], then the speech segment I is segmented by adopting three different time scales to obtain sub-speech segments I1, I2 and I3, the voice segment J is segmented to obtain sub voice segments J1, J2 and J3, the voice segment K is segmented to obtain sub voice segments K1, K2 and K3, the second set of speech segments contains speech segments of [ I1, I2, I3, J1, J2, J3, K1, K2, K3 ]. Since the speaker separation error rate can be improved to some extent with the frame shift reduced in the case of a fixed window size. When the lengths of the windows are different, the accuracy of speaker representation is relatively higher for longer window sections, and voice information with different scales can be obtained by adopting a method based on the combination of speaker change points and uniform segmentation.
Step S20: respectively generating voice features of each voice fragment set, and respectively performing feature extraction and feature fusion operations on the voice features of each voice fragment set based on a pre-constructed model to obtain a target feature matrix of each voice fragment set;
in this embodiment, each obtained voice fragment set may be stored in a (Rich Transcription Time Marked, RTTM) file, a spectrogram is generated through discrete fourier transform for each voice fragment set (which may be an audio in a wav format, and has a sampling rate of 16000) according to the RTTM file, a filterbank is generated through a Mel filter, an 80-dimensional log Mel filterbank feature is obtained after a log is taken, and an MCFF feature may also be obtained through discrete cosine transform. And then, respectively performing feature extraction and feature fusion operation on the voice features of each voice segment set by using the model to obtain a target feature matrix of each voice segment set.
In an embodiment, the performing, based on the pre-constructed model, feature extraction and feature fusion operations on the speech features of each speech segment set to obtain a target feature matrix of each speech segment set includes:
respectively utilizing the multilayer feature extraction network of the model to perform feature extraction operation on the voice features of each voice fragment set to obtain a plurality of initial feature matrixes of each voice fragment set;
and respectively executing fusion operation on the plurality of initial feature matrixes of each voice fragment set by utilizing the feature fusion network of the model to obtain a target feature matrix of each voice fragment set.
As shown in fig. 2, which is a schematic diagram of a network structure of a model constructed in advance in the present application, the model has 5 re-parameterized network modules (i.e., feature extraction network modules), and speech features extracted by the 5 feature extraction network modules of the model gradually increase from a shallow layer to a deep layer, so that rich speech features can be obtained. And outputting a feature matrix after each layer of feature extraction network, storing the feature matrix in a plurality of feature fusion layers, and simultaneously sending the feature matrix to a next layer of feature extraction network module to continuously extract features so as to obtain 5 feature matrices. And combining the 5 feature matrixes corresponding to each voice segment set into a feature matrix according to rows through a multi-layer feature fusion layer, and obtaining the target feature matrix of each voice segment set after attention data pooling, full connection, normalization and additional corner edge loss activation. The multilayer feature fusion effectively utilizes the shallow and deep features of the voice, so that a more robust output result is obtained subsequently.
Further, the performing feature extraction operations on the speech features of each speech segment set by using the multi-layer feature extraction networks of the models respectively to obtain a plurality of initial feature matrices of each speech segment set includes:
respectively inputting the voice features of each voice fragment set into a first-layer feature extraction network of the model to obtain a first feature matrix of each voice fragment set;
inputting the first feature matrix of each voice fragment set into a second-layer feature extraction network of the model to obtain a second feature matrix of each voice fragment set;
inputting the second feature matrix of each voice fragment set into a third layer feature extraction network of the model to obtain a third feature matrix of each voice fragment set;
inputting the third feature matrix of each voice fragment set into a fourth-layer feature extraction network of the model to obtain a fourth feature matrix of each voice fragment set;
inputting the fourth feature matrix of each voice segment set into the fifth-layer feature extraction network of the model to obtain a fifth feature matrix of each voice segment set;
and taking the first feature matrix, the second feature matrix, the third feature matrix, the fourth feature matrix and the fifth feature matrix of each voice segment set as a plurality of initial feature matrices of the voice segment set.
Each layer of feature extraction network can adopt 3 × 3 convolution kernels, the calculation density (calculation amount divided by used time) of the 3 × 3 convolution kernels can be four times that of 1 × 1 and 5 × 5 convolution kernels, and a single-path architecture adopted in feature extraction has higher calculation efficiency and saves more memory, so that the occupation of storage units is reduced, and the model has higher efficient reasoning rate.
Step S30: calculating a similarity matrix of each target feature matrix, and performing clustering operation on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
in this embodiment, the similarity matrix of each target feature matrix is calculated by using a cosine similarity algorithm, and then a smaller value in the similarity matrix is deleted, so as to pay more attention to the highlighted value. And then, clustering each similarity characteristic matrix according to a spectral clustering algorithm to obtain a clustering result of each voice fragment set, distributing time marks to corresponding speaker labels after clustering is finished, and generating a plurality of RTTM files comprising speaker starting time points, duration and speaker labels.
As shown in fig. 3, a schematic diagram of the RTTM file corresponding to the clustering result of the present application is illustrated by using the first behavior example of fig. 3, where "0.000" in the first row represents the speaker start time point, "2.995" in the first row represents the speaking duration, and "1" in the first row represents the speaker tag, i.e., which speaker.
Specifically, the clustering operation performed on each similarity feature matrix based on the spectral clustering algorithm to obtain a clustering result of each voice fragment set includes:
and constructing a Laplace matrix according to each similarity matrix, acquiring first K eigenvectors of each Laplace matrix, and constructing a matrix corresponding to each first K eigenvectors, wherein the first K eigenvectors are obtained by prediction based on a maximum feature gap algorithm. And carrying out clustering operation on the matrix corresponding to each K eigenvectors based on a K mean algorithm to obtain a clustering result of each voice fragment set.
Step S40: and executing voting operation on the clustering result of each voice segment set to generate a target result of the voice to be processed, wherein the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
In this embodiment, in order to make the output result more robust, an overlap sensing algorithm may be adopted to fuse and vote the clustering results of each voice segment set.
Specifically, the performing a voting operation on the clustering result of each speech segment set to generate a target result of the to-be-processed speech includes:
reading the mark files corresponding to the clustering result of each voice fragment set, reserving the non-overlapping parts of the speaker time points in each mark file, voting the overlapping parts of the speaker time points in the mark files, reserving the result with the highest vote number, and generating the target result of the voice to be processed.
And reading the RTTM file corresponding to the clustering result of each voice fragment set, mapping the RTTM file to a public area for voting, directly reserving the non-overlapping part of the time marks, and reserving the results of most labels in the overlapping part. As shown in FIG. 4, which is a schematic diagram of the target result of the present application for generating the pending speech, the time stamps for speaker A and speaker B do not overlap and the portion remains directly during time points T0 through T1. At time T1, assuming 1 (Hypothesis 1) and 2 (Hypothesis 2) are the speaker A speech, and assuming 3 (Hypothesis 3) there is an overlap between speaker A and speaker B, then time T1 retains most of the results (i.e., speaker A speech). At time T2, where Hypothesis1 exists where speaker A and speaker B overlap, Hypothesis2 is the speech of speaker A, Hypothesis3 also exists where speaker A and speaker B overlap, then most of the results are retained at time T2 (where speaker A and speaker B overlap). Thereby obtaining the target result of the voice to be processed and generating the RTTM file corresponding to the target result.
Referring to fig. 5, a functional block diagram of the speaker separation apparatus 100 based on speech processing according to the present invention is shown.
The speaker separating apparatus 100 based on speech processing according to the present application may be installed in an electronic device. According to the implemented functions, the speaker separating apparatus 100 based on speech processing may include a segmentation module 110, an extraction module 120, a clustering module 130, and a voting module 140. A module, which may also be referred to as a unit in this application, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the segmentation module 110: the voice segmentation device is used for acquiring voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
the extraction module 120: the voice feature extraction and feature fusion operation is respectively carried out on the voice features of each voice fragment set based on a pre-constructed model, and a target feature matrix of each voice fragment set is obtained;
the clustering module 130: the similarity matrix is used for calculating the similarity matrix of each target feature matrix, and clustering operation is carried out on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
the voting module 140: and the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
In one embodiment, the segmenting the to-be-processed speech according to the speaker change point markers of the to-be-processed speech and a preset time scale to obtain at least one speech segment set includes:
detecting the speaker change point mark of each voice segment in the voice to be processed by utilizing a voice endpoint detection algorithm;
segmenting the voice to be processed based on the time point marks to obtain a first voice fragment set;
and uniformly dividing the first voice fragment set according to a preset time scale to obtain a second voice fragment set.
In an embodiment, the uniformly segmenting the first speech segment set according to a preset time scale to obtain a second speech segment set includes:
respectively segmenting each voice segment in the first voice segment set by adopting a plurality of time scales to obtain a plurality of sub voice segments corresponding to each voice segment;
and summarizing a plurality of sub voice sections corresponding to each voice section to obtain the second voice section set.
In an embodiment, the performing feature extraction and feature fusion operations on the speech features of each speech segment set based on a pre-constructed model to obtain a target feature matrix of each speech segment set includes:
respectively utilizing the multilayer feature extraction network of the model to perform feature extraction operation on the voice features of each voice fragment set to obtain a plurality of initial feature matrixes of each voice fragment set;
and respectively executing fusion operation on the plurality of initial feature matrixes of each voice fragment set by utilizing the feature fusion network of the model to obtain a target feature matrix of each voice fragment set.
In one embodiment, the performing, by using the multi-layer feature extraction network of the model respectively, a feature extraction operation on the speech features of each speech segment set to obtain a plurality of initial feature matrices of each speech segment set includes:
respectively inputting the voice features of each voice fragment set into a first-layer feature extraction network of the model to obtain a first feature matrix of each voice fragment set;
inputting the first feature matrix of each voice fragment set into a second-layer feature extraction network of the model to obtain a second feature matrix of each voice fragment set;
inputting the second feature matrix of each voice fragment set into a third layer feature extraction network of the model to obtain a third feature matrix of each voice fragment set;
inputting the third feature matrix of each voice fragment set into a fourth-layer feature extraction network of the model to obtain a fourth feature matrix of each voice fragment set;
inputting the fourth feature matrix of each voice segment set into the fifth-layer feature extraction network of the model to obtain a fifth feature matrix of each voice segment set;
and taking the first feature matrix, the second feature matrix, the third feature matrix, the fourth feature matrix and the fifth feature matrix of each voice segment set as a plurality of initial feature matrices of the voice segment set.
In an embodiment, the clustering operation performed on each similarity feature matrix based on the spectral clustering algorithm to obtain a clustering result of each speech segment set includes:
constructing a Laplace matrix according to each similarity matrix;
acquiring first K eigenvectors of each Laplacian matrix, and constructing a matrix corresponding to each first K eigenvectors, wherein the first K eigenvectors are obtained by prediction based on a maximum feature gap algorithm;
and carrying out clustering operation on the matrix corresponding to each K eigenvectors based on a K mean algorithm to obtain a clustering result of each voice fragment set.
In one embodiment, the performing a voting operation on the clustering result of each speech segment set to generate a target result of the to-be-processed speech includes:
reading the mark files corresponding to the clustering result of each voice fragment set, reserving the non-overlapping parts of the speaker time points in each mark file, voting the overlapping parts of the speaker time points in the mark files, reserving the result with the highest vote number, and generating the target result of the voice to be processed.
Fig. 6 is a schematic diagram of an electronic device 1 according to a preferred embodiment of the present application.
The electronic device 1 includes but is not limited to: memory 11, processor 12, display 13 and communication interface 14. The electronic device 1 is connected to a network via a communication interface 14. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System for Mobile communications (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, or a communication network.
The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit and an external memory device of the electronic device 1. In this embodiment, the memory 11 is generally used for storing an operating system installed in the electronic device 1 and various application software, such as a program code of the speaker separation program 10 based on speech processing. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, such as performing data interaction or communication related control and processing. In this embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, such as the program code of the speaker isolation program 10 based on speech processing.
The display 13 may be referred to as a display screen or display unit. In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface.
The communication interface 14 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), the communication interface 14 typically being used for establishing a communication connection between the electronic device 1 and other devices.
Fig. 6 shows only the electronic device 1 with the components 11-14 and the speaker isolation program 10 based on speech processing, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
Optionally, the electronic device 1 may further include a user interface, the user interface may include a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further include a standard wired interface and a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
The electronic device 1 may further include a Radio Frequency (RF) circuit, a sensor, an audio circuit, and the like, which are not described in detail herein.
In the above embodiment, the processor 12, when executing the speaker separation program 10 based on speech processing stored in the memory 11, can realize the following steps:
acquiring a voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice fragment set;
respectively generating voice features of each voice fragment set, and respectively performing feature extraction and feature fusion operations on the voice features of each voice fragment set based on a pre-constructed model to obtain a target feature matrix of each voice fragment set;
calculating a similarity matrix of each target feature matrix, and performing clustering operation on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
and executing voting operation on the clustering result of each voice segment set to generate a target result of the voice to be processed, wherein the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For a detailed description of the above steps, please refer to the above description of fig. 5 regarding a functional block diagram of an embodiment of the speaker separation apparatus 100 based on speech processing and fig. 1 regarding a flowchart of an embodiment of a speaker separation method based on speech processing.
In addition, the embodiment of the present application also provides a computer-readable storage medium, which may be non-volatile or volatile. The computer readable storage medium may be any one or any combination of hard disks, multimedia cards, SD cards, flash memory cards, SMCs, Read Only Memories (ROMs), Erasable Programmable Read Only Memories (EPROMs), portable compact disc read only memories (CD-ROMs), USB memories, etc. The computer readable storage medium includes a data storage area and a program storage area, the program storage area stores a speaker separation program 10 based on speech processing, and the speaker separation program 10 based on speech processing realizes the following operations when being executed by a processor:
acquiring a voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
respectively generating voice features of each voice fragment set, and respectively performing feature extraction and feature fusion operations on the voice features of each voice fragment set based on a pre-constructed model to obtain a target feature matrix of each voice fragment set;
calculating a similarity matrix of each target feature matrix, and performing clustering operation on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
and executing voting operation on the clustering result of each voice segment set to generate a target result of the voice to be processed, wherein the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
The embodiment of the computer-readable storage medium of the present application is substantially the same as the embodiment of the speaker separation method based on speech processing, and will not be described herein again.
It should be noted that the above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, an electronic device, or a network device) to execute the method according to the embodiments of the present application.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims (10)

1. A speaker separation method based on speech processing, the method comprising:
acquiring a voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
respectively generating voice features of each voice fragment set, and respectively performing feature extraction and feature fusion operation on the voice features of each voice fragment set based on a pre-constructed model to obtain a target feature matrix of each voice fragment set;
calculating a similarity matrix of each target feature matrix, and performing clustering operation on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
and executing voting operation on the clustering result of each voice segment set to generate a target result of the voice to be processed, wherein the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
2. The method as claimed in claim 1, wherein the segmenting the to-be-processed speech according to the speaker variation point markers of the to-be-processed speech and a preset time scale to obtain at least one speech segment set comprises:
detecting the speaker change point mark of each voice segment in the voice to be processed by utilizing a voice endpoint detection algorithm;
segmenting the voice to be processed based on the time point marks to obtain a first voice fragment set;
and uniformly dividing the first voice fragment set according to a preset time scale to obtain a second voice fragment set.
3. The method as claimed in claim 2, wherein the uniformly dividing the first speech segment set according to a predetermined time scale to obtain a second speech segment set comprises:
respectively segmenting each voice segment in the first voice segment set by adopting a plurality of time scales to obtain a plurality of sub voice segments corresponding to each voice segment;
and summarizing a plurality of sub voice sections corresponding to each voice section to obtain the second voice section set.
4. The method for separating speakers based on speech processing as claimed in claim 1, wherein said performing feature extraction and feature fusion operations on the speech features of each speech segment set based on the pre-constructed model to obtain the target feature matrix of each speech segment set comprises:
respectively utilizing the multilayer feature extraction network of the model to perform feature extraction operation on the voice features of each voice fragment set to obtain a plurality of initial feature matrixes of each voice fragment set;
and respectively executing fusion operation on the plurality of initial feature matrixes of each voice fragment set by utilizing the feature fusion network of the model to obtain a target feature matrix of each voice fragment set.
5. The method as claimed in claim 4, wherein said performing feature extraction on the speech features of each speech segment set by using the multi-layer feature extraction network of the model to obtain a plurality of initial feature matrices of each speech segment set comprises:
respectively inputting the voice features of each voice fragment set into a first-layer feature extraction network of the model to obtain a first feature matrix of each voice fragment set;
inputting the first feature matrix of each voice fragment set into a second-layer feature extraction network of the model to obtain a second feature matrix of each voice fragment set;
inputting the second feature matrix of each voice fragment set into a third layer feature extraction network of the model to obtain a third feature matrix of each voice fragment set;
inputting the third feature matrix of each voice fragment set into a fourth-layer feature extraction network of the model to obtain a fourth feature matrix of each voice fragment set;
inputting the fourth feature matrix of each voice segment set into the fifth-layer feature extraction network of the model to obtain a fifth feature matrix of each voice segment set;
and taking the first feature matrix, the second feature matrix, the third feature matrix, the fourth feature matrix and the fifth feature matrix of each voice segment set as a plurality of initial feature matrices of the voice segment set.
6. The method of claim 1, wherein the clustering operation performed on each similarity feature matrix based on the spectral clustering algorithm to obtain a clustering result for each speech segment set comprises:
constructing a Laplace matrix according to each similarity matrix;
acquiring first K eigenvectors of each Laplacian matrix, and constructing a matrix corresponding to each first K eigenvectors, wherein the first K eigenvectors are obtained by prediction based on a maximum feature gap algorithm;
and carrying out clustering operation on the matrix corresponding to each K eigenvectors based on a K mean algorithm to obtain a clustering result of each voice fragment set.
7. The method as claimed in claim 1, wherein the performing a voting operation on the clustering result of each speech segment set to generate the target result of the speech to be processed comprises:
reading the mark files corresponding to the clustering result of each voice fragment set, reserving the non-overlapping parts of the speaker time points in each mark file, voting the overlapping parts of the speaker time points in the mark files, reserving the result with the highest vote number, and generating the target result of the voice to be processed.
8. A speaker separation apparatus based on speech processing, the apparatus comprising:
a segmentation module: the voice segmentation device is used for acquiring voice to be processed, and segmenting the voice to be processed according to a speaker change point mark of the voice to be processed and a preset time scale to obtain at least one voice segment set;
an extraction module: the voice feature extraction and feature fusion operation is respectively carried out on the voice features of each voice fragment set based on a pre-constructed model, and a target feature matrix of each voice fragment set is obtained;
a clustering module: the similarity matrix is used for calculating the similarity matrix of each target feature matrix, and clustering operation is carried out on each similarity feature matrix based on a spectral clustering algorithm to obtain a clustering result of each voice fragment set;
a voting module: and the target result comprises the speaker starting time point, the speaking duration and/or the speaker tag information of the voice to be processed.
9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the speaker separation method based on speech processing according to any one of claims 1 to 7 when executing a program stored in a memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for speaker separation based on speech processing according to any one of claims 1 to 7.
CN202210891372.2A 2022-07-27 2022-07-27 Speaker separation method, device, equipment and storage medium based on voice processing Active CN114974258B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210891372.2A CN114974258B (en) 2022-07-27 2022-07-27 Speaker separation method, device, equipment and storage medium based on voice processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210891372.2A CN114974258B (en) 2022-07-27 2022-07-27 Speaker separation method, device, equipment and storage medium based on voice processing

Publications (2)

Publication Number Publication Date
CN114974258A true CN114974258A (en) 2022-08-30
CN114974258B CN114974258B (en) 2022-12-16

Family

ID=82969333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210891372.2A Active CN114974258B (en) 2022-07-27 2022-07-27 Speaker separation method, device, equipment and storage medium based on voice processing

Country Status (1)

Country Link
CN (1) CN114974258B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091984A (en) * 2023-04-12 2023-05-09 中国科学院深圳先进技术研究院 Video object segmentation method, device, electronic equipment and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2048656A1 (en) * 2007-10-10 2009-04-15 Harman/Becker Automotive Systems GmbH Speaker recognition
US20110106531A1 (en) * 2009-10-30 2011-05-05 Sony Corporation Program endpoint time detection apparatus and method, and program information retrieval system
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN105913849A (en) * 2015-11-27 2016-08-31 中国人民解放军总参谋部陆航研究所 Event detection based speaker segmentation method
CN108198547A (en) * 2018-01-18 2018-06-22 深圳市北科瑞声科技股份有限公司 Sound end detecting method, device, computer equipment and storage medium
CN109346104A (en) * 2018-08-29 2019-02-15 昆明理工大学 A kind of audio frequency characteristics dimension reduction method based on spectral clustering
CN110543822A (en) * 2019-07-29 2019-12-06 浙江理工大学 finger vein identification method based on convolutional neural network and supervised discrete hash algorithm
CN111063341A (en) * 2019-12-31 2020-04-24 苏州思必驰信息科技有限公司 Method and system for segmenting and clustering multi-person voice in complex environment
CN111312256A (en) * 2019-10-31 2020-06-19 平安科技(深圳)有限公司 Voice identity recognition method and device and computer equipment
CN113362831A (en) * 2021-07-12 2021-09-07 科大讯飞股份有限公司 Speaker separation method and related equipment thereof
US20210280169A1 (en) * 2020-03-03 2021-09-09 International Business Machines Corporation Metric learning of speaker diarization
CN113851136A (en) * 2021-09-26 2021-12-28 平安科技(深圳)有限公司 Clustering-based speaker recognition method, device, equipment and storage medium
CN114203185A (en) * 2021-12-01 2022-03-18 厦门快商通科技股份有限公司 Time sequence voiceprint feature combination identification method and device
CN114283817A (en) * 2021-12-27 2022-04-05 思必驰科技股份有限公司 Speaker verification method and system

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2048656A1 (en) * 2007-10-10 2009-04-15 Harman/Becker Automotive Systems GmbH Speaker recognition
US20110106531A1 (en) * 2009-10-30 2011-05-05 Sony Corporation Program endpoint time detection apparatus and method, and program information retrieval system
CN102682760A (en) * 2011-03-07 2012-09-19 株式会社理光 Overlapped voice detection method and system
CN105913849A (en) * 2015-11-27 2016-08-31 中国人民解放军总参谋部陆航研究所 Event detection based speaker segmentation method
CN108198547A (en) * 2018-01-18 2018-06-22 深圳市北科瑞声科技股份有限公司 Sound end detecting method, device, computer equipment and storage medium
CN109346104A (en) * 2018-08-29 2019-02-15 昆明理工大学 A kind of audio frequency characteristics dimension reduction method based on spectral clustering
CN110543822A (en) * 2019-07-29 2019-12-06 浙江理工大学 finger vein identification method based on convolutional neural network and supervised discrete hash algorithm
CN111312256A (en) * 2019-10-31 2020-06-19 平安科技(深圳)有限公司 Voice identity recognition method and device and computer equipment
CN111063341A (en) * 2019-12-31 2020-04-24 苏州思必驰信息科技有限公司 Method and system for segmenting and clustering multi-person voice in complex environment
US20210280169A1 (en) * 2020-03-03 2021-09-09 International Business Machines Corporation Metric learning of speaker diarization
CN113362831A (en) * 2021-07-12 2021-09-07 科大讯飞股份有限公司 Speaker separation method and related equipment thereof
CN113851136A (en) * 2021-09-26 2021-12-28 平安科技(深圳)有限公司 Clustering-based speaker recognition method, device, equipment and storage medium
CN114203185A (en) * 2021-12-01 2022-03-18 厦门快商通科技股份有限公司 Time sequence voiceprint feature combination identification method and device
CN114283817A (en) * 2021-12-27 2022-04-05 思必驰科技股份有限公司 Speaker verification method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YANMIN QIAN,ET AL.: "Audio-Visual Deep Neural Network for Robust Person Verification", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 》 *
肖金壮等: "应用AAM损失函数的无文本说话人识别", 《激光杂志》 *
陈芬: "无监督说话人聚类方法研究及实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116091984A (en) * 2023-04-12 2023-05-09 中国科学院深圳先进技术研究院 Video object segmentation method, device, electronic equipment and storage medium
CN116091984B (en) * 2023-04-12 2023-07-18 中国科学院深圳先进技术研究院 Video object segmentation method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114974258B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
CN106446816B (en) Face recognition method and device
WO2019205391A1 (en) Apparatus and method for generating vehicle damage classification model, and computer readable storage medium
CN110443692B (en) Enterprise credit auditing method, device, equipment and computer readable storage medium
US9449163B2 (en) Electronic device and method for logging in application program of the electronic device
US9367728B2 (en) Fingerprint recognition method and device thereof
CN108711161A (en) A kind of image partition method, image segmentation device and electronic equipment
CN109766072B (en) Information verification input method and device, computer equipment and storage medium
WO2016015621A1 (en) Human face picture name recognition method and system
CN114974258B (en) Speaker separation method, device, equipment and storage medium based on voice processing
CN110781856A (en) Heterogeneous face recognition model training method, face recognition method and related device
CN111178147A (en) Screen crushing and grading method, device, equipment and computer readable storage medium
CN114241499A (en) Table picture identification method, device and equipment and readable storage medium
CN112380978B (en) Multi-face detection method, system and storage medium based on key point positioning
CN106250755A (en) For generating the method and device of identifying code
CN111401981B (en) Bidding method, device and storage medium of bidding cloud host
CN112749694A (en) Method and device for identifying image direction and nameplate characters
CN112363814A (en) Task scheduling method and device, computer equipment and storage medium
CN109255214B (en) Authority configuration method and device
CN104637496B (en) Computer system and audio comparison method
CN115455271A (en) Label generating method, device and equipment based on search query words and storage medium
CN112071331B (en) Voice file restoration method and device, computer equipment and storage medium
CN112395450B (en) Picture character detection method and device, computer equipment and storage medium
CN113342825A (en) Buried point data processing method, buried point data processing device, buried point data processing equipment and computer readable storage medium
CN113420178A (en) Data processing method and equipment
CN112396103A (en) Image classification method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant