CN111128223B

CN111128223B - Text information-based auxiliary speaker separation method and related device

Info

Publication number: CN111128223B
Application number: CN201911424875.3A
Authority: CN
Inventors: 方昕; 柳林; 刘海波; 方磊
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2022-08-05
Anticipated expiration: 2039-12-30
Also published as: CN111128223A

Abstract

The embodiment of the application discloses a method for assisting speaker separation based on text information and a related device, wherein the method comprises the following steps: acquiring first voice information to be separated; carrying out first separation processing on first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information; performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice representation information acquisition; inputting the second voice information into a pre-trained speaker transition point identification model, and determining the speaker transition point in the second voice information; a target separation result is obtained based on the transition point of the speaker and the first separation result. Therefore, the text information is obtained through the obtained first voice information, the bottom acoustic characteristics and the text information are fused to separate the speaker, and accuracy of speaker separation is improved.

Description

Text information-based auxiliary speaker separation method and related device

Technical Field

The application relates to the technical field of electronic equipment, in particular to a method and a related device for assisting speaker separation based on text information.

Background

In recent years, with the continuous improvement of audio processing technology, it has become a research focus to acquire specific voices of interest from massive data such as telephone recordings, news broadcasts, conference recordings, and the like. The speaker separation technology refers to a process of automatically dividing and marking voices according to speakers from a multi-person conversation, namely, solving the problem of ' when and by whom ' the voices are spoken '. People can realize a structured management of audio data streams by means of speaker separation technology, effectively distinguish role information of different people in audio, and further provide a basis for realizing structured audio content on a higher semantic level. The speaker separation technology has a plurality of practical application values, and the separated result can be used for speaker self-adaptation (speaker adaptation) so as to improve the recognition rate of voice recognition; can assist the telephone and meeting data to automatically transcribe and construct the speaker audio file, thereby realizing the speaker audio file management; in addition, automatic tracking and labeling of the corpus can be realized through a speaker separation technology. In the process of separating speakers, the acoustic features of speech are generally used as the basis for judgment, and different speakers are distinguished through the timbre information of the speech, however, when the gender of two speakers in a section of speech is the same and the timbre is close, separation errors are often caused.

Disclosure of Invention

The embodiment of the application provides an auxiliary speaker separation method based on text information and a related device, so that speaker separation is performed by fusing bottom acoustic features and the text information, and accuracy of speaker separation is improved.

In a first aspect, an embodiment of the present application provides a method for assisting speaker separation based on text information, including:

acquiring first voice information to be separated;

performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information;

performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice characterization information acquisition;

inputting the second voice information into a pre-trained speaker transition point identification model, and determining a speaker transition point in the second voice information;

and obtaining a target separation result according to the transition point of the speaker and the first separation result.

In a second aspect, embodiments of the present application provide a speaker separation aid apparatus based on text information, including a processing unit and a communication unit, wherein,

the processing unit is used for acquiring first voice information to be separated; performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information; performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice representation information acquisition; inputting the second voice information into a pre-trained speaker transition point identification model, and determining the transition point of the speaker in the second voice information; and obtaining a target separation result according to the transition point of the speaker and the first separation result.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing steps in any method of the first aspect of the embodiment of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program makes a computer perform part or all of the steps described in any one of the methods of the first aspect of the present application.

In a fifth aspect, the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform some or all of the steps as described in any one of the methods of the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

It can be seen that, in the embodiment of the application, the electronic device acquires first voice information to be separated; performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to performing preliminary segmentation and clustering on different speakers in the first voice information; performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice representation information acquisition; inputting the second voice information into a pre-trained speaker transition point identification model, and determining the transition point of the speaker in the second voice information; and obtaining a target separation result according to the transition point of the speaker and the first separation result. Therefore, the text information is obtained through the obtained first voice information, the bottom acoustic characteristics and the text information are fused to separate the speaker, and the accuracy of speaker separation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a system for assisting speaker separation based on text information according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for assisting speaker separation based on text information according to an embodiment of the present disclosure;

FIG. 3 is a schematic flowchart illustrating another method for speaker separation assistance based on text information according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 5 is a block diagram illustrating functional units of an apparatus for assisting speaker separation based on text information according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, fig. 1 is a schematic diagram of a speaker separation aid system 100 based on text information, where the speaker separation aid system 100 based on text information includes a speech acquiring device 110 and a speech processing device 120, the speech acquiring device 110 is connected to the speech processing device 120, the speech acquiring device 110 is used to acquire speech data and send the speech data to the speech processing device 120 for processing, and the speech processing device 120 is used to process the speech data and output a processing result, and the speaker separation aid system 100 based on text information may include an integrated single device or multiple devices, and for convenience of description, the speaker separation aid system 100 based on text information is generically referred to as an electronic device in the present application. It will be apparent that the electronic device may include various handheld devices, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to a wireless modem having wireless communication capability, as well as various forms of User Equipment (UE), Mobile Stations (MS), terminal Equipment (terminal device), and the like.

The following describes embodiments of the present application in detail.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for assisting speaker separation based on text information according to an embodiment of the present application, applied to the electronic device shown in fig. 1, where as shown in the diagram, the method for assisting speaker separation based on text information includes:

s201, acquiring first voice information to be separated;

the first voice message to be separated may be a bilateral voice of a voice call record.

In a specific implementation, the manner of acquiring the first voice information includes audio data recorded by the audio device (e.g., a recorder), such as recording the first voice information by a call; or acquiring one or more audio data in a pre-recorded audio data set.

S202, performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information;

and performing speaker segmentation and speaker clustering on the first voice information to be separated by adopting a traditional speaker separation system based on acoustic characteristics to obtain a first separation result after preliminary separation. The speaker segmentation means that time points of conversion of roles of different speakers are found out from a section of voice with a plurality of speakers, and then the voice is segmented according to the time points to be a plurality of small segments of the speakers. Optimally, the small segments obtained after the segmentation only contain one speaker information, and meanwhile, a continuous segment belonging to a continuous speaker cannot be mistakenly segmented into two or more small segments. The speaker clustering refers to that short speaker segments obtained by segmenting the speaker in the previous step are recombined and combined together by utilizing some clustering methods. Thereby obtaining a starting point of each speaker's voice and a transition point of the speaker's transition.

In a specific implementation, before the first voice information to be separated is subjected to the first separation processing, the first voice information may be further detected and denoised, for example, in a telephone voice, various types of noise may be included, such as coughing, laughing, speaking voices of others, and the like, and the detection and denoising of the effective voice of multiple speakers may be performed based on energy detection and based on a channel cross-interaction, and the like.

S203, performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice characterization information acquisition;

respectively carrying out voice recognition on the separated voice fragments, and transcribing to obtain voice content, namely second voice information; or respectively carrying out voice characterization information acquisition on the voice segments after the separation is finished, and acquiring second voice information.

S204, inputting the second voice information into a pre-trained speaker transition point identification model, and determining the transition point of the speaker in the second voice information;

the content of the speech text obtained in step S203 is sent to a pre-trained speaker transition point identification model to detect the speaker transition point, where the speaker transition point is a point where the identity of the speaker changes.

S205, obtaining a target separation result according to the transition point of the speaker and the first separation result.

And reconfirming the first separation result according to the point with the changed speaker identity detected by a pre-trained speaker transition point identification model to obtain a target separation result with better speaker separation effect.

In one possible example, the performing voice processing on the first separation result to obtain second voice information includes: acquiring voice characterization information of the first separation result to obtain second voice information, wherein the second voice information comprises text characterization, confidence degree characterization and voice characteristic characterization; or, performing voice recognition on the first separation result to obtain the second voice information, wherein the second voice information comprises voice text information; performing voice word segmentation according to the voice text information to obtain word segmentation results; and marking attribute information of the word segmentation result, wherein the attribute information comprises word segmentation part of speech and word meaning of the word segmentation.

Extracting a text representation, a confidence level representation and a voice feature representation through a preset voice representation information extraction model; and obtaining the second voice information. The method is characterized in that the method obtains a word segmentation result by carrying out word segmentation processing on the recognized voice text information, so that the speaker transition point of the voice text can be conveniently detected by adopting the following modes: the word segmentation method based on character string matching realizes word segmentation through matching of a voice text and dictionary words; or performing word segmentation processing through a preset word segmentation model and the like. The word segmentation process is shown in table 1 below:

table 1

In this example, the preset characterization extraction model is used for performing text characterization, speech feature characterization and confidence vector characterization extraction according to the first separation result, or the speech text is subjected to word segmentation to obtain a word segmentation result, so that an information source of the speaker transition point identification model is enhanced, and the accuracy of judging the speaker transition point is improved.

In one possible example, the performing voice feature information collection on the first separation result to obtain the second voice information, where the second voice information includes a text feature, a confidence level indicator, and a voice feature, and includes: extracting text representations from the first separation result through a first preset representation extraction model; determining a word boundary for each word in the first separation result; inputting the character boundary into a second preset representation extraction model to extract a voice feature representation; identifying an identification confidence vector for each word of the first separation result; acquiring a preset confidence coefficient vector matrix; and determining confidence coefficient vector representation according to the recognition confidence coefficient vector and the preset confidence coefficient vector matrix.

The voice characterization information acquisition is carried out on the first separation result, text characterization Pt extraction can be carried out by using words as granularity, a first preset characterization extraction model is adopted, a word boundary of each word of the text content of the first voice information is obtained according to the transcription result of the first separation result, a word boundary frame of each word is extracted and input into a second preset characterization extraction model for voice characterization extraction, and then voice feature characterization Ps of the word are obtained through average pooling (averaging). Identifying an identification confidence vector of each word of the Pt content represented by the current text; and acquiring a preset confidence coefficient vector matrix V, and directly multiplying the recognition confidence coefficient vector by the confidence coefficient vector matrix V to obtain a confidence coefficient vector representation Pc.

In this example, the preset characterization extraction model is used for performing the text characterization, the speech feature characterization and the confidence vector characterization extraction according to the first separation result, so that the information source of the speaker transition point identification model is enhanced, and the accuracy of determining the speaker transition point is improved.

In one possible example, when the second speech information is characterized as text, the inputting the second speech information into a pre-trained speaker transition point identification model to determine a speaker transition point in the second speech information comprises: inputting the text representation into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

The text representation is the voice text content, the text representation is input into a pre-trained speaker transition point identification model to detect the speaker transition point, and the first separation result obtained by separation processing is reconfirmed, so that the speaker transition points with similar timbre can be detected through additional text information.

In the concrete implementation, when the timbres of two speakers are very close, and the two speakers are difficult to separate based on the traditional speaker separation system based on the acoustic characteristics, in the speaker separation process, the speaker separation is performed on the timbre by using the traditional bottom layer acoustic characteristics, and the speaker transition point information in the text representation is also used, so that the pre-trained speaker transition point identification model can have a better effect on the separation of the speakers with close timbres, and the robustness of the speaker separation is effectively improved.

It can be seen that in the present example, the accuracy of detecting a speaker transition point is improved by additional textual information by entering a textual representation into a pre-trained speaker transition point identification model to obtain a speaker transition point and re-validating the first separation result.

In one possible example, the second speech information includes a text characterization, a confidence metric, and a speech feature characterization, the inputting the second speech information into a pre-trained speaker transition point recognition model, determining a speaker transition point in the second speech information, comprising: performing head-to-tail splicing on the text characterization, the confidence degree characterization and the voice characteristic characterization to obtain a comprehensive characterization vector; inputting the comprehensive characterization vector into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

The text representation, the confidence level and the voice feature representation are input into a pre-trained speaker transition point identification model to detect the speaker transition point, and a first separation result obtained by separation processing is reconfirmed, so that the speaker transition points with similar timbre can be detected through additional text information.

In the concrete implementation, when the timbres of two speakers are very close, and the two speakers are difficult to separate based on the traditional speaker separation system based on the acoustic characteristics, in addition to the separation of the speakers on the timbre by utilizing the traditional bottom acoustic characteristics in the speaker separation process, the text representation, the confidence table and the voice characteristic representation are also input into the speaker transition point identification model trained in advance to detect the speaker transition point information, so that the information source of the speaker transition point identification model is enhanced, the effect of the speaker transition point identification model trained in advance on the separation of the speakers with close timbres can be better, and the robustness of the speaker separation is effectively improved.

It can be seen that in the present example, the accuracy of detecting a speaker transition point is improved by additional textual information by inputting a textual representation, a confidence measure, and a phonetic feature representation into a pre-trained speaker transition point recognition model to obtain a speaker transition point and reconfirming the first separation result.

In one possible example, the inputting the second speech information into a pre-trained speaker transition point identification model, determining a transition point of a speaker in the second speech information, includes: inputting the word segmentation part of speech and the word segmentation meaning of speech into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

The method comprises the steps of inputting word segmentation part of speech and word segmentation meaning of speech into a pre-trained speaker transition point identification model, detecting the speaker transition point, and reconfirming a first separation result obtained by separation processing so as to expect that the speaker transition points with similar tone can be detected through additional text information.

In a specific implementation, the general sentence components in modern Chinese are eight kinds, namely, subject, predicate, object, verb, adverb, pleode and core. In a complete sentence, there is a certain combination relationship between words, and the sentence can be divided into different components according to different relationships. Sentence components are acted upon by words or phrases. The speaker of the sentence is analyzed by the sentence component. Illustratively, "give you good ask what you can help you good you ask you good you who i want to find a good call charge, etc. The transition point of the speaker can be distinguished by the greetings (i.e., "hello" and "hello"), and the detection of the transition point of the speaker can be specifically performed by the speaking habit of the speaker, for example, "feeding your ask what can help you get a request for a good call charge, etc., and by calling" you "and" you "to the opposite party, it is obvious that different speakers can be detected.

It can be seen that, in the present example, the accuracy of detecting the speaker transition point is improved by the additional text information by inputting the part of speech and the part of speech sense into the pre-trained speaker transition point recognition model to obtain the speaker transition point and re-confirming the first separation result.

In one possible example, the performing a first separation process on the first speech information to be separated to obtain a first separation result includes: extracting the sound production characteristics of the first voice information to be separated, wherein the sound production characteristics comprise voiceprints, tones and tones; and inputting the sound production characteristics into a preset speaker separation model for processing to obtain a first separation result.

Wherein, before the separation process, active voice detection can be carried out. The active speech detection is to identify an effective speech part and a non-speech part, and the speaker separation and processing only needs to process and analyze the effective speech part, wherein the non-speech part has a great influence on the separation effect of the speaker, and the non-speech part is removed. The non-speech segments may include silence, music, indoor noise, background noise, etc. depending on the type of speech document.

The first voice information is segmented according to occurrence characteristics by extracting the sound production characteristics of the first voice information, such as voiceprints, timbre, tone and the like, namely, the speaker change point in the first voice information is detected and segmented. The voices of different speakers can be extracted according to the voice signal frequency domain characteristics of the first voice information, wherein the voice signal frequency domain characteristics can be Mel-domain cepstrum coefficients (MFCC) or filter bank characteristics (FBank).

In the specific implementation, a fixed-length sliding analysis window containing N frames can be defined, the reliability of each frame in each analysis window is calculated, whether a point changed by a speaker exists in the analysis window is judged by using a relevant criterion, each detected division point is stored in a division point set, and when a voice sequence needing to be analyzed reaches the end of voice, all the detected division points are output. And analyzing all the obtained segmentation points to determine the real segmentation points.

Therefore, in the example, not only the voice text content is fully utilized, but also the tone information of the voice is better utilized, and when speakers with similar tones are separated, the speaker separation effect is better and the accuracy is higher.

In one possible example, before the obtaining the first voice information to be separated, the method further includes: acquiring training voice information, wherein the training voice information comprises recorded unilateral voice; carrying out voice content recognition through a preset voice recognition system to obtain voice text content; splicing the voice text contents according to time to obtain a co-recorded voice text; processing the preset speaker transition point model according to the recorded voice text and a cross entropy criterion to obtain a pre-trained speaker transition point identification model; or acquiring training voice information, wherein the training voice information comprises the recorded unilateral voice; carrying out voice content recognition through a preset voice recognition system to obtain voice text content; splicing the voice text contents according to time to obtain a co-recorded voice text; acquiring a text characterization for training, a confidence degree characterization for training and a voice characteristic characterization for training according to the combined voice text and the first voice information; performing head-to-tail splicing on the training text characterization, the training confidence characteristic and the training voice characteristic characterization to obtain a comprehensive training characteristic vector; and processing the preset speaker transition point model according to the feature vector and the cross entropy criterion for the comprehensive training to obtain a pre-trained speaker transition point identification model.

The method comprises the steps of adopting a large number of voice calls recorded on one side, then performing text content transcription on the voice calls, splicing the voice text contents of different speakers according to time sequence, performing speaker transition point training on each word level for a preset neural network model, and training to obtain a speaker transition point recognition model on the text level.

In the specific implementation, firstly, a large amount of unilateral voices recorded by voice call are used for carrying out voice content recognition and transcription by using a voice recognition system to obtain text contents; according to the chronological order of the roles of A and B (for example, A is a telephone customer service and B is a helper), the transcribed text contents are spliced, for example, the telephone customer service A says that 'feeding you and asking what can help you', the helper B says that 'you-me want to find a call charge', the telephone customer service A says that 'good you want to find a little, and then according to the chronological order, a complete sentence of calling is formed, and' feeding you and asking what can help you want to find a call charge a little, and all calls are formed into the complete sentence; and (3) performing text representation extraction by using the characters as granularity and adopting a preset representation extraction model, and further predicting whether the current character is a speaker transition point or not and the place where the role A and the role B are switched. The preset neural network model can be trained by using Cross Entropy (CE) criterion to obtain a speaker transition point identification model based on text information.

In the specific implementation, firstly, a large amount of unilateral voices recorded by voice call are used for carrying out voice content recognition and transcription by using a voice recognition system to obtain text contents; according to the time sequence of the roles A and B, the transcribed text contents are spliced, for example, a telephone customer service says that 'a telephone customer service gives a good ask for what can help you', a person seeking help says that 'you want to find a call charge', a telephone customer service A says that 'a good ask for something is, and the like', a complete sentence of a call is formed according to the time sequence, 'a telephone customer service can give a good ask for what can help you want to find a call charge, and the like', and all calls form the complete sentence; taking characters as granularity, adopting a neural network to extract text representation, combining and transcribing original training recorded voices according to time sequence to obtain voice recognition text contents, determining character boundaries of each character of the current voice recognition text contents, taking frame number as an example, obtaining a frame number range [ m, n ] of each character, for example, if the current character is 'so', extracting and inputting the 'in-use' word boundaries [10,20] frames corresponding to the original voice features into a preset neural network model, adopting a preset representation extraction model to extract voice representation, and then obtaining the voice feature representation of the character through average pooling. Recognizing a recognition confidence vector of each word of the text content by using the current speech, wherein the confidence may be, but not limited to, an acoustic model confidence or a confidence of an acoustic plus language model, and the general confidence is a numerical value between [0 and 1], and a higher numerical value indicates a higher confidence level of the word, and [0-1] is uniformly divided into 20 intervals, the corresponding dimension of the interval in which the confidence of the current word is located is "1", and the remaining dimensions are "0", to obtain a corresponding recognition confidence vector of the current word, for example, the current word is "in", the confidence is "0.99", and the corresponding recognition confidence vector is located in the last interval of the confidence interval, and the obtained confidence vector is [0,0,0,0,0,0,0,0,0,0,0,0,0, 1 ]; multiplying the confidence coefficient vector by a confidence coefficient vector matrix to convert the confidence coefficient vector matrix into a confidence coefficient vector representation of a fixed dimension, wherein the confidence coefficient vector matrix is randomly initialized and then updated along with the network gradient; and performing head-to-tail splicing on the voice characteristic representation, the recognition confidence coefficient representation and the text representation to obtain a higher-dimensional representation vector P, and further predicting whether the current word is a speaker transition point or not by utilizing P, wherein the part where the role A and the role B are switched is the last word, namely the last word is the speaker transition point, and other words are not transition points. For example, only "you" and "fee" for "help you" in "hello ask what you ask you can help you get you your call fee is a transition point, and none of the other words is. The training of the speaker transition point can be performed using CE criteria. After the training is completed, a speaker transition point identification model based on text information can be obtained.

It can be seen that, in this example, through the unilateral pronunciation that adopts a large amount of voice calls to divide the book, utilize speech recognition system, carry out speech content recognition, the transcription obtains text content, obtains the speaker transition point identification model based on text information, corrects the transition point through speaker transition point identification model, can make that two kinds of information can fuse more fully.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a method for assisting speaker separation based on text information according to an embodiment of the present application, and the method is applied to the electronic device shown in fig. 1, where as shown in the figure, the method for assisting speaker separation based on text information includes:

s301, acquiring first voice information to be separated;

s302, performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information;

s303, inputting the text representation into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result;

s304, reconfirming the first separation result according to the detection result, and determining a speaker transition point;

s305, obtaining a target separation result according to the transition point of the speaker and the first separation result.

In accordance with the embodiment shown in fig. 2, please refer to fig. 4, fig. 4 is a schematic structural diagram of an electronic device 400 provided in an embodiment of the present application, and as shown in the drawing, the electronic device 400 includes an application processor 410, a memory 420, a communication interface 430, and one or more programs 421, where the one or more programs 421 are stored in the memory 420 and configured to be executed by the application processor 410, and the one or more programs 421 include instructions for performing the following steps;

acquiring first voice information to be separated;

It can be seen that, in the embodiment of the application, the electronic device acquires first voice information to be separated; performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to performing preliminary segmentation and clustering on different speakers in the first voice information; performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice representation information acquisition; inputting the second voice information into a pre-trained speaker transition point identification model, and determining the transition point of the speaker in the second voice information; and obtaining a target separation result according to the transition point of the speaker and the first separation result. Therefore, the text information is obtained through the obtained first voice information, the bottom acoustic characteristics and the text information are fused to separate the speaker, and accuracy of speaker separation is improved.

In one possible example, in the aspect of performing the voice processing on the first separation result to obtain the second voice information, the instructions in the program are specifically configured to perform the following operations: acquiring voice characterization information of the first separation result to obtain second voice information, wherein the second voice information comprises text characterization, confidence degree characterization and voice characteristic characterization; or, performing voice recognition on the first separation result to obtain the second voice information, wherein the second voice information comprises voice text information; performing voice word segmentation according to the voice text information to obtain word segmentation results; and marking attribute information of the word segmentation result, wherein the attribute information comprises word segmentation part of speech and word meaning of the word segmentation.

In one possible example, in the case that the voice characterization information is collected for the first separation result, the second voice information is obtained, where the second voice information includes text characterization, confidence metric, and voice feature characterization, the instructions in the program are specifically configured to perform the following operations: extracting text representations from the first separation result through a first preset representation extraction model; determining a word boundary for each word in the first separation result; inputting the character boundary into a second preset representation extraction model to extract a voice feature representation; identifying an identification confidence vector for each word of the first separation result; acquiring a preset confidence coefficient vector matrix; and determining confidence coefficient vector representation according to the recognition confidence coefficient vector and the preset confidence coefficient vector matrix.

In one possible example, when the second speech information is text-characterized, the instructions in the program are specifically configured to perform the following operations in the respect of determining a transition point of a speaker in the second speech information by inputting the second speech information into a pre-trained speaker transition point recognition model: inputting the text representation into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

In one possible example, the second speech information includes a text characterization, a confidence measure, and a speech feature characterization, and the instructions in the program are specifically configured to perform the following in the aspects of inputting the second speech information into a pre-trained speaker transition point recognition model to determine a speaker transition point in the second speech information: performing head-to-tail splicing on the text characterization, the confidence degree characterization and the voice characteristic characterization to obtain a comprehensive characterization vector; inputting the comprehensive characterization vector into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

In one possible example, in determining the transition point of the speaker in the second speech information by inputting the second speech information into a pre-trained speaker transition point recognition model, the instructions in the program are specifically configured to: inputting the word segmentation part of speech and the word segmentation meaning of speech into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

In one possible example, in terms of performing the first separation processing on the first speech information to be separated to obtain a first separation result, the instructions in the program are specifically configured to perform the following operations: extracting the sound production characteristics of the first voice information to be separated, wherein the sound production characteristics comprise voiceprints, tones and tones; and inputting the sound production characteristics into a preset speaker separation model for processing to obtain a first separation result.

In one possible example, the program further includes instructions for: before the first voice information to be separated is obtained, voice information for training is obtained, wherein the voice information for training comprises an entry single-side voice; performing voice content recognition through a preset voice recognition system to obtain voice text content; splicing the voice text contents according to time to obtain a combined and recorded voice text; processing the preset speaker transition point model according to the recorded voice text and a cross entropy criterion to obtain a pre-trained speaker transition point identification model; or acquiring training voice information, wherein the training voice information comprises the recorded unilateral voice; performing voice content recognition through a preset voice recognition system to obtain voice text content; splicing the voice text contents according to time to obtain a co-recorded voice text; acquiring a text characterization for training, a confidence degree characterization for training and a voice characteristic characterization for training according to the combined voice text and the first voice information; performing head-to-tail splicing on the training text characterization, the training confidence characteristic and the training voice characteristic characterization to obtain a comprehensive training characteristic vector; and processing the preset speaker transition point model according to the feature vector and the cross entropy criterion for the comprehensive training to obtain a pre-trained speaker transition point identification model.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It is understood that the electronic device comprises corresponding hardware structures and/or software modules for performing the respective functions in order to realize the above-mentioned functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Fig. 5 is a block diagram of functional elements of a text-based speaker separation aid 500 according to an embodiment of the present application. The apparatus 500 for assisting speaker separation based on text information is applied to an electronic device including a processing unit 501 and a communication unit 502, wherein,

the processing unit 501 is configured to obtain first voice information to be separated; performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information; performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice representation information acquisition; inputting the second voice information into a pre-trained speaker transition point identification model, and determining a transition point of the speaker in the second voice information; and obtaining a target separation result according to the transition point of the speaker and the first separation result.

The apparatus 500 for assisting speaker separation based on text information may further include a storage unit 503 for storing program codes and data of an electronic device. The processing unit 501 may be a processor, the communication unit 502 may be an internal communication interface, and the storage unit 503 may be a memory.

In a possible example, in terms of performing the voice processing on the first separation result to obtain the second voice information, the processing unit 501 is specifically configured to: acquiring voice characterization information of the first separation result to obtain second voice information, wherein the second voice information comprises text characterization, confidence degree characterization and voice characteristic characterization; or, performing voice recognition on the first separation result to obtain the second voice information, wherein the second voice information comprises voice text information; performing voice word segmentation according to the voice text information to obtain word segmentation results; and marking attribute information of the word segmentation result, wherein the attribute information comprises word segmentation part of speech and word meaning of the word segmentation.

In a possible example, in the acquiring of the voice characterization information of the first separation result to obtain the second voice information, where the second voice information includes a text characterization, a confidence table, and a voice feature characterization, the processing unit 501 is specifically configured to: extracting text representations from the first separation result through a first preset representation extraction model; determining a word boundary for each word in the first separation result;

inputting the character boundary into a second preset representation extraction model to extract a voice feature representation; identifying an identification confidence vector for each word of the first separation result; acquiring a preset confidence coefficient vector matrix; and determining confidence coefficient vector representation according to the recognition confidence coefficient vector and the preset confidence coefficient vector matrix.

In one possible example, when the second speech information is characterized as text, the processing unit 501 is specifically configured to, in the aspect that the second speech information is input into a pre-trained speaker transition point identification model to determine a speaker transition point in the second speech information: inputting the text representation into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

In one possible example, the second speech information includes a text characterization, a confidence measure, and a speech feature characterization, and in terms of inputting the second speech information into a pre-trained speaker transition point recognition model to determine a speaker's transition point in the second speech information, the processing unit 501 is specifically configured to: performing head-to-tail splicing on the text characterization, the confidence level meter and the voice characteristic characterization to obtain a comprehensive characterization vector; inputting the comprehensive characterization vector into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

In one possible example, in the aspect of inputting the second speech information into a pre-trained speaker transition point identification model to determine a transition point of a speaker in the second speech information, the processing unit 501 is specifically configured to: inputting the word segmentation part of speech and the word segmentation meaning of speech into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result; and reconfirming the first separation result according to the detection result to determine the speaker transition point.

In a possible example, in terms of performing a first separation process on the first speech information to be separated to obtain a first separation result, the processing unit 501 is specifically configured to: extracting the sound production characteristics of the first voice information to be separated, wherein the sound production characteristics comprise voiceprints, tones and tones; and inputting the sound production characteristics into a preset speaker separation model for processing to obtain a first separation result.

In one possible example, the processing unit 501 is further configured to: before the first voice information to be separated is obtained, voice information for training is obtained, wherein the voice information for training comprises an entry single-side voice; carrying out voice content recognition through a preset voice recognition system to obtain voice text content; splicing the voice text contents according to time to obtain a co-recorded voice text; processing the preset speaker transition point model according to the recorded voice text and a cross entropy criterion to obtain a pre-trained speaker transition point identification model; or acquiring training voice information, wherein the training voice information comprises the recorded unilateral voice; carrying out voice content recognition through a preset voice recognition system to obtain voice text content; splicing the voice text contents according to time to obtain a co-recorded voice text; acquiring a text characterization for training, a confidence degree characterization for training and a voice characteristic characterization for training according to the combined voice text and the first voice information; performing head-to-tail splicing on the training text characterization, the training confidence characteristic and the training voice characteristic characterization to obtain a comprehensive training characteristic vector; and processing the preset speaker transition point model according to the feature vector and the cross entropy criterion for the comprehensive training to obtain a pre-trained speaker transition point identification model.

It can be understood that, since the method embodiment and the apparatus embodiment are different presentation forms of the same technical concept, the content of the method embodiment portion in the present application should be synchronously adapted to the apparatus embodiment portion, and is not described herein again.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be noted that for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps of the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, the memory including: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for assisting speaker separation based on text information is characterized by comprising the following steps:

acquiring first voice information to be separated;

and performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice characterization information acquisition, and the voice characterization information comprises: text characterization, confidence characterization and voice feature characterization;

2. The method of claim 1, wherein the performing speech processing on the first separation result to obtain second speech information comprises:

acquiring voice characteristic information of the first separation result to obtain second voice information, wherein the second voice information comprises text characteristics, a confidence meter and voice characteristic characteristics; or the like, or, alternatively,

performing voice recognition on the first separation result to obtain voice text information;

performing voice word segmentation according to the voice text information to obtain word segmentation results;

and marking attribute information of the word segmentation result, wherein the attribute information comprises word segmentation part of speech and word meaning of the word segmentation.

3. The method of claim 2, wherein the collecting of the voice characterization information of the first separation result to obtain the second voice information, the second voice information including a text characterization, a confidence level table, and a voice characterization, comprises:

extracting text representations from the first separation result through a first preset representation extraction model;

determining a word boundary for each word in the first separation result;

inputting the character boundary into a second preset representation extraction model to extract a voice feature representation;

identifying an identification confidence vector for each word of the first separation result;

acquiring a preset confidence coefficient vector matrix;

and determining confidence coefficient vector representation according to the recognition confidence coefficient vector and the preset confidence coefficient vector matrix.

4. The method of claim 3, wherein when the second speech information is text-characterized, said inputting the second speech information into a pre-trained speaker transition point recognition model to determine a speaker's transition point in the second speech information comprises:

inputting the text representation into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result;

and reconfirming the first separation result according to the detection result to determine the speaker transition point.

5. The method of claim 3, wherein the second speech information includes a text characterization, a confidence measure, and a speech feature characterization, and wherein inputting the second speech information into a pre-trained speaker transition point recognition model to determine a speaker transition point in the second speech information comprises:

performing head-to-tail splicing on the text characterization, the confidence degree characterization and the voice characteristic characterization to obtain a comprehensive characterization vector;

inputting the comprehensive characterization vector into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result;

6. The method of claim 2, wherein said inputting the second speech information into a pre-trained speaker transition point recognition model to determine a speaker transition point in the second speech information comprises:

inputting the word segmentation part of speech and the word segmentation meaning of speech into a pre-trained speaker transition point identification model, and detecting the speaker transition point to obtain a detection result;

7. The method according to claim 1, wherein the performing a first separation process on the first speech information to be separated to obtain a first separation result includes:

extracting the sound production characteristics of the first voice information to be separated, wherein the sound production characteristics comprise voiceprints, tones and tones;

and inputting the sound production characteristics into a preset speaker separation model for processing to obtain a first separation result.

8. The method according to any one of claims 1 to 7, wherein before the obtaining the first voice information to be separated, the method further comprises:

acquiring training voice information, wherein the training voice information comprises recorded unilateral voice;

performing voice content recognition through a preset voice recognition system to obtain voice text content;

splicing the voice text contents according to time to obtain a co-recorded voice text;

processing a preset speaker transition point model according to the recorded voice text and the cross entropy criterion to obtain a pre-trained speaker transition point identification model; or the like, or, alternatively,

carrying out voice content recognition through a preset voice recognition system to obtain voice text content;

splicing the voice text contents according to time to obtain a combined and recorded voice text;

acquiring a text characterization for training, a confidence degree characterization for training and a voice characteristic characterization for training according to the combined voice text and the first voice information;

performing head-to-tail splicing on the training text characterization, the training confidence characteristic and the training voice characteristic characterization to obtain a comprehensive training characteristic vector;

and processing the preset speaker transition point model according to the feature vector and the cross entropy criterion for the comprehensive training to obtain a pre-trained speaker transition point identification model.

9. A device for assisting speaker separation based on text information, the device comprising a processing unit and a communication unit, wherein,

the processing unit is used for acquiring first voice information to be separated; performing first separation processing on the first voice information to be separated to obtain a first separation result, wherein the first separation processing refers to preliminary segmentation and clustering of different speakers in the first voice information; and performing voice processing on the first separation result to obtain second voice information, wherein the voice processing comprises voice recognition or voice characterization information acquisition, and the voice characterization information comprises: text characterization, confidence characterization and voice feature characterization; inputting the second voice information into a pre-trained speaker transition point identification model, and determining a transition point of the speaker in the second voice information; and obtaining a target separation result according to the transition point of the speaker and the first separation result.

10. An electronic device comprising an application processor, a communication interface and a memory, the application processor, the communication interface and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the application processor being configured to invoke the program instructions to perform the method of any of claims 1 to 8.

11. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 8.