CN113936664A

CN113936664A - Voiceprint registration method and device, electronic equipment and storage medium

Info

Publication number: CN113936664A
Application number: CN202111175777.8A
Authority: CN
Inventors: 钟锟; 郭晓天; 吴苇康; 孙国俊
Original assignee: Hefei Xunfei Reading And Writing Technology Co ltd
Current assignee: Hefei Xunfei Reading And Writing Technology Co ltd
Priority date: 2021-10-09
Filing date: 2021-10-09
Publication date: 2022-01-14

Abstract

The invention provides a voiceprint registration method, a voiceprint registration device, electronic equipment and a storage medium, wherein the method comprises the following steps: carrying out voiceprint separation on voice data to obtain an initial role in the voice data; receiving identity information of the initial role; and performing voiceprint registration on the initial role based on the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data. The method, the device, the electronic equipment and the storage medium solve the problem that voiceprint registration before recording is very complicated, improve the efficiency of voiceprint registration and realize fast and accurate voiceprint registration.

Description

Voiceprint registration method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of voice processing technologies, and in particular, to a voiceprint registration method and apparatus, an electronic device, and a storage medium.

Background

In recent years, a multi-person conference scene is more and more common, a plurality of speakers speak at intervals or talk in a cross way in the multi-person conference, in order to improve conference efficiency of the multi-person conference, role separation and role marking are required to be carried out on each speaker in the conference scene discussed by the plurality of speakers, and the precondition of the role marking is that the role is subjected to voiceprint registration.

At present, the voiceprint registration is usually to record the registration voice in advance before recording starts, and the voiceprint registration is carried out according to the registration voice, so that the voiceprint registration process is complicated, and the user experience is poor.

Disclosure of Invention

The invention provides a voiceprint registration method and device, electronic equipment and a storage medium, which are used for solving the defects that the voiceprint registration process is complicated and the user experience is poor in the prior art.

The invention provides a voiceprint registration method, which comprises the following steps:

carrying out voiceprint separation on voice data to obtain an initial role in the voice data;

receiving identity information of the initial role;

and performing voiceprint registration on the initial role based on the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

According to a voiceprint registration method provided by the present invention, the receiving the identity information of the initial role includes:

displaying the transcription text of the role voice corresponding to the initial role in the voice data;

and receiving the identity information of the initial role corresponding to the transcription text.

displaying role voice corresponding to the initial role in the voice data;

and receiving the identity information of the initial role corresponding to the role voice.

According to the voiceprint registration method provided by the invention, the voiceprint characteristics are determined based on the following steps:

selecting sample voice of the initial role from the role voice corresponding to the initial role in the voice data;

and carrying out voiceprint extraction on the sample voice to obtain the voiceprint characteristics of the initial role.

According to a voiceprint registration method provided by the present invention, the selecting a sample voice of the initial role from the role voices corresponding to the initial role in the voice data includes:

and selecting sample voice of the initial role from the voice of each role based on the voice duration and/or the voice definition of each voice of the role corresponding to the initial role.

According to the voiceprint registration method provided by the invention, the speech definition is determined based on the number of the tone words and/or the semantic error number contained in the corresponding role speech.

and receiving sample voice of the initial role selected from the role voices corresponding to the initial role.

The present invention also provides a voiceprint registration apparatus, comprising:

the voice print separation unit is used for carrying out voice print separation on voice data to obtain an initial role in the voice data;

the identity information receiving unit is used for receiving the identity information of the initial role;

and the voiceprint registration unit is used for carrying out voiceprint registration on the initial role based on the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and operable on the processor, wherein the processor implements the steps of the voiceprint registration method as described in any of the above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the voiceprint registration method as any one of the above.

The voiceprint registration method, the voiceprint registration device, the electronic equipment and the storage medium provided by the invention are used for carrying out voiceprint registration based on the identity information of the initial role in the voice data and the voiceprint characteristics of the initial role in the voice data, so that the multiplexing of the voice data is realized, the voice special for voiceprint registration is not required to be additionally recorded, the problem that the voiceprint registration before recording is very complicated is solved, the efficiency of voiceprint registration is improved, and the fast and accurate voiceprint registration is realized.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a voiceprint registration method provided by the present invention;

FIG. 2 is a flowchart illustrating step 120 of the voiceprint registration method provided by the present invention;

FIG. 3 is a second flowchart illustrating a step 120 of the voiceprint registration method according to the present invention;

FIG. 4 is a flow chart illustrating a voiceprint feature determination process provided by the present invention;

FIG. 5 is an illustration of an interface display for speaker labeling in accordance with the present invention;

FIG. 6 is a second illustration of an interface display for speaker labeling according to the present invention;

FIG. 7 is a schematic structural diagram of a voiceprint registration apparatus provided by the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the problem, the invention provides a method for performing voiceprint registration during recording or after recording is finished, aiming at realizing fast and accurate voiceprint registration and optimizing user experience, and fig. 1 is a schematic flow diagram of the voiceprint registration method provided by the invention, and as shown in fig. 1, the method comprises the following steps:

and step 110, carrying out voiceprint separation on the voice data to obtain an initial role in the voice data.

Specifically, before voiceprint registration is performed, voice data needs to be acquired, where the voice data may be intercepted from a real-time recorded voice data stream, or may be intercepted from already recorded voice data, and this is not specifically limited in the embodiment of the present invention.

After the voice data is obtained, the voice print separation can be performed on the section of voice data to determine each speaker, i.e., each role, in the section of voice data. It should be noted that the voice data may only include a single character or may include a plurality of characters, and in the case of including a plurality of characters, the voice data is subjected to voiceprint separation, that is, voiceprint features of the characters included in the voice data are separated, so as to determine a time interval during which each character in the voice data speaks, and then each character is separated. Considering that the voiceprint separation in step 110 is performed on the premise that the identity information of each role in the speech data is unknown, that is, the voiceprint separation at this time belongs to blind source separation, therefore, each role obtained by the voiceprint separation at this time is defined as an initial role, the initial roles obtained at this time can be labeled as "[ speaker 1 ]," [ speaker 2 ], "[ speaker 3 ], and the like, different labels represent different speakers, and a specific labeling form may be that each speaker corresponds to a time axis in the speech data, or each speaker corresponds to a transcription text under a time axis.

Step 120, receiving identity information of the initial role.

Specifically, after the initial role in the voice data is determined in step 110, the identity information of the initial role needs to be further determined, so that voiceprint registration is performed on the initial role according to the identity information of the initial role in the voice data in the following.

Here, the identity information of the initial role may be marked by the user, or may be searched and determined from a pre-stored database, which is not specifically limited in this embodiment of the present invention, and the following description will be given by taking the identity information of the initial role as the user mark:

the process of determining the identity information of the initial role may specifically be that, after the initial role in the voice data is determined, the user performs identity information tagging on the initial role in the voice data, so as to obtain the identity information of the initial role tagged by the user.

Here, the specific way that the user marks the identity information of the initial role may be that the user marks the initial role currently displayed through an intelligent terminal, where the intelligent terminal may be an intelligent device such as a smart phone or a tablet computer having an interactive function, the display mode of the initial role may be to play the voice corresponding to the initial role in the voice data, or to display a transcription text of the initial role in the voice data corresponding to the voice, or to further display information related to the role obtained by analyzing the voice corresponding to the initial role in the voice data, such as the age and gender of the role, on the basis of the above two types of information. After obtaining information displayed by the intelligent terminal, a user can input identity information of an initial role through the intelligent terminal, so that identity information labeling for the initial role is realized, for example, the user can use an earphone to connect the intelligent terminal, the intelligent terminal plays role voice corresponding to the initial role through the earphone, the user determines the identity information of the initial role from the played role voice, and performs identity information labeling on the initial role through modes of touch screen input, keyboard input, mouse input and the like, wherein the labeling can be a speaker (Xiaoming) in a certain voice section; for another example, the user may determine the identity information of the initial character from the transcription text of the character voice displayed by the intelligent terminal, and label the identity information at the beginning of the corresponding transcription text, for example, [ xiaoming ] speech content 1, where [ xiaoming ] represents the identity information of the initial character corresponding to "speech content 1". After that, the intelligent terminal can send the identity information of the initial role marked by the user to the server for voiceprint registration, the server receives the identity information of the initial role marked by the user, or the intelligent terminal can undertake the function of voiceprint registration, and then the intelligent terminal obtains the identity information of the initial role marked by the user.

In addition, the following description will take the example that the identity information of the initial role is determined by searching:

the database storing the identity information of different speakers may be pre-constructed, for example, the database may be a database in which information such as names, sexes, ages, and the like of participants is entered, after the initial role in the voice data is obtained, information related to the role, such as the age, the sex, and the like of the role, which is inferred according to analysis of the role voice corresponding to the initial role in the voice data, may be matched with the corresponding speaker from the database to determine the identity information of the initial role, for example, if the initial role is a female with an age between 20 years and 30 years, which is inferred according to analysis of the role voice corresponding to the initial role in the voice data, a female with an age between 20 years and 30 years is searched from the pre-constructed database, and the searched identity information of the speaker is used as the identity information of the initial role.

It should be noted that, the marking of the identity information of the initial role by the user is actually marking the identity information of the real speaker corresponding to the initial role, and the identity information may include the name of the speaker, and may also include a nickname, a position, and the like of the speaker.

Step 130, based on the identity information of the initial role and the voiceprint feature of the initial role in the voice data, the voiceprint registration is performed on the initial role.

Specifically, after the identity information corresponding to the initial role in the voice data is determined in step 120, if the identity of the initial role needs to be labeled, the voiceprint feature of the initial role in the voice data needs to be determined, so that the voiceprint registration can be performed on the initial role according to the initial role in the voice data and the voiceprint feature of the initial role.

It should be noted that the determining process of the voiceprint feature of the initial character may occur before step 120 or may occur after step 120, and the embodiment of the present invention is not limited in this respect. If the initial role's voiceprint feature determination occurs before step 120, the specific determination process may include the following steps: firstly, voice-print separation is carried out on voice data, voice-print characteristics of each initial role in the voice data are distinguished, so that a speaking time interval of each initial role in the voice data is determined, and on the basis, the voice-print characteristics of the initial roles are determined according to voice sections in the speaking time intervals of the same initial role. If the determining process of the voiceprint feature of the initial character occurs after step 120, the specific determining process may include the following steps: firstly, acquiring role voice corresponding to an initial role marked from voice data; and then, voice print extraction is carried out on the role voice corresponding to the initial role, so that the voice print characteristics of the initial role are obtained. It should be noted that before voiceprint extraction, a voice with better voice quality can be selected from the role voices corresponding to the initial role, so that after voiceprint extraction is performed on the voice, more accurate voiceprint features of the initial role can be obtained, and thus the accuracy of voiceprint registration is improved.

After determining the voiceprint features of the initial role, the voiceprint registration of the initial role can be performed according to the identity information of the initial role and the voiceprint features of the initial role in the voice data. After voiceprint registration is completed, in the subsequent voice recording or transcription process, the corresponding relation between the voiceprint characteristics of the initial role and the identity information of the initial role can be directly applied to determine the identity information of each speaker in the voice data.

The voiceprint registration method provided by the invention is used for carrying out voiceprint registration based on the identity information of the initial role in the voice data and the voiceprint characteristics of the initial role in the voice data, realizes the multiplexing of the voice data, does not need to additionally record the voice special for voiceprint registration, solves the problem that the voiceprint registration before recording is very complicated, improves the efficiency of voiceprint registration and realizes the fast and accurate voiceprint registration.

Based on the above embodiment, the voice data is subjected to voiceprint separation, which previously further includes:

voice data is obtained from the recorded voice data stream.

Specifically, before performing voiceprint registration, voice data required for voiceprint registration needs to be acquired from a recorded voice data stream, where the recording may be voice recording or video recording, and this is not particularly limited in this embodiment of the present invention.

Considering that the voice time length of the recorded voice data stream cannot be determined in advance, when the voice data is obtained from the recorded voice data stream, if the voice time length is short, the whole recorded voice data stream can be used as the voice data; on the contrary, if the voice duration is longer and reaches dozens of minutes or hours, a segment of voice data can be intercepted from the recorded voice data stream as the required voice data, for example, the duration of the voice data can be preset, and a segment of voice data can be intercepted from the recorded voice data stream, so as to obtain the recorded voice data with a segment of preset duration.

The method provided by the embodiment of the invention can acquire the voice data from the recorded voice data stream, and can save the process of additionally recording the voice data, thereby greatly reducing the time consumed by voiceprint registration and accelerating the process of voiceprint registration.

Based on the foregoing embodiment, fig. 2 is a schematic flow diagram of step 120 in the voiceprint registration method provided by the present invention, and as shown in fig. 2, step 120 includes:

step 121, displaying the transcription text of the role voice corresponding to the initial role in the voice data;

step 122, receiving the identity information of the initial role corresponding to the transcribed text.

Specifically, after the initial role in the voice data is obtained in step 110, if the initial role needs to be voiceprint registered, the identity information of the initial role needs to be determined, that is, the identity information of the real speaker corresponding to the initial role needs to be determined, so that the initial role can be voiceprint registered according to the identity information of the initial role.

And the determination process of the identity information of the initial role comprises the following steps: step 121, firstly, performing voice transcription on voice data to obtain a transcription text of the voice data; then, according to the time shaft corresponding to each initial role in the voice data, determining the transcription text of the role voice corresponding to the initial role under the corresponding time shaft from the transcription text of the voice data, and displaying the transcription text of the role voice corresponding to each initial role on the intelligent terminal; subsequently, step 122 is executed to receive the identity information of the initial role corresponding to the transcribed text, where the identity information of the initial role may be determined by searching from a pre-stored database according to the transcribed text, or may be marked by the user, which is not specifically limited in this embodiment of the present invention, and the following description will be given by taking receiving the identity information of the initial role corresponding to the transcribed text marked by the user as an example: the user can input the identity information of the initial role through the intelligent terminal, and therefore identity information labeling aiming at the initial role is achieved. It should be noted that, when the user inputs the identity information of the initial role, the user may input the identity information of each displayed initial role, or may select one or more initial roles from the displayed initial roles to input the identity information, which is not specifically limited in the embodiment of the present invention. After that, the intelligent terminal can send the identity information of the initial role corresponding to the transcribed text labeled by the user to the server for voiceprint registration, the server receives the identity information of the initial role corresponding to the transcribed text labeled by the user, or the intelligent terminal itself bears the function of voiceprint registration, and then the intelligent terminal obtains the identity information of the initial role corresponding to the transcribed text labeled by the user.

the database storing identity information of different speakers can be pre-constructed, for example, the database can be a database of recorded information of synopsis of a meeting, summary of a meeting and the like of participants, after the initial role in the voice data is obtained, information related to the role, such as synopsis of a meeting, summary of a meeting and the like, which is inferred according to the analysis and the analysis of the transcription text of the role voice of the initial role in the voice data, and the corresponding speaker is matched from the database to determine the identity information of the initial role, for example, the speaker in the field of the initial role record is inferred to be ' selected correspondingly according to the requirement ' according to the analysis and the corresponding transcription text of the role voice of the initial role in the voice data, and the identity information of the speaker in the field of ' selected according to the requirement ' and the corresponding selection ' in the conference summary are searched from the pre-constructed database, as identity information of the initial role.

According to the method provided by the embodiment of the invention, after the initial role in the voice data is determined, the transcription text of the role voice corresponding to the initial role is displayed, a user can directly determine the identity information of the initial role from the displayed transcription text through the intelligent terminal and label the identity information, or a corresponding speaker is searched from a pre-stored database according to the transcription text, and the searched identity information of the speaker is used as the identity information of the initial role, so that the server can receive the identity information of the initial role corresponding to the transcription text, and the identity information of the initial role is determined according to the identity information of the initial role, the accuracy of the identity information of the initial role is ensured, the rate of determining the identity information of the initial role is also improved, and strong assistance is provided for fast and accurate voiceprint registration.

Based on the above embodiment, fig. 3 is a second schematic flow chart of step 120 in the voiceprint registration method provided by the present invention, and as shown in fig. 3, step 120 includes:

step 123, displaying role voice corresponding to the initial role in the voice data;

step 124, receiving the identity information of the initial role corresponding to the role voice.

And the determination process of the identity information of the initial role comprises the following steps: step 123, firstly, determining role voice corresponding to the initial role from the voice data; displaying role voice corresponding to the initial role in the voice data in the intelligent terminal; subsequently, step 124 is executed to receive the identity information of the initial role corresponding to the role speech, where the identity information of the initial role may be determined by searching from a pre-stored database according to the role speech corresponding to the initial role, or may be marked by the user, which is not specifically limited in this embodiment of the present invention, and the following description will be given by taking the identity information of the initial role corresponding to the role speech marked by the user as an example: the user can input the identity information of the initial role through the intelligent terminal, and therefore identity information labeling aiming at the initial role is achieved. It should be noted that, when the user inputs the identity information of the initial role, the user may input the identity information of each displayed initial role, or may select one or more initial roles from the displayed initial roles to input the identity information, which is not specifically limited in the embodiment of the present invention. After that, the intelligent terminal can send the identity information of the initial role marked by the user to the server for voiceprint registration, the server receives the identity information of the initial role marked by the user, or the intelligent terminal can undertake the function of voiceprint registration, and then the intelligent terminal obtains the identity information of the initial role marked by the user.

the database storing the identity information of different speakers may be pre-constructed, for example, the database may be a database in which information such as names, sexes, ages, and the like of participants is entered, after the initial role in the voice data is obtained, information related to the role, such as the age, the sex, and the like of the role, which is inferred according to analysis of the role voice corresponding to the initial role in the voice data, may be matched with the corresponding speaker from the database to determine the identity information of the initial role, for example, if the initial role is a male with an age between 30 years and 40 years, which is inferred according to analysis of the role voice corresponding to the initial role in the voice data, a male with an age between 30 years and 40 years is searched from the pre-constructed database, and the identity information of the searched speaker is used as the identity information of the initial role.

The method provided by the embodiment of the invention displays the role voice corresponding to the initial role after the initial role in the voice data is determined, a user can directly determine the identity information of the initial role from the displayed role voice through the intelligent terminal and label the identity information, or searching the corresponding speaker from the pre-constructed database according to the character voice analysis related information of the initial character, and using the identity information of the speaker as the identity information of the initial character, thereby the server can receive the identity information of the initial role corresponding to the role voice, the identity information of the initial role is determined according to the identity information, the accuracy of the identity information of the initial role is ensured, and the speed of determining the identity information of the initial role is improved, so that strong assistance is provided for fast and accurate voiceprint registration.

In addition, the intelligent terminal can also display the role voice corresponding to the initial role and the transcription text of the role voice corresponding to the initial role, and a user can determine the identity information of the initial role from any one or two information displayed by the intelligent terminal and input the identity information of the initial role through the intelligent terminal, so that the identity information labeling aiming at the initial role is realized.

Based on the above embodiment, fig. 4 is a schematic diagram of a determining process of a voiceprint feature provided by the present invention, and as shown in fig. 4, the voiceprint feature is determined based on the following steps:

step 410, selecting a sample voice of an initial role from the role voice corresponding to the initial role in the voice data;

and step 420, carrying out voiceprint extraction on the sample voice to obtain the voiceprint characteristics of the initial role.

Specifically, after the initial role in the voice data and the identity information of the initial role are obtained through

steps

110 and 120, if the identity of the initial role is to be labeled, the voiceprint feature of the initial role in the voice data needs to be determined, so that the voiceprint registration can be performed on the initial role according to the initial role in the voice data and the voiceprint feature of the initial role.

The voiceprint feature of the initial role can be obtained by performing voiceprint extraction on a role voice corresponding to the initial role, wherein the role voice is a voice section corresponding to the initial role in the voice data, and the voice section is obtained by performing voiceprint separation on the voice data.

Considering that the initial role may correspond to a plurality of segments of role voices in the voice data, if voiceprint extraction is performed on each segment of role voice, not only is it necessary to take a long time, but also the progress of voiceprint registration is delayed, based on this embodiment of the present invention, before voiceprint extraction, step 410 may be further performed to select sample voices from the role voices corresponding to the initial role in the voice data, where the sample voices are the role voices with higher voice quality and longer voice duration in the role voices corresponding to the initial role, and the voice quality of the role voices may be determined by evaluation of voice definition, noise size, and the like, that is, the sample voices corresponding to the initial role may be selected from each segment of role voices according to at least one of the voice definition, noise size, and voice duration of each segment of role voices.

It should be noted that the selection of the sample speech is not limited to a single segment, that is, the character speech with the highest speech quality and/or the longest speech duration is not necessarily used as the sample speech, and since there may be slight differences in voiceprint features extracted from the character speech of the same initial character in different time intervals, when the sample speech is selected, multiple segments of sample speech may also be selected, that is, the previous segments of character speech with higher speech quality and/or longer speech duration are used as the sample speech.

After the sample voice corresponding to the initial role is selected, step 420 may be executed to perform voiceprint extraction on the sample voice, and extract the voiceprint features of the initial role in the sample voice, thereby obtaining the voiceprint features of the initial role. Aiming at the condition that the sample voice is multi-segment voice data, the voiceprint extraction process comprises the following steps: firstly, carrying out voiceprint extraction on each section of sample voice to obtain the voiceprint characteristics of the initial role in each section of sample voice; and then, fusing the voiceprint characteristics of the initial role in each section of sample voice to obtain the voiceprint characteristics of the initial role, wherein the voiceprint characteristics are fused, and the average value can be calculated.

Then, the initial role can be registered with voiceprint according to the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

Based on the above embodiment, step 410 includes:

and selecting sample voice of the initial role from each segment of role voice based on the voice time length and/or the voice definition of each segment of role voice corresponding to the initial role.

The initial role voiceprint registration is carried out according to the voiceprint features of the initial role, the voiceprint features of the initial role are obtained by carrying out voiceprint extraction on the sample voice of the initial role, the sample voice of the initial role is selected from the role voice corresponding to the initial role in the voice data, and when the voice duration of the role voice of the initial role is short, the accurate voiceprint features of the initial role cannot be extracted from the role voice in a short time, so that the role voice cannot be used as the sample voice of the initial role.

In addition, if part of unclear voice sections exist in the character voice corresponding to the initial character, the voiceprint extraction process based on the character voice is seriously affected, so that the accuracy of the voiceprint features obtained by voiceprint extraction is low. Therefore, when the sample voice is selected from the character voices corresponding to the initial character, the voice duration and the voice definition of each character voice are particularly critical, and the selected voice duration and the selected voice definition of the sample voice not only relate to the accuracy of the voiceprint feature of the initial character, but also indirectly influence the voiceprint registration process of the initial character.

Based on this, in step 410, when selecting the sample voice corresponding to the initial role from the role voices corresponding to the initial role in the voice data, the sample voice of the initial role can be selected from each segment of role voices by referring to the voice duration of each segment of role voices or the voice definition of each segment of voice data; and the influence of the voice duration and the voice definition of each section of voice data on the voiceprint extraction process can be comprehensively considered, and the sample voice of the initial role is selected from each section of role voice.

The speech intelligibility here characterizes the intelligibility of the speech, and can be determined by one or more of the number of speech words, the noise level, and the semantic error in each segment of speech data.

The voice time length is the time period of each character voice from beginning to end, and the voice time length of the selected sample voice can be more than or equal to 15 seconds, 20 seconds, 25 seconds and the like. Preferably, in the embodiment of the present invention, the voice duration of the sample voice is determined to be greater than or equal to 15 seconds, that is, the 15 seconds are taken as a reference, the sample voice is selected from each segment of character voice, the character voice with the voice duration less than 15 seconds is filtered, and the sample voice is selected from the remaining character voices.

Based on the above embodiment, the speech intelligibility is determined based on the number of semantic words and/or the number of semantic errors contained in the speech of the corresponding character.

In consideration of the fact that a large number of tone words may be contained in actually recorded voice data, and the existence of the large number of tone words has adverse effects on the subsequent voiceprint extraction and voiceprint registration processes based on the tone words, and the voiceprints of character voice registration containing the large number of tone words can cause misleading and a large number of recognition errors in the subsequent use process. In addition, when there is a semantic error in the character voice, for example, in the transcription text "of the character voice compared to the gene, more is that the message 712 is dead frequently as if the difference has not been felt yet for a while, but the message T two is not yet encountered," the message "s", "712", "T two" is no "and is a semantic error, and more semantic errors will also have an adverse effect on the subsequent voiceprint extraction and voiceprint registration.

Based on this, in the embodiment of the present invention, when the sample speech is selected according to the speech clarity of each segment of role speech, the sample speech corresponding to the initial role can be selected from each segment of role speech corresponding to the initial role further according to the number of the mood words contained in each segment of role speech or the number of semantic errors in each segment of role speech; and selecting sample voice corresponding to the initial role from each segment of role voice corresponding to the initial role by combining two factors of the number of the tone words contained in each segment of role voice and the number of semantic errors in each segment of role voice.

It should be noted that, for the above two factors, in the selection condition, the selection order of the number of the semantic words and the number of the semantic errors may not be sequential or may be in tandem, and this is not specifically limited in the embodiment of the present invention.

The following describes the selection process of the sample speech of the initial role in the selection sequence of the preceding number of the tone words and the succeeding number of the semantic errors:

firstly, acquiring each segment of role voice corresponding to an initial role from voice data;

labeling the tone words in each character voice, and determining the number of the tone words contained in each character voice;

then, according to the number of the tone words contained in each segment of role voice, selecting candidate sample voice from each segment of role voice corresponding to the initial role, namely, according to the sequence of the number of the tone words in each segment of role voice from less to more, sequentially selecting role voice as candidate sample voice;

then, performing semantic understanding on each segment of candidate sample voice to obtain the number of semantic errors in each segment of candidate sample voice, namely labeling the semantic errors in each segment of initial sample voice and determining the number of semantic errors in each segment of initial sample voice;

and finally, determining sample voice corresponding to the initial role from each segment of candidate sample voice according to the number of semantic errors in each segment of candidate sample voice, namely sequentially selecting the candidate sample voice as the sample voice of the initial role according to the sequence of the number of semantic errors in each segment of candidate sample voice from less to more.

According to the method provided by the embodiment of the invention, the sample voice corresponding to the initial role is selected from the character voices corresponding to the initial role according to one or two of the number of the tone words and the semantic error number contained in each character voice corresponding to the initial role, the sample voice is selected from each character voice according to two conditions, the voice definition of the selected sample voice is ensured, the character voice with higher voice definition is used as the sample voice for voiceprint registration, the accuracy of voiceprint characteristics obtained by voiceprint extraction is improved, and the reliability and the accuracy of voiceprint registration are ensured.

Based on the above embodiment, step 410 includes:

and receiving sample voice of the initial role selected from the character voices corresponding to the initial role.

In addition to the server selecting the sample voice of the initial role from the role voices corresponding to the initial role in the voice data, the sample voice may also be determined by the user selection, specifically, the process of the user selecting the sample voice of the initial role from each segment of role voices corresponding to the initial role specifically includes: firstly, acquiring role voice corresponding to an initial role from voice data; then, the user selects sample voice of the initial role from each segment of role voice corresponding to the initial role; and then, receiving sample voice of the initial role selected by the user from the role voices corresponding to the initial role.

It should be noted that, in view of the influence of the voice duration and the voice definition on the voiceprint feature, when the user selects the sample voice, the user may refer to the voice duration of each segment of character voice or the voice definition of each segment of voice data, and select the sample voice of the initial character from each segment of character voice; the two factors of the speech duration and the speech definition of each segment of speech data may also be integrated, and the sample speech of the initial character is selected from each segment of character speech, which is not specifically limited in the embodiment of the present invention.

Based on the above embodiment, the number of sample voices is equal to or less than the preset number.

Specifically, before the initial role is subjected to voiceprint registration according to the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data, the number of sample voices needs to be determined, and whether the number of the sample voices reaches a preset number is judged, wherein the preset number is the maximum number of the sample voices capable of being accommodated by the equipment for voiceprint registration and can be predetermined according to the storage amount of the equipment for voiceprint registration, and the preset number in the embodiment of the invention is 64 segments.

Further, if the number of sample voices reaches a preset number, which indicates that the number of sample voices stored in the device for voiceprint registration at this time reaches a maximum number that can be accommodated, a new sample voice cannot be selected for voiceprint registration at this time, and if a new sample voice needs to be selected for voiceprint registration from each segment of role voices corresponding to the initial role, a part of sample voices which have completed voiceprint registration needs to be deleted, so that the number of sample voices is reduced, and thus, voiceprint registration can be performed based on the new sample voices.

Correspondingly, if the number of the sample voices does not reach the preset number, which indicates that the number of the sample voices stored in the equipment for voiceprint registration does not reach the maximum number capable of being accommodated at this time, the voiceprint registration can be performed on the initial role directly according to the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

Based on the above embodiment, the voiceprint registration of the initial role further includes:

and marking the speaker for the voice data based on the identity information of each initial role and the voiceprint characteristics of each initial role in the voice data.

Specifically, in the actual process of recording voices, most of the voices of multiple speakers indirectly speak, different speakers are often marked according to the actual recording requirements, and after marking, the voices of the characters corresponding to the speakers are the speaking contents of the speakers.

Fig. 5 is one of display diagrams of an interface for speaker labeling according to the present invention, and as shown in fig. 5, when the device for voiceprint registration performs a first voice transcription on recorded voice data, it first needs to perform voiceprint separation on the voice data, i.e. distinguish and judge the voiceprint features of each speaker in the voice data, and determine the matching degree between the voiceprint features of each speaker in the voice data; then, according to the matching degree of the voiceprint features of each speaker, performing initial role labeling on each speaker, and labeling the speaker with higher matching degree of the voiceprint features as the same speaker, for example, labeling the speaking content of the same speaker in different time intervals as: [ speaker 1 ] speaking content 1, and [ speaker 1 ] speaking content 2; the speakers with lower matching degree of the voiceprint features are labeled as different speakers, for example, the speaking contents of different speakers are labeled as [ speaker 1 ] speaking content 1 and [ speaker 2 ] speaking content 2, so as to obtain the initial role labels of the speakers in the voice data.

Then, updating the initial role label according to the voiceprint features of the initial role which has finished the voiceprint registration and the identity information of the initial role, thereby obtaining the role label of each speaker in the voice data, wherein the process can specifically be that the voiceprint features of the initial role matched with the voiceprint features of each speaker in the voice data are determined from the voiceprint features of the initial role which has finished the voiceprint registration; and then, attaching the identity information corresponding to the voiceprint characteristics of the initial role to the corresponding speaker, namely modifying the name of the corresponding speaker in the transcription text obtained by voice transcription into the name contained in the identity information corresponding to the initial role, wherein the modification of the name of the speaker can be realized by clicking the name of the speaker, inputting the name of the speaker contained in the identity information in a popped name modification frame and clicking a confirmation button.

It should be noted that, when the speaker name is modified, the popped-up name modification box also needs to select to modify this point or modify all the points, where modification indicates that only the clicked speaker name is modified, for example, modifying the clicked [ speaker 2 ] to be small; the modification indicates that all the speaker names in the transcribed text are modified according to the name of the clicked speaker, for example, all [ speaker 2 ] in the transcribed text in fig. 5 are modified, all [ speaker 2 ] can be modified to be small, and the other speaker names ([ speaker 1 ], [ speaker 3 ], etc.) are not modified.

Considering that there may be slight differences in voiceprint features extracted from the same speaker in different periods of time, when selecting a sample voice, a plurality of sample voices may be added for voiceprint registration, so that the voiceprint features of each speaker can be distinguished more accurately in the subsequent voice transcription process.

Based on this, the name modification box popped up in the embodiment of the present invention is further provided with a check box, the content of which is that "the corresponding audio is stored as the voice sample of the speaker", it can be understood that if the check box is checked, the role voice corresponding to the clicked speaker name is stored, the stored role voice is used as the sample voice of the clicked speaker, and after the check is finished, the confirmation button is clicked, and then the registration can be completed.

However, when the voice time of the character voice corresponding to the clicked speaker name is short, or the number of the sample voices reaches a preset number, the popped name modification box is correspondingly displayed according to the actual situation, fig. 6 is a second display diagram of the interface for speaker labeling provided by the present invention, as shown in fig. 6, if the voice time of the character voice corresponding to the clicked speaker name is less than 15 seconds, the popped name modification box displays "the audio time corresponding to the selected transcription result is less than 15 seconds", at this time, voiceprint registration cannot be performed based on the sample voice, the check box cannot be checked again, and the character voice with the voice time exceeding 15 seconds needs to be reselected as the sample voice to perform voiceprint registration.

If the number of the sample voices reaches the preset number, the popped name modification frame displays that the sample reaches the upper limit 64 and can be managed in the registered speaker interface, and at the moment, voiceprint registration can not be performed on the basis of the sample voices, corresponding management can be performed on the registered speaker interface, part of the registered sample voices are deleted, the number of the sample voices is reduced, and therefore the role voices corresponding to the clicked speaker names can be used as the sample voices, and voiceprint registration is performed on the basis of the sample voices.

According to the method provided by the embodiment of the invention, check boxes are checked in the popped-up name modification boxes, the role voice corresponding to the name of the clicked speaker is used as the sample voice, voiceprint registration is carried out according to the sample voice, so that the speaker can be automatically identified and displayed in a subsequent voice transcription scene conveniently, additional voice data recording is not needed for voiceprint registration, the step of additionally recording voice data is omitted, and the progress of voiceprint registration is accelerated.

The following describes the voiceprint registration apparatus provided by the present invention, and the voiceprint registration apparatus described below and the voiceprint registration method described above can be referred to correspondingly.

Fig. 7 is a schematic structural diagram of a voiceprint registration apparatus provided in the present invention, and as shown in fig. 7, the apparatus includes:

a voiceprint separation unit 710, configured to perform voiceprint separation on voice data to obtain an initial role in the voice data;

an identity information receiving unit 720, configured to receive identity information of the initial role;

a voiceprint registration unit 730, configured to perform voiceprint registration on the initial role based on the identity information of the initial role and a voiceprint feature of the initial role in the voice data.

The voiceprint registration device provided by the invention can be used for carrying out voiceprint registration based on the identity information of the initial role in the voice data and the voiceprint characteristics of the initial role in the voice data, so that the multiplexing of the voice data is realized, the voice special for voiceprint registration is not required to be additionally recorded, the problem that the voiceprint registration before recording is very complicated is solved, the efficiency of voiceprint registration is improved, and the fast and accurate voiceprint registration is realized.

Based on the above embodiment, the identity information receiving unit 720 is configured to:

displaying role voice corresponding to the initial role in the voice data;

Based on the above embodiment, the apparatus further includes a voiceprint feature determination unit, configured to:

Based on the above embodiment, the voiceprint feature determination unit is configured to:

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a voiceprint registration method comprising: carrying out voiceprint separation on voice data to obtain an initial role in the voice data; receiving identity information of the initial role; and performing voiceprint registration on the initial role based on the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the voiceprint registration method provided by the above methods, the method comprising: carrying out voiceprint separation on voice data to obtain an initial role in the voice data; receiving identity information of the initial role; and performing voiceprint registration on the initial role based on the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the voiceprint registration method provided by the above methods, the method comprising: carrying out voiceprint separation on voice data to obtain an initial role in the voice data; receiving identity information of the initial role; and performing voiceprint registration on the initial role based on the identity information of the initial role and the voiceprint characteristics of the initial role in the voice data.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voiceprint registration method, comprising:

receiving identity information of the initial role;

2. The voiceprint registration method according to claim 1, wherein the receiving identity information of the initial role comprises:

3. The voiceprint registration method according to claim 1, wherein the receiving identity information of the initial role comprises:

displaying role voice corresponding to the initial role in the voice data;

4. The voiceprint registration method according to any one of claims 1 to 3, wherein the voiceprint feature is determined based on the following steps:

5. The voiceprint registration method according to claim 4, wherein the selecting the sample voice of the initial character from the character voices corresponding to the initial character in the voice data includes:

6. The voiceprint registration method according to claim 5, wherein the speech intelligibility is determined based on the number of mood words and/or the number of semantic errors contained in the speech of the corresponding character.

7. The voiceprint registration method according to claim 4, wherein the selecting the sample voice of the initial character from the character voices corresponding to the initial character in the voice data includes:

8. A voiceprint registration apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the voiceprint registration method according to any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the voiceprint registration method of any one of claims 1 to 7.