CN114299957A

CN114299957A - Voiceprint separation method and device, electronic equipment and storage medium

Info

Publication number: CN114299957A
Application number: CN202111448940.3A
Authority: CN
Inventors: 郭启行
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-04-08

Abstract

The disclosure discloses a voiceprint separation method, a voiceprint separation device, voiceprint separation equipment and a storage medium, relates to the technical field of computers, and particularly relates to the technical field of voice recognition. The specific implementation scheme is as follows: removing noise segments in at least one audio segment based on a confidence recognition result of the at least one audio segment corresponding to the audio data to obtain a target audio segment set; acquiring a voiceprint characteristic corresponding to at least one target audio clip in the target audio clip set; and clustering the at least one target audio segment based on the voiceprint characteristics to obtain a voiceprint separation result corresponding to the audio data. The embodiment of the disclosure can effectively remove noise segments in audio data, can improve the accuracy of obtaining effective audio segments, and improves the accuracy of voiceprint separation.

Description

Voiceprint separation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a voiceprint separation method and apparatus, an electronic device, and a storage medium.

Background

Audio recognition technology is technology that allows smart devices to understand human language. It is a science that involves many disciplines such as digital signal processing, artificial intelligence, linguistics, mathematical statistics, acoustics, affective science and psychology alternately. The technology can provide a plurality of applications such as automatic customer service, automatic voice translation, command control, voice verification code, and the like. In recent years, with the rise of artificial intelligence, speech recognition technology makes a breakthrough in both theory and application, starts to go from the laboratory to the market, and gradually enters our daily life.

Disclosure of Invention

The present disclosure provides a voiceprint separation method, apparatus, electronic device, and storage medium for improving accuracy of audio recognition.

According to an aspect of the present disclosure, there is provided a voiceprint separation method including:

removing noise segments in at least one audio segment based on a confidence recognition result of the at least one audio segment corresponding to the audio data, and acquiring a target audio segment set;

acquiring a voiceprint characteristic corresponding to at least one target audio clip in the target audio clip set;

and clustering the at least one target audio segment based on the voiceprint characteristics to obtain a voiceprint separation result corresponding to the audio data.

According to another aspect of the present disclosure, there is provided a voiceprint separation apparatus comprising:

the audio acquisition unit is used for removing noise fragments in at least one audio fragment based on the confidence coefficient identification result of the at least one audio fragment corresponding to the audio data to acquire a target audio fragment set;

the voiceprint acquisition unit is used for acquiring a voiceprint characteristic corresponding to at least one target audio clip in the target audio clip set;

and the voiceprint separation unit is used for clustering the at least one target audio clip based on the voiceprint characteristics to obtain a voiceprint separation result corresponding to the audio data.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of the preceding aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any one of the preceding aspects.

In one or at least one embodiment of the present disclosure, by identifying a result based on a confidence level of at least one audio segment corresponding to audio data, a noise segment in the at least one audio segment may be removed, and a target audio segment set is obtained; acquiring a voiceprint characteristic corresponding to at least one target audio clip in the target audio clip set; and clustering the at least one target audio segment based on the voiceprint characteristics to obtain a voiceprint separation result corresponding to the audio data. The accuracy of audio recognition can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a background schematic diagram of a voiceprint separation method used to implement an embodiment of the present disclosure;

FIG. 2 is a system architecture diagram for implementing the voiceprint separation method of an embodiment of the present disclosure;

FIG. 3 is a schematic flow diagram of a voiceprint separation method according to a first embodiment of the present disclosure;

FIG. 4 is a schematic flow diagram of a voiceprint separation method according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a scenario for implementing the voiceprint separation method of an embodiment of the present disclosure;

FIG. 6a is a schematic structural diagram of a first voiceprint separation apparatus for implementing the voiceprint separation method of the embodiments of the present disclosure;

FIG. 6b is a schematic structural diagram of a second voiceprint separation apparatus for implementing the voiceprint separation method of the embodiments of the present disclosure;

FIG. 6c is a schematic structural diagram of a third voiceprint separation apparatus for implementing the voiceprint separation method of the embodiments of the present disclosure;

FIG. 6d is a schematic structural diagram of a fourth voiceprint separation apparatus for implementing the voiceprint separation method of the embodiments of the present disclosure;

FIG. 6e is a schematic structural diagram of a fifth voiceprint separation apparatus for implementing the voiceprint separation method of the embodiment of the disclosure;

FIG. 7 is a block diagram of an electronic device for implementing the voiceprint separation method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the development of scientific technology, the development of speech recognition technology is more and more rapid. When a piece of audio contains at least two pieces of audio of characters, various voiceprint separation methods are developed in order to identify and separate the audio of each character therefrom.

Fig. 1 illustrates a background schematic diagram of a voiceprint separation method used to implement embodiments of the present disclosure, according to some embodiments. As shown in fig. 1, when the terminal receives a segment of audio, the terminal may segment the audio, and the terminal may obtain at least one audio sentence. When the terminal acquires at least one audio sentence, the terminal can perform voice recognition on the separated at least one audio sentence through the recognition engine, so that the at least one audio sentence can be converted into characters. The terminal can acquire the audio characters corresponding to different users.

In some embodiments, FIG. 2 shows a system architecture diagram used to implement the voiceprint separation method of embodiments of the present disclosure. As shown in fig. 2, when receiving a segment of audio, the terminal 21 may segment the audio through Voice Activity Detection (VAD) to obtain an audio segment corresponding to the audio. The terminal can extract the voiceprint characteristics of the segmented audio segments. When the terminal acquires the voiceprint features corresponding to at least one audio clip, the extracted voiceprint features can be clustered. Therefore, the audio corresponding to each character forms a category through clustering, and the terminal can cluster the audio corresponding to the same character into one category. When the terminal acquires the text information corresponding to the audio, the terminal may send the audio to the server 23 through the network 22, and the server 23 may identify the audio to obtain the text information corresponding to the audio and send the text information to the terminal 21.

It is easily understood that external noise exists in the audio in addition to the audio of the character. Therefore, due to the introduction of external noise, at least one category can be introduced into the voiceprint separation result, so that the clustering result is inaccurate, and the accuracy of voiceprint separation is influenced.

The present application will be described in detail with reference to specific examples.

In a first embodiment, as shown in fig. 3, fig. 3 shows a flow chart of a voiceprint separation method according to a first embodiment of the present disclosure, which may be implemented by means of a computer program and may be run on an apparatus for voiceprint separation. The computer program may be integrated into the application or may run as a separate tool-like application.

Wherein, the voiceprint separation device can be a terminal with voiceprint separation function, and the terminal includes but is not limited to: wearable devices, handheld devices, personal computers, tablet computers, in-vehicle devices, smart phones, computing devices or other processing devices connected to a wireless modem, and the like. Terminals can be called different names in different networks, for example: user equipment, access terminal, subscriber unit, subscriber station, Mobile station, remote terminal, Mobile device, user terminal, wireless Communication device, user agent or user equipment, cellular telephone, cordless telephone, Personal Digital Assistant (PDA), fifth Generation Mobile Communication Technology (5G) network, the 4th Generation Mobile Communication Technology (4G) network, a terminal in a 3rd-Generation, 3G or future evolution network, and the like. The execution body may be a terminal having a voiceprint separation function, or may be a server having a voiceprint separation function.

Specifically, the voiceprint separation method comprises the following steps:

s301, based on the confidence recognition result of at least one audio clip corresponding to the audio data, removing noise clips in the at least one audio clip, and acquiring a target audio clip set;

according to some embodiments, the audio data refers to single-channel audio data received by the terminal through the acquisition module. For example, the collection module may be a bluetooth headset or a microphone. The audio data does not refer to a fixed audio data. For example, the audio data may be different based on the difference of the acquisition time, may be different based on the difference of the audio content, and may be different based on the difference of the audio duration.

It is to be understood that, for example, when only one character utters, the audio data of one character may be included in the audio data. When two characters are uttered, the audio data of the two characters may be included in the audio data. When three characters are uttered, the audio data of the three characters may be included in the audio data.

In some embodiments, the audio segment refers to an audio segment formed by a terminal slicing received audio data. The audio clip does not refer specifically to a fixed audio clip. For example, the audio pieces may be different based on the number of slices of the audio data, or may be different based on the length of time for which the audio data is sliced. For example, the audio segment may be a 10s audio segment, a 20s audio segment, or a 30s audio segment.

In some embodiments, confidence is the likelihood that the recognition result is the correct result when the terminal performs audio recognition on the audio clip. The higher the confidence value is, the higher the possibility that the recognition result is the correct result is, which is an important basis for performing audio recognition.

Optionally, the confidence recognition result refers to a result obtained after the terminal performs confidence recognition on each audio clip. The confidence recognition result does not refer to a fixed confidence recognition result. The confidence recognition result may be different based on, for example, the difference in audio segments. For example, if the audio content of the audio segments is different, the confidence level recognition result may also change correspondingly.

In some embodiments, a noise segment refers to a segment of audio in which external noise is present. The noise section does not refer to a fixed noise section. The noise section may be different based on, for example, the audio content included in the audio data. For example, the noise segment may be an unvoiced noise segment, and may also be a vocal noise segment.

In some embodiments, the target audio segment refers to audio data to be voiceprint separated. The target audio segment does not refer to a fixed target audio segment. The target audio piece may be different based on, for example, the audio data.

In some embodiments, a set of target audio segments refers to a collection of at least one target audio segment. The target audio segment set does not refer specifically to a fixed target audio segment set. For example, the set of target audio segments may vary based on the number of target audio segments and may also vary based on the duration of time that the target audio segments correspond to.

It is easily understood that, when the terminal receives the audio data, the terminal may segment the received audio data into at least one audio segment. When the terminal receives the audio clip, the terminal may perform confidence recognition on the audio clip. The terminal can remove the noise segments in the audio segments based on the confidence coefficient identification result of the audio segments, so that the target audio segment set is obtained.

S302, acquiring a voiceprint feature corresponding to at least one target audio clip in a target audio clip set;

according to some embodiments, a voiceprint feature refers to an acoustic feature that is related to the anatomy of a human pronunciation mechanism, and does not specifically refer to a fixed voiceprint feature. For example, the voiceprint feature may be different based on the character information corresponding to the target audio piece, or may be different based on the time point corresponding to the target audio piece. For example, the voiceprint features include, but are not limited to, a nasal sound, a profound breath sound, a hoarse sound, or a laughing sound.

It is easy to understand that when the terminal acquires the target audio clip set, the voiceprint feature corresponding to at least one target audio clip in the target audio clip set can be acquired. Namely, the terminal can obtain the voiceprint feature corresponding to each target audio clip in the target audio clip set.

S303, clustering at least one target audio fragment based on the voiceprint characteristics to obtain a voiceprint separation result corresponding to the audio data.

According to some embodiments, clustering refers to the process of dividing a set of physical or abstract objects into at least one class consisting of similar objects. Clustering at least one target audio clip refers to a process of clustering at least one target audio clip in a target audio clip set based on different voiceprint characteristics corresponding to the target audio clip.

In some embodiments, the voiceprint separation result refers to a voiceprint separation result corresponding to the audio data obtained when the terminal performs clustering on at least one target audio segment. The voiceprint separation result does not refer specifically to a fixed voiceprint separation result. The voiceprint separation result can be different based on, for example, the voiceprint characteristics corresponding to the audio data.

It is easy to understand that, when the terminal acquires the voiceprint feature corresponding to the at least one target audio segment, based on the voiceprint feature, the terminal may cluster the at least one target audio segment to obtain a voiceprint separation result corresponding to the audio data.

In the embodiment of the disclosure, the noise segment in the at least one audio segment is removed based on the confidence recognition result of the at least one audio segment corresponding to the audio data, and the target audio segment set is obtained, so that the noise segment in the audio data can be removed, the accuracy of obtaining the target audio segment set is improved, and the influence of the noise segment on the voiceprint separation result is reduced. The voiceprint characteristics corresponding to at least one target audio clip in the target audio clip set are obtained, clustering is carried out on the at least one target audio clip based on the voiceprint characteristics, and a voiceprint separation result corresponding to the audio data is obtained, so that the calculation resources consumed when the noise clips are separated can be reduced, extra calculation resources do not need to be consumed, meanwhile, the accuracy of obtaining the target audio clip can be improved due to the fact that the noise clips in the audio data are removed, the accuracy of clustering the target audio clip can be improved, and the accuracy of voiceprint separation can be improved.

Referring to fig. 4, fig. 4 is a flow chart illustrating a voiceprint separation method according to a second embodiment of the disclosure. In particular

S401, segmenting the audio data based on the silence detection result of the audio data to obtain at least one audio segment corresponding to the audio data;

the specific process is as above, and is not described herein again.

In some embodiments, the technical solution of the present disclosure may be applied to a voiceprint separation scene including only two classes, and may also be applied to a voiceprint separation scene of multiple classes. The technical scheme disclosed by the invention can be applied to a customer service quality inspection scene.

According to some embodiments, silence detection refers to detecting a beginning end point and an ending end point (also called an end point) of a certain segment of speech according to a specific rule, where the beginning end point can be regarded as a first detected syllable in the segment of speech signal, and the end point can be regarded as a last detected syllable in the segment of speech signal.

Alternatively, for example, the "present" of the speech signal "how the weather is today" is the start end point, and "like" is the end point thereof. The silence detection may be performed by, for example, UniMRCP-based VAD, WebRTC-based VAD, or DNN-based VAD.

In some embodiments, the UniMRCP is an open-source, cross-platform MRCP protocol implementation, written in C/C + + language, including two parts, MRCP client and server, each of which can be freely split for individual use. It encapsulates the SIP, RTSP, SDP, MRCPv1, MRCPv2, RTP/RTCP stacks and provides a consistent API for voice service integrators.

In some embodiments, the UniMRCP-based VAD continuously determines the average energy of the audio data of a set time and the size of the silence detection threshold by setting the silence detection threshold, and if the average energy of any segment of audio data changes from being smaller than the silence detection threshold to being larger than the silence detection threshold, it indicates that the start point of the segment of audio data is the start end point of a segment of voice, and if the average energy of any segment of audio data changes from being larger than the silence detection threshold to being smaller than the silence detection threshold, it indicates that the end point of the segment of audio data is the end point of a segment of voice, and records all the start end points and the end points.

In some embodiments, the WebRTC name is derived from the abbreviation of Web instant messaging (Web Real-Time Communication), an API that supports Web browsers for Real-Time voice conversations or video conversations. WebRTC realizes a web-based video conference, the standard is WHATWG protocol, and the aim is to achieve Real-Time communication (RTC) capability by providing simple javascript through a browser. WebRTC provides a core technology of a video conference, including functions of audio and video acquisition, encoding and decoding, network transmission, display and the like, and also supports cross-platform: windows, linux, mac, android.

In some embodiments, the WebRTC-based VAD obtains feature vectors of different audios by using a gaussian model, a start point of the feature vector of any piece of audio is a start endpoint of the piece of audio, an end point of the feature vector of any piece of audio is an end endpoint of the piece of audio, and all the start endpoints and the end endpoints are recorded.

It is easy to understand that the silence detection result refers to a set of start endpoints and end endpoints obtained by performing silence detection on audio data.

According to some embodiments, the terminal slicing the audio data may refer to the terminal slicing the audio data based on a silence detection result, for example. The way in which the terminal slices the audio data may be, for example, to filter out the audio data from the ending endpoint to the starting endpoint by a filter. The duration corresponding to the audio segments obtained by segmenting the audio data by the terminal is different, and the duration of the audio segments can be determined based on the audio content corresponding to the audio data, for example.

It is easy to understand that after receiving the audio data, the terminal may obtain a silence detection result corresponding to the audio data through silence detection, and based on the silence detection result, the terminal may segment the effective audio data to obtain at least one audio clip corresponding to the audio data. For example, the terminal receives 5min of audio data, acquires 3 start endpoints and 3 end endpoints through silence detection, and filters out the audio data from the end endpoint to the start endpoint through a filter, so that the terminal can acquire, for example, 40s of audio data, 90s of audio data, and 2min of audio data.

S402, based on the confidence coefficient identification result of at least one audio segment corresponding to the audio data, removing a noise segment in the at least one audio segment, and acquiring a target audio segment set;

the specific process is as above, and is not described herein again.

According to some embodiments, the terminal identifies the confidence level, including but not limited to, using an acoustic score method, using a decoding extraction method, using a heuristic calculation method, using a log-likelihood ratio method, and the like.

In some embodiments, when the terminal performs confidence recognition by using an acoustic score method, for example, the acoustic posterior probability obtained in the recognition process may be directly used, that is, the terminal may use the acoustic score as the confidence, and the higher the acoustic score is, the higher the confidence is.

Alternatively, for example, the audio data has a score of 5 for "people university", a score of 3 for the path "china" - "people", and a score of 2 for the path "china" - "people", and it can be obtained that the confidence of "people university" is 5/(5+3+2) ═ 0.5, the confidence of "china" is (3+2)/(5+3+ 2): 0.5, the confidence of "people" is 3/(5+3+ 2): 0.3, and the confidence of "people" is 2/(5+3+ 2): 0.2.

In some embodiments, when the terminal performs confidence level recognition by using a decoding extraction method, the confidence level is extracted from the N-best multi-candidate list, the word network and/or the confusion network output by the decoder.

In some embodiments, when the terminal performs the confidence level identification by using a heuristic calculation method, for example, a degradation score of a statistical language model may be used.

In some embodiments, when the terminal performs confidence level identification in a log-likelihood ratio manner, the log-likelihood ratio between the optimal candidate and a certain alternative hypothesis is used as the confidence level.

According to some embodiments, the confidence threshold refers to a threshold corresponding to the confidence, and the confidence threshold is used for determining whether the audio segment is a noise segment. The confidence threshold is not specific to a fixed threshold and may be modified based on the threshold setting instructions. The threshold setting instruction may be, for example, a text threshold setting instruction, a click threshold setting instruction, a timing threshold setting instruction, or a voice threshold setting instruction, and the like.

In some embodiments, when the terminal identifies the confidence level of at least one audio segment corresponding to the audio data, if the confidence level identification result indicates that the text information corresponding to the audio segment is not identified, the terminal may determine the audio segment as a noise segment and remove the noise segment.

Alternatively, for example, when the acoustic score corresponding to the a audio clip is 0, the terminal may determine the a audio clip as a noise clip and remove the a noise clip. For example, when the heuristic information calculation identification result corresponding to the B audio segment is 0, the terminal may determine the B audio segment as a noise segment and remove the B noise segment.

In some embodiments, when the terminal identifies the confidence level of at least one audio segment corresponding to the audio data, if the confidence level identification result indicates that the confidence level corresponding to the audio segment is less than the confidence level threshold, the audio segment is determined as a noise segment, and the noise segment is removed. That is, if the confidence recognition result indicates that the confidence corresponding to the audio segment is smaller than the confidence threshold, the terminal may determine the audio segment as a human voice noise segment and remove the human voice noise segment.

Alternatively, for example, when the acoustic score corresponding to the a audio segment is 40, the acoustic score threshold is 60, the a audio segment is determined as a human noise segment, and the a noise segment is removed. For example, when the heuristic information calculation identification result corresponding to the B audio segment is 20, the heuristic information calculation identification result threshold is 70, the B audio segment is determined as a human noise segment, and the B noise segment is removed. For example, when the log likelihood ratio corresponding to the C audio segment is 2, and the log likelihood ratio threshold is 5, the terminal may determine the C audio segment as a human noise segment and remove the C noise segment.

In some embodiments, when the terminal identifies the confidence level of at least one audio segment corresponding to the audio data, if the confidence level identification result indicates that the confidence level corresponding to the audio segment is greater than or equal to the confidence level threshold, the audio segment is determined to be the target audio segment, and the target audio segment is added to the target audio segment set.

Optionally, for example, when the acoustic score corresponding to the a audio clip is 60, the acoustic score threshold is 60, the a audio clip is determined as the target audio clip, and the terminal may add the a audio clip to the target audio clip set. When the calculation and identification result of the heuristic information corresponding to the B audio clip is 90, the threshold of the calculation and identification result of the heuristic information is 70, the B audio clip is determined as the target audio clip, and the terminal can add the B audio clip to the target audio clip set. When the log likelihood ratio corresponding to the C audio clip is 10, the threshold of the log likelihood ratio is 5, the C audio clip is determined as the target audio clip, and the terminal may add the C audio clip to the target audio clip set.

It is easy to understand that, after the terminal receives at least one audio segment corresponding to the audio data, the terminal may perform confidence recognition on the at least one audio segment corresponding to the audio data, and remove a noise segment in the at least one audio segment based on a confidence recognition result of the at least one audio segment corresponding to the audio data, so as to obtain the target audio segment set.

Alternatively, for example, when the terminal acquires one 10s audio piece, one 20s audio piece, and one 30s audio piece, the terminal may identify the acoustic scores of the three audio pieces, respectively. The acoustic score of the 10s audio segment may be 20, the acoustic score of the 20s audio segment may be 60, the acoustic score of the 30s audio segment may be 80, and the acoustic score threshold may be 50. According to the acoustic scores of the three audio segments, the terminal can remove the 10s audio segment, leave the 20s audio segment and the 30s audio segment as the target audio segment, and add the 20s audio segment and the 30s audio segment to the target audio segment set.

S403, acquiring a voiceprint feature corresponding to at least one target audio clip in the target audio clip set;

the specific process is as above, and is not described herein again.

According to some embodiments, the voiceprint features may be obtained, for example, by framing the audio segment according to a fixed frame length and frame shift, and extracting short-time voiceprint features in each frame of audio. The voiceprint feature may be obtained by, for example, calculating a fundamental frequency of a fixed speech frame, determining a frame length of a current frame according to a value of the fundamental frequency, framing the audio segment according to the frame length and the frame shift, and extracting a short-time voiceprint feature in each frame of the audio segment. The voiceprint features may include, for example, at least one Mel-frequency Cepstrum Coefficient (MFCC), lexical features, prosodic features, language information, and channel information.

It is easy to understand that after the terminal acquires the target audio segment set, the voiceprint feature corresponding to at least one target audio segment in the target audio segment set can be acquired. For example, the terminal may obtain a voiceprint feature corresponding to each target audio clip in the target audio clip set, or the terminal may obtain a voiceprint feature corresponding to a part of target audio clips in the target audio clip set.

In some embodiments, at least one target audio segment included in the target audio segment set may be, for example, Q1-Q10 target audio segments, and the terminal acquires a voiceprint feature Q1 corresponding to a Q1 target audio segment, a voiceprint feature Q2 corresponding to a Q2 target audio segment, a voiceprint feature Q1 corresponding to a Q3 target audio segment, a voiceprint feature Q2 corresponding to a Q4 target audio segment, a voiceprint feature Q1 corresponding to a Q5 target audio segment, a voiceprint feature Q2 corresponding to a Q5 target audio segment, a voiceprint feature Q1 corresponding to a Q7 target audio segment, a voiceprint feature Q2 corresponding to a Q8 target audio segment, a voiceprint feature Q1 corresponding to a Q9 target audio segment, and a voiceprint feature Q2 corresponding to a Q10 target audio segment.

S404, clustering at least one target audio segment based on the voiceprint features to obtain a first clustering result and a second clustering result corresponding to the audio data, wherein different clustering results correspond to different voiceprint features;

the specific process is as above, and is not described herein again.

According to some embodiments, the first clustering result does not refer to a certain fixed clustering result, and the first clustering result may be different based on, for example, differences in voiceprint characteristics. For example, when at least one target audio segment is a Q1-Q10 target audio segment, and the terminal acquires the voiceprint feature Q1, a Q2 target audio segment, a Q4 target audio segment, a Q6 target audio segment, a Q8 target audio segment, and a Q10 target audio segment corresponding to the Q1-Q10 target audio segment, the terminal acquires clusters of the Q1 target audio segment, the Q3 target audio segment, the Q5 target audio segment, the Q7 target audio segment, and the Q9 target audio segment, the first clustering result may be clusters of the Q1 target audio segment, the Q3 target audio segment, the Q5 target audio segment, the Q7 target audio segment, and the Q9 target audio segment, and the first clustering result may also be clusters of the Q2 target audio segment, the Q4 target audio segment, the Q6 target audio segment, the Q8 target audio segment, and the Q10 target audio segment.

According to some embodiments, the second clustering result does not refer specifically to a certain fixed clustering result, which may differ based on, for example, differences in voiceprint characteristics. The first clustering result is different from the second clustering result. For example, when the terminal acquires the voiceprint feature Q1, Q2, Q4, Q6, Q8, and Q10 target audio segments corresponding to the Q1, Q3, Q5, Q7, and Q9 target audio segments, the first clustering result may be, for example, a cluster of the Q2, Q4, Q6, Q8, and Q10 target audio segments, and the second clustering result may be, for example, 1, Q3, Q5, Q7, and Q9 target audio segments. When the first clustering result may be a cluster of a Q1 target audio segment, a Q3 target audio segment, a Q5 target audio segment, a Q7 target audio segment, and a Q9 target audio segment, for example, the second clustering result may also be a cluster of a Q2 target audio segment, a Q4 target audio segment, a Q6 target audio segment, a Q8 target audio segment, and a Q10 target audio segment, for example.

It is easy to understand that, after the terminal acquires the voiceprint feature corresponding to the at least one target audio segment, the terminal may perform clustering on the at least one target audio segment based on the voiceprint feature to obtain a first clustering result and a second clustering result corresponding to the audio data.

Optionally, for example, when at least one target audio segment is a Q1-Q10 target audio segment, and the terminal acquires the voiceprint feature Q1, Q2 target audio segment, Q4 target audio segment, Q6 target audio segment, Q8 target audio segment, and Q10 voiceprint feature Q2 corresponding to a Q1 target audio segment, a Q3 target audio segment, a Q5 target audio segment, a Q7 target audio segment, and a Q9 target audio segment, the clustering result acquired by the terminal may be, for example: the first clustering result is a cluster of a Q1 target audio segment, a Q3 target audio segment, a Q5 target audio segment, a Q7 target audio segment, a Q9 target audio segment, and the second clustering result is a cluster of a Q2 target audio segment, a Q4 target audio segment, a Q6 target audio segment, a Q8 target audio segment, a Q10 target audio segment.

S405, performing voice recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set;

the specific process is as above, and is not described herein again.

According to some embodiments, the language recognition may be, for example, a stochastic model method, in which speech is recognized through the steps of extracting features, training templates, classifying templates, and determining templates. The language identification method may be, for example, a method using a neural network to perform feature extraction on audio data, perform character and audio data by using CTC (connectionist Temporal classification), process an image by using a Convolutional Neural Network (CNN), extract main features by maximum pooling, and perform training by adding a defined CTC loss function. Or, the probability grammar analysis method solves the problem by using knowledge of corresponding levels for knowledge of different levels.

In some embodiments, the first text information does not refer to a fixed text information. For example, the first text information may be different based on the difference of the first clustering results, may also be different based on the difference of the number of audio segments corresponding to the first clustering results, and may also be different based on the difference of the audio contents of the audio segments corresponding to the first clustering results.

Alternatively, for example, when the audio clip is a user audio clip, the text message may be "when to ship, size is not accurate", for example. When the audio piece is a customer service audio piece, the text message may be, for example, "now available, immediately shipped".

It is easy to understand that, after the terminal acquires the first clustering result, the terminal may perform speech recognition on at least one audio segment corresponding to the first clustering result, so as to obtain the first text information set.

Optionally, for example, when the first clustering result obtained by the terminal is clustering of the Q1 target audio segment, the Q3 target audio segment, the Q5 target audio segment, the Q7 target audio segment, and the Q9 target audio segment, speech recognition may be performed on the Q1 target audio segment, the Q3 target audio segment, the Q5 target audio segment, the Q7 target audio segment, and the Q9 target audio segment by using a neural network method, so as to obtain text information corresponding to the Q1 target audio segment, text information corresponding to the Q3 target audio segment, text information corresponding to the Q5 target audio segment, text information corresponding to the Q7 target audio segment, and text information corresponding to the Q9 target audio segment.

It is easy to understand that when the first clustering result obtained by the terminal is the clustering of the Q2 target audio segment, the Q4 target audio segment, the Q6 target audio segment, the Q8 target audio segment, and the Q10 target audio segment, speech recognition can be performed on the Q2 target audio segment, the Q4 target audio segment, the Q6 target audio segment, the Q8 target audio segment, and the Q10 target audio segment by using a random model method, so as to obtain text information corresponding to the Q2 target audio segment, text information corresponding to the Q4 target audio segment, text information corresponding to the Q6 target audio segment, text information corresponding to the Q8 target audio segment, and text information corresponding to the Q10 target audio segment.

S406, performing voice recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set;

the specific process is as above, and is not described herein again.

According to some embodiments, step S205 and step S206 may be performed simultaneously or sequentially. For example, the terminal may perform speech recognition on at least one audio segment corresponding to the first clustering result and at least one audio segment corresponding to the second clustering result at the same time to obtain a first text information set and a second text information set. Or, the terminal may perform speech recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set, and then perform speech recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set. Or, the terminal may perform voice recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set, and then perform voice recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set.

In some embodiments, the second text information does not refer to a fixed text information. The second textual information may be different, for example, based on the audio segment. For example, when the audio clip is a user audio clip, the text message may be "when to ship, size is not accurate", for example. When the audio piece is a customer service audio piece, the text message may be, for example, "now available, immediately shipped".

It is easy to understand that, after the terminal obtains the second clustering result, the terminal can perform voice recognition on at least one audio segment corresponding to the second clustering result, so as to obtain a second text information set;

optionally, for example, when the second clustering result obtained by the terminal is clustering of the Q1 target audio segment, the Q3 target audio segment, the Q5 target audio segment, the Q7 target audio segment, and the Q9 target audio segment, speech recognition may be performed on the Q1 target audio segment, the Q3 target audio segment, the Q5 target audio segment, the Q7 target audio segment, and the Q9 target audio segment by using a neural network method, so as to obtain text information corresponding to the Q1 target audio segment, text information corresponding to the Q3 target audio segment, text information corresponding to the Q5 target audio segment, text information corresponding to the Q7 target audio segment, and text information corresponding to the Q9 target audio segment.

S407, outputting a first text information set and a second text information set corresponding to the audio data.

The specific process is as above, and is not described herein again.

In some embodiments, the terminal performs speech recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set, and performs speech recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set, and the terminal may output the first text information set and the second text information set corresponding to the audio data.

According to some embodiments, when the terminal acquires the first text information set and the second text information set, acquiring first text information in the first text information set; and acquiring first second text information in the second text information set.

For example, when the first text information set is the user text information set, the first text information may be "shop owner is at. When the second text message set is the customer service text message set, the first second text message may be "good you ask what needs help", for example.

In some embodiments, the manner of acquiring the first text message in the text message set by the terminal may be, for example, the first text message recognized when performing speech recognition according to the time sequence of the audio segment. The manner in which the terminal acquires the first text information in the text information set may be, for example, determining the first text information according to a vector of the text information.

In some embodiments, if the first text message meets the customer service text message requirement, the first text message set is determined to be a customer service text message set, and the second text message set is determined to be a user text message set. Fig. 5 is a scene schematic diagram for implementing the voiceprint separation method of the embodiment of the present disclosure. As shown in fig. 5, for example, the first text message is "good you ask what needs help," and the first second text message is "good you ask you's shoes with 38 yards," the terminal may determine that the first text message set is a customer service text message set, and determine that the second text message set is a user text message set.

In some embodiments, if the first second text message meets the customer service text message requirement, the second text message set is determined to be the customer service text message set, and the first text message set is determined to be the user text message set. For example, if the first obtained second text message is "hello, ask what needs help", the second text message set is determined to be the customer service text message set, and the first text message set is determined to be the user text message set.

It is easy to understand that the customer service text message requirement refers to a condition for judging whether the text message is the customer text message, and the customer service text message is not particularly specified to a certain fixed requirement. The customer service text message request may include text messages with a high frequency of customer service use, for example. The customer service text information requirement does not refer to a certain fixed information requirement. For example, when the customer service text information included in the customer service text information request changes, the customer service text information request may also change accordingly.

Optionally, the customer service text information requirement may be, for example, whether the first text information belongs to a customer service text information set, where the customer service text information may include: "you ask what needs help", "can ship immediately", "sorry, no goods now".

It is easy to understand that when the terminal acquires the first text information set and the second text information set, the terminal may acquire first text information in the first text information set or acquire first second text information in the second text information set. The terminal can judge the types of the first character information set and the second character information set through the first character information and the first second character information.

According to some embodiments, voiceprint features corresponding to a first text information set are obtained; if the voiceprint features belong to the customer service voiceprint feature set, determining that the first character information set is the customer service character information set, and determining that the second character information set is the user character information set; and if the voiceprint features do not belong to the customer service voiceprint feature set, determining that the second text information set is the customer service text information set, and determining that the first text information set is the user text information set.

Optionally, for example, if the sound vector and the context vector in the voiceprint feature corresponding to the first text information set belong to the customer service voiceprint feature range, the terminal may determine that the first text information set is the customer service text information set, and determine that the second text information set is the user text information set. If the sound vector and the context vector in the voiceprint feature corresponding to the first text information set do not belong to the range of the voiceprint feature of the customer service, the terminal can determine that the second text information set is the customer service text information set and determine that the first text information set is the user text information set.

It is easy to understand that when the terminal acquires the first text information set and the second text information set, the terminal may determine the types of the first text information set and the second text information set by acquiring the voiceprint features corresponding to the first text information set.

It is easy to understand that when the terminal acquires the first text information set and the second text information set, the terminal may output the first text information set and the second text information set corresponding to the audio data, respectively. When the terminal acquires the first text information set and the second text information set, the terminal can judge the types of the first text information set and the second text information set by acquiring first text information in the first text information set, acquiring first second text information in the second text information set or by voiceprint characteristics corresponding to the first text information set.

In one or at least one embodiment of the present disclosure, based on a silence detection result of audio data, the audio data is segmented, at least one audio segment corresponding to the audio data is obtained, the silence segment in the audio data can be removed, extraction of a voiceprint feature of a noise segment is reduced, and voiceprint separation efficiency can be improved. Secondly, based on the confidence recognition result of at least one audio segment corresponding to the audio data, the noise segment in the at least one audio segment is removed, and the acquisition accuracy of the target audio segment can be improved and the voiceprint separation efficiency and accuracy can be improved by acquiring the target audio segment set. In addition, the voiceprint separation accuracy can be improved by obtaining the voiceprint feature corresponding to at least one target audio fragment in the target audio fragment set, the at least one target audio fragment is clustered based on the voiceprint feature to obtain the first clustering result and the second clustering result corresponding to the audio data, the noise fragments are not required to be clustered, extra computing resources are not required to be consumed, the influence of the noise fragments on the voiceprint separation result is reduced, the acquisition accuracy of the target audio fragment can be improved, and the voiceprint separation accuracy is improved. In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Referring to fig. 6a, a schematic structural diagram of a first voiceprint separation apparatus for implementing the voiceprint separation method according to the embodiment of the disclosure is shown. The voiceprint separation apparatus 600 can be implemented as all or part of an apparatus by software, hardware, or a combination of both. The voiceprint separation apparatus 600 includes an audio acquisition unit 610, a voiceprint acquisition unit 620, and a voiceprint separation unit 630, wherein:

the audio obtaining unit 610 is configured to remove a noise segment in at least one audio segment based on a confidence recognition result of the at least one audio segment corresponding to the audio data, and obtain a target audio segment set;

a voiceprint obtaining unit 620, configured to obtain a voiceprint feature corresponding to at least one target audio clip in the target audio clip set;

the voiceprint separation unit 630 is configured to cluster at least one target audio segment based on the voiceprint feature, so as to obtain a voiceprint separation result corresponding to the audio data.

Optionally, the audio obtaining unit 610 is configured to remove a noise segment in at least one audio segment based on a confidence recognition result of the at least one audio segment corresponding to the audio data, and when the target audio segment set is obtained, specifically configured to:

if the confidence coefficient identification result indicates that the text information corresponding to the audio clip is not identified, determining the audio clip as a noise clip, and removing the noise clip;

if the confidence coefficient identification result indicates that the confidence coefficient corresponding to the audio clip is smaller than the confidence coefficient threshold value, determining the audio clip as a noise clip, and removing the noise clip;

and if the confidence coefficient identification result indicates that the confidence coefficient corresponding to the audio clip is greater than or equal to the confidence coefficient threshold value, determining the audio clip as the target audio clip, and adding the target audio clip into the target audio clip set.

Optionally, fig. 6b shows a schematic structural diagram of a second voiceprint separation apparatus for implementing the voiceprint separation method according to the embodiment of the disclosure, and as shown in fig. 6b, the voiceprint separation apparatus 600 further includes an audio slicing unit 640, where:

the audio segmentation unit 640 is configured to segment the audio data based on a silence detection result of the audio data, and obtain at least one audio segment corresponding to the audio data.

Optionally, fig. 6c shows a schematic structural diagram of a third voiceprint separation apparatus for implementing the voiceprint separation method according to the embodiment of the disclosure, as shown in fig. 6c, the voiceprint separation unit 630 includes an audio clustering subunit 631, a speech recognition subunit 632, and an information output subunit 633, and the voiceprint separation unit 630 is configured to cluster at least one target audio segment based on a voiceprint feature, and when a voiceprint separation result corresponding to audio data is obtained:

the audio clustering subunit 631 is configured to cluster at least one target audio segment based on the voiceprint features to obtain a first clustering result and a second clustering result corresponding to the audio data, where different clustering results correspond to different voiceprint features;

a voice recognition subunit 632, configured to perform voice recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set;

the voice recognition subunit 633 is further configured to perform voice recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set;

an information output subunit 634, configured to output the first set of textual information and the second set of textual information corresponding to the audio data.

Optionally, fig. 6d shows a schematic structural diagram of a fourth voiceprint separation apparatus for implementing the voiceprint separation method according to the embodiment of the present disclosure, and as shown in fig. 6d, the voiceprint separation apparatus 600 further includes a first set determining unit 650, configured to obtain a first text message in the first text message set; acquiring first second text information in a second text information set;

if the first text information meets the customer service text information requirement, determining that the first text information set is a customer service text information set, and determining that the second text information set is a user text information set;

and if the first second text information meets the customer service text information requirement, determining that the second text information set is a customer service text information set, and determining that the first text information set is a user text information set.

Optionally, fig. 6e shows a schematic structural diagram of a fifth voiceprint separation apparatus for implementing the voiceprint separation method according to the embodiment of the present disclosure, and as shown in fig. 6e, the voiceprint separation apparatus 600 further includes a second set determining unit 660, configured to obtain a voiceprint feature corresponding to the first text information set;

if the voiceprint features belong to the customer service voiceprint feature set, determining that the first character information set is the customer service character information set, and determining that the second character information set is the user character information set;

and if the voiceprint features do not belong to the customer service voiceprint feature set, determining that the second text information set is the customer service text information set, and determining that the first text information set is the user text information set.

It should be noted that, when the voiceprint separation apparatus provided in the foregoing embodiment executes the voiceprint separation method, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed to different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the above described functions. In addition, the data transmission parameter determining apparatus and the data transmission parameter determining method provided in the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments, and are not described herein again.

The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

In one or at least one embodiment of the present disclosure, the audio obtaining unit may remove a noise segment in at least one audio segment based on a confidence recognition result of the at least one audio segment corresponding to the audio data, and obtain a target audio segment set; the voiceprint acquisition unit can acquire a voiceprint feature corresponding to at least one target audio clip in the target audio clip set; the voiceprint separation unit may cluster the at least one target audio segment based on the voiceprint features to obtain a voiceprint separation result corresponding to the audio data. Therefore, through the flow of first separation and then identification, the noise fragment in the audio data can be removed without consuming extra computing resources, the accuracy of obtaining the effective audio fragment can be improved, and the accuracy of voiceprint separation is improved.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement the voiceprint separation method of an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 804. An input/output (I/O) interface 705 is also connected to bus 704.

At least one component of the device 700 is connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the voiceprint separation method. For example, in some embodiments, the voiceprint separation method can be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or at least one step of the voiceprint separation method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the voiceprint separation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or at least one computer program that is executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or at least one programming language. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

According to an embodiment of the present disclosure, the present disclosure further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the voiceprint separation method according to the embodiments shown in fig. 3 to 4, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 3 to 4, which is not described herein again. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

According to an embodiment of the present disclosure, the present disclosure further provides a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program product stores at least one instruction, and the at least one instruction is loaded by a processor and executes the voiceprint separation method according to the embodiment shown in fig. 3 to 4, and a specific execution process may refer to specific descriptions of the embodiment shown in fig. 3 to 4, which is not described herein again.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A voiceprint separation method comprising:

2. The method according to claim 1, wherein the removing noise segments from at least one audio segment based on the confidence recognition result of the at least one audio segment corresponding to the audio data to obtain the target audio segment set comprises:

if the confidence coefficient identification result indicates that the confidence coefficient corresponding to the audio clip is smaller than a confidence coefficient threshold value, determining the audio clip as a noise clip, and removing the noise clip;

if the confidence coefficient identification result indicates that the confidence coefficient corresponding to the audio clip is greater than or equal to the confidence coefficient threshold value, determining the audio clip as a target audio clip, and adding the target audio clip to a target audio clip set.

3. The method of claim 1, further comprising:

and segmenting the audio data based on the silence detection result of the audio data to obtain at least one audio segment corresponding to the audio data.

4. The method of claim 1, wherein the clustering the at least one target audio segment based on the voiceprint features to obtain a voiceprint separation corresponding to the audio data comprises:

clustering the at least one target audio segment based on the voiceprint features to obtain a first clustering result and a second clustering result corresponding to the audio data, wherein different clustering results correspond to different voiceprint features;

performing voice recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set;

performing voice recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set;

and outputting the first text information set and the second text information set corresponding to the audio data.

5. The method of claim 4, further comprising:

acquiring first text information in the first text information set;

acquiring first second text information in the second text information set;

and if the first second text information meets the customer service text information requirement, determining that the second text information set is the customer service text information set, and determining that the first text information set is the user text information set.

6. The method of claim 4, further comprising:

acquiring voiceprint characteristics corresponding to the first text information set;

if the voiceprint features belong to a customer service voiceprint feature set, determining that the first text information set is a customer service text information set, and determining that the second text information set is a user text information set;

and if the voiceprint features do not belong to a customer service voiceprint feature set, determining that the second text information set is the customer service text information set, and determining that the first text information set is the user text information set.

7. A voiceprint separation apparatus comprising:

8. The apparatus according to claim 7, wherein the audio obtaining unit is configured to remove a noise segment from at least one audio segment corresponding to the audio data based on a confidence level recognition result of the at least one audio segment, and when the target audio segment set is obtained, specifically configured to:

9. The apparatus according to claim 7, further comprising an audio slicing unit, configured to slice audio data based on a silence detection result of the audio data, and obtain at least one audio segment corresponding to the audio data.

10. The apparatus according to claim 7, wherein the voiceprint separation unit includes an audio clustering subunit, a speech recognition subunit, and an information output subunit, and is configured to, based on the voiceprint feature, cluster the at least one target audio segment, and when a voiceprint separation result corresponding to the audio data is obtained:

the audio clustering subunit is configured to cluster the at least one target audio segment based on the voiceprint features to obtain a first clustering result and a second clustering result corresponding to the audio data, where different clustering results correspond to different voiceprint features;

the voice recognition subunit is configured to perform voice recognition on at least one audio segment corresponding to the first clustering result to obtain a first text information set;

the voice recognition subunit is further configured to perform voice recognition on at least one audio segment corresponding to the second clustering result to obtain a second text information set;

the information output subunit is configured to output the first text information set and the second text information set corresponding to the audio data.

11. The apparatus according to claim 10, wherein the apparatus further comprises a first set determining unit, configured to obtain a first literal information in the first set of literal information;

acquiring first second text information in the second text information set;

12. The apparatus according to claim 10, wherein the apparatus further comprises a second set determining unit, configured to obtain a voiceprint feature corresponding to the first text information set;

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.