CN110718228A

CN110718228A - Voice separation method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN110718228A
Application number: CN201911006481.6A
Authority: CN
Inventors: 赖勇铨; 陈文�; 贺亚运; 李美玲
Original assignee: China Citic Bank Corp Ltd
Current assignee: China Citic Bank Corp Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-01-21
Anticipated expiration: 2039-10-22
Also published as: CN110718228B

Abstract

The application provides a voice separation method, a voice separation device, electronic equipment and a computer readable storage medium, which are applied to the technical field of voice processing, wherein the method comprises the following steps: extracting the three-dimensional characteristics of the spectrogram corresponding to the target call audio, fully utilizing the information of the speaker and improving the accuracy of voice segmentation; in addition, the segmentation labels of the target call audio are automatically output through the pre-trained recurrent neural network, the judgment of the conversion points is more accurate, the accuracy of voice segmentation is further improved, and moreover, the clustering processing is not needed by an additional clustering algorithm, so that the end-to-end processing of voice separation can be realized.

Description

Voice separation method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech separation method, apparatus, electronic device, and computer-readable storage medium.

Background

With the development of multimedia technology, a large number of audio files are increasing. How to effectively manage and apply massive audio information is an important problem to change audio data from 'unordered' to 'ordered', and speaker segmentation is a key technology for solving the problem.

At present, the voice segmentation of the audio is realized by a method of jumping point detection and clustering, namely, a speaker jumping point detection method is firstly utilized to segment the voice into a plurality of voice small segments, and then bottom-up clustering is adopted to merge the same speakers together. However, without prior knowledge of speaker information in speech, speaker information is not fully utilized in the segmentation and clustering process, which affects the accuracy of audio segmentation. Therefore, the existing method for realizing voice segmentation by the method of jumping point detection and clustering has the problems of insufficient speaker information utilization and low voice segmentation accuracy.

Disclosure of Invention

The application provides a voice separation method, a voice separation device, electronic equipment and a computer-readable storage medium, which are used for improving the efficiency of credit card application verification, reducing the fraud risk of newly added applications and finding out fraud applications in stock applications, and adopt the following technical scheme:

in a first aspect, a speech separation method is provided, the method comprising,

acquiring a spectrogram corresponding to the target call audio;

extracting three-dimensional features of a spectrogram based on a pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features;

carrying out average pooling treatment on the three-dimensional characteristics in a frequency dimension to obtain pooled characteristics;

inputting the pooled features into a pre-trained recurrent neural network to obtain a segmentation label on a time dimension;

and performing voice separation on the target call audio based on the segmentation label on the time dimension.

Optionally, the obtaining of the spectrogram corresponding to the target call audio includes:

acquiring a target call audio;

carrying out standardization processing on the target call audio to obtain a digital signal corresponding to the target call audio;

and obtaining a spectrogram corresponding to the target call audio through a sliding window method and FFT (fast Fourier transform) conversion based on the digital signal.

Optionally, performing voice separation on the target call audio based on the segmentation tag in the time dimension, including:

dividing the target call audio into a plurality of audio segments based on the division labels in the time dimension;

and splicing the audio segments corresponding to the segmentation labels in the same time dimension to obtain the audio after the target call audio is separated.

Optionally, the method further comprises:

extracting the voiceprint characteristics of the audio segment corresponding to any segmentation label;

calculating the similarity between the voiceprint feature of the audio clip corresponding to any one segmentation label and the voiceprint feature of at least one pre-stored customer service;

and determining that the audio of the target call to be identified after the audio separation is the audio of the target object or the audio of the customer service based on the first similarity calculation result.

Optionally, the method further comprises:

extracting voiceprint characteristics of the audio after the target call audio is separated;

calculating the similarity between the voiceprint characteristics of the audio subjected to the separation processing of the target call audio and the voiceprint characteristics of at least one pre-stored customer service;

and determining that the audio of the target call to be identified after the audio separation is the audio of the target object or the audio of the customer service based on the second similarity calculation result.

Optionally, the target call audio is a call audio of the credit card applicant with the customer service.

Optionally, the method further comprises: carrying out voice separation on the target call audio to obtain the audio of the credit card applicant;

fraud identification is based on the credit card applicant's audio.

In a second aspect, there is provided a speech separation apparatus, comprising,

the first acquisition module is used for acquiring a spectrogram corresponding to the target call audio;

the extraction module is used for extracting three-dimensional features of the spectrogram based on the pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features;

the pooling processing module is used for carrying out average pooling processing on the three-dimensional characteristics in a frequency dimension to obtain pooled characteristics;

the segmentation label module is used for inputting the pooled features into a pre-trained recurrent neural network to obtain segmentation labels in a time dimension;

and the voice separation module is used for carrying out voice separation on the target call audio based on the segmentation label on the time dimension.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a target call audio;

the processing module is used for carrying out standardization processing on the target call audio to obtain a digital signal corresponding to the target call audio;

and the transformation module is used for obtaining a spectrogram corresponding to the target call audio through a sliding window method and FFT transformation based on the digital signal.

Optionally, the voice separation module comprises:

a dividing unit, configured to divide the target call audio into a plurality of audio segments based on the division labels in the time dimension;

and the splicing unit is used for splicing the audio segments corresponding to the segmentation labels in the same time dimension to obtain the audio frequency after the target call audio frequency is separated.

Optionally, the apparatus further comprises:

the first extraction module is used for extracting the voiceprint characteristics of the audio segment corresponding to any segmentation label;

the first calculation module is used for calculating the similarity between the voiceprint feature of the audio clip corresponding to any one of the segmentation labels and the voiceprint feature of at least one pre-stored customer service;

and the first determining module is used for determining that the audio after the target call audio separation to be identified is the audio of the target object or the audio of the customer service based on the first similarity calculation result.

Optionally, the apparatus further comprises:

the second extraction module is used for extracting the voiceprint characteristics of the audio after the target call audio is subjected to separation processing;

the second calculation module is used for calculating the similarity between the voiceprint characteristics of the audio obtained by separating the target call audio and the voiceprint characteristics of at least one pre-stored customer service;

and the second determining module is used for determining that the audio after the target call audio separation to be identified is the audio of the target object or the audio of the customer service based on the second similarity calculation result.

Optionally, the apparatus further comprises:

the separation module is used for carrying out voice separation on the target call audio to obtain the audio of the credit card applicant;

an identification module for fraud identification based on the credit card applicant's audio.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the speech separation method shown in the first aspect is performed.

In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions that, when executed on a computer, cause the computer to perform the speech separation method shown in the first aspect.

Compared with the prior art that voice separation is carried out through a method of jumping point detection and clustering, the voice separation method comprises the steps of obtaining a spectrogram corresponding to target call audio, extracting three-dimensional features of the spectrogram based on a pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features, carrying out average pooling processing on the three-dimensional features in frequency dimension to obtain features after pooling processing, inputting the features after pooling processing into the pre-trained convolutional neural network to obtain segmentation labels in the time dimension, and carrying out voice separation on the target call audio based on the segmentation labels in the time dimension. The method and the device extract the three-dimensional characteristics of the spectrogram corresponding to the target call audio, fully utilize the information of the speaker, and can improve the accuracy of voice segmentation; in addition, the segmentation labels of the target call audio are automatically output through the pre-trained recurrent neural network, the judgment of the conversion points is more accurate, the accuracy of voice segmentation is further improved, and moreover, the clustering processing is not needed by an additional clustering algorithm, so that the end-to-end processing of voice separation can be realized.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a speech separation method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech separation apparatus according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of another speech separation apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

An embodiment of the present application provides a speech separation method, as shown in fig. 1, the method may include the following steps:

step S101, obtaining a spectrogram corresponding to target call audio;

specifically, a spectrogram corresponding to a target call audio is obtained by a corresponding data obtaining method, where the target call audio may be a call audio of two or more people, and specifically, may be a call audio between a credit card applicant and a customer service.

Step S102, extracting three-dimensional features of a spectrogram based on a pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features;

specifically, extracting three-dimensional features of a spectrogram through a pre-trained convolutional neural network (such as ResNet-18, VGG-16 and the like); the two-dimensional spectrogram (f, t) can be input into a convolutional neural network to obtain a three-dimensional characteristic map (f, t, c), wherein f is a frequency dimension, t is a time dimension, c is a channel dimension, and the channel dimension is related to the number of convolutional kernels of the convolutional layer.

Step S103, carrying out average pooling on the three-dimensional features in frequency dimension to obtain pooled features;

specifically, the three-dimensional feature is subjected to average pooling in frequency dimension to obtain pooled features, so that the three-dimensional feature map is converted into a two-dimensional feature map, the length of the dimension in time is kept, and the size of the other dimension of the two-dimensional feature map is determined by the number of convolution kernels of the feature map output by the convolution network, so that call voices with different lengths can be processed.

For example, channel is 6, f is 28, and t is 28, and the process of converting the three-dimensional feature map into the two-dimensional feature map may be: carrying out meanpoolling along the frequency f direction, namely, taking the average value of 28 numerical values at each time point along the frequency direction, thereby obtaining a one-dimensional 28-digit vector; the one-dimensional vectors from the 6 channel processes are then stacked together to obtain the final two-dimensional features (6 rows and 28 columns).

Step S104, inputting the pooled features into a pre-trained recurrent neural network to obtain a segmentation label on a time dimension;

specifically, the pooled features are input into a pre-trained recurrent neural network to obtain segmentation labels on a time dimension, the characteristics on the time dimension extracted by the recurrent neural network can be subjected to sequence analysis by adopting the recurrent neural network, and speaker labels on corresponding time points can be effectively and accurately output by correlating the input of a period of time before and after the correlation, so that the purpose of speaker voice separation can be achieved through the labels; the recurrent neural Network may be an LSTM Network (Long Short Memory Network), or may be another recurrent neural Network that can implement the present application, and the present application is not limited herein.

And step S105, carrying out voice separation on the target call audio based on the segmentation label on the time dimension.

Specifically, the target call audio is subjected to voice separation based on the segmentation label in the time dimension, and the target call audio may be subjected to voice separation based on the mapping relationship between the segmentation label and the corresponding audio segment.

Compared with the prior art that voice separation is carried out through a jumping point detection and clustering method, the method comprises the steps of obtaining a spectrogram corresponding to target call audio, extracting three-dimensional features of the spectrogram based on a pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features, carrying out average pooling processing on the three-dimensional features in frequency dimension to obtain pooled features, inputting the pooled features into the pre-trained recurrent neural network to obtain segmentation labels in the time dimension, and carrying out voice separation on the target call audio based on the segmentation labels in the time dimension. The method and the device extract the three-dimensional characteristics of the spectrogram corresponding to the target call audio, fully utilize the information of the speaker, and can improve the accuracy of voice segmentation; in addition, the segmentation labels of the target call audio are automatically output through the pre-trained recurrent neural network, the judgment of the conversion points is more accurate, the accuracy of voice segmentation is further improved, and moreover, the clustering processing is not needed by an additional clustering algorithm, so that the end-to-end processing of voice separation can be realized.

The embodiment of the present application provides a possible implementation manner, and step S101 includes:

step S106 (not shown in the figure), acquiring a target call audio;

specifically, the target call audio is acquired by an audio acquisition device of a terminal device such as a mobile phone.

Step S107 (not shown in the figure), standardizing the target call audio to obtain a digital signal corresponding to the target call audio;

specifically, the sound signal is converted into electronic information through a mobile phone and a telephone microphone, and then is sampled by a sampling rate of 16K or higher, and a string of 16-bit quantized digital signals is output. Because the recording settings are different, the output voice digital signal may be a dual-channel signal, that is, the signal includes two similar digital sequences, and at this time, one of the two digital sequences is taken as a signal for separating a speaker; when the sampling rate is not 16K, the signal is converted into 16K by means of resampling or interpolation, and when the quantized signal is not 16-bit integer, the signal is converted into 16-bit quantization by means of linear mapping.

Step S108 (not shown in the figure), a spectrogram corresponding to the target call audio is obtained through a sliding window method and FFT transformation based on the digital signal.

Specifically, the normalized sound becomes a string (array) of digital signals. The array takes a block of data (for example, 512 length, corresponding to 512/16K ═ 0.032 second) at intervals (for example, 0.025 second) through a sliding window to perform FFT; before the FFT, the signal segment to be transformed can be windowed (namely, the signal segment is multiplied by a window function to emphasize the middle part of the signal segment and reduce the edge effect of the FFT); and arranging the absolute value of the complex signal obtained after the FFT is calculated at each time interval of the sound as a column of the image according to the time sequence to obtain the spectrogram.

For the embodiment of the application, the problem of obtaining the spectrogram corresponding to the target call audio is solved.

The embodiment of the present application provides a possible implementation manner, and specifically, step S105 includes:

step S1051 (not shown in the figure), dividing the target call audio into a plurality of audio pieces based on the division tags in the time dimension;

specifically, the features extracted and processed by the CNN network are input into an RNN network, the RNN network outputs a segmentation mask label in a time dimension, the length of the label corresponds to the length of a spectrogram, and the output segmentation label corresponds to an original audio because the spectrogram and the original audio have a correspondence; the target call audio is split into a plurality of audio segments based on the split tags in the time dimension.

Step S1052 (not shown in the figure), concatenating the audio segments corresponding to the segmentation labels in the same time dimension to obtain the audio after the target call audio is separated.

Illustratively, the segmentation labels are binary labels (e.g., 0 and 1 labels), where a label of 0 represents a person's audio, and a label of 1 represents a person's audio, and the audio segments corresponding to the segmentation labels in the same time dimension are spliced to obtain an audio obtained by separating the target call audio, that is, the audio segments of the label of 0 are spliced, and the audio segments of the label of 1 are spliced to obtain an audio obtained by separating the target call audio.

The embodiment of the application solves the problem of how to separate the voice of the target call audio according to the segmentation label.

The embodiment of the present application provides a possible implementation manner, and further, the method includes,

step S109 (not shown in the figure), extracting the voiceprint feature of the audio segment corresponding to any of the segmentation labels;

specifically, the voiceprint features of the audio segment corresponding to any segmentation label are extracted through a corresponding voiceprint feature extraction method (such as a deep learning method); in the above example, the audio clip may be an audio clip with a label of 0, or an audio clip with a label of 1.

Step S110 (not shown in the figure), calculating a similarity between a voiceprint feature of an audio clip corresponding to any one of the segmentation tags and a voiceprint feature of at least one pre-stored customer service;

specifically, the similarity between the voiceprint feature of the audio segment corresponding to any segmentation tag and the voiceprint feature of the at least one pre-stored customer service is calculated, where the similarity may be realized by calculating a distance between vectors, where the distance may be a cosine distance, a euclidean distance, a hamming distance, or the like, and is not limited herein.

Step S111 (not shown in the figure) determines that the audio after the target call audio separation to be recognized is the audio of the target object or the audio of the customer service based on the first similarity calculation result.

Specifically, it is determined that the audio after the target call audio separation to be identified is the audio of the target object or the audio of the customer service based on the first similarity calculation result.

Specifically, when the pre-stored voiceprint feature of only one customer service exists, if the similarity between the voiceprint feature of the audio clip corresponding to any segmentation label and the pre-stored voiceprint feature of the customer service is smaller than a predetermined threshold, the audio corresponding to the segmentation label is the target object (i.e. the audio of the credit card applicant); if the similarity between the voiceprint feature of the audio clip corresponding to any one of the segmentation labels and the prestored voiceprint feature of the customer service is greater than a predetermined threshold value, the audio corresponding to the segmentation label is the audio of the customer service, and the audio corresponding to the other segmentation label is the audio of the target object (namely, the credit card applicant).

Specifically, when voiceprint features of a plurality of customer services are prestored, similarity between the voiceprint feature of the audio clip corresponding to any segmentation label and the voiceprint feature of each prestored customer service is respectively calculated to obtain a plurality of similarity values; if any similarity value is larger than a preset threshold value, the audio corresponding to the segmentation label is the audio of the customer service; if all the similarity values are smaller than the predetermined threshold, it indicates that the audio corresponding to the segmentation label is the audio of the target object (i.e. the credit card applicant).

The method and the device solve the problem that the audio obtained after the audio separation of the target call is the audio of the target object or the audio of the customer service.

The embodiment of the present application provides a possible implementation manner, and further, the method further includes:

step S112 (not shown in the figure), extracting voiceprint features of the audio after the target call audio is subjected to the separation processing;

step S113 (not shown in the figure), calculating a similarity between a voiceprint feature of the audio obtained by performing the separation processing on the target call audio and a voiceprint feature of at least one pre-stored customer service;

step S114 (not shown in the figure), it is determined that the audio after the target call audio separation to be recognized is the audio of the target object or the audio of the customer service based on the second similarity calculation result.

Specifically, the voiceprint feature of the audio after the target call audio is subjected to the separation processing is extracted through a corresponding voiceprint feature extraction method (such as a deep learning method).

Illustratively, the segmentation tag is a binary tag, that is, after the target call audio is subjected to separation processing, the audio of two persons can be obtained, the voiceprint feature of the audio corresponding to any person is extracted, the similarity between the voiceprint feature of the audio corresponding to any person and the voiceprint feature of the pre-stored customer service is calculated, and based on the similarity calculation result, the audio after the target call audio is separated is determined to be the audio of the target object or the audio of the customer service. The method for determining that the audio obtained after the target call audio separation is the audio of the target object or the audio of the customer service based on the corresponding number of customer service voiceprint features may refer to the above embodiments, and details are not repeated here.

The embodiment of the application provides a possible implementation manner, wherein the target call audio is the call audio of the credit card applicant and the customer service.

Specifically, the target call audio is the call audio of the credit card applicant and the customer service, and a specific application scenario may be a scenario in which the credit card applicant applies for a credit card through a voice call.

For the embodiment of the application, the target call audio is the call audio of the credit card applicant and the customer service, and the problem of voice separation in a credit card application scene is solved.

step S115 (not shown in the figure), the voice of the target call audio is separated to obtain the audio of the credit card applicant;

step S116 (not shown), fraud identification is performed based on the audio of the credit card applicant.

Specifically, the target call audio is the call audio of the credit card applicant and the customer service, the audio of the credit card applicant can be obtained by performing voice separation on the target call audio, and further, fraud identification can be performed according to the audio of the credit card applicant.

Wherein the step of fraud identification may be: obtaining audio information of a plurality of applicants; based on the audio information of a plurality of applicants, extracting the voiceprint features of each target object through a pre-trained first deep learning model to obtain a voiceprint feature pool; clustering each voiceprint feature in the voiceprint feature pool through a clustering algorithm to obtain at least one cluster, and sequencing each cluster according to the similarity of each cluster; and determining a cheating applicant based on the sorting result of each cluster, wherein all the applicants in the top N clusters can be selected as the cheating applicant, and all the applicants in the clusters with the similarity exceeding a certain threshold can also be selected as the cheating applicant.

For the embodiment of the application, fraud identification is carried out based on the audio frequency of the credit card applicant, and the problem of identification of credit card fraud application is solved.

Fig. 2 is a speech separation apparatus according to an embodiment of the present application, where the apparatus 20 includes: a first acquisition module 201, an extraction module 202, a pooling processing module 203, a split tag module 204, and a voice separation module 205, wherein,

a first obtaining module 201, configured to obtain a spectrogram corresponding to a target call audio;

the extraction module 202 is configured to extract three-dimensional features of a spectrogram based on a pre-trained convolutional neural network, where the three-dimensional features include a time dimension feature, a frequency dimension feature, and a channel dimension feature;

the pooling processing module 203 is used for performing average pooling processing on the three-dimensional features in frequency dimension to obtain pooled features;

a segmentation label module 204, configured to input the pooled features to a pre-trained recurrent neural network to obtain a segmentation label in a time dimension;

and the voice separation module 205 is configured to perform voice separation on the target call audio based on the segmentation tag in the time dimension.

The embodiment of the application provides a voice separation device, compared with the prior art that voice separation is performed through a method of jumping point detection and clustering, the voice separation device comprises a time dimension feature, a frequency dimension feature and a channel dimension feature, wherein the time dimension feature, the frequency dimension feature and the channel dimension feature are extracted from a spectrogram corresponding to target call audio through obtaining the spectrogram, the three-dimensional feature of the spectrogram is extracted based on a pre-trained convolutional neural network, average pooling processing is performed on the three-dimensional feature in the frequency dimension to obtain the pooled feature, the pooled feature is input into the pre-trained recurrent neural network to obtain a segmentation label in the time dimension, and voice separation is performed on the target call audio based on the segmentation label in the time dimension. The method and the device extract the three-dimensional characteristics of the spectrogram corresponding to the target call audio, fully utilize the information of the speaker, and can improve the accuracy of voice segmentation; in addition, the segmentation labels of the target call audio are automatically output through the pre-trained recurrent neural network, the judgment of the conversion points is more accurate, the accuracy of voice segmentation is further improved, and moreover, the clustering processing is not needed by an additional clustering algorithm, so that the end-to-end processing of voice separation can be realized.

The speech separation apparatus of this embodiment can perform the fraud recognition method provided in the above embodiments of this application, and the implementation principles thereof are similar and will not be described herein again.

As shown in fig. 3, the present embodiment provides another speech separation apparatus, where the apparatus 30 includes: a first acquisition module 301, an extraction module 302, a pooling processing module 303, a split tag module 304, and a voice separation module 305, wherein,

the first obtaining module 301 is configured to obtain a spectrogram corresponding to a target call audio;

the first obtaining module 301 in fig. 3 has the same or similar function as the first obtaining module 201 in fig. 2.

An extraction module 302, configured to extract three-dimensional features of a spectrogram based on a pre-trained convolutional neural network, where the three-dimensional features include a time dimension feature, a frequency dimension feature, and a channel dimension feature;

wherein the extraction module 302 in fig. 3 has the same or similar function as the extraction module 201 in fig. 2.

The pooling processing module 303 is configured to perform average pooling processing on the three-dimensional features in a frequency dimension to obtain pooled features;

wherein the pooling processing module 303 of fig. 3 is the same or similar in function to the pooling processing module 203 of fig. 2.

A segmentation label module 304, configured to input the pooled features to a pre-trained recurrent neural network to obtain a segmentation label in a time dimension;

the split tag module 304 in fig. 3 has the same or similar function as the split tag module 204 in fig. 2.

And a voice separation module 305, configured to perform voice separation on the target call audio based on the segmentation tag in the time dimension.

The voice separation module 305 in fig. 3 has the same or similar function as the voice separation module 205 in fig. 2.

The embodiment of the present application provides a possible implementation manner, and further, the apparatus further includes:

a second obtaining module 306, configured to obtain a target call audio;

the processing module 307 is configured to perform standardization processing on the target call audio to obtain a digital signal corresponding to the target call audio;

and the transformation module 308 is configured to obtain a spectrogram corresponding to the target call audio through a sliding window method and FFT transformation based on the digital signal.

The embodiment of the present application provides a possible implementation manner, and the voice separation module 305 includes:

a division unit 3051 (not shown in the figure) for dividing the target call audio into a plurality of audio pieces based on the division label in the time dimension;

the splicing unit 3052 (not shown in the figure) is configured to splice the audio segments corresponding to the split tags in the same time dimension to obtain an audio obtained after the target call audio is subjected to separation processing.

The embodiment of the present application provides a possible implementation manner, and further, the apparatus 30 further includes:

a first extraction module 309, configured to extract a voiceprint feature of an audio segment corresponding to any segmentation tag;

the first calculating module 310 is configured to calculate a similarity between a voiceprint feature of an audio clip corresponding to any one of the segmentation tags and a voiceprint feature of at least one pre-stored customer service;

and a first determining module 311, configured to determine, based on the first similarity calculation result, that the audio obtained after the target call audio separation to be identified is the audio of the target object or the audio of the customer service.

a second extracting module 312, configured to extract a voiceprint feature of the audio obtained after the target call audio is subjected to the separation processing;

the second calculation module 313 is configured to calculate a similarity between a voiceprint feature of the audio obtained by performing the separation processing on the target call audio and a voiceprint feature of at least one pre-stored customer service;

and the second determining module 314 is configured to determine, based on the second similarity calculation result, that the audio obtained after the target call audio separation to be identified is the audio of the target object or the audio of the customer service.

the separation module 315 is configured to perform voice separation on the target call audio to obtain an audio of the credit card applicant;

an identification module 316 for fraud identification based on the credit card applicant's audio.

The embodiment of the present application provides a speech separation apparatus, which is suitable for the method shown in the above embodiment, and is not described herein again.

An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. In this embodiment, the processor 401 is applied to implement the functions of the first obtaining module, the extracting module, the pooling processing module, the label splitting module and the voice separating module shown in fig. 2 or fig. 3, and the functions of the second obtaining module, the processing module, the transforming module, the first extracting module, the first calculating module, the first determining module, the second extracting module, the second calculating module, the second determining module, the separating module and the identifying module shown in fig. 3. The transceiver 404 includes a receiver and a transmitter.

The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. Processor 401 is configured to execute application program code stored in memory 403 to implement the functionality of the voice separation apparatus provided by the embodiments shown in fig. 2 or fig. 3.

Compared with the prior art that voice separation is performed through a method of jumping point detection and clustering, the electronic equipment provided by the embodiment of the application extracts three-dimensional features of a spectrogram through obtaining of a spectrogram corresponding to target call audio based on a pretrained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features, the three-dimensional features are subjected to average pooling processing in frequency dimension to obtain pooled features, the pooled features are input into the pretrained recurrent neural network to obtain segmentation labels in the time dimension, and voice separation is performed on the target call audio based on the segmentation labels in the time dimension. The method and the device extract the three-dimensional characteristics of the spectrogram corresponding to the target call audio, fully utilize the information of the speaker, and can improve the accuracy of voice segmentation; in addition, the segmentation labels of the target call audio are automatically output through the pre-trained recurrent neural network, the judgment of the conversion points is more accurate, the accuracy of voice segmentation is further improved, and moreover, the clustering processing is not needed by an additional clustering algorithm, so that the end-to-end processing of voice separation can be realized.

The embodiment of the application provides an electronic device suitable for the method embodiment. And will not be described in detail herein.

The present application provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method shown in the above embodiments is implemented.

Compared with the prior art that voice separation is performed through a method of jumping point detection and clustering, the voice separation method includes the steps that a spectrogram corresponding to target call audio is obtained, three-dimensional features of the spectrogram are extracted based on a pre-trained convolutional neural network, the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features, the three-dimensional features are subjected to average pooling processing in frequency dimension to obtain pooled features, the pooled features are input into the pre-trained recurrent neural network to obtain segmentation labels in the time dimension, and voice separation is performed on the target call audio based on the segmentation labels in the time dimension. The method and the device extract the three-dimensional characteristics of the spectrogram corresponding to the target call audio, fully utilize the information of the speaker, and can improve the accuracy of voice segmentation; in addition, the segmentation labels of the target call audio are automatically output through the pre-trained recurrent neural network, the judgment of the conversion points is more accurate, the accuracy of voice segmentation is further improved, and moreover, the clustering processing is not needed by an additional clustering algorithm, so that the end-to-end processing of voice separation can be realized.

The embodiment of the application provides a computer-readable storage medium which is suitable for the method embodiment. And will not be described in detail herein.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method of speech separation, comprising:

acquiring a spectrogram corresponding to the target call audio;

extracting three-dimensional features of the spectrogram based on a pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features;

2. The method of claim 1, wherein the obtaining the spectrogram corresponding to the target call audio comprises:

acquiring a target call audio;

standardizing the target call audio to obtain a digital signal corresponding to the target call audio;

3. The method of claim 1, wherein the voice separating the target call audio based on the split tag in the time dimension comprises:

dividing the target call audio into a plurality of audio segments based on a division label in the time dimension;

and splicing the audio segments corresponding to the same segmentation labels in the time dimension to obtain the audio obtained after the target call audio is separated.

4. The method of claim 3, further comprising:

calculating the similarity between the voiceprint feature of the audio clip corresponding to any one of the segmentation labels and the voiceprint feature of at least one pre-stored customer service;

5. The method of claim 3, further comprising:

6. The method of any one of claims 1-5, wherein the target call audio is a call audio of a credit card applicant with a customer service.

7. The method of claim 6, further comprising:

carrying out voice separation on the target call audio to obtain the audio of the credit card applicant;

fraud identification is performed based on the credit card applicant's audio.

8. A speech separation apparatus, comprising:

the extraction module is used for extracting three-dimensional features of the spectrogram based on a pre-trained convolutional neural network, wherein the three-dimensional features comprise time dimension features, frequency dimension features and channel dimension features;

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: -performing the speech separation method according to any of claims 1 to 7.

10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the speech separation method of any of claims 1 to 7.