CN112712790B - Speech extraction method, device, equipment and medium for target speaker - Google Patents

Speech extraction method, device, equipment and medium for target speaker Download PDF

Info

Publication number
CN112712790B
CN112712790B CN202011545184.1A CN202011545184A CN112712790B CN 112712790 B CN112712790 B CN 112712790B CN 202011545184 A CN202011545184 A CN 202011545184A CN 112712790 B CN112712790 B CN 112712790B
Authority
CN
China
Prior art keywords
voice
voice data
processed
extracted
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011545184.1A
Other languages
Chinese (zh)
Other versions
CN112712790A (en
Inventor
张舒婷
赖众程
杨念慈
何利斌
李会璟
王小红
刘彦国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202011545184.1A priority Critical patent/CN112712790B/en
Publication of CN112712790A publication Critical patent/CN112712790A/en
Application granted granted Critical
Publication of CN112712790B publication Critical patent/CN112712790B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application relates to the technical field of artificial intelligence, and discloses a voice extraction method, device, equipment and medium for a target speaker, wherein the method comprises the following steps: determining a plurality of first voice data segments to be extracted according to first voice data to be processed in a first direction by adopting a preset segmentation method; carrying out sectional extraction on second voice data to be processed in a second direction according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted; carrying out data extraction on the first voice data segments to be extracted and the second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on the voice extraction model input by each voice data segment to be extracted to obtain a plurality of target speaker voice data segments, and then splicing according to time sequence to obtain target voice data of the target speaker. Therefore, the cost of the business quality assessment is reduced, and the comprehensiveness of the business quality assessment is improved.

Description

Speech extraction method, device, equipment and medium for target speaker
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for extracting voice aiming at a target speaker.
Background
At present, service personnel have uneven quality of service, and the problems of nonstandard service call operation and unfriendly attitude exist. In order to improve the service quality of service personnel, the service quality is evaluated in a manual spot check and blind spot check mode, so that a large amount of manpower and financial resources are consumed, and the cost is high; and the spot check and spot check can only reflect the service condition at partial moments, so that the obtained service quality assessment has one-sided performance.
Disclosure of Invention
The application mainly aims to provide a voice extraction method, device, equipment and medium for a target speaker, and aims to solve the technical problems that the cost is high and the obtained business quality assessment has one-sided performance due to the fact that business quality assessment is carried out in a manual spot check and a hidden visit spot check mode in the service industry in the prior art.
In order to achieve the above object, the present application proposes a method for extracting speech for a target speaker, the method comprising:
acquiring first to-be-processed voice data and second to-be-processed voice data of a target speaker in the same time period, wherein the first to-be-processed voice data is voice data obtained according to voice signals in a first direction, and the second to-be-processed voice data is voice data obtained according to voice signals in a second direction;
Carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;
the second voice data to be processed are extracted in a segmented mode according to the plurality of first voice data segments to be extracted, and a plurality of second voice data segments to be extracted are obtained;
carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;
and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training;
and splicing the voice data segments of the target speakers in time sequence to obtain the target voice data of the target speakers.
Further, the step of obtaining the first to-be-processed voice data and the second to-be-processed voice data of the target speaker in the same time period includes:
Acquiring the voice signals of the target speaker in the first direction and the voice signals of the target speaker in the second direction in the same time period;
segmenting the voice signal in the first direction by adopting a first preset duration to obtain a plurality of segmented voice signal segments in the first direction;
inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;
performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;
performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;
segmenting the voice signal in the second direction by adopting the first preset duration to obtain a plurality of segmented voice signal segments in the second direction;
inputting each segmented second-direction voice signal segment into a digital filter to obtain a plurality of filtered second-direction voice signal segments;
performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;
Performing inverse discrete Fourier transform on the plurality of transformed second-direction voice signal segments to obtain noise-reduced second-direction voice data;
pre-emphasis processing is carried out on the noise-reduced voice data in the first direction to obtain the voice data to be processed;
and pre-emphasis processing is carried out on the noise-reduced voice data in the second direction to obtain the voice data to be processed.
Further, the step of segmenting the first speech data to be processed by using a preset segmentation method to obtain a plurality of first speech data segments to be extracted includes:
framing the first voice data to be processed by adopting a second preset duration to obtain a plurality of first voice data frames to be processed;
respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;
extracting the first voice energy from the first voice energy corresponding to each of the plurality of first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energy;
average value calculation is carried out on the plurality of first initial voice energy to obtain first background voice energy corresponding to the plurality of first voice data frames to be processed;
Respectively subtracting the first voice energy corresponding to each first voice data frame to be processed from the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;
respectively comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;
when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;
when the first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;
and carrying out mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold value and the mute category to obtain a plurality of first voice data segments to be extracted.
Further, the step of performing mute frame deletion processing on the plurality of first to-be-processed voice data frames by using the mute frame number threshold and the mute category to obtain the plurality of first to-be-extracted voice data segments includes:
Calculating the number of the silence frames of the first voice data frames to be processed according to time continuity to obtain the number of the first continuous silence frames;
respectively comparing the number of each first continuous mute frame with the threshold value of the number of the mute frames;
and deleting the first to-be-processed voice data frames corresponding to all the first continuous silence frame numbers larger than the silence frame number threshold from the first to-be-processed voice data frames when the first continuous silence frame numbers are larger than the silence frame number threshold, so as to obtain the first to-be-extracted voice data segments.
Further, the step of extracting the second to-be-processed voice data in a segmented manner according to the plurality of first to-be-extracted voice data segments to obtain a plurality of second to-be-extracted voice data segments includes:
extracting the starting time and the ending time of each first voice data segment to be extracted respectively to obtain a first starting time and a first ending time which correspond to each of the plurality of first voice data segments to be extracted;
and respectively adopting a first starting time and a first ending time corresponding to each first voice data segment to be extracted to perform segmented extraction from the second voice data to be processed, so as to obtain a plurality of second voice data segments to be extracted.
Further, the step of extracting the voice of each voice data segment to be extracted from the input single speaker voice extraction model to obtain a plurality of target speaker voice data segments includes:
inputting the first voice data segment to be extracted of the voice data segment pair to the first coding conversion module of the single speaker voice extraction model for coding conversion to obtain a first coding conversion result;
inputting the second voice data segment to be extracted of the voice data segment pair to the second coding conversion module of the single speaker voice extraction model for coding conversion to obtain a second coding conversion result;
calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first code transformation result and the second code transformation result to obtain a target mask matrix;
invoking the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;
and repeatedly executing the step of inputting the first voice data segment to be extracted of the voice data segment pair to the first coding conversion module of the single speaker voice extraction model to carry out coding conversion to obtain a first coding conversion result until all the voice data segment pairs to be extracted respectively correspond to the voice data segments of the target speaker.
Further, before the step of respectively extracting each to-be-extracted voice data segment from the input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, the method comprises the following steps:
obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;
inputting the voice sample data of the training sample in the first direction into a first to-be-trained coding conversion module of a to-be-trained voice extraction model and the voice sample data in the second direction into a second to-be-trained coding conversion module of the to-be-trained voice extraction model, and obtaining single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a model obtained by modifying the TasNet network;
inputting the voice calibration data and the single speaker training data into a loss function for calculation to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained for calculating the single speaker training data next time;
Repeating the steps until the loss value reaches a first convergence condition or the iteration number reaches a second convergence condition, and determining the speech extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition, as the single speaker speech extraction model.
The application also provides a voice extraction device aiming at the target speaker, which comprises:
the voice data acquisition module is used for acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to voice signals in a first direction, and the second voice data to be processed is voice data obtained according to voice signals in a second direction;
the first segmentation processing module is used for carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;
the second segmentation extraction module is used for carrying out segmentation extraction on the second voice data to be processed according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;
The to-be-extracted voice data segment pair determining module is used for carrying out data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;
the target speaker voice data segment determining module is configured to perform voice extraction on each to-be-extracted voice data segment to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, where the single speaker voice extraction model includes: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training;
and the target voice data determining module is used for splicing the voice data segments of the target speakers in time sequence to obtain target voice data of the target speakers.
The application also proposes a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.
The application also proposes a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method of any of the above.
According to the voice extraction method, the device, the equipment and the medium for the target speaker, the first voice data to be processed and the second voice data to be processed in the second direction of the target speaker in the same time period are subjected to segmentation processing and data extraction in the same time to obtain a plurality of second voice data segments to be extracted, then the second voice data segments to be extracted are input into a single speaker voice extraction model to be subjected to voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally the target voice data of the target speaker are obtained by splicing the plurality of target speaker voice data segments in time sequence, so that the voice of the target speaker is extracted rapidly, accurately and automatically, the cost of business quality assessment is reduced, the comprehensiveness of business quality assessment is improved through the complete voice data of the target speaker, and the privacy safety of voice data of other speakers is facilitated.
Drawings
FIG. 1 is a flowchart of a method for extracting voice for a target speaker according to an embodiment of the present application;
FIG. 2 is a block diagram schematically illustrating a voice extraction apparatus for a target speaker according to an embodiment of the present application;
fig. 3 is a schematic block diagram of a computer device according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In order to solve the technical problems that the service industry in the prior art carries out service quality assessment in a manual spot check and blind spot check mode, so that the cost is high and the obtained service quality assessment has one-sided performance, the application provides a voice extraction method for a target speaker, and the method is applied to the technical field of artificial intelligence. According to the voice extraction method for the target speaker, the voices of the target speaker in different directions are subjected to segmentation processing, then a single speaker voice extraction model obtained based on TasNet network training is input for voice extraction, the extracted voices are spliced in time sequence to obtain voice data only containing the voice of the target speaker, so that the voice of the target speaker is extracted rapidly, accurately and automatically, the cost of business quality assessment is reduced, the comprehensiveness of business quality assessment is improved through the complete voice data of the target speaker, and the privacy safety of voice data of other speakers is protected.
Referring to fig. 1, in an embodiment of the present application, a method for extracting voice for a target speaker is provided, where the method includes:
s1: acquiring first to-be-processed voice data and second to-be-processed voice data of a target speaker in the same time period, wherein the first to-be-processed voice data is voice data obtained according to voice signals in a first direction, and the second to-be-processed voice data is voice data obtained according to voice signals in a second direction;
s2: carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;
s3: the second voice data to be processed are extracted in a segmented mode according to the plurality of first voice data segments to be extracted, and a plurality of second voice data segments to be extracted are obtained;
s4: carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;
s5: and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training;
S6: and splicing the voice data segments of the target speakers in time sequence to obtain the target voice data of the target speakers.
According to the method, the device and the system, the first voice data to be processed of the target speaker in the first direction and the second voice data to be processed of the second direction in the same time period are subjected to segmentation processing and data extraction in the same time to obtain a plurality of second voice data segments to be extracted, then the second voice data segments to be extracted are input into a single speaker voice extraction model to be subjected to voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally the target voice data of the target speaker are obtained by splicing the plurality of target speaker voice data segments according to time sequence, so that the speech voice of the target speaker is extracted rapidly, accurately and automatically, the cost of business quality assessment is reduced, the comprehensiveness of business quality assessment is improved through the complete voice data of the target speaker, and privacy security of voice data of other speakers is protected.
For S1, the first to-be-processed voice data and the second to-be-processed voice data of the target speaker in the same time period may be obtained from the database, or the first to-be-processed voice data and the second to-be-processed voice data of the target speaker in the same time period recorded by the recording device may be directly obtained.
The first voice data to be processed is voice data obtained according to voice signals in a first direction, the second voice data to be processed is voice data obtained according to voice signals in a second direction, and the first direction and the second direction are different directions.
The first voice data to be processed and the second voice data to be processed are voice signals which are recorded by a target speaker in the same time period, namely, the voice signals in the first direction and the voice signals in the second direction are simultaneously recorded, and the recording duration of the voice signals in the first direction is the same as the recording duration of the voice signals in the second direction.
For S2, the first speech data to be processed is divided into multiple small speech data segments by using a preset segmentation method, and each small speech data segment is used as a first speech data segment to be extracted.
And for S3, dividing the second voice data to be processed into a plurality of small voice data sections by adopting the same starting time and ending time as the plurality of first voice data sections to be extracted, and taking each small voice data section as a second voice data section to be extracted.
It may be understood that the second to-be-processed voice data may be subjected to the segmentation processing first, and then the segmentation processing result is adopted to perform the segmentation extraction on the first to-be-processed voice data, which is not limited herein.
It is to be understood that the step S3 may also be directly performed by the method of the step S2, which is not specifically limited herein.
For S4, the to-be-extracted voice data segments of the same start time and end time in the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments are formed into one to-be-extracted voice data segment pair. That is, each of the pairs of speech data segments to be extracted includes a first speech data segment to be extracted and a second speech data segment to be extracted. The starting time of a first voice data segment to be extracted and the starting time of a second voice data segment to be extracted in the same voice data segment pair are the same, and the ending time of the first voice data segment to be extracted and the ending time of the second voice data segment to be extracted in the same voice data segment pair are the same.
For example, in the pair of voice data segments D1, the start time of the first voice data segment to be extracted is 1 hour, 0 minute 0 second and the end time of the first voice data segment to be extracted is 2 hours, 0 minute 0 second, and the end time of the second voice data segment to be extracted is also 1 hour, 0 minute 0 second, and the end time of the second voice data segment to be extracted is also 2 hours, 0 minute 0 second, which are not specifically limited herein.
And S5, respectively inputting the first voice data segment to be extracted and the second voice data segment to be extracted corresponding to each voice data segment to be extracted into a single speaker voice extraction model at the same time to perform voice extraction, and obtaining a plurality of target speaker voice data segments output by the single speaker voice extraction model. That is, only the voice data of the target speaker is in the voice data segments of the target speaker, and each voice data segment to be extracted extracts the voice of the single speaker through the voice extraction model to obtain a voice data segment of the target speaker.
The first code transformation module and the second code transformation module output data to the speaker separation learning module, and the speaker separation learning module outputs data to the decoding transformation module.
It can be appreciated that the first transcoding module and the second transcoding module of the single speaker speech extraction model have the same structure, and the parameters of the first transcoding module and the second transcoding module of the single speaker speech extraction model are the same.
The first code conversion module is a module obtained by training an Encoder module based on a TasNet (Single channel real-time speech Separation) network, the second code conversion module is a module obtained by training an Encoder module based on a TasNet network, the speaker Separation learning module is a module obtained by training a Separation module based on a TasNet network, and the decoding conversion module is a module obtained by training a Decoder module based on a TasNet network.
The first code transformation module is used for code transformation. The second code transformation module is used for code transformation. The speaker separation learning module is used for performing speaker separation learning. The decoding transformation module is used for performing decoding transformation.
And S6, splicing each target speaker voice data segment in the voice data segments of the target speakers according to the time sequence, and taking the spliced voice data as target voice data of the target speakers. The target voice data of the target speaker only comprises the voice data of the target speaker and is the complete voice data of the target speaker in the same time period in the step S1, so that the comprehensiveness of business quality assessment is improved, the voice data of other speakers are removed, and the privacy safety of the other speakers is protected.
In one embodiment, the step of acquiring the first to-be-processed voice data and the second to-be-processed voice data of the target speaker in the same time period includes:
s101: acquiring the voice signals of the target speaker in the first direction and the voice signals of the target speaker in the second direction in the same time period;
S102: segmenting the voice signal in the first direction by adopting a first preset duration to obtain a plurality of segmented voice signal segments in the first direction;
s103: inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;
s104: performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;
s105: performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;
s106: segmenting the voice signal in the second direction by adopting the first preset duration to obtain a plurality of segmented voice signal segments in the second direction;
s107: inputting each segmented second-direction voice signal segment into a digital filter to obtain a plurality of filtered second-direction voice signal segments;
s108: performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;
S109: performing inverse discrete Fourier transform on the plurality of transformed second-direction voice signal segments to obtain noise-reduced second-direction voice data;
s110: pre-emphasis processing is carried out on the noise-reduced voice data in the first direction to obtain the voice data to be processed;
s111: and pre-emphasis processing is carried out on the noise-reduced voice data in the second direction to obtain the voice data to be processed.
The embodiment realizes the segmentation filtering, discrete Fourier transformation, inverse discrete Fourier transformation and pre-emphasis processing of the voice signals, thereby improving the voice data quality of the obtained first voice data to be processed and the second voice data to be processed and the accuracy of the target voice data of the determined target speaker.
For S101, the voice signal in the first direction is a voice signal recorded by the recording device in the first direction for the target speaker. The voice signal in the second direction is a voice signal recorded by the recording device in the second direction for the target speaker.
The recording device in the first direction and the recording device in the second direction can be independently arranged, and can also be integrated on the same electronic device. For example, the first direction recording device and the second direction recording device are integrated into the smart chest card, where the first direction recording device faces the mouth of the target speaker and the second direction recording device faces directly in front of the target speaker, and the examples are not limited specifically.
For S102, dividing the voice signal in the first direction into multiple segments of small voice signals by using a first preset duration, and taking each segment of small voice signal as a segmented voice signal segment in the first direction.
Optionally, the first preset duration is 20ms.
For S103, the digital filter may be a filter that can remove additive noise from the prior art, which is not described herein.
For S104, performing a discrete fourier transform on the filtered first-direction speech signal segment, where the filtered first-direction speech signal segment after the discrete fourier transform is used as a transformed first-direction speech signal segment.
And S105, sequencing the plurality of transformed first-direction voice signal segments according to the time sequence, performing inverse discrete Fourier transform on the sequenced plurality of transformed first-direction voice signal segments, and obtaining noise-reduced first-direction voice data after inverse discrete Fourier transform. The noise-reduced first-direction voice data is clean voice data.
For S106, the first preset duration is adopted to divide the voice signal in the second direction into multiple segments of small voice signals, and each segment of small voice signal is used as a segmented voice signal segment in the second direction.
For S107, the digital filter may be a filter that can remove additive noise from the prior art, which is not described herein.
And S108, performing discrete Fourier transform on the filtered second-direction voice signal segment, wherein the filtered second-direction voice signal segment after discrete Fourier transform is used as a transformed second-direction voice signal segment.
And S109, sequencing the plurality of transformed second direction voice signal segments according to the time sequence, performing inverse discrete Fourier transform on the sequenced plurality of transformed second direction voice signal segments, and obtaining noise-reduced second direction voice data after inverse discrete Fourier transform. The noise-reduced second-direction voice data is clean voice data.
For S110, pre-emphasis is a signal processing method for compensating the high frequency component of the input signal to compensate for the excessive attenuation of the high frequency component during transmission.
And performing pre-emphasis processing on the voice data subjected to noise reduction in the first direction by adopting a first-order FIR high-pass digital filter to obtain the first voice data to be processed.
Optionally, the pre-emphasis coefficient of the first-order FIR high-pass digital filter is 0.9< α <1.0.
Optionally, the pre-emphasis coefficient of the first-order FIR high-pass digital filter is 0.97.
And S111, performing pre-emphasis processing on the voice data subjected to the noise reduction in the second direction by adopting a first-order FIR high-pass digital filter to obtain the second voice data to be processed.
Optionally, the pre-emphasis coefficient of the first-order FIR high-pass digital filter is 0.9< α <1.0.
Optionally, the pre-emphasis coefficient of the first-order FIR high-pass digital filter is 0.97.
In one embodiment, the step of performing segmentation processing on the first to-be-processed voice data by using a preset segmentation method to obtain a plurality of first to-be-extracted voice data segments includes:
s21: framing the first voice data to be processed by adopting a second preset duration to obtain a plurality of first voice data frames to be processed;
s22: respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;
s23: extracting the first voice energy from the first voice energy corresponding to each of the plurality of first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energy;
S24: average value calculation is carried out on the plurality of first initial voice energy to obtain first background voice energy corresponding to the plurality of first voice data frames to be processed;
s25: respectively subtracting the first voice energy corresponding to each first voice data frame to be processed from the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;
s26: respectively comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;
s27: when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;
s28: when the first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;
s29: and carrying out mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold value and the mute category to obtain a plurality of first voice data segments to be extracted.
The embodiment realizes that frames are divided firstly, then the mute category of each frame is determined according to the voice energy of each frame, finally a plurality of first voice data segments to be extracted are obtained by deleting the mute frame according to the mute category, the number of voice data segments input into a single speaker voice extraction model is reduced, the voice extraction efficiency is improved, the mute duration in the finally obtained target voice data of the target speaker is reduced, and the efficiency of carrying out business quality assessment based on the target voice data of the target speaker is improved.
For S21, dividing the first to-be-processed voice data into multiple frames of voice data by using a second preset duration, and taking each frame of voice data as a first to-be-processed voice data frame. By dividing the voice data frame, the error of deleting the mute frame later is reduced, which is beneficial to further improving the accuracy of the target voice data of the target speaker.
Optionally, the second preset time period is 30ms.
And S22, performing voice energy calculation on the first voice data frame to be processed, and taking the calculated voice energy as first voice energy corresponding to the first voice data frame to be processed.
First speech energy calculation formula E n The method comprises the following steps:
where x (m) is a first frame of speech data to be processed, w (m) is a window function (a rectangular function corresponding to the first frame of speech data to be processed), where the window is a square window, i.e. the speech energy is equal to the sum of squares of all speech data in each frame.
For S23, extracting a preset number of first to-be-processed speech data frames from the beginning of the plurality of first to-be-processed speech data frames, and taking the extracted preset number of first to-be-processed speech data frames as a plurality of background to-be-processed speech data frames; and taking the first voice energy corresponding to each background voice data frame to be processed as first beginning voice energy.
Optionally, the preset number is 10.
And S24, carrying out average value calculation on the voice energy of the first beginning voice energy, and taking the average value of the voice energy obtained by calculation as first background voice energy corresponding to the voice data frames to be processed.
And for S25, subtracting the first background voice energy from the first voice energy corresponding to the first voice data frame to be processed to obtain a voice energy difference value, and taking the obtained voice energy difference value as a first voice energy difference value corresponding to the first voice data frame to be processed.
For S26, a speech energy threshold is obtained; and independently comparing the first voice energy difference value corresponding to each first voice data frame to be processed in the first voice energy difference values corresponding to the first voice data frames to be processed with a voice energy threshold value.
For S27, when the first speech energy difference value corresponding to the first to-be-processed speech data frame is greater than the speech energy threshold, it means that the first to-be-processed speech data frame corresponding to the first speech energy difference value has a larger difference from the background speech corresponding to the first background speech energy, and the target speaker and/or other speakers are speaking at this time, so that it may be determined that the silence type of the first to-be-processed speech data frame corresponding to the first speech energy difference value is a non-silence frame.
For S28, when the first speech energy difference value corresponding to the first to-be-processed speech data frame is less than or equal to the speech energy threshold, it means that the first to-be-processed speech data frame corresponding to the first speech energy difference value is not different from the background speech corresponding to the first background speech energy, and the target speaker and other speakers are not speaking at this time, so that it may be determined that the silence type of the first to-be-processed speech data frame corresponding to the first speech energy difference value is a silence frame.
And for S29, deleting the continuous multiple voice data frames to be processed, which are the silence frames, of which the silence categories meet the silence frame quantity threshold, and determining the multiple first voice data segments to be extracted according to the multiple first voice data frames to be processed after deletion is completed. That is, the total duration of the voice data of the plurality of first pieces of voice data to be extracted is less than or equal to the total duration of the voice data of the plurality of first frames of voice data to be processed.
Optionally, performing mute frame deletion processing on the plurality of first to-be-processed voice data frames by adopting a mute frame quantity threshold and the mute category to obtain a plurality of to-be-combined first to-be-processed voice data frames; and combining the adjacent voice data frames according to the first voice data frames to be combined in time sequence to obtain a plurality of first voice data segments to be extracted. Therefore, the number of voice data segments input into the single speaker voice extraction model is further reduced, and the voice extraction efficiency is improved. For example, after deleting the voice data frame 1, the voice data frame 2, the voice data frame 3, the voice data frame 4, the voice data frame 5, the voice data frame 6, and the voice data frame 3 and the voice data frame 4 in the plurality of first to-be-processed voice data frames sequenced in time sequence, the plurality of first to-be-processed voice data frames to be combined are obtained as the voice data frame 1, the voice data frame 2, the voice data frame 5, the voice data frame 6, and the voice data frame 7, at this time, the plurality of first to-be-processed voice data frames to be combined are combined into adjacent voice data frames in time sequence, that is, the adjacent voice data frame 1 and the adjacent voice data frame 2 are combined, and the adjacent voice data frame 5, the voice data frame 6, and the voice data frame 7 are combined to obtain two first to-be-extracted voice data segments, wherein the first to-be-extracted voice data segment includes the voice data frame 1 and the voice data frame 2, and the second to-be-extracted voice data segment includes the voice data frame 5, the voice data frame 6, and the voice data frame 7 are not specifically limited herein.
In one embodiment, the step of performing mute frame deletion processing on the plurality of first to-be-processed voice data frames by using the mute frame number threshold and the mute category to obtain the plurality of first to-be-extracted voice data segments includes:
s291: calculating the number of the silence frames of the first voice data frames to be processed according to time continuity to obtain the number of the first continuous silence frames;
s292: respectively comparing the number of each first continuous mute frame with the threshold value of the number of the mute frames;
s293: and deleting the first to-be-processed voice data frames corresponding to all the first continuous silence frame numbers larger than the silence frame number threshold from the first to-be-processed voice data frames when the first continuous silence frame numbers are larger than the silence frame number threshold, so as to obtain the first to-be-extracted voice data segments.
According to the embodiment, a plurality of first voice data segments to be extracted are obtained by deleting the mute frame according to the mute category, the number of voice data segments input into a single speaker voice extraction model is reduced, the voice extraction efficiency is improved, the mute duration in the finally obtained target voice data of the target speaker is reduced, and the efficiency of service quality assessment based on the target voice data of the target speaker is improved.
For S291, the plurality of first frames of voice data to be processed are ordered in chronological order; and calculating the number of the continuous silence frames of the sequenced multiple first voice data frames to be processed to obtain the number of multiple first continuous silence frames.
For example, the number of the first continuous silence frames is 2 (i.e. the voice data frame 1 and the voice data frame 2), the number of the second continuous silence frames is 1 (i.e. the voice data frame 5), and the number of the third continuous silence frames is 1 (i.e. the voice data frame 7), which are not particularly limited herein, if the silence categories of the voice data frame 3, the voice data frame 4 and the voice data frame 6 are silence frames.
For S292, each of the first continuous silence frame numbers is compared with a silence frame number threshold separately.
For S293, when the number of the first continuous silence frames is greater than the threshold of the number of silence frames, it means that the number of the first to-be-processed voice data frames corresponding to the number of the first continuous silence frames reaches a deletion condition, deleting the first to-be-processed voice data frames corresponding to all the number of the first continuous silence frames greater than the threshold of the number of silence frames from the plurality of first to-be-processed voice data frames, and obtaining the plurality of first to-be-extracted voice data segments according to the plurality of first to-be-processed voice data frames after deletion.
When the number of the first continuous silence frames is smaller than or equal to the threshold value of the number of the silence frames, processing is not needed, so that the voice speed of a target speaker in target voice data is prevented from being changed by transition deletion.
In one embodiment, the step of extracting the second to-be-processed voice data according to the plurality of first to-be-extracted voice data segments to obtain a plurality of second to-be-extracted voice data segments includes:
s31: extracting the starting time and the ending time of each first voice data segment to be extracted respectively to obtain a first starting time and a first ending time which correspond to each of the plurality of first voice data segments to be extracted;
s32: and respectively adopting a first starting time and a first ending time corresponding to each first voice data segment to be extracted to perform segmented extraction from the second voice data to be processed, so as to obtain a plurality of second voice data segments to be extracted.
The embodiment realizes the segmented extraction of the second voice data to be processed according to the plurality of first voice data segments to be extracted, thereby providing a data basis for extracting the voice data segment pairs to be extracted.
For S31, extracting a first speech data segment to be extracted from the plurality of first speech data segments to be extracted as a target first speech data segment to be extracted; acquiring a start time of a target first voice data segment to be extracted as a first start time corresponding to the target first voice data segment to be extracted, and acquiring an end time of the target first voice data segment to be extracted as a first end time corresponding to the target first voice data segment to be extracted; and repeatedly executing the step of extracting a first voice data segment to be extracted from the plurality of first voice data segments to be extracted as a target first voice data segment to be extracted until the first starting time and the first ending time corresponding to each of the plurality of first voice data segments to be extracted are determined.
For S32, extracting a first speech data segment to be extracted from the plurality of first speech data segments to be extracted as a target first speech data segment to be extracted; performing segmented extraction from the second voice data to be processed according to a first start time and a first end time corresponding to the first voice data segment to be extracted of the target, and taking the voice data obtained by segmented extraction as the second voice data segment to be extracted corresponding to the first voice data segment to be extracted of the target; and repeating the step of extracting a first voice data segment to be extracted from the plurality of first voice data segments to be extracted as a target first voice data segment to be extracted until determining second voice data segments to be extracted, which correspond to the plurality of first voice data segments to be extracted respectively.
In one embodiment, the step of performing speech extraction on the input single speaker speech extraction model to obtain a plurality of target speaker speech data segments by using each of the to-be-extracted speech data segments includes:
s51: inputting the first voice data segment to be extracted of the voice data segment pair to the first coding conversion module of the single speaker voice extraction model for coding conversion to obtain a first coding conversion result;
S52: inputting the second voice data segment to be extracted of the voice data segment pair to the second coding conversion module of the single speaker voice extraction model for coding conversion to obtain a second coding conversion result;
s53: calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first code transformation result and the second code transformation result to obtain a target mask matrix;
s54: invoking the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;
s55: and repeatedly executing the step of inputting the first voice data segment to be extracted of the voice data segment pair to the first coding conversion module of the single speaker voice extraction model to carry out coding conversion to obtain a first coding conversion result until all the voice data segment pairs to be extracted respectively correspond to the voice data segments of the target speaker.
The embodiment realizes that the first voice data segment to be extracted and the second voice data segment to be extracted of the voice data segment pair to be extracted are simultaneously input into the single speaker voice extraction model to carry out voice extraction, thereby realizing rapid, accurate and automatic extraction of the speaking voice of the target speaker.
For S51, inputting the first to-be-extracted speech data segment of the to-be-extracted speech data segment pair into the first transcoding module of the single-speaker speech extraction model to perform transcoding, so as to obtain a first transcoding result, that is, the first transcoding module for training the single-speaker speech extraction model adopts a speech signal in a first direction.
For S52, inputting the second to-be-extracted speech data segment of the to-be-extracted speech data segment pair into the second transcoding module of the single-speaker speech extraction model to perform transcoding, so as to obtain a second transcoding result, that is, the second transcoding module for training the single-speaker speech extraction model adopts a speech signal in the second direction.
For S53, the target mask matrix refers to a mask matrix of the target speaker.
And S54, invoking the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to realize reduction, so as to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted.
For S55, steps S51 to S55 are repeatedly performed until it is determined that all the to-be-extracted voice data segment pairs respectively correspond to the target speaker voice data segment.
In one embodiment, before the step of performing speech extraction on the input single speaker speech extraction model to obtain the plurality of target speaker speech data segments, each of the to-be-extracted speech data segments includes:
s051: obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;
s052: inputting the voice sample data of the training sample in the first direction into a first to-be-trained coding conversion module of a to-be-trained voice extraction model and the voice sample data in the second direction into a second to-be-trained coding conversion module of the to-be-trained voice extraction model, and obtaining single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a model obtained by modifying the TasNet network;
s053: inputting the voice calibration data and the single speaker training data into a loss function for calculation to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained for calculating the single speaker training data next time;
S054: repeating the steps until the loss value reaches a first convergence condition or the iteration number reaches a second convergence condition, and determining the speech extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition, as the single speaker speech extraction model.
The embodiment realizes that a single speaker voice extraction model is obtained based on TasNet network training, and provides a basis for subsequent voice data separation of a single speaker.
For S051, each of the training samples includes a first direction voice sample data, a second direction voice sample data, and a voice calibration data.
The voice calibration data is voice data of a single speaker calibrated by voice sample data in a first direction and voice sample data in a second direction.
The voice sample data in the first direction, the voice sample data in the second direction and the voice calibration data are all voice data in the time domain.
For S052, the speech extraction model to be trained includes: the system comprises a first code transformation module to be trained, the second code transformation module to be trained, a speaker separation learning module to be trained and a decoding transformation module to be trained; the first code transformation module to be trained and the second code transformation module to be trained are connected with the speaker separation learning module to be trained, and the speaker separation learning module to be trained is connected with the decoding transformation module to be trained.
Optionally, the first to-be-trained code conversion module and the second to-be-trained code conversion module both adopt an Encoder module of the TasNet network, the to-be-trained speaker Separation learning module adopts a Separation module of the TasNet network, and the to-be-trained decoding conversion module adopts a Decoder module of the TasNet network.
The Encoder module of the TasNet network comprises: the convolution kernel is a convolution layer, a regularization layer and a full connection layer of 1*1.
For S053, the loss function SI-SNR is:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the single speaker training data, s representing the speech calibration data, +.>Is to vectorSum vector->Performing point multiplication, wherein log () refers to a logarithmic function, S refers to a second norm of the voice calibration data, S starget I is S starget Is ||e noise I is e noise Is a second norm of (c).
And for S054, repeatedly executing the steps S052 to S054 until the loss value reaches a first convergence condition or the iteration number reaches a second convergence condition, and determining the speech extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition, as the single speaker speech extraction model.
The first convergence condition means that the magnitude of the loss value calculated in two adjacent times satisfies the lipschitz condition (lipschitz continuous condition).
The number of iterations reaching the second convergence condition means the number of times the speech extraction model to be trained is used to calculate the single speaker training data, that is, one time, the number of iterations is increased by 1.
Referring to fig. 2, the present application also proposes a voice extraction apparatus for a target speaker, the apparatus comprising:
a voice data obtaining module 100, configured to obtain first to-be-processed voice data and second to-be-processed voice data of a target speaker in a same time period, where the first to-be-processed voice data is voice data obtained according to a voice signal in a first direction, and the second to-be-processed voice data is voice data obtained according to a voice signal in a second direction;
the first segmentation processing module 200 is configured to perform segmentation processing on the first to-be-processed voice data by using a preset segmentation method, so as to obtain a plurality of first to-be-extracted voice data segments;
the second segment extraction module 300 is configured to segment extract the second to-be-processed voice data according to the plurality of first to-be-extracted voice data segments, so as to obtain a plurality of second to-be-extracted voice data segments;
The to-be-extracted voice data segment pair determining module 400 is configured to perform data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;
the target speaker voice data segment determining module 500 is configured to perform voice extraction on each of the voice data segments to be extracted on an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, where the single speaker voice extraction model includes: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training;
the target voice data determining module 600 is configured to splice the plurality of target speaker voice data segments in time sequence to obtain target voice data of the target speaker.
According to the method, the device and the system, the first voice data to be processed of the target speaker in the first direction and the second voice data to be processed of the second direction in the same time period are subjected to segmentation processing and data extraction in the same time to obtain a plurality of second voice data segments to be extracted, then the second voice data segments to be extracted are input into a single speaker voice extraction model to be subjected to voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally the target voice data of the target speaker are obtained by splicing the plurality of target speaker voice data segments according to time sequence, so that the speech voice of the target speaker is extracted rapidly, accurately and automatically, the cost of business quality assessment is reduced, the comprehensiveness of business quality assessment is improved through the complete voice data of the target speaker, and privacy security of voice data of other speakers is protected.
Referring to fig. 3, in an embodiment of the present application, there is further provided a computer device, which may be a server, and an internal structure thereof may be as shown in fig. 3. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as a voice extraction method for a target speaker. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech extraction method for a targeted speaker. The voice extraction method for the target speaker comprises the following steps: acquiring first to-be-processed voice data and second to-be-processed voice data of a target speaker in the same time period, wherein the first to-be-processed voice data is voice data obtained according to voice signals in a first direction, and the second to-be-processed voice data is voice data obtained according to voice signals in a second direction; carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted; the second voice data to be processed are extracted in a segmented mode according to the plurality of first voice data segments to be extracted, and a plurality of second voice data segments to be extracted are obtained; carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training; and splicing the voice data segments of the target speakers in time sequence to obtain the target voice data of the target speakers.
According to the method, the device and the system, the first voice data to be processed of the target speaker in the first direction and the second voice data to be processed of the second direction in the same time period are subjected to segmentation processing and data extraction in the same time to obtain a plurality of second voice data segments to be extracted, then the second voice data segments to be extracted are input into a single speaker voice extraction model to be subjected to voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally the target voice data of the target speaker are obtained by splicing the plurality of target speaker voice data segments according to time sequence, so that the speech voice of the target speaker is extracted rapidly, accurately and automatically, the cost of business quality assessment is reduced, the comprehensiveness of business quality assessment is improved through the complete voice data of the target speaker, and privacy security of voice data of other speakers is protected.
An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for extracting speech for a target speaker, including the steps of: acquiring first to-be-processed voice data and second to-be-processed voice data of a target speaker in the same time period, wherein the first to-be-processed voice data is voice data obtained according to voice signals in a first direction, and the second to-be-processed voice data is voice data obtained according to voice signals in a second direction; carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted; the second voice data to be processed are extracted in a segmented mode according to the plurality of first voice data segments to be extracted, and a plurality of second voice data segments to be extracted are obtained; carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted; and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training; and splicing the voice data segments of the target speakers in time sequence to obtain the target voice data of the target speakers.
According to the voice extraction method for the target speaker, the first voice data to be processed of the target speaker in the first direction and the second voice data to be processed of the target speaker in the second direction in the same time period are subjected to segmentation processing and data extraction in the same time to obtain a plurality of second voice data segments to be extracted, then the second voice data segments to be extracted are input into a single speaker voice extraction model to be subjected to voice extraction to obtain a plurality of target speaker voice data segments, the single speaker voice extraction model is a model obtained based on TasNet network training, and finally the target voice data of the target speaker are obtained by splicing the plurality of target speaker voice data segments in time sequence, so that the voice of the target speaker is extracted rapidly, accurately and automatically, the cost of business quality assessment is reduced, the comprehensiveness of business quality assessment is improved through the complete voice data of the target speaker, and the privacy safety of voice data of other speakers is facilitated.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the application.

Claims (9)

1. A method of speech extraction for a targeted speaker, the method comprising:
acquiring first to-be-processed voice data and second to-be-processed voice data of a target speaker in the same time period, wherein the first to-be-processed voice data is voice data obtained according to voice signals in a first direction, and the second to-be-processed voice data is voice data obtained according to voice signals in a second direction;
Carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;
the second voice data to be processed are extracted in a segmented mode according to the plurality of first voice data segments to be extracted, and a plurality of second voice data segments to be extracted are obtained;
carrying out data extraction on the plurality of first voice data segments to be extracted and the plurality of second voice data segments to be extracted at the same time to obtain a plurality of voice data segment pairs to be extracted;
and respectively carrying out voice extraction on each voice data segment to be extracted by inputting a single speaker voice extraction model to obtain a plurality of target speaker voice data segments, wherein the single speaker voice extraction model comprises the following steps: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training;
splicing the voice data segments of the target speakers according to time sequence to obtain target voice data of the target speakers;
the step of segmenting the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted comprises the following steps:
Framing the first voice data to be processed by adopting a second preset duration to obtain a plurality of first voice data frames to be processed;
respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;
extracting the first voice energy from the first voice energy corresponding to each of the plurality of first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energy;
average value calculation is carried out on the plurality of first initial voice energy to obtain first background voice energy corresponding to the plurality of first voice data frames to be processed;
respectively subtracting the first voice energy corresponding to each first voice data frame to be processed from the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;
respectively comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;
when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;
When the first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;
and carrying out mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold value and the mute category to obtain a plurality of first voice data segments to be extracted.
2. The voice extraction method for a target speaker according to claim 1, wherein the step of acquiring the first voice data to be processed and the second voice data to be processed of the target speaker in the same period of time comprises:
acquiring the voice signals of the target speaker in the first direction and the voice signals of the target speaker in the second direction in the same time period;
segmenting the voice signal in the first direction by adopting a first preset duration to obtain a plurality of segmented voice signal segments in the first direction;
inputting each segmented first-direction voice signal segment into a digital filter to obtain a plurality of filtered first-direction voice signal segments;
Performing discrete Fourier transform on each filtered first-direction voice signal segment to obtain a plurality of transformed first-direction voice signal segments;
performing inverse discrete Fourier transform on the plurality of transformed first-direction voice signal segments to obtain noise-reduced first-direction voice data;
segmenting the voice signal in the second direction by adopting the first preset duration to obtain a plurality of segmented voice signal segments in the second direction;
inputting each segmented second-direction voice signal segment into a digital filter to obtain a plurality of filtered second-direction voice signal segments;
performing discrete Fourier transform on each filtered second-direction voice signal segment to obtain a plurality of transformed second-direction voice signal segments;
performing inverse discrete Fourier transform on the plurality of transformed second-direction voice signal segments to obtain noise-reduced second-direction voice data;
pre-emphasis processing is carried out on the noise-reduced voice data in the first direction to obtain the voice data to be processed;
and pre-emphasis processing is carried out on the noise-reduced voice data in the second direction to obtain the voice data to be processed.
3. The method for extracting voice for a target speaker according to claim 2, wherein the step of performing mute frame deletion processing on the plurality of first to-be-processed voice data frames using a mute frame number threshold and the mute category to obtain the plurality of first to-be-extracted voice data segments includes:
calculating the number of the silence frames of the first voice data frames to be processed according to time continuity to obtain the number of the first continuous silence frames;
respectively comparing the number of each first continuous mute frame with the threshold value of the number of the mute frames;
and deleting the first to-be-processed voice data frames corresponding to all the first continuous silence frame numbers larger than the silence frame number threshold from the first to-be-processed voice data frames when the first continuous silence frame numbers are larger than the silence frame number threshold, so as to obtain the first to-be-extracted voice data segments.
4. The method for extracting voice for a target speaker according to claim 1, wherein the step of extracting the second voice data to be processed according to the plurality of first voice data segments to be extracted in a segmented manner to obtain a plurality of second voice data segments to be extracted includes:
Extracting the starting time and the ending time of each first voice data segment to be extracted respectively to obtain a first starting time and a first ending time which correspond to each of the plurality of first voice data segments to be extracted;
and respectively adopting a first starting time and a first ending time corresponding to each first voice data segment to be extracted to perform segmented extraction from the second voice data to be processed, so as to obtain a plurality of second voice data segments to be extracted.
5. The method for extracting speech of a target speaker according to claim 1, wherein the step of extracting speech of each of the speech data segments to be extracted from the input single speaker speech extraction model to obtain a plurality of speech data segments of the target speaker comprises:
inputting the first voice data segment to be extracted of the voice data segment pair to the first coding conversion module of the single speaker voice extraction model for coding conversion to obtain a first coding conversion result;
inputting the second voice data segment to be extracted of the voice data segment pair to the second coding conversion module of the single speaker voice extraction model for coding conversion to obtain a second coding conversion result;
Calling the speaker separation learning module of the single speaker voice extraction model to perform speaker separation learning on the first code transformation result and the second code transformation result to obtain a target mask matrix;
invoking the decoding transformation module of the single speaker voice extraction model to perform decoding transformation on the target mask matrix to obtain the target speaker voice data segment corresponding to the voice data segment to be extracted;
and repeatedly executing the step of inputting the first voice data segment to be extracted of the voice data segment pair to the first coding conversion module of the single speaker voice extraction model to carry out coding conversion to obtain a first coding conversion result until all the voice data segment pairs to be extracted respectively correspond to the voice data segments of the target speaker.
6. The method for extracting speech of a target speaker according to claim 1, wherein before the step of extracting speech of each of the speech data segments to be extracted from the input single speaker speech extraction model to obtain a plurality of speech data segments of the target speaker, respectively, the method comprises:
Obtaining a plurality of training samples, the training samples comprising: voice sample data in a first direction, voice sample data in a second direction and voice calibration data;
inputting the voice sample data of the training sample in the first direction into a first to-be-trained coding conversion module of a to-be-trained voice extraction model and the voice sample data in the second direction into a second to-be-trained coding conversion module of the to-be-trained voice extraction model, and obtaining single speaker training data output by the to-be-trained voice extraction model, wherein the to-be-trained voice extraction model is a model obtained by modifying the TasNet network;
inputting the voice calibration data and the single speaker training data into a loss function for calculation to obtain a loss value of the voice extraction model to be trained, updating parameters of the voice extraction model to be trained according to the loss value, and using the updated voice extraction model to be trained for calculating the single speaker training data next time;
repeating the steps until the loss value reaches a first convergence condition or the iteration number reaches a second convergence condition, and determining the speech extraction model to be trained, of which the loss value reaches the first convergence condition or the iteration number reaches the second convergence condition, as the single speaker speech extraction model.
7. A speech extraction apparatus for a targeted speaker, the apparatus comprising:
the voice data acquisition module is used for acquiring first voice data to be processed and second voice data to be processed of a target speaker in the same time period, wherein the first voice data to be processed is voice data obtained according to voice signals in a first direction, and the second voice data to be processed is voice data obtained according to voice signals in a second direction;
the first segmentation processing module is used for carrying out segmentation processing on the first voice data to be processed by adopting a preset segmentation method to obtain a plurality of first voice data segments to be extracted;
framing the first voice data to be processed by adopting a second preset duration to obtain a plurality of first voice data frames to be processed;
respectively carrying out voice energy calculation on each first voice data frame to be processed to obtain first voice energy corresponding to each of the plurality of first voice data frames to be processed;
extracting the first voice energy from the first voice energy corresponding to each of the plurality of first voice data frames to be processed according to a preset quantity to obtain a plurality of first beginning voice energy;
Average value calculation is carried out on the plurality of first initial voice energy to obtain first background voice energy corresponding to the plurality of first voice data frames to be processed;
respectively subtracting the first voice energy corresponding to each first voice data frame to be processed from the first background voice energy to obtain first voice energy difference values corresponding to the plurality of first voice data frames to be processed;
respectively comparing the first voice energy difference value corresponding to each first voice data frame to be processed with a voice energy threshold value;
when the first voice energy difference value corresponding to the first voice data frame to be processed is larger than the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a non-mute frame;
when the first voice energy difference value corresponding to the first voice data frame to be processed is smaller than or equal to the voice energy threshold value, determining that the mute type of the first voice data frame to be processed corresponding to the first voice energy difference value is a mute frame;
performing mute frame deletion processing on the plurality of first voice data frames to be processed by adopting a mute frame quantity threshold and the mute category to obtain a plurality of first voice data segments to be extracted;
The second segmentation extraction module is used for carrying out segmentation extraction on the second voice data to be processed according to the plurality of first voice data segments to be extracted to obtain a plurality of second voice data segments to be extracted;
the to-be-extracted voice data segment pair determining module is used for carrying out data extraction on the plurality of first to-be-extracted voice data segments and the plurality of second to-be-extracted voice data segments at the same time to obtain a plurality of to-be-extracted voice data segment pairs;
the target speaker voice data segment determining module is configured to perform voice extraction on each to-be-extracted voice data segment to an input single speaker voice extraction model to obtain a plurality of target speaker voice data segments, where the single speaker voice extraction model includes: the system comprises a first coding transformation module, a second coding transformation module, a speaker separation learning module and a decoding transformation module, wherein the single speaker voice extraction model is a model obtained based on TasNet network training;
and the target voice data determining module is used for splicing the voice data segments of the target speakers in time sequence to obtain target voice data of the target speakers.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202011545184.1A 2020-12-23 2020-12-23 Speech extraction method, device, equipment and medium for target speaker Active CN112712790B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011545184.1A CN112712790B (en) 2020-12-23 2020-12-23 Speech extraction method, device, equipment and medium for target speaker

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011545184.1A CN112712790B (en) 2020-12-23 2020-12-23 Speech extraction method, device, equipment and medium for target speaker

Publications (2)

Publication Number Publication Date
CN112712790A CN112712790A (en) 2021-04-27
CN112712790B true CN112712790B (en) 2023-08-15

Family

ID=75543939

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011545184.1A Active CN112712790B (en) 2020-12-23 2020-12-23 Speech extraction method, device, equipment and medium for target speaker

Country Status (1)

Country Link
CN (1) CN112712790B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345464A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Voice extraction method, system, device and storage medium
CN113724694B (en) * 2021-11-01 2022-03-08 深圳市北科瑞声科技股份有限公司 Voice conversion model training method and device, electronic equipment and storage medium
CN115019804B (en) * 2022-08-03 2022-11-01 北京惠朗时代科技有限公司 Multi-verification type voiceprint recognition method and system for multi-employee intensive sign-in

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102522081A (en) * 2011-12-29 2012-06-27 北京百度网讯科技有限公司 Method for detecting speech endpoints and system
CN110428842A (en) * 2019-08-13 2019-11-08 广州国音智能科技有限公司 Speech model training method, device, equipment and computer readable storage medium
CN111243576A (en) * 2020-01-16 2020-06-05 腾讯科技(深圳)有限公司 Speech recognition and model training method, device, equipment and storage medium
CN111524525A (en) * 2020-04-28 2020-08-11 平安科技(深圳)有限公司 Original voice voiceprint recognition method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5519689B2 (en) * 2009-10-21 2014-06-11 パナソニック株式会社 Sound processing apparatus, sound processing method, and hearing aid
CN106683680B (en) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 Speaker recognition method and device, computer equipment and computer readable medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102522081A (en) * 2011-12-29 2012-06-27 北京百度网讯科技有限公司 Method for detecting speech endpoints and system
CN110428842A (en) * 2019-08-13 2019-11-08 广州国音智能科技有限公司 Speech model training method, device, equipment and computer readable storage medium
CN111243576A (en) * 2020-01-16 2020-06-05 腾讯科技(深圳)有限公司 Speech recognition and model training method, device, equipment and storage medium
CN111524525A (en) * 2020-04-28 2020-08-11 平安科技(深圳)有限公司 Original voice voiceprint recognition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112712790A (en) 2021-04-27

Similar Documents

Publication Publication Date Title
CN112712790B (en) Speech extraction method, device, equipment and medium for target speaker
CN109326299B (en) Speech enhancement method, device and storage medium based on full convolution neural network
DE69831288T2 (en) Sound processing adapted to ambient noise
CN108922544B (en) Universal vector training method, voice clustering method, device, equipment and medium
CN110930976B (en) Voice generation method and device
CN112581973B (en) Voice enhancement method and system
CN109616139A (en) Pronunciation signal noise power spectral density estimation method and device
CN112053695A (en) Voiceprint recognition method and device, electronic equipment and storage medium
DE69911169T2 (en) METHOD FOR DECODING AN AUDIO SIGNAL WITH CORRECTION OF TRANSMISSION ERRORS
CN111429932A (en) Voice noise reduction method, device, equipment and medium
CN109658943B (en) Audio noise detection method and device, storage medium and mobile terminal
CN113177630B (en) Data memory elimination method and device for deep learning model
CN116631412A (en) Method for judging voice robot through voiceprint matching
CN114694674A (en) Speech noise reduction method, device and equipment based on artificial intelligence and storage medium
WO2020015546A1 (en) Far-field speech recognition method, speech recognition model training method, and server
Parmar et al. Comparison of performance of the features of speech signal for non-intrusive speech quality assessment
CN111028852A (en) Noise removing method in intelligent calling system based on CNN
CN113889085A (en) Speech recognition method, apparatus, device, storage medium and program product
JP7184236B2 (en) Voiceprint Recognition Method, Apparatus, Equipment, and Storage Medium
CN114827363A (en) Method, device and readable storage medium for eliminating echo in call process
JPS628800B2 (en)
CN110958417B (en) Method for removing compression noise of video call video based on voice clue
CN111883154B (en) Echo cancellation method and device, computer-readable storage medium, and electronic device
CN113516987A (en) Speaker recognition method, device, storage medium and equipment
CN110648681A (en) Voice enhancement method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant