CN111754982A - Noise elimination method and device for voice call, electronic equipment and storage medium - Google Patents

Noise elimination method and device for voice call, electronic equipment and storage medium Download PDF

Info

Publication number
CN111754982A
CN111754982A CN202010570483.4A CN202010570483A CN111754982A CN 111754982 A CN111754982 A CN 111754982A CN 202010570483 A CN202010570483 A CN 202010570483A CN 111754982 A CN111754982 A CN 111754982A
Authority
CN
China
Prior art keywords
voice
speaker
detected
category
feature set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010570483.4A
Other languages
Chinese (zh)
Inventor
孙岩丹
王瑞璋
马骏
王少军
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010570483.4A priority Critical patent/CN111754982A/en
Publication of CN111754982A publication Critical patent/CN111754982A/en
Priority to PCT/CN2020/121571 priority patent/WO2021151310A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention relates to a voiceprint recognition technology, and discloses a noise elimination method for voice communication, which comprises the following steps: carrying out voice endpoint detection on the call audio to obtain a voice set; carrying out voice feature extraction on the voice set to obtain a voice feature set; intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice feature set according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and grading the cluster; and according to the scores, dividing the voice speech set into a first speaker voice and a second speaker voice, distinguishing background speaker voices from the first speaker voice and the second speaker voice, and deleting the background speaker voices from the voice speech set. The invention also relates to a block chain technology, and the call audio can be stored in the block chain. The invention can delete the background voice in the voice communication, thereby improving the voice communication quality.

Description

Noise elimination method and device for voice call, electronic equipment and storage medium
Technical Field
The present invention relates to the field of voiceprint recognition technologies, and in particular, to a method and an apparatus for eliminating noise in a voice call, an electronic device, and a computer-readable storage medium.
Background
Customer service systems, particularly intelligent outbound systems, often need to face background noise interference from the environment in which the customer is located. Among all noises, the noise interference of background voice is the strongest, and the automatic voice recognition of the intelligent outbound system can also recognize the background voice and take the background voice as a conversation target, thereby greatly influencing the success rate of the whole conversation.
However, the current noise cancellation technology mainly cancels background noise of non-human voice, and the noise cancellation effect for background human voice is poor, resulting in poor voice call quality.
Disclosure of Invention
The invention provides a noise elimination method and device for voice call, electronic equipment and a computer readable storage medium, and mainly aims to delete background voice in voice call and improve the success rate of a conversation system.
In order to achieve the above object, a noise cancellation method for voice call provided by the present invention includes:
carrying out voice endpoint detection on the call audio to obtain a voice set;
carrying out voice feature extraction on the voice set to obtain a voice feature set;
intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice feature set according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and grading the obtained clustering result by using a preset evaluation algorithm to obtain a grade value of each voice feature set to be detected;
dividing the voice and voice set into a first speaker voice and a second speaker voice according to the scoring value;
and calculating the time lengths of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the time lengths of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set.
Optionally, the performing voice feature extraction on the voice speech set to obtain a voice feature set includes:
pre-emphasis, framing and windowing are carried out on the voice speech set to obtain a speech frame sequence;
obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
converting the spectrum into a mel-frequency spectrum by a mel-filter bank;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the human voice feature set.
Optionally, the clustering each to-be-detected speech feature set includes:
step a, randomly selecting two feature vectors in the voice feature set to be detected as a category center;
b, for each feature vector in the voice feature set to be detected, clustering the feature vector and the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes;
step c, updating the category centers of the two initial categories;
and d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
Optionally, the dividing the set of human voices into a first speaker voice and a second speaker voice according to the score value includes:
selecting one of the voice feature sets to be detected, and acquiring a corresponding score value;
comparing the scoring value with a preset scoring threshold;
when the score value is larger than a preset score threshold value, combining two standard categories of the selected voice feature set to be detected into a single voice category, calculating a category center of the single voice category, and generating a first speaker voice according to the single voice category and the category center;
when the score value is smaller than or equal to a preset score threshold value, generating a first speaker voice and a second speaker voice according to the two standard categories;
and selecting the next voice feature set to be detected, acquiring a corresponding score value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the score value.
Optionally, the classifying two standard categories in the to-be-detected speech feature set into the first speaker voice or the second speaker voice according to the score value includes:
if the score value is larger than the score threshold value, combining the two standard categories of the voice feature set to be detected into a single voice category, calculating the category center of the single voice category, and classifying the single voice category into the first speaker voice or the second speaker voice according to the cosine distance between the category center of the single voice category and the category centers of the first speaker voice and the second speaker voice;
and if the score value is smaller than or equal to a score threshold value, classifying the two standard categories into the first speaker voice and the second speaker voice respectively according to cosine distances between category centers of the two standard categories in the voice feature set to be detected and category centers of the first speaker voice and the second speaker voice.
Optionally, the categorizing comprises:
combining the single voice category with the first speaker voice or the second speaker voice, recalculating a combined category center, and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice; or
And combining the two standard categories with the first speaker voice and the second speaker voice respectively, recalculating a combined category center, and accumulating the frame number of the standard categories and the time length of the first speaker voice or the second speaker voice.
Optionally, the deleting the background voice from the voice speech set includes:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
comparing the duration proportion with a preset proportion threshold;
and when the duration proportion is greater than the proportion threshold value, deleting the background voice from the voice set, and removing the background voice in the call audio.
In order to solve the above problem, the present invention also provides a noise canceling device for voice call, the device comprising:
the voice endpoint detection module is used for carrying out voice endpoint detection on the call audio to obtain a voice set;
the voice feature extraction module is used for extracting voice features of the voice set to obtain a voice feature set;
the cluster scoring module is used for intercepting the voice feature sets to be detected with accumulated time as a preset time threshold from the voice feature sets according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and scoring the obtained clustering result by using a preset evaluation algorithm to obtain a score value of each voice feature set to be detected;
the voice classification module is used for dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value;
and the background voice removing module is used for calculating the voice time of the first speaker and the voice time of the second speaker, judging the background voice in the voice set according to the voice time of the first speaker and the voice time of the second speaker and deleting the background voice from the voice set.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the noise elimination method of the voice call.
In order to solve the above problem, the present invention also provides a computer-readable storage medium including a storage data area storing created data and a storage program area storing a computer program which, when executed by a processor, implements the noise canceling method of a voice call as described in any one of the above.
The embodiment of the invention carries out voice endpoint detection on the call audio, deletes the non-human voice noise in the call audio, and reduces the subsequent processing amount of a computer; voice feature extraction is carried out on the voice set to obtain a voice feature set, so that background voice in the call audio can be conveniently separated out in the follow-up process; intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice feature set according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, grading the obtained clustering result by using a preset evaluation algorithm to obtain a grade value of each voice feature set to be detected, and detecting background voices with fragmentation, fuzziness and low volume by using a clustering and grading mode; according to the scoring value, the voice speech set is divided into a first speaker voice and a second speaker voice, and the audio characteristics of the speakers and the background voices can be stored and dynamically updated in real time; and calculating the voice time of the first speaker and the voice time of the second speaker, judging the background voice in the voice set according to the voice time of the first speaker and the voice time of the second speaker, and deleting the background voice from the voice set so as to improve the voice call quality. Therefore, the noise elimination method, the noise elimination device and the computer readable storage medium for voice call provided by the invention can delete the background voice in the voice call and improve the success rate of a conversation system.
Drawings
Fig. 1 is a flowchart illustrating a method for eliminating noise in a voice call according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of a speech feature extraction method according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of a human voice separation method according to an embodiment of the present invention;
fig. 4 is a block diagram of a noise cancellation apparatus for voice call according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an internal structure of an electronic device implementing a noise cancellation method for voice call according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The execution subject of the noise elimination method for voice call provided by the embodiment of the present application includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiment of the present application. In other words, the noise cancellation method for the voice call may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
The invention provides a noise elimination method for voice communication. Fig. 1 is a schematic flow chart of a noise cancellation method for voice call according to an embodiment of the present invention.
In this embodiment, the method for eliminating noise in a voice call includes:
and S1, carrying out voice endpoint detection on the call audio to obtain a voice set.
In detail, the call audio in the embodiment of the present invention includes audio generated in a conversation among people or in an environment with a lot of voices, such as call audio generated when a call is made through a communication system, such as a telephone or instant messaging software, in an environment full of background voices. The call audio may be retrieved directly from the communication system or recalled from a database used to store voice conversation information. It is emphasized that, in order to further ensure the privacy and security of the call audio, the call audio may also be stored in a node of a block chain.
The voice endpoint detection is to distinguish the voice data and the non-voice data (silence and environmental noise) in the call audio in a noisy or other interference environment, and determine the starting point and the ending point of the voice data to delete the non-voice data in the call audio, reduce the subsequent processing amount of a computer, improve the efficiency, and provide necessary support for subsequent signal processing.
In a preferred embodiment of the present invention, the voice endpoint detection model may be a Deep Neural Network (DNN) based Voice Activity Detection (VAD) model.
And S2, performing voice feature extraction on the voice set to obtain a voice feature set.
In detail, referring to fig. 2, the S2 includes:
s21, pre-emphasis, framing and windowing are carried out on the voice set to obtain a voice frame sequence;
wherein, the pre-emphasis is to use a high-pass filter to raise the high-frequency part of the voice signal in the human voice set, so as to flatten the frequency spectrum of the voice signal; the framing is to adopt a movable window with limited length for weighting so as to divide the voice signal into a plurality of short sections, so that the voice signal has stationarity; the windowing is to make the voice signal without periodicity present partial characteristics of periodic function, so as to facilitate subsequent Fourier expansion.
S22, obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
preferably, since the characteristics of the speech signal are usually difficult to see by transforming the speech signal in the time domain, the speech signal is usually transformed into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices.
S23, converting the frequency spectrum into a Mel frequency spectrum through a Mel filter bank;
the Mel (Mel) filter bank is a set of Mel-scale triangular filter bank, the Mel filter bank can convert the frequency spectrum into Mel frequency spectrum, and the Mel frequency can accurately reflect the auditory characteristic of human ears.
And S24, performing cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the human voice feature set.
Further, the cepstrum analysis includes taking the logarithm and discrete cosine change, and outputting the feature vector. The voice feature set comprises feature vectors corresponding to the voice frame sequence output after cepstrum analysis.
S3, intercepting the voice feature sets to be detected with accumulated time as a preset time threshold from the voice feature sets according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and grading the obtained clustering result by using a preset evaluation algorithm to obtain the grading value of each voice feature set to be detected.
In the embodiment of the invention, when the accumulated time of the voice feature set reaches the preset time threshold, the detection calculation is performed once, and the voice feature set obtained by accumulation at this time is called as the voice feature set to be detected.
In detail, the clustering process of each speech feature set to be detected includes:
step a, randomly selecting two feature vectors in the voice feature set to be detected as a category center;
and b, for each feature vector in the voice feature set to be detected, clustering the feature vector and the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes.
In detail, in the embodiment of the present invention, the distance between the feature vector and each of the category centers is calculated by using the following distance algorithm:
Figure BDA0002547480920000071
wherein L (X, Y) is the distance value, X is the class center, YiAnd the feature vectors are feature vectors in the voice feature set to be detected.
Step c, updating the category centers of the two initial categories;
preferably, the embodiment of the present invention calculates a mean value of all feature vectors in each of the initial classes, and updates the mean value to the class center of the class.
And d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
Further, the two obtained standard categories are graded by a preset evaluation algorithm to obtain the grading values of the standard categories. Preferably, in the embodiment of the present invention, the evaluation algorithm is as follows:
Figure BDA0002547480920000072
wherein n is1And n2Class centers of two standard classes, HsFor the assumption that the standard classes belong to the same class, HdHypotheses for the standard categories to belong to different categories; p (n)1,n2|Hs) Is n1And n2Likelihood functions from the same space; p (n)1|Hd),P(n2|Hd) Are each n1And n2Likelihood functions from different spaces. The likelihood function is a function of the parameters of the statistical model, a test to detect whether a certain hypothesis is valid.
Preferably, the higher the score value is, the higher the possibility that the voices corresponding to the two standard categories belong to the same speaker is; the lower the score value is, the less likely that the voices corresponding to the two standard categories belong to the same speaker.
And S4, dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value.
In detail, referring to fig. 3, the S4 includes:
s40, selecting one of the voice feature sets to be detected, and acquiring a corresponding score value;
s41, comparing the score value with a preset score threshold value;
when the score value is larger than a preset score threshold value, S42 is executed, two standard categories of the selected voice feature set to be detected are merged into a single voice category, a category center of the single voice category is calculated, and a first speaker voice is generated according to the single voice category and the category center;
wherein the first speaker voice comprises a voice feature and a duration, the voice feature comprises the single voice category and a category center, and the duration comprises a frame number of the single voice category.
When the score value is smaller than or equal to a preset score threshold value, executing S43, and generating a first speaker voice and a second speaker voice according to the two standard categories;
similarly, the first and second speaker voices include a speech characteristic and a duration, the speech characteristic includes the standard category and a category center, and the duration includes a frame number of the standard category.
S44, selecting the next voice feature set to be detected, acquiring the corresponding score value of the next voice feature set to be detected, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the score value of the next voice feature set to be detected;
and S45, judging whether each voice feature set to be detected is completely selected, and repeating the S44 until each voice feature set to be detected is completely selected to obtain a first speaker voice and a second speaker voice.
In detail, the classifying the two standard categories in the speech feature set to be detected into the first speaker voice or the second speaker voice includes:
in one embodiment of the present invention, if the score value of the voice feature set to be detected is greater than the score threshold, the two standard classes of the voice feature set to be detected are merged into a single voice class, the class center of the single voice class is calculated, the cosine distance between the class center of the single voice class and the class centers of the first speaker voice and the second speaker voice is calculated, and the single voice class is classified into the first speaker voice or the second speaker voice according to the cosine distance.
And if the cosine distance between the center of the single voice category and the center of the category of the second speaker voice is close, the single voice category is classified into the second speaker voice.
The classification comprises the following steps: combining the single voice category with the first speaker voice or the second speaker voice, and recalculating a combined category center; and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice.
In another embodiment of the present invention, if the score value is smaller than or equal to the score threshold, the two standard categories are classified into the first speaker voice and the second speaker voice according to the cosine distance by calculating the cosine distance between the category center of each standard category in the to-be-detected speech feature set and the category center of the first speaker voice and the second speaker voice.
If the class center of the standard class A is closer to the cosine of the class center of the first speaker voice and the class center of the standard class B is closer to the cosine of the class center of the second speaker voice, the standard class A is classified into the first speaker voice and the standard class B is classified into the second speaker voice.
Similarly, the categorizing includes: combining the standard class A and the standard class with the first speaker voice or the second speaker voice respectively, and recalculating a combined class center; and accumulating the frame numbers of the standard type A and the standard type B and the duration of the first speaker voice or the second speaker voice.
S5, calculating the time length of the first speaker voice and the second speaker voice, judging the background speaker voice in the voice set according to the time length of the first speaker voice and the second speaker voice, and deleting the background speaker voice from the voice set.
Preferably, in the call audio, the duration of the target speaker is longer than that of the background speaker, so in the embodiment of the present invention, the speaker voice with longer duration of the first speaker voice and the second speaker voice is used as the target speaker, and the rest of the speaker voices are used as the background speaker.
In detail, the deleting the background voice from the voice speech set includes:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
comparing the duration proportion with a preset proportion threshold;
and when the duration proportion is greater than the proportion threshold value, deleting the background voice from the voice set, and removing the background voice in the call audio.
Wherein, the time length algorithm is as follows:
R=t/T
wherein, R is the time length proportion of the background speaker in the call, T is the time length of the background speaker, and T is the total call time length, i.e. the sum of the time lengths of the target speaker and the background speaker.
Preferably, when the duration ratio is smaller than the ratio threshold, it indicates that the background voice noise interference on the call is small, and the call audio does not need to be processed; when the duration proportion is greater than the proportion threshold, the communication is interfered by serious background voice noise, the background voice is intensively deleted from the voice, so that the error recognition caused by the background voice can be reduced, and the voice communication quality is improved.
The embodiment of the invention carries out voice endpoint detection on the call audio, deletes the non-human voice noise in the call audio, and reduces the subsequent processing amount of a computer; voice feature extraction is carried out on the voice set to obtain a voice feature set, so that background voice in the call audio can be conveniently separated out in the follow-up process; intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice feature set according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, grading the obtained clustering result by using a preset evaluation algorithm to obtain a grade value of each voice feature set to be detected, and detecting background voices with fragmentation, fuzziness and low volume by using a clustering and grading mode; according to the scoring value, the voice speech set is divided into a first speaker voice and a second speaker voice, and the audio characteristics of the speakers and the background voices can be stored and dynamically updated in real time; and calculating the voice time of the first speaker and the voice time of the second speaker, judging the background voice in the voice set according to the voice time of the first speaker and the voice time of the second speaker, and deleting the background voice from the voice set so as to improve the voice call quality. Therefore, the noise elimination method, the noise elimination device and the computer readable storage medium for voice call provided by the invention can delete the background voice in the voice call and improve the success rate of a conversation system.
Fig. 4 is a functional block diagram of the noise canceling device for voice call according to the present invention.
The noise canceling device 100 for voice call according to the present invention may be installed in an electronic apparatus. According to the implemented functions, the noise elimination apparatus for voice call may include a voice endpoint detection module 101, a voice feature extraction module 102, a cluster scoring module 103, a voice classification module 104, and a background voice removal module 105. A module according to the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the voice endpoint detection module 101 is configured to perform voice endpoint detection on a call audio to obtain a voice set.
In detail, the call audio in the embodiment of the present invention includes audio generated in a conversation among people or in an environment with a lot of voices, such as call audio generated when a call is made through a communication system, such as a telephone or instant messaging software, in an environment full of background voices. The call audio may be retrieved directly from the communication system or recalled from a database used to store voice conversation information. It is emphasized that, in order to further ensure the privacy and security of the call audio, the call audio may also be stored in a node of a block chain.
The voice endpoint detection is to distinguish the voice data and the non-voice data (silence and environmental noise) in the call audio in a noisy or other interference environment, and determine the starting point and the ending point of the voice data to delete the non-voice data in the call audio, reduce the subsequent processing amount of a computer, improve the efficiency, and provide necessary support for subsequent signal processing.
In a preferred embodiment of the present invention, the voice endpoint detection model may be a Deep Neural Network (DNN) based Voice Activity Detection (VAD) model.
The voice feature extraction module 102 is configured to perform voice feature extraction on the human voice set to obtain a voice feature set.
In detail, the speech feature extraction module 102 specifically executes:
pre-emphasis, framing and windowing are carried out on the voice speech set to obtain a speech frame sequence;
obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
converting the spectrum into a mel-frequency spectrum by a mel-filter bank;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the human voice feature set. Wherein, the pre-emphasis is to use a high-pass filter to raise the high-frequency part of the voice signal in the human voice set, so as to flatten the frequency spectrum of the voice signal; the framing is to adopt a movable window with limited length for weighting so as to divide the voice signal into a plurality of short sections, so that the voice signal has stationarity; the windowing is to make the voice signal without periodicity present partial characteristics of periodic function, so as to facilitate subsequent Fourier expansion.
Preferably, since the characteristics of the speech signal are usually difficult to see by transforming the speech signal in the time domain, the speech signal is usually transformed into an energy distribution in the frequency domain, and different energy distributions can represent the characteristics of different voices.
The Mel (Mel) filter bank is a set of Mel-scale triangular filter bank, the Mel filter bank can convert the frequency spectrum into Mel frequency spectrum, and the Mel frequency can accurately reflect the auditory characteristic of human ears.
Further, the cepstrum analysis includes taking the logarithm and discrete cosine change, and outputting the feature vector. The voice feature set comprises feature vectors corresponding to the voice frame sequence output after cepstrum analysis.
The cluster scoring module 103 is configured to intercept a to-be-detected voice feature set with an accumulated duration being a preset duration threshold from the voice feature set according to a time sequence to obtain a plurality of to-be-detected voice feature sets, perform cluster processing on each to-be-detected voice feature set, and score an obtained clustering result by using a preset evaluation algorithm to obtain a score value of each to-be-detected voice feature set.
In the embodiment of the invention, when the accumulated time of the voice feature set reaches the preset time threshold, the detection calculation is performed once, and the voice feature set obtained by accumulation at this time is called as the voice feature set to be detected.
In detail, the clustering process of each speech feature set to be detected includes:
step a, randomly selecting two feature vectors in the voice feature set to be detected as a category center;
and b, for each feature vector in the voice feature set to be detected, clustering the feature vector and the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes.
In detail, in the embodiment of the present invention, the distance between the feature vector and each of the category centers is calculated by using the following distance algorithm:
Figure BDA0002547480920000121
wherein L (X, Y) is the distance value and X is theClass center, YiAnd the feature vectors are feature vectors in the voice feature set to be detected.
Step c, updating the category centers of the two initial categories;
preferably, the embodiment of the present invention calculates a mean value of all feature vectors in each of the initial classes, and updates the mean value to the class center of the class.
And d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
Further, the two obtained standard categories are graded by a preset evaluation algorithm to obtain the grading values of the standard categories. Preferably, in the embodiment of the present invention, the evaluation algorithm is as follows:
Figure BDA0002547480920000131
wherein n is1And n2Class centers of two standard classes, HsFor the assumption that the standard classes belong to the same class, HdHypotheses for the standard categories to belong to different categories; p (n)1,n2|Hs) Is n1And n2Likelihood functions from the same space; p (n)1|Hd),P(n2|Hd) Are each n1And n2Likelihood functions from different spaces. The likelihood function is a function of the parameters of the statistical model, a test to detect whether a certain hypothesis is valid.
Preferably, the higher the score value is, the higher the possibility that the voices corresponding to the two standard categories belong to the same speaker is; the lower the score value is, the less likely that the voices corresponding to the two standard categories belong to the same speaker.
The voice classification module 104 is configured to classify the voice speech set into a first speaker voice and a second speaker voice according to the score value.
In detail, the human voice classification module 104 is specifically configured to:
selecting one of the voice feature sets to be detected, and acquiring a score value corresponding to the voice feature set;
comparing the scoring value with a preset scoring threshold;
when the score value is smaller than or equal to a preset score threshold value, generating a first speaker voice and a second speaker voice according to the two standard categories;
selecting a next voice feature set to be detected, acquiring a corresponding score value of the next voice feature set to be detected, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the score value of the next voice feature set to be detected;
and judging whether each voice feature set to be detected is selected completely or not until each voice feature set to be detected is selected completely, and obtaining a first speaker voice and a second speaker voice.
When the score value is larger than a preset score threshold value, the voice classification module 104 combines two standard categories of the selected voice feature set to be detected into a single voice category, calculates a category center of the single voice category, and generates a first speaker voice according to the single voice category and the category center;
wherein the first speaker voice comprises a voice feature and a duration, the voice feature comprises the single voice category and a category center, and the duration comprises a frame number of the single voice category.
Similarly, the first and second speaker voices include a speech characteristic and a duration, the speech characteristic includes the standard category and a category center, and the duration includes a frame number of the standard category.
In detail, the classifying the two standard categories in the speech feature set to be detected into the first speaker voice or the second speaker voice includes:
in one embodiment of the present invention, if the score value of the voice feature set to be detected is greater than the score threshold, the two standard classes of the voice feature set to be detected are merged into a single voice class, the class center of the single voice class is calculated, the cosine distance between the class center of the single voice class and the class centers of the first speaker voice and the second speaker voice is calculated, and the single voice class is classified into the first speaker voice or the second speaker voice according to the cosine distance.
And if the cosine distance between the center of the single voice category and the center of the category of the second speaker voice is close, the single voice category is classified into the second speaker voice.
The classification comprises the following steps: combining the single voice category with the first speaker voice or the second speaker voice, and recalculating a combined category center; and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice.
In another embodiment of the present invention, if the score value is smaller than or equal to the score threshold, the two standard categories are classified into the first speaker voice and the second speaker voice according to the cosine distance by calculating the cosine distance between the category center of each standard category in the to-be-detected speech feature set and the category center of the first speaker voice and the second speaker voice.
If the class center of the standard class A is closer to the cosine of the class center of the first speaker voice and the class center of the standard class B is closer to the cosine of the class center of the second speaker voice, the standard class A is classified into the first speaker voice and the standard class B is classified into the second speaker voice.
Similarly, the categorizing includes: combining the standard class A and the standard class with the first speaker voice or the second speaker voice respectively, and recalculating a combined class center; and accumulating the frame numbers of the standard type A and the standard type B and the duration of the first speaker voice or the second speaker voice.
The background voice removing module 105 is configured to calculate the durations of the first and second speaker voices, determine a background voice in the voice set according to the durations of the first and second speaker voices, and delete the background voice from the voice set.
Preferably, in the call audio, the duration of the target speaker is longer than that of the background speaker, so in the embodiment of the present invention, the speaker voice with longer duration of the first speaker voice and the second speaker voice is used as the target speaker, and the rest of the speaker voices are used as the background speaker.
In detail, the background voice removing module 105 deletes the background voice from the voice speech set by the following method, including:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
comparing the duration proportion with a preset proportion threshold;
and when the duration proportion is greater than the proportion threshold value, deleting the background voice from the voice set, and removing the background voice in the call audio.
Wherein, the time length algorithm is as follows:
R=t/T
wherein, R is the time length proportion of the background speaker in the call, T is the time length of the background speaker, and T is the total call time length, i.e. the sum of the time lengths of the target speaker and the background speaker.
Preferably, when the duration ratio is smaller than the ratio threshold, it indicates that the background voice noise interference on the call is small, and the call audio does not need to be processed; when the duration proportion is greater than the proportion threshold, the communication is interfered by serious background voice noise, the background voice is intensively deleted from the voice, so that the error recognition caused by the background voice can be reduced, and the voice communication quality is improved.
Fig. 5 is a schematic structural diagram of an electronic device implementing the noise cancellation method for voice call according to the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a noise cancellation program 12 for a voice call, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as a code of the noise canceling program 12 for voice call, etc., but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by operating or executing programs or modules (e.g., a noise canceling program for performing a voice call, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The noise cancellation program 12 for voice call stored in the memory 11 of the electronic device 1 is a combination of a plurality of instructions, which when executed in the processor 10, can realize:
carrying out voice endpoint detection on the call audio to obtain a voice set;
carrying out voice feature extraction on the voice set to obtain a voice feature set;
intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice feature set according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and grading the obtained clustering result by using a preset evaluation algorithm to obtain a grade value of each voice feature set to be detected;
dividing the voice and voice set into a first speaker voice and a second speaker voice according to the scoring value;
and calculating the voice time of the first speaker and the voice time of the second speaker, judging the background voice in the voice set according to the voice time of the first speaker and the voice time of the second speaker, and deleting the background voice from the voice set.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any accompanying claims should not be construed as limiting the claim concerned.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for eliminating noise in a voice call, the method comprising:
carrying out voice endpoint detection on the call audio to obtain a voice set;
carrying out voice feature extraction on the voice set to obtain a voice feature set;
intercepting a voice feature set to be detected with accumulated time as a preset time threshold from the voice feature set according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and grading the obtained clustering result by using a preset evaluation algorithm to obtain a grade value of each voice feature set to be detected;
dividing the voice and voice set into a first speaker voice and a second speaker voice according to the scoring value;
and calculating the time lengths of the first speaker voice and the second speaker voice, judging the background voice in the voice set according to the time lengths of the first speaker voice and the second speaker voice, and deleting the background voice from the voice set.
2. The method for eliminating noise in voice communication according to claim 1, wherein said extracting voice features from said voice speech set to obtain a voice feature set comprises:
pre-emphasis, framing and windowing are carried out on the voice speech set to obtain a speech frame sequence;
obtaining a corresponding frequency spectrum for each frame of voice in the voice frame sequence through fast Fourier transform;
converting the spectrum into a mel-frequency spectrum by a mel-filter bank;
and performing cepstrum analysis on the Mel frequency spectrum to obtain a voice feature set corresponding to the human voice feature set.
3. The method for eliminating noise in voice call according to claim 1, wherein the clustering each set of voice features to be detected comprises:
step a, randomly selecting two feature vectors in the voice feature set to be detected as a category center;
b, for each feature vector in the voice feature set to be detected, clustering the feature vector and the nearest class center by calculating the distance from the feature vector to each class center to obtain two initial classes;
step c, updating the category centers of the two initial categories;
and d, repeating the step b and the step c until the iteration times reach a preset time threshold value, and obtaining two standard categories.
4. The method of eliminating noise in a voice call according to claim 1, wherein said dividing the set of human voices into a first speaker voice and a second speaker voice according to the score value comprises:
selecting one of the voice feature sets to be detected, and acquiring a corresponding score value;
comparing the scoring value with a preset scoring threshold;
when the score value is larger than a preset score threshold value, combining two standard categories of the selected voice feature set to be detected into a single voice category, calculating a category center of the single voice category, and generating a first speaker voice according to the single voice category and the category center;
when the score value is smaller than or equal to a preset score threshold value, generating a first speaker voice and a second speaker voice according to the two standard categories;
and selecting the next voice feature set to be detected, acquiring a corresponding score value, and classifying two standard categories in the voice feature set to be detected into the first speaker voice or the second speaker voice according to the score value.
5. The method according to claim 4, wherein the classifying the two standard categories in the speech feature set to be detected into the first speaker voice or the second speaker voice according to the score value comprises:
if the score value is larger than the score threshold value, combining the two standard categories of the voice feature set to be detected into a single voice category, calculating the category center of the single voice category, and classifying the single voice category into the first speaker voice or the second speaker voice according to the cosine distance between the category center of the single voice category and the category centers of the first speaker voice and the second speaker voice;
and if the score value is smaller than or equal to a score threshold value, classifying the two standard categories into the first speaker voice and the second speaker voice respectively according to cosine distances between category centers of the two standard categories in the voice feature set to be detected and category centers of the first speaker voice and the second speaker voice.
6. The method of noise cancellation for voice call as claimed in claim 5, wherein said classifying comprises:
combining the single voice category with the first speaker voice or the second speaker voice, recalculating a combined category center, and accumulating the frame number of the single voice category and the duration of the first speaker voice or the second speaker voice; or
And combining the two standard categories with the first speaker voice and the second speaker voice respectively, recalculating a combined category center, and accumulating the frame number of the standard categories and the time length of the first speaker voice or the second speaker voice.
7. The method according to any one of claims 1 to 6, wherein the removing the background voice from the voice set comprises:
calculating the time length proportion of the background voice in the call by using a preset time length algorithm;
and when the duration proportion is greater than a preset proportion threshold value, deleting the background voice from the voice set, and removing the background voice in the call audio.
8. An apparatus for eliminating noise in a voice call, the apparatus comprising:
the voice endpoint detection module is used for carrying out voice endpoint detection on the call audio to obtain a voice set;
the voice feature extraction module is used for extracting voice features of the voice set to obtain a voice feature set;
the cluster scoring module is used for intercepting the voice feature sets to be detected with accumulated time as a preset time threshold from the voice feature sets according to a time sequence to obtain a plurality of voice feature sets to be detected, clustering each voice feature set to be detected, and scoring the obtained clustering result by using a preset evaluation algorithm to obtain a score value of each voice feature set to be detected;
the voice classification module is used for dividing the voice speech set into a first speaker voice and a second speaker voice according to the scoring value;
and the background voice removing module is used for calculating the voice time of the first speaker and the voice time of the second speaker, judging the background voice in the voice set according to the voice time of the first speaker and the voice time of the second speaker and deleting the background voice from the voice set.
9. An electronic device, characterized in that the electronic device comprises:
a memory storing at least one instruction; and
a processor executing instructions stored in the memory to perform the method of noise cancellation for voice calls according to any one of claims 1 to 5.
10. A computer-readable storage medium comprising a storage data area storing created data and a storage program area storing a computer program, wherein the computer program, when executed by a processor, implements the noise canceling method for a voice call according to any one of claims 1 to 5.
CN202010570483.4A 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium Pending CN111754982A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010570483.4A CN111754982A (en) 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium
PCT/CN2020/121571 WO2021151310A1 (en) 2020-06-19 2020-10-16 Voice call noise cancellation method, apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010570483.4A CN111754982A (en) 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111754982A true CN111754982A (en) 2020-10-09

Family

ID=72675687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010570483.4A Pending CN111754982A (en) 2020-06-19 2020-06-19 Noise elimination method and device for voice call, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111754982A (en)
WO (1) WO2021151310A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700790A (en) * 2020-12-11 2021-04-23 广州市申迪计算机系统有限公司 IDC machine room sound processing method, system, equipment and computer storage medium
WO2021151310A1 (en) * 2020-06-19 2021-08-05 平安科技(深圳)有限公司 Voice call noise cancellation method, apparatus, electronic device, and storage medium
CN113255362A (en) * 2021-05-19 2021-08-13 平安科技(深圳)有限公司 Method and device for filtering and identifying human voice, electronic device and storage medium
CN113572908A (en) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call
CN114070935A (en) * 2022-01-12 2022-02-18 百融至信(北京)征信有限公司 Intelligent outbound interruption method and system
CN115394310A (en) * 2022-08-19 2022-11-25 中邮消费金融有限公司 Neural network-based background voice removing method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070033027A1 (en) * 2005-08-03 2007-02-08 Texas Instruments, Incorporated Systems and methods employing stochastic bias compensation and bayesian joint additive/convolutive compensation in automatic speech recognition
CN108962237B (en) * 2018-05-24 2020-12-04 腾讯科技(深圳)有限公司 Hybrid speech recognition method, device and computer readable storage medium
CN109065028B (en) * 2018-06-11 2022-12-30 平安科技(深圳)有限公司 Speaker clustering method, speaker clustering device, computer equipment and storage medium
CN109147798B (en) * 2018-07-27 2023-06-09 北京三快在线科技有限公司 Speech recognition method, device, electronic equipment and readable storage medium
CN111199741A (en) * 2018-11-20 2020-05-26 阿里巴巴集团控股有限公司 Voiceprint identification method, voiceprint verification method, voiceprint identification device, computing device and medium
CN110136749B (en) * 2019-06-14 2022-08-16 思必驰科技股份有限公司 Method and device for detecting end-to-end voice endpoint related to speaker
CN111754982A (en) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 Noise elimination method and device for voice call, electronic equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021151310A1 (en) * 2020-06-19 2021-08-05 平安科技(深圳)有限公司 Voice call noise cancellation method, apparatus, electronic device, and storage medium
CN112700790A (en) * 2020-12-11 2021-04-23 广州市申迪计算机系统有限公司 IDC machine room sound processing method, system, equipment and computer storage medium
CN113255362A (en) * 2021-05-19 2021-08-13 平安科技(深圳)有限公司 Method and device for filtering and identifying human voice, electronic device and storage medium
CN113255362B (en) * 2021-05-19 2024-02-02 平安科技(深圳)有限公司 Method and device for filtering and identifying human voice, electronic device and storage medium
CN113572908A (en) * 2021-06-16 2021-10-29 云茂互联智能科技(厦门)有限公司 Method, device and system for reducing noise in VoIP (Voice over Internet protocol) call
CN114070935A (en) * 2022-01-12 2022-02-18 百融至信(北京)征信有限公司 Intelligent outbound interruption method and system
CN115394310A (en) * 2022-08-19 2022-11-25 中邮消费金融有限公司 Neural network-based background voice removing method and system
CN115394310B (en) * 2022-08-19 2023-04-07 中邮消费金融有限公司 Neural network-based background voice removing method and system

Also Published As

Publication number Publication date
WO2021151310A1 (en) 2021-08-05

Similar Documents

Publication Publication Date Title
CN111754982A (en) Noise elimination method and device for voice call, electronic equipment and storage medium
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
WO2019227583A1 (en) Voiceprint recognition method and device, terminal device and storage medium
CN112949708A (en) Emotion recognition method and device, computer equipment and storage medium
CN111862951B (en) Voice endpoint detection method and device, storage medium and electronic equipment
CN106033669B (en) Audio recognition method and device
CN113327586B (en) Voice recognition method, device, electronic equipment and storage medium
CN112669822B (en) Audio processing method and device, electronic equipment and storage medium
CN113707173B (en) Voice separation method, device, equipment and storage medium based on audio segmentation
CN113903363A (en) Violation detection method, device, equipment and medium based on artificial intelligence
CN113112992B (en) Voice recognition method and device, storage medium and server
CN112382309A (en) Emotion recognition model training method, device, equipment and storage medium
CN113593597A (en) Voice noise filtering method and device, electronic equipment and medium
CN109688271A (en) The method, apparatus and terminal device of contact information input
CN115394318A (en) Audio detection method and device
US10910000B2 (en) Method and device for audio recognition using a voting matrix
CN112289311B (en) Voice wakeup method and device, electronic equipment and storage medium
CN111640450A (en) Multi-person audio processing method, device, equipment and readable storage medium
CN111552832A (en) Risk user identification method and device based on voiceprint features and associated map data
CN116364107A (en) Voice signal detection method, device, equipment and storage medium
CN111985231B (en) Unsupervised role recognition method and device, electronic equipment and storage medium
CN114420136A (en) Method and device for training voiceprint recognition model and storage medium
KR101449856B1 (en) Method for estimating user emotion based on call speech
CN112216286B (en) Voice wakeup recognition method and device, electronic equipment and storage medium
CN112614492A (en) Voiceprint recognition method, system and storage medium based on time-space information fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination