CN112151052A - Voice enhancement method and device, computer equipment and storage medium - Google Patents

Voice enhancement method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112151052A
CN112151052A CN202011153521.2A CN202011153521A CN112151052A CN 112151052 A CN112151052 A CN 112151052A CN 202011153521 A CN202011153521 A CN 202011153521A CN 112151052 A CN112151052 A CN 112151052A
Authority
CN
China
Prior art keywords
voice
enhancement
data
voice data
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011153521.2A
Other languages
Chinese (zh)
Other versions
CN112151052B (en
Inventor
罗剑
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011153521.2A priority Critical patent/CN112151052B/en
Priority to PCT/CN2020/136364 priority patent/WO2021189979A1/en
Publication of CN112151052A publication Critical patent/CN112151052A/en
Application granted granted Critical
Publication of CN112151052B publication Critical patent/CN112151052B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Navigation (AREA)

Abstract

The invention discloses a voice enhancement method, a voice enhancement device, computer equipment and a storage medium, relates to the technical field of artificial intelligence, and mainly aims to automatically select voice enhancement parameters matched with the surrounding environment from a pre-constructed voice enhancement parameter set, and enable the voice recognition accuracy to reach the highest after voice enhancement processing is carried out on voice data to be recognized by utilizing the voice enhancement parameters. The method comprises the following steps: acquiring voice data to be processed; extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set; and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing. The invention is mainly suitable for the voice enhancement processing of voice data.

Description

Voice enhancement method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technology, and in particular, to a speech enhancement method, apparatus, computer device, and storage medium.
Background
In recent years, with the rapid development and rise of intelligent wearable devices, consumer electronic products controlled by voice have become the latest trend, voice intelligence requires an automatic voice recognition intelligent system with strong reliability and high accuracy as a support, and a front-end voice enhancement technology is the most critical loop.
At present, when noise is processed by using a front-end speech enhancement technology, parameters of a speech enhancement module are generally adjusted according to the surrounding environment and the expert experience, so as to achieve a better speech recognition effect. However, the method of adjusting the speech enhancement parameters according to expert experience can only adapt to the surrounding environment to a certain extent, and improve the effect of high speech recognition, but cannot ensure that the accuracy of speech recognition reaches the highest.
Disclosure of Invention
The invention provides a voice enhancement method, a voice enhancement device, computer equipment and a storage medium, which mainly aim at automatically selecting voice enhancement parameters matched with surrounding environments from a pre-constructed voice enhancement parameter set, and after voice enhancement processing is carried out on voice data to be recognized by utilizing the voice enhancement parameters, the voice recognition accuracy can reach the highest, so that the optimal voice recognition effect can be achieved in any environment.
According to a first aspect of the present invention, there is provided a speech enhancement method comprising:
acquiring voice data to be processed;
extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments;
and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
According to a second aspect of the present invention, there is provided a speech enhancement apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be processed;
the selecting unit is used for extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments;
and the processing unit is used for carrying out voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain the voice data after the voice enhancement processing.
According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
acquiring voice data to be processed;
extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments;
and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:
acquiring voice data to be processed;
extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments;
and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
Compared with the current mode of adjusting the parameters of the voice enhancement module according to the expert experience, the voice enhancement method, the voice enhancement device, the computer equipment and the storage medium can acquire the voice data to be processed; simultaneously extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain the voice data after the voice enhancement processing, so that the target environment where the voice data to be processed is located is determined, the target voice enhancement parameters corresponding to the voice enhancement parameters can be automatically selected from the voice enhancement parameter set, and the voice data is subjected to voice enhancement processing by using the target voice enhancement parameters, so that the voice enhancement effect in the target environment can be improved, and meanwhile, the highest accuracy of voice recognition in the target environment can be ensured.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a speech enhancement method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another speech enhancement method provided by an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a speech enhancement apparatus according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of another speech enhancement apparatus provided in an embodiment of the present invention;
fig. 5 shows a physical structure diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
At present, when noise is processed by using a front-end speech enhancement technology, parameters of a speech enhancement module are generally adjusted according to the surrounding environment and the expert experience, so as to achieve a better speech recognition effect. However, the method of adjusting the speech enhancement parameters according to expert experience can only adapt to the surrounding environment to a certain extent, and improve the effect of high speech recognition, but cannot ensure that the accuracy of speech recognition reaches the highest.
In order to solve the above problem, an embodiment of the present invention provides a credit risk assessment method, as shown in fig. 1, the method including:
101. and acquiring voice data to be processed.
For the embodiment of the present invention, in order to overcome the defect in the prior art that the speech enhancement parameters are adjusted according to the expert experience, the embodiment of the present invention pre-constructs a speech enhancement parameter set, and automatically selects matched speech enhancement parameters from a speech enhancement parameter set according to the target environment where the speech data to be processed is located, so that not only the speech enhancement effect of the speech data can be improved in any environment, but also the speech recognition accuracy can be maximized. The embodiment of the invention is suitable for voice enhancement processing of voice data, and the execution main body of the embodiment of the invention is a device or equipment capable of performing voice enhancement processing on the voice data, and can be specifically arranged at one side of a client or a server.
Specifically, a segment of voice data of a user in a certain scene is acquired, and before performing voice enhancement processing on the voice data, the voice data needs to be preprocessed, specifically including pre-emphasis processing, framing processing, and windowing function processing, so as to obtain the preprocessed voice data.
102. Extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set.
The voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under the different environments. For the embodiment of the invention, sample voice data collected under different environments are stored in a preset sample library, in order to determine the environments of the different sample voice data, the sample voice data needs to be clustered to obtain the sample voice data under the different environments, and the sample voice data under the different environments is used for training a voice enhancement model, namely, initial voice enhancement parameters in the voice enhancement model are optimized and adjusted until the sample voice data after the voice enhancement processing is input into a pre-constructed voice recognition model for voice recognition, the voice recognition accuracy of the voice data can reach the highest, so that the voice enhancement parameters under the different environments can be obtained, a voice enhancement parameter set is constructed, when the voice data is in a certain environment, the voice enhancement parameters corresponding to the environment are used for voice enhancement processing on the voice data, and the voice data after the voice enhancement processing is input into a pre-constructed voice enhancement model, so that the voice recognition accuracy of the voice data can reach the highest.
For the embodiment of the present invention, before performing the speech enhancement processing on the speech data, it is necessary to determine the target environment where the speech data to be processed is located, specifically, extract the first speech feature corresponding to the speech data to be processed, simultaneously, respectively extracting second voice characteristics corresponding to the sample voice data under different clustering types (different environments), then calculating characteristic centers corresponding to the sample voice data under different clustering types according to the second voice characteristics corresponding to the sample voice data under different clustering types, because the voice characteristics corresponding to the voice data collected under the same environment are relatively similar, the distance between the first voice characteristic and the centers of different characteristics is calculated, the sample voice data under which cluster type the voice data to be processed is classified is determined, and then the target environment of the voice data to be processed can be determined.
Further, the target enhancement parameters corresponding to the target environment are selected from the pre-constructed voice enhancement parameter set, so that voice enhancement processing is performed on the voice data by using the target voice enhancement parameters, the voice data after the voice enhancement processing is input into the pre-constructed voice recognition model for voice recognition, the voice recognition efficiency of the voice data can be the highest, therefore, the target environment where the voice data is located can be determined according to the voice characteristics of the voice data to be processed, the voice enhancement parameters corresponding to the target environment are automatically selected from the voice enhancement parameter set, the voice enhancement processing is performed on the voice data, the voice enhancement effect is improved, and meanwhile, the voice recognition accuracy of the voice data after the voice enhancement processing can be the highest.
103. And performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
For the embodiment of the invention, the voice enhancement processing mainly refers to the noise reduction processing of the voice noise in the voice data to be processed, and can be adopted in the voice enhancement processing processPerforming voice enhancement processing on voice data by using an LMS adaptive filter denoising processing algorithm, and specifically performing the voice enhancement processing by using the LMS adaptive filter denoising processing algorithm, firstly performing silence elimination processing on a voice signal by using a voice endpoint detection algorithm (VAD) to obtain a proper voice spectrum characteristic sequence X ═ X (X)1,x2,…,xn) Then obtaining Y ═ Y (Y) through a multi-channel wiener filtering operation, specifically including a beam forming process1,y2,…,yn) And using Power Spectral Density (PSD) estimation to reduce residual noise component and obtain wiener filter input component
Figure BDA0002741985030000061
And phiV(omega, tau) and then obtaining a post-filter input parameter vector G through wiener filtering calculationWiener(omega, tau), and obtaining a filtering output signal Z (omega, tau) G through post-filter processingWienerAnd (omega, tau) Y, and obtaining the voice data after voice enhancement processing after signal compression or expansion processing, so that the voice data after voice enhancement processing can be adapted to the input form of the voice recognition model.
Compared with the current mode of adjusting the parameters of the voice enhancement module according to the expert experience, the voice enhancement method provided by the embodiment of the invention can acquire the voice data to be processed; simultaneously extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain the voice data after the voice enhancement processing, so that the target environment where the voice data to be processed is located is determined, the target voice enhancement parameters corresponding to the voice enhancement parameters can be automatically selected from the voice enhancement parameter set, and the voice data is subjected to voice enhancement processing by using the target voice enhancement parameters, so that the voice enhancement effect in the target environment can be improved, and meanwhile, the highest accuracy of voice recognition in the target environment can be ensured.
Further, in order to better explain the above process of performing speech enhancement processing on speech data, as a refinement and extension to the above embodiment, another speech enhancement method is provided in an embodiment of the present invention, as shown in fig. 2, and the method includes:
201. and acquiring voice data to be processed.
For the embodiment of the present invention, in order to automatically select the speech enhancement parameters matched with the environment according to the environment where the speech data to be processed is located, and to make the speech recognition accuracy of the speech data reach the highest, the speech enhancement parameters under different environments need to be constructed in advance, based on which the method includes: carrying out voice enhancement processing on the sample voice data under different environments by using the initial voice enhancement parameters to obtain sample voice data subjected to voice enhancement processing under different environments; constructing voice recognition accuracy functions under different environments according to the sample voice data; and optimizing and adjusting the initial voice enhancement parameters according to the accuracy function to obtain voice enhancement parameters under different environments, and constructing the voice enhancement parameter set based on the voice enhancement parameters under the different environments. Further, the constructing the speech recognition accuracy functions under different environments according to the sample speech data includes: carrying out voice recognition on the sample voice data subjected to the voice enhancement processing by utilizing a pre-constructed voice recognition model to obtain voice recognition results under different environments; and constructing a voice recognition accuracy function under different environments according to the voice recognition results under different environments. The pre-constructed speech recognition model may be a neural network speech recognition model.
For example, an initial speech enhancement is given, then the initial speech enhancement parameter is used to perform speech enhancement processing on sample speech data in a factory environment to obtain sample speech data after the speech enhancement processing in the factory environment, and the sample speech data after the speech enhancement processing is input to a pre-constructed speech recognition model to perform speech recognition processing to obtain a speech recognition result corresponding to the sample speech data in the factory environment, then a speech recognition accuracy function in the factory environment is constructed according to the speech recognition result in the factory environment, the function is solved under the condition of the highest speech recognition accuracy, when an optimal solution is specifically searched, a genetic algorithm can be used to search speech enhancement parameters in different environments, and the specific formula is as follows:
θi=argmaxT(θ)
wherein T (theta) is the speech recognition accuracy in a factory environment, thetaiFor the voice enhancement parameters in the factory environment, the voice enhancement parameters theta can be obtained by continuously optimizing and adjusting the initial voice enhancement parametersiSpeech enhancement parameter θiThe speech recognition accuracy under the factory environment can be maximized, so that the speech enhancement parameters under different environments can be obtained according to the method, and the speech enhancement parameter set { theta is constructediAnd then the voice recognition accuracy under different environments reaches the highest.
For the embodiment of the invention, after the voice enhancement parameter set is constructed, the voice data to be processed can be obtained, and the corresponding voice enhancement parameters are selected from the voice enhancement parameter set for voice enhancement processing by determining the target environment of the voice data to be processed.
202. Extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set.
The voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for maximizing voice recognition accuracy under different environments. For the embodiment of the present invention, in order to determine the target environment where the voice data to be processed is located, step 202 specifically includes: acquiring sample voice data under different environments, and extracting second voice features corresponding to the sample voice data; calculating a feature center corresponding to the sample voice data in different environments according to the second voice feature; and determining the target environment in which the voice data is positioned according to the feature center and the first voice feature. Further, the determining the target environment in which the voice data is located according to the feature center and the first voice feature includes: calculating Euclidean distances between the first voice feature and different feature centers by using a preset Euclidean distance algorithm; and screening out the minimum Euclidean distance from the calculated Euclidean distances, and determining the environment where the sample voice data corresponding to the minimum Euclidean distance is located as the target environment. When extracting the voice features corresponding to the voice data to be processed and the sample voice data, a preset mel cepstrum algorithm can be adopted to calculate mel cepstrum coefficients corresponding to the sample data to be processed and the sample voice data respectively, and the calculated mel cepstrum coefficients are determined as the voice features corresponding to the voice data to be processed and the sample voice data respectively.
For example, the feature center corresponding to the sample voice data at the street is calculated to be a, the feature center corresponding to the sample voice data in the factory environment is calculated to be B, the feature center corresponding to the sample voice data in the airport environment is calculated to be C, because the voice characteristics corresponding to the voice data in the same environment are relatively similar, and then the Euclidean distances between the first voice characteristic corresponding to the voice data to be processed and the characteristic center A, the characteristic center B and the characteristic center C are respectively calculated, and selecting the minimum Euclidean distance from the calculated Euclidean distances, if the Euclidean distance between the feature center B and the first voice feature is determined to be minimum, it is determined that the voice data to be processed is relatively similar to the sample voice data in the factory environment, and thus it is determined that the voice data to be processed is in the factory environment, and thus the target environment in which the voice data to be processed is located can be determined in the above manner.
203. And performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
For the embodiment of the present invention, in order to perform speech enhancement processing on speech data, step 203 specifically includes: and according to the target filtering noise reduction parameter, carrying out filtering noise reduction processing on the voice data to obtain noise-reduced voice data. Specifically, the method of performing noise reduction processing on the speech data by using the target filtering noise reduction parameter is completely the same as that in step 103, and is not described herein again.
204. And performing feature extraction on the voice data after the voice enhancement processing to obtain a third voice feature corresponding to the voice data, and determining a voice recognition result corresponding to the voice data according to the third voice feature.
For the embodiment, after the voice enhancement processing is performed on the voice data, voice recognition needs to be further performed on the voice data after the voice enhancement processing, specifically, when the voice recognition is performed on the voice data, voice recognition can be performed by using a pre-established voice recognition model, the voice recognition model can be specifically a neural network voice recognition model, specifically, the voice data after the voice enhancement processing is input into the voice recognition model, a hidden layer in the voice recognition model can extract a third voice feature corresponding to the voice data, and voice recognition is performed according to the third voice feature, so that a voice recognition result corresponding to the voice data is obtained, and at this time, the accuracy of the voice recognition result can reach the highest.
Compared with the current mode of adjusting the parameters of the voice enhancement module according to the expert experience, the other voice enhancement method provided by the embodiment of the invention can acquire the voice data to be processed; simultaneously extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain the voice data after the voice enhancement processing, so that the target environment where the voice data to be processed is located is determined, the target voice enhancement parameters corresponding to the voice enhancement parameters can be automatically selected from the voice enhancement parameter set, and the voice data is subjected to voice enhancement processing by using the target voice enhancement parameters, so that the voice enhancement effect in the target environment can be improved, and meanwhile, the highest accuracy of voice recognition in the target environment can be ensured.
Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a speech enhancement apparatus, as shown in fig. 3, the apparatus includes: an acquisition unit 31, a selection unit 32 and a processing unit 33.
The acquiring unit 31 may be configured to acquire voice data to be processed. The acquiring unit 31 is a main functional module in the present apparatus for acquiring voice data to be processed.
The selecting unit 32 may be configured to extract a first voice feature corresponding to the voice data, determine a target environment in which the voice data is located according to the first voice feature, and select a target voice enhancement parameter corresponding to the target environment from a pre-established voice enhancement parameter set, where the voice enhancement parameter set includes voice enhancement parameters in different environments, and the voice enhancement parameters are used to maximize voice recognition accuracy in different environments. The selecting unit 32 is a main function module, which is also a core module, that extracts a first voice feature corresponding to the voice data in the device, determines a target environment where the voice data is located according to the first voice feature, and selects a target voice enhancement parameter corresponding to the target environment from a pre-established voice enhancement parameter set.
The processing unit 33 may be configured to perform speech enhancement processing on the speech data according to the target speech enhancement parameter, so as to obtain speech data after the speech enhancement processing. The processing unit 33 is a main functional module of the device that performs speech enhancement processing on the speech data according to the target speech enhancement parameter to obtain speech data after the speech enhancement processing.
Further, in order to determine the target environment where the voice data is located, as shown in fig. 4, the selecting unit 32 includes an extracting module 321, a calculating module 322, and a determining module 323.
The extracting module 321 may be configured to obtain sample voice data in different environments, and extract a second voice feature corresponding to the sample voice data.
The calculating module 322 may be configured to calculate, according to the second speech feature, a feature center corresponding to the sample speech data in the different environments.
The determining module 323 may be configured to determine a target environment in which the voice data is located according to the feature center and the first voice feature.
Further, in order to determine the target environment where the voice data is located, the determining module 323 includes: a calculation submodule and a determination submodule.
The calculating submodule can be configured to calculate the euclidean distances between the first speech feature and different feature centers by using a preset euclidean distance algorithm.
The determining submodule may be configured to screen a minimum euclidean distance from the calculated euclidean distances, and determine an environment where the sample voice data corresponding to the minimum euclidean distance is located as the target environment.
Further, to construct a set of speech enhancement parameters, the apparatus further comprises: a unit 34 is constructed.
The processing unit 33 may further be configured to perform speech enhancement processing on the sample speech data in different environments by using the initial speech enhancement parameter, so as to obtain sample speech data after the speech enhancement processing in different environments.
The constructing unit 34 may be configured to construct speech recognition accuracy functions under different environments according to the sample speech data.
The constructing unit 34 may be further configured to optimize and adjust the initial speech enhancement parameter according to the accuracy function to obtain speech enhancement parameters under different environments, and construct the speech enhancement parameter set based on the speech enhancement parameters under different environments.
Further, in order to construct the speech recognition accuracy function under different environments, the constructing unit 34 includes: an identification module 341 and a construction module 342.
The recognition module 341 may be configured to perform speech recognition on the sample speech data after the speech enhancement processing by using a pre-established speech recognition model, so as to obtain speech recognition results in different environments.
The constructing module 342 may be configured to construct a speech recognition accuracy function under different environments according to the speech recognition results under different environments.
Further, in order to perform voice recognition on voice data, the apparatus further includes: an extraction unit 35 and a determination unit 36.
The extracting unit 35 may be configured to perform feature extraction on the voice data after the voice enhancement processing, so as to obtain a third voice feature corresponding to the voice data.
The determining unit 36 may be configured to determine a speech recognition result corresponding to the speech data according to the third speech feature.
Further, in order to perform speech enhancement processing on the speech data, the processing unit 33 may be specifically configured to perform filtering and noise reduction processing on the speech data according to the target filtering and noise reduction parameter, so as to obtain noise-reduced speech data.
It should be noted that other corresponding descriptions of the functional modules related to the speech enhancement device provided in the embodiment of the present invention may refer to the corresponding description of the method shown in fig. 1, and are not described herein again.
Based on the method shown in fig. 1, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps: acquiring voice data to be processed; extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments; and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
Based on the above embodiments of the method shown in fig. 1 and the apparatus shown in fig. 3, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 5, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43 such that when the processor 41 executes the program, the following steps are performed: acquiring voice data to be processed; extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments; and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
By the technical scheme, the voice processing method and the voice processing device can acquire the voice data to be processed; simultaneously extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments; and then carrying out voice enhancement processing on the voice data according to the target voice enhancement parameters to obtain the voice data after the voice enhancement processing, so that the target environment where the voice data to be processed is located is determined, the target voice enhancement parameters corresponding to the voice enhancement parameters can be automatically selected from the voice enhancement parameter set, and the voice data is subjected to voice enhancement processing by using the target voice enhancement parameters, so that the voice enhancement effect in the target environment can be improved, and meanwhile, the highest accuracy of voice recognition in the target environment can be ensured.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of speech enhancement, comprising:
acquiring voice data to be processed;
extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments;
and performing voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain voice data after the voice enhancement processing.
2. The method of claim 1, wherein determining the target environment in which the voice data is located based on the first voice feature comprises:
acquiring sample voice data under different environments, and extracting second voice features corresponding to the sample voice data;
calculating a feature center corresponding to the sample voice data in different environments according to the second voice feature;
and determining the target environment in which the voice data is positioned according to the feature center and the first voice feature.
3. The method of claim 2, wherein determining the target environment in which the speech data is located based on the feature center and the first speech feature comprises:
calculating Euclidean distances between the first voice feature and different feature centers by using a preset Euclidean distance algorithm;
and screening out the minimum Euclidean distance from the calculated Euclidean distances, and determining the environment where the sample voice data corresponding to the minimum Euclidean distance is located as the target environment.
4. The method of claim 1, wherein prior to said obtaining voice data to be processed, the method comprises:
carrying out voice enhancement processing on the sample voice data under different environments by using the initial voice enhancement parameters to obtain sample voice data subjected to voice enhancement processing under different environments;
constructing voice recognition accuracy functions under different environments according to the sample voice data;
and optimizing and adjusting the initial voice enhancement parameters according to the accuracy function to obtain voice enhancement parameters under different environments, and constructing the voice enhancement parameter set based on the voice enhancement parameters under the different environments.
5. The method of claim 4, wherein constructing the speech recognition accuracy function for different environments from the sample speech data comprises:
carrying out voice recognition on the sample voice data subjected to the voice enhancement processing by utilizing a pre-constructed voice recognition model to obtain voice recognition results under different environments;
and constructing a voice recognition accuracy function under different environments according to the voice recognition results under different environments.
6. The method according to claim 1, wherein after performing the speech enhancement processing on the speech data according to the target speech enhancement parameter to obtain speech-enhanced speech data, the method further comprises:
performing feature extraction on the voice data after the voice enhancement processing to obtain a third voice feature corresponding to the voice data;
and determining a voice recognition result corresponding to the voice data according to the third voice characteristic.
7. The method according to any one of claims 1 to 6, wherein the target speech enhancement parameter is a target filtering noise reduction parameter, and performing speech enhancement processing on the speech data according to the target speech enhancement parameter to obtain speech data after speech enhancement processing comprises:
and according to the target filtering noise reduction parameter, carrying out filtering noise reduction processing on the voice data to obtain noise-reduced voice data.
8. A speech enhancement apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be processed;
the selecting unit is used for extracting a first voice feature corresponding to the voice data, determining a target environment where the voice data is located according to the first voice feature, and selecting a target voice enhancement parameter corresponding to the target environment from a pre-constructed voice enhancement parameter set, wherein the voice enhancement parameter set comprises voice enhancement parameters under different environments, and the voice enhancement parameters are used for enhancing the voice recognition accuracy under different environments;
and the processing unit is used for carrying out voice enhancement processing on the voice data according to the target voice enhancement parameter to obtain the voice data after the voice enhancement processing.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
10. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.
CN202011153521.2A 2020-10-26 2020-10-26 Speech enhancement method, device, computer equipment and storage medium Active CN112151052B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011153521.2A CN112151052B (en) 2020-10-26 2020-10-26 Speech enhancement method, device, computer equipment and storage medium
PCT/CN2020/136364 WO2021189979A1 (en) 2020-10-26 2020-12-15 Speech enhancement method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011153521.2A CN112151052B (en) 2020-10-26 2020-10-26 Speech enhancement method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112151052A true CN112151052A (en) 2020-12-29
CN112151052B CN112151052B (en) 2024-06-25

Family

ID=73955013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011153521.2A Active CN112151052B (en) 2020-10-26 2020-10-26 Speech enhancement method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112151052B (en)
WO (1) WO2021189979A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539262A (en) * 2021-07-09 2021-10-22 广东金鸿星智能科技有限公司 Sound enhancement and recording method and system for voice control of electrically operated gate

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114512136B (en) * 2022-03-18 2023-09-26 北京百度网讯科技有限公司 Model training method, audio processing method, device, equipment, storage medium and program

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013037177A (en) * 2011-08-08 2013-02-21 Nippon Telegr & Teleph Corp <Ntt> Speech enhancement device, and method and program thereof
CN103456305A (en) * 2013-09-16 2013-12-18 东莞宇龙通信科技有限公司 Terminal and speech processing method based on multiple sound collecting units
CN104575509A (en) * 2014-12-29 2015-04-29 乐视致新电子科技(天津)有限公司 Voice enhancement processing method and device
KR20190037867A (en) * 2017-09-29 2019-04-08 주식회사 케이티 Device, method and computer program for removing noise from noisy speech data
CN110503974A (en) * 2019-08-29 2019-11-26 泰康保险集团股份有限公司 Fight audio recognition method, device, equipment and computer readable storage medium
CN111698629A (en) * 2019-03-15 2020-09-22 北京小鸟听听科技有限公司 Calibration method and apparatus for audio playback device, and computer storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8082148B2 (en) * 2008-04-24 2011-12-20 Nuance Communications, Inc. Testing a grammar used in speech recognition for reliability in a plurality of operating environments having different background noise
CN101593522B (en) * 2009-07-08 2011-09-14 清华大学 Method and equipment for full frequency domain digital hearing aid
CN101710490B (en) * 2009-11-20 2012-01-04 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
CN110473568B (en) * 2019-08-08 2022-01-07 Oppo广东移动通信有限公司 Scene recognition method and device, storage medium and electronic equipment
CN110648680B (en) * 2019-09-23 2024-05-14 腾讯科技(深圳)有限公司 Voice data processing method and device, electronic equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013037177A (en) * 2011-08-08 2013-02-21 Nippon Telegr & Teleph Corp <Ntt> Speech enhancement device, and method and program thereof
CN103456305A (en) * 2013-09-16 2013-12-18 东莞宇龙通信科技有限公司 Terminal and speech processing method based on multiple sound collecting units
CN104575509A (en) * 2014-12-29 2015-04-29 乐视致新电子科技(天津)有限公司 Voice enhancement processing method and device
KR20190037867A (en) * 2017-09-29 2019-04-08 주식회사 케이티 Device, method and computer program for removing noise from noisy speech data
CN111698629A (en) * 2019-03-15 2020-09-22 北京小鸟听听科技有限公司 Calibration method and apparatus for audio playback device, and computer storage medium
CN110503974A (en) * 2019-08-29 2019-11-26 泰康保险集团股份有限公司 Fight audio recognition method, device, equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539262A (en) * 2021-07-09 2021-10-22 广东金鸿星智能科技有限公司 Sound enhancement and recording method and system for voice control of electrically operated gate
CN113539262B (en) * 2021-07-09 2023-08-22 广东金鸿星智能科技有限公司 Sound enhancement and recording method and system for voice control of electric door

Also Published As

Publication number Publication date
CN112151052B (en) 2024-06-25
WO2021189979A1 (en) 2021-09-30

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN108281146B (en) Short voice speaker identification method and device
CN109599109B (en) Confrontation audio generation method and system for white-box scene
Zhou et al. A compact representation of visual speech data using latent variables
CN108922544B (en) Universal vector training method, voice clustering method, device, equipment and medium
WO2018005858A1 (en) Speech recognition
CN110556103A (en) Audio signal processing method, apparatus, system, device and storage medium
CN108597505B (en) Voice recognition method and device and terminal equipment
CN109427328B (en) Multichannel voice recognition method based on filter network acoustic model
CN111785288B (en) Voice enhancement method, device, equipment and storage medium
CN111242005B (en) Heart sound classification method based on improved wolf&#39;s swarm optimization support vector machine
CN112151052B (en) Speech enhancement method, device, computer equipment and storage medium
CN110211599A (en) Using awakening method, device, storage medium and electronic equipment
CN109147798B (en) Speech recognition method, device, electronic equipment and readable storage medium
CN113205803A (en) Voice recognition method and device with adaptive noise reduction capability
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
CN113077779A (en) Noise reduction method and device, electronic equipment and storage medium
CN117496998A (en) Audio classification method, device and storage medium
CN112489678B (en) Scene recognition method and device based on channel characteristics
CN114220430A (en) Multi-sound-zone voice interaction method, device, equipment and storage medium
CN114023336A (en) Model training method, device, equipment and storage medium
CN114495903A (en) Language category identification method and device, electronic equipment and storage medium
CN112201270B (en) Voice noise processing method and device, computer equipment and storage medium
Kumar et al. Improving the performance of speech recognition feature selection using northern goshawk optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant