CN115101082A

CN115101082A - Speech enhancement method, apparatus, device, storage medium and program product

Info

Publication number: CN115101082A
Application number: CN202210639667.0A
Authority: CN
Inventors: 梁俊斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-23

Abstract

The application discloses a voice enhancement method, a device, equipment, a storage medium and a program product, and relates to the field of voice technical processing. The method comprises the following steps: performing frequency band segmentation on a target audio to be subjected to voice enhancement along a frequency domain dimension to obtain at least two sub-frequency bands; acquiring sub-band energy data respectively corresponding to at least two sub-bands; analyzing the sub-band energy data respectively corresponding to the at least two sub-bands along a time domain dimension to obtain sub-band energy distribution data respectively corresponding to the at least two sub-bands; and when the sub-band energy distribution data corresponding to the specified sub-band meets the adjustment condition, adjusting the sub-band energy data of the specified sub-band to obtain the target enhanced audio. Through the method, the sub-band energy data which do not accord with the adjustment conditions in the target audio can be selectively adjusted, and the quality of voice enhancement is improved while the characteristics of the target audio are fully considered. The method and the device can be applied to various scenes such as cloud technology, artificial intelligence and intelligent traffic.

Description

Speech enhancement method, apparatus, device, storage medium and program product

Technical Field

The present invention relates to the field of speech technology processing, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for speech enhancement.

Background

The speech is inevitably interfered by surrounding environment and noise inside the communication equipment in the transmission process, so that the original speech with purity as much as possible needs to be extracted from a noisy speech signal through a speech enhancement technology, and the speech enhancement technology plays an important role in the fields of speech processing, speech recognition, speech detection and the like.

In the related art, the speech signal is usually processed by methods such as noise suppression, echo cancellation, volume adjustment, etc., for example: and (3) restraining noise components in the voice signals by a deep learning method, and outputting the voice signals with enhanced signal-to-noise ratio.

Although the voice signal obtained by the above method reduces the noise interference to a certain extent, when the receiver receives the voice signal for a long time, it may produce hearing fatigue to the voice signal, which affects the subsequent processing process of the voice signal.

Disclosure of Invention

Embodiments of the present application provide a method, an apparatus, a device, a storage medium, and a program product for speech enhancement, which can selectively adjust subband energy data that does not meet adjustment conditions in a target audio, and improve speech enhancement quality while fully considering characteristics of the target audio. The technical scheme is as follows.

In one aspect, a method for speech enhancement is provided, the method comprising:

acquiring a target audio, wherein the target audio is audio data to be subjected to voice enhancement;

performing band segmentation on the target audio along a frequency domain dimension to obtain at least two sub-bands;

acquiring sub-band energy data corresponding to the at least two sub-bands respectively, wherein the sub-band energy data is used for indicating the frequency change condition of an audio frame in the target audio along the frequency domain dimension in the sub-bands;

analyzing the sub-band energy data respectively corresponding to the at least two sub-bands along a time domain dimension to obtain sub-band energy distribution data respectively corresponding to the at least two sub-bands, wherein the sub-band energy distribution data is used for indicating the frequency distribution condition of the target audio on the at least two sub-bands;

and determining an adjusting parameter based on the sub-band energy distribution data corresponding to the specified sub-band under the condition that the sub-band energy distribution data corresponding to the specified sub-band in the at least two sub-bands meets an adjusting condition, and adjusting the sub-band energy data of the specified sub-band to obtain the target enhanced audio.

In another aspect, an apparatus for speech enhancement is provided, the apparatus comprising:

the audio acquisition module is used for acquiring a target audio, wherein the target audio is audio data to be subjected to voice enhancement;

the frequency band segmentation module is used for carrying out frequency band segmentation on the target audio along the dimension of a frequency domain to obtain at least two sub-frequency bands;

a data obtaining module, configured to obtain sub-band energy data corresponding to the at least two sub-bands, where the sub-band energy data is used to indicate a frequency variation condition of an audio frame in the target audio along a frequency domain dimension in the sub-bands;

the data analysis module is configured to analyze the sub-band energy data corresponding to the at least two sub-bands along a time domain dimension to obtain sub-band energy distribution data corresponding to the at least two sub-bands, where the sub-band energy distribution data is used to indicate frequency distribution conditions of the target audio on the at least two sub-bands;

and the energy adjusting module is used for determining an adjusting parameter based on the sub-band energy distribution data corresponding to the specified sub-band under the condition that the sub-band energy distribution data corresponding to the specified sub-band in the at least two sub-bands meets an adjusting condition, and adjusting the sub-band energy data of the specified sub-band to obtain the target enhanced audio.

In another aspect, a computer device is provided, which comprises a processor and a memory, wherein at least one instruction, at least one program, set of codes, or set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the speech enhancement method according to any of the embodiments of the present application.

In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement a speech enhancement method as described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the speech enhancement method described in any of the above embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

performing frequency band segmentation on a target audio to be subjected to voice enhancement along a frequency domain dimension to obtain at least two sub-bands, acquiring sub-band energy data and sub-band energy distribution data corresponding to different sub-bands, determining an adjustment parameter based on the sub-band energy distribution data corresponding to a specified sub-band when the sub-band energy distribution data corresponding to the specified sub-band meets the adjustment condition, and adjusting the sub-band energy data of the specified sub-band by using the adjustment parameter, thereby effectively avoiding performing no-destination adjustment on all voice signals of the target audio, selectively adjusting the sub-band energy data of the specified sub-band which does not meet the adjustment condition by judging whether the sub-band energy distribution data of the sub-band meets the adjustment condition, and further adjusting the sub-band energy data of the specified sub-band by using the adjustment parameter determined by the sub-band energy distribution data corresponding to the specified sub-band, and based on the adjusted sub-band energy data, selectively enhanced target enhanced audio is obtained, and the quality of voice enhancement is improved while the characteristics of the target audio are fully considered.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method of speech enhancement provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of band slicing provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of a method of speech enhancement provided by another exemplary embodiment of the present application;

FIG. 5 is a schematic illustration of target audio provided by an exemplary embodiment of the present application;

FIG. 6 is a flow chart of a method of speech enhancement provided by yet another exemplary embodiment of the present application;

FIG. 7 is a diagram illustrating subband energy data scaling provided in an exemplary embodiment of the present application;

FIG. 8 is a process flow diagram of a method for speech enhancement provided by an exemplary embodiment of the present application;

FIG. 9 is a flow chart of a speech enhancement method provided by yet another exemplary embodiment of the present application;

FIG. 10 is a block diagram of a speech enhancement apparatus provided in an exemplary embodiment of the present application;

fig. 11 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In the related art, the speech signal is usually processed by methods such as noise suppression, echo cancellation, volume adjustment, etc., for example: and (3) suppressing noise components in the voice signals by a deep learning method, and outputting the voice signals with enhanced signal-to-noise ratio. Although the voice signal obtained by the method reduces the noise interference to a certain extent, when the receiver receives the voice signal for a long time, the receiver may generate hearing fatigue to the voice signal, and the subsequent processing process of the voice signal is affected.

The embodiment of the application provides a voice enhancement method, which can selectively adjust sub-band energy data which do not accord with adjustment conditions in target audio, and improves the quality of voice enhancement while fully considering the characteristics of the target audio. The voice enhancement method obtained by training of the application comprises a voice call enhancement scene, an audio enhancement scene and other scenes during application.

It should be noted that the above application scenarios are only illustrative examples, and the speech enhancement method provided in this embodiment may also be applied to other speech scenarios, which is not limited in this embodiment of the present application.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or sufficiently authorized by each party, and the collection, use, and processing of the relevant data is in compliance with relevant laws and regulations and standards in relevant countries and regions. For example, speech data, target audio, etc. referred to in this application are obtained with sufficient authorization.

Next, an implementation environment related to the embodiment of the present application is described, and please refer to fig. 1 schematically, in which a terminal 110 and a server 120 are related, and the terminal 110 and the server 120 are connected through a communication network 130.

In some embodiments, an application having an audio acquisition function is installed in the terminal 110. In some embodiments, the terminal 110 is configured to transmit the target audio to the server 120. The server 120 may perform voice enhancement on the target audio through the voice enhancement model 121, and feed the enhanced target enhanced audio back to the terminal 110 for playing.

The application process of the speech enhancement model 121 is as follows: performing band segmentation on the obtained target audio along the frequency domain dimension to obtain at least two sub-bands (sub-band 1, sub-band 2, and sub-band n …), then obtaining sub-band energy data (sub-band energy data 1, sub-band energy data 2, and sub-band energy data n …) corresponding to the at least two sub-bands respectively, wherein the sub-band energy data is indicative of a frequency variation of an audio frame in the target audio along a frequency domain dimension within a sub-band, then, sub-band energy distribution data (sub-band energy distribution data 1, sub-band energy distribution data 2, … sub-band energy distribution data n) corresponding to at least two sub-bands are obtained by analyzing the sub-band energy data corresponding to the at least two sub-bands along the time domain dimension, through the sub-band energy distribution data, the frequency distribution situation of the target audio on at least two sub-bands can be determined. And when the sub-band energy distribution data corresponding to the designated sub-band in the at least two sub-bands meet the adjustment condition, determining an adjustment parameter based on the sub-band energy distribution data corresponding to the designated sub-band. For example: if the sub-band energy distribution data 1 meets the adjustment condition, the sub-band 1 is a designated sub-band, an adjustment parameter corresponding to the sub-band 1 is determined based on the sub-band energy analysis data 1, the sub-band energy data (sub-band energy data 1) of the designated sub-band (sub-band 1) is adjusted by the adjustment parameter, so that adjusted sub-band energy data (sub-band energy data 1) corresponding to the designated sub-band (sub-band 1) is obtained, and the target enhanced audio is obtained according to the adjusted sub-band energy data. The above process is an example of a non-exclusive case of the application process of the speech enhancement model 121.

It should be noted that the above terminals include, but are not limited to, mobile terminals such as mobile phones, tablet computers, portable laptop computers, intelligent voice interaction devices, intelligent home appliances, and vehicle-mounted terminals, and can also be implemented as desktop computers; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, domain name service, security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, application programs, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, can be used as required, and is flexible and convenient.

In some embodiments, the servers described above may also be implemented as nodes in a blockchain system.

In conjunction with the above noun introduction and application scenario, the speech enhancement method provided in this application is described, taking the application of this method to a server as an example, as shown in fig. 2, this method includes the following steps 210 to 250.

Step 210, a target audio is obtained.

The target audio is audio data to be subjected to voice enhancement.

Illustratively, audio is used to indicate data having audio information, such as: a piece of music, a piece of voice message, etc. Optionally, the audio is acquired by using a terminal, a recorder and other devices which are internally or externally connected with the voice acquisition assembly. For example: acquiring audio by adopting a terminal provided with a microphone, a microphone array or a sound pickup; alternatively, the audio is synthesized using an audio synthesis application to obtain the audio, and so on. Optionally, the target audio is audio data obtained by the above-mentioned acquisition method or synthesis method.

In an alternative embodiment, the target audio is audio data obtained in real-time. Illustratively, in a scenario of a two-person call or a multi-person call, for example: telephone, Voice over Internet Protocol (VoIP), conference call, etc. And taking the voice generated in real time as target audio, namely, the target audio is audio data and the like in the communication process.

Optionally, the target audio to be subjected to voice enhancement is taken as a song, and the song comprises musical instrument audio data corresponding to multiple musical instrument sound sources and human voice audio data corresponding to human voice sound sources; or, the target audio to be subjected to voice enhancement is a multi-person real-time call audio, and the real-time call audio includes human voice audio data corresponding to a plurality of human voice sound sources and background sound audio data (such as audio data of environmental sounds, noises and the like) corresponding to background sound sources.

Step 220, performing band segmentation on the target audio along the frequency domain dimension to obtain at least two sub-bands.

The frequency domain dimension is used for describing the dimension condition of the target audio in frequency, and the target audio is analyzed in the frequency domain dimension, so that the oscillation information of the target audio in the frequency domain dimension can be provided.

Optionally, as shown in fig. 3, after the target audio is obtained, the frequency band corresponding to the target audio 310 is band-sliced along the frequency domain dimension, so as to obtain at least two sub-frequency bands 320.

Illustratively, when the input target audio 310 is band-split along the frequency domain dimension, the band corresponding to the target audio 310 is split into K sub-bands, and the dimension of each sub-band is F _k K is 1, … K, and satisfies

Optionally, K is a randomly generated number; alternatively, K is a preset number or the like. Illustratively, the target audio 310 is band-split in the same frequency bandwidth (dimension), and then the frequency bandwidths of K sub-bands are the same; or, the target audio 310 is divided in different frequency bandwidths, and the frequency bandwidths of K sub-bands are different, for example: the frequency bandwidth of the K sub-bands is sequentially increased in an increasing mode, the frequency bandwidth of the K sub-bands is randomly selected, and the like.

Step 230, obtaining sub-band energy data corresponding to at least two sub-bands respectively.

Illustratively, after performing band segmentation on the target audio to obtain at least two sub-bands, the sub-band energy data corresponding to the at least two sub-bands are respectively determined. Wherein the sub-band energy data is used to indicate a frequency variation of an audio frame in the target audio along a frequency domain dimension within the sub-band.

The audio frame is a measurement unit corresponding to the audio. Optionally, the target audio is framed according to a distribution of an audio signal corresponding to the target audio, an oscillation of audio data, and the like. For example: and taking a section of audio data with stable frequency change as an audio frame. Optionally, a fixed time interval is used as a time interval corresponding to one audio frame, such as: 32 milliseconds are used as one audio frame. Wherein, one audio frame takes 20-50 milliseconds.

Based on the audio frame division standard, frame division processing is carried out on the target audio, and the small section of audio data obtained by interception is used as an audio frame, so that a plurality of audio frames correspondingly included in the target audio are determined.

Optionally, when the sub-band energy data corresponding to the at least two sub-bands are obtained, the sub-band energy data corresponding to the at least two sub-bands in different audio frames are obtained by taking the audio frame as a unit, that is, the sub-band energy data corresponding to the at least two sub-bands in different audio frames are different.

In an alternative embodiment, at least two sub-bands under any one audio frame are analyzed.

Illustratively, a plurality of sampling points corresponding to the target audio are obtained after the target audio is sampled, and when the sub-band energy data of a certain sub-band is determined, the sub-band energy data corresponding to the sub-band is determined according to the energy data corresponding to the plurality of sampling points, for example: integrating energy data corresponding to the plurality of sampling points in the sub-band to obtain sub-band energy data corresponding to the sub-band; or, when determining the sub-band energy data of a certain sub-band, first, band-pass filtering is performed on a plurality of sampling points, and then, energy data corresponding to the plurality of filtered sampling points in the sub-band is integrated, so as to obtain sub-band energy data and the like corresponding to the sub-band.

Optionally, based on the method for determining subband energy data, subband energy data corresponding to a plurality of subbands in the same audio frame are determined, and subband energy data corresponding to a plurality of subbands in other audio frames are determined, so as to obtain subband energy data corresponding to different subbands in different audio frames.

And 240, analyzing the sub-band energy data corresponding to the at least two sub-bands along the time domain dimension to obtain sub-band energy distribution data corresponding to the at least two sub-bands.

Illustratively, the time domain dimension is a dimension condition in which a time scale is used to record a temporal change of the target audio, and optionally, the audio frame is used to describe the change condition of the time domain dimension, that is, a process of analyzing sub-band energy data corresponding to at least two sub-bands along the time domain dimension, that is, a process of analyzing sub-band energy data corresponding to at least two sub-bands under different audio frames.

Optionally, after obtaining the sub-band energy data corresponding to the at least two sub-bands, the frequency variation condition of the same sub-band in different audio frames is analyzed, and the sub-band energy distribution data corresponding to the at least two sub-bands is determined. For example: the target audio includes a subband a and a subband B, and in an audio frame 1, subband energy data a corresponding to the subband a and subband energy data B corresponding to the subband B are obtained, and in an audio frame 2, subband energy data a 'corresponding to the subband a and subband energy data B' corresponding to the subband B are obtained, where the audio frame 1 and the audio frame 2 are different audio frames distributed along a time domain dimension.

Schematically, taking the sub-band a as an example for analysis, determining sub-band energy distribution data corresponding to the sub-band a according to a frequency variation condition between the sub-band energy data a of the sub-band a in the audio frame 1 and the sub-band energy data a' of the sub-band a in the audio frame 2; or, taking the sub-band B as an example for analysis, determining the sub-band energy distribution data corresponding to the sub-band B according to the frequency variation between the sub-band energy data B of the sub-band B in the audio frame 1 and the sub-band energy data B' of the sub-band B in the audio frame 2. That is, the sub-band energy distribution data is used to indicate the frequency distribution of the target audio over at least two sub-bands.

Optionally, taking a certain sub-band as an example, since the sub-band energy distribution data is data obtained based on different audio frames in the sub-band, for a sub-band, the sub-band energy distribution data respectively corresponding to different audio frames is correspondingly included, and one sub-band energy distribution data represents a relationship between the sub-band energy data in the audio frame and the sub-band energy data in other audio frames, and the like.

And 250, determining an adjusting parameter based on the sub-band energy distribution data corresponding to the specified sub-band and adjusting the sub-band energy data of the specified sub-band to obtain the target enhanced audio under the condition that the sub-band energy distribution data corresponding to the specified sub-band in the at least two sub-bands meets the adjusting condition.

Optionally, the adjustment condition is a preset condition, for example: adjusting the condition to be a preset energy threshold value; alternatively, the adjustment condition is a condition determined in real time according to the plurality of sub-band energy distribution data, for example: the adjustment conditions are as follows: the sub-band energy distribution data of the specified sub-band is larger than the average value of the plurality of sub-band energy distribution data, and the like.

Illustratively, after the frequency domain dimension analysis and the time domain dimension analysis are performed on the target audio by the method, the sub-band energy distribution data corresponding to at least two sub-bands are determined. Optionally, the subband energy distribution data corresponding to the at least two subbands is compared with a preset adjustment condition, so as to determine whether to adjust the subband energy data corresponding to the at least two subbands. For example: and comparing the sub-band energy distribution data respectively corresponding to a certain sub-band under different audio frames with the adjustment condition.

In an optional embodiment, when there is subband energy distribution data corresponding to a certain subband that meets an adjustment condition, adjusting the subband energy data of the subband; or when the sub-band energy distribution data corresponding to a certain sub-band does not meet the adjustment condition, the sub-band energy data of the sub-band is not adjusted.

Illustratively, the sub-bands meeting the adjustment condition are taken as the designated sub-bands, that is, the designated sub-bands are used to indicate the sub-bands meeting the adjustment condition in at least two sub-bands. For example: sub-band energy distribution data corresponding to the sub-bands are specified to be larger than a preset energy threshold; or, the sub-band energy distribution data corresponding to the specified sub-band is less than 3 times of a preset energy threshold value, and the like.

Alternatively, the specific sub-band may be one sub-band of the at least two sub-bands or a plurality of sub-bands of the at least two sub-bands (for example, at least two sub-bands are both the specific sub-bands) based on the band characteristics of the specific sub-band (the sub-band energy distribution data of the specific sub-band meets the adjustment condition).

In an optional embodiment, when the subband energy data of the designated subband is adjusted, an adjustment parameter is determined by the subband energy distribution data corresponding to the designated subband, so that the subband energy data of the designated subband is adjusted by the adjustment parameter.

Illustratively, when acquiring sub-band energy distribution data corresponding to a specific sub-band, the sub-band energy distribution data corresponding to the specific sub-band in different audio frames is obtained based on a time domain relationship corresponding to the sub-band energy distribution data, for example: taking a specific subband as subband a as an example (energy distribution data corresponding to subband a meets an adjustment condition), when acquiring subband energy distribution data corresponding to subband a, subband energy distribution data 1 corresponding to subband a in audio frame 1 and subband energy distribution data 2 corresponding to subband a in audio frame 2 are obtained, and subband energy distribution data 1 and subband energy distribution data 2 are taken as subband energy distribution data corresponding to subband a.

Optionally, according to the variation condition of the sub-band energy distribution data 1 and the sub-band energy distribution data 2 corresponding to the sub-band a, determining an adjustment parameter corresponding to the sub-band a, and adjusting the sub-band energy data of the sub-band a by using the adjustment parameter.

Illustratively, according to the method for determining the adjustment parameter, the adjustment parameters corresponding to the different specified sub-frequency bands are determined, and the sub-band energy data of the corresponding specified sub-frequency band is adjusted by the adjustment parameters corresponding to the different specified sub-frequency bands, so as to obtain the adjusted sub-band energy data corresponding to the different specified sub-frequency bands.

In an alternative embodiment, the target enhanced audio is obtained based on the adjusted subband energy data corresponding to the specified subband.

Illustratively, taking any audio frame as an example for analysis, according to the relationship between the specified sub-band and at least two sub-bands under the audio frame, when obtaining the target enhanced audio according to the adjusted sub-band energy data, at least one of the following situations is included.

(1) Some of the at least two sub-bands are designated sub-bands.

Illustratively, when the designated sub-band is a part of sub-bands of at least two sub-bands, that is, the sub-band energy distribution data corresponding to the part of sub-bands meet the adjustment condition, the corresponding adjustment parameter is determined according to the sub-band energy distribution data corresponding to the part of sub-bands, and when the sub-band energy data of the designated sub-band is adjusted by the adjustment parameter, the sub-band energy data of the designated sub-band corresponding to the adjustment parameter is adjusted by different adjustment parameters, so that the sub-band energy data of the part of sub-bands which become the designated sub-band is adjusted. However, the subband energy distribution data of the remaining subbands except the designated subband in the at least two subbands do not meet the adjustment condition, so the remaining subbands do not need to determine the adjustment parameter and perform the adjustment process of the subband energy data.

Namely: for at least two sub-bands, adjusting sub-band energy data corresponding to part of sub-bands (designated sub-bands) to obtain adjusted sub-band energy data corresponding to different designated sub-bands; there is also no adjustment of the subband energy data corresponding to the remaining subbands (the subbands other than the designated subband among the at least two subbands).

Optionally, when obtaining the target enhanced audio based on the adjusted subband energy data, considering the band regions of the remaining unadjusted subbands in the target audio, comprehensively considering the adjusted subband energy data corresponding to the adjusted specified subband and the subband energy data corresponding to the remaining unadjusted subbands, thereby obtaining the enhanced target enhanced audio.

In an optional embodiment, the subband energy data of the designated subband is adjusted to obtain the adjusted subband energy data corresponding to the designated subband. The at least two sub-bands further include a candidate discarded sub-band without energy adjustment, that is, the candidate discarded sub-band is used to indicate a sub-band of the at least two sub-bands other than the designated sub-band.

Optionally, for an audio frame in the target audio, in response to that the frequency band ratio of the energy-adjusted specified sub-band in the at least two sub-bands exceeds a preset ratio threshold, and the candidate discarded sub-band is outside the frequency domain range corresponding to the human voice, the sub-band energy data of the energy-adjusted specified sub-band is retained, and the sub-band energy data of the candidate discarded sub-band which is not energy-adjusted is discarded, so as to obtain the target enhanced audio.

Illustratively, after determining that the frequency band proportion of the designated sub-frequency band in the at least two sub-frequency bands reaches the preset proportion threshold, determining whether the candidate discarding sub-frequency bands except the designated sub-frequency band in the at least two sub-frequency bands are outside a frequency domain range corresponding to the human voice, where the frequency domain range corresponding to the human voice is 20HZ to 20 KHZ.

When the candidate discarding sub-band is outside the frequency domain range corresponding to the voice, the candidate discarding sub-band is taken as the discarding sub-band, namely the candidate discarding sub-band outside the frequency domain range corresponding to the voice is discarded; when the candidate discarding sub-band is in the frequency domain range corresponding to the voice, the candidate discarding sub-band is not discarded, that is, the candidate discarding sub-band in the frequency domain range corresponding to the voice is reserved. For example: the preset proportion threshold value is 95%, when 95% of the at least two sub-bands are designated sub-bands, the designated sub-bands are used as candidate discarding sub-bands, whether the candidate discarding sub-bands are out of the frequency domain range corresponding to the human voice is judged, and therefore discarding operation or reserving operation performed on the candidate discarding sub-bands is determined.

Illustratively, when the candidate discarding sub-band is a sub-band outside the frequency domain range corresponding to the human voice, the candidate discarding sub-band is discarded, so as to determine the target enhanced audio based on the adjusted sub-band energy data corresponding to the specified sub-band; or, when the candidate discarding sub-band is a sub-band in the frequency domain range corresponding to the human voice, the candidate discarding sub-band is retained, so that the target enhanced audio and the like are determined based on the adjusted sub-band energy data corresponding to the specified sub-band and the sub-band energy data corresponding to the retained sub-band.

(2) All of the at least two subbands are designated subbands.

Illustratively, when all sub-bands in the at least two sub-bands are designated sub-bands, determining adjustment parameters corresponding to the at least two sub-bands according to sub-band energy distribution data corresponding to the at least two sub-bands, and adjusting the sub-band energy data of the corresponding sub-bands by using the adjustment parameters, thereby obtaining adjusted sub-band energy data corresponding to the at least two sub-bands. Optionally, the target enhanced audio is obtained based on the adjusted subband energy data corresponding to the at least two subbands respectively.

Schematically, an energy distribution curve is determined according to the adjusted sub-band energy data corresponding to at least two sub-bands, a sound signal is obtained by restoring according to the energy distribution curve, and a target enhanced audio and the like are determined through the sound signal obtained by restoring.

It should be noted that the above is only an illustrative example, and the embodiments of the present application are not limited thereto.

In summary, band segmentation is performed on a target audio to be subjected to voice enhancement along a frequency domain dimension to obtain at least two sub-bands, sub-band energy data and sub-band energy distribution data corresponding to different sub-bands are obtained, when there is sub-band energy distribution data corresponding to a specified sub-band that meets an adjustment condition, an adjustment parameter is determined based on the sub-band energy distribution data corresponding to the specified sub-band, and the sub-band energy data of the specified sub-band is adjusted by using the adjustment parameter, so as to effectively avoid performing unintended adjustment on all voice signals of the target audio, by judging whether the sub-band energy distribution data of the sub-band meets the adjustment condition, sub-band energy data corresponding to the specified sub-band that does not meet the adjustment condition is selectively adjusted, and further, the adjustment parameter determined by the sub-band energy distribution data corresponding to the specified sub-band is used, the sub-band energy data of the appointed sub-band is adjusted, the enhanced target enhanced audio is obtained based on the adjusted sub-band energy data, and the quality of voice enhancement is improved while the characteristics of the target audio are fully considered.

In an alternative embodiment, it is determined whether to adjust the sub-band energy data based on the sub-band energy long-term distribution data included in the sub-band energy distribution data. That is, the sub-band energy distribution data includes sub-band energy long-term distribution data indicating a change in sub-band energy data of two adjacent frames of audio frames. Illustratively, as shown in fig. 4, the embodiment shown in fig. 2 described above can also be implemented as the following steps 410 to 470.

Step 410, a target audio is obtained.

The target audio is audio data to be subjected to voice enhancement.

Illustratively, audio is used to indicate data having audio information, such as: a piece of music, a piece of voice message, etc. Optionally, a terminal, a recorder and other devices which are internally or externally connected with a voice acquisition assembly are adopted to acquire the target audio; or the target audio is audio data in the call process, and the like.

Step 420, performing band segmentation on the target audio along the frequency domain dimension to obtain at least two sub-bands.

Illustratively, according to the frequency distribution range corresponding to the target audio, the target audio is subjected to frequency band segmentation along the frequency domain dimension.

In an alternative embodiment, the target audio is band sliced with a fixed frequency bandwidth.

For example: the frequency distribution range corresponding to the target audio is 100HZ-399HZ, when the target audio is subjected to frequency band division, each frequency band width is 100HZ as a division standard, and 3 sub-frequency bands are obtained, wherein the sub-frequency bands comprise sub-frequency bands with the frequency band range of 100HZ-199HZ, sub-frequency bands with the frequency band range of 200HZ-299HZ and sub-frequency bands with the frequency band range of 300HZ-399 HZ.

In an alternative embodiment, the target audio is band sliced using a critical band division criterion.

Here, the critical frequency band is used to indicate a frequency bandwidth of an auditory filter generated due to a cochlear structure, that is, the critical frequency band is a sound frequency band in which the perceptibility of a first tone is disturbed by the auditory masking of a second tone, thereby dividing 24 critical frequency bands, and the 24 critical frequency bands are generally expressed in a sound domain (Bark domain) corresponding to 24 frequency points where the human ear structure can generate resonance. That is, the critical band division standard is a preset band division standard.

For example: the frequency distribution range corresponding to the target audio is 20HZ to 400HZ, and when the target audio is band-split, 4 sub-bands are obtained according to the division standard of the critical band, including a sub-band (critical band 1) having a band range of 20HZ to 100HZ (band bandwidth of 80, band center of 50), a sub-band (critical band 2) having a band range of 100HZ to 200HZ (band bandwidth of 100, band center of 150), a sub-band (critical band 3) having a band range of 200HZ to 300HZ (band bandwidth of 100, band center of 250), and a sub-band (critical band 4) having a band range of 300HZ to 400HZ (band bandwidth of 100, band center of 350).

It should be noted that the above are merely exemplary, and the embodiments of the present application are not limited thereto.

Step 430, obtaining sub-band energy data corresponding to at least two sub-bands respectively.

Wherein the sub-band energy data is used to indicate a frequency variation of an audio frame in the target audio along a frequency domain dimension within the sub-band.

Schematically, sampling a target audio to obtain a plurality of sampling points corresponding to the target audio, and when determining sub-band energy data corresponding to at least two sub-bands, integrating the energy data corresponding to the plurality of sampling points, so as to determine the sub-band energy data corresponding to the sub-bands; or, band-pass filtering is performed on a plurality of sampling points, and then energy data corresponding to a plurality of filtered sampling points in the sub-band is integrated, so as to obtain sub-band energy data and the like corresponding to the sub-band.

In an optional embodiment, the subband energy values corresponding to different subbands are determined by using the method, the subband energy values are subjected to logarithm operation, and the result after the logarithm operation is used as subband energy data.

Illustratively, the subband energy value corresponding to the current frame audio frame is obtained as x, and then the subband energy value is logarithmized, for example: log10(x), the logarithmized value is used as the sub-band energy data, that is, the sub-band energy data is log10 (x). Optionally, the subband energy data of the k subband corresponding to the i audio frame is indicated by Eb (i, k).

Step 440, obtaining sub-band energy data corresponding to at least two sub-bands in the i-th frame of audio frame and sub-band energy distribution data corresponding to at least two sub-bands in the i-1-th frame of audio frame.

The ith frame audio frame and the (i-1) th frame audio frame are two adjacent audio frames distributed along the time domain dimension, and i is a positive integer larger than 1.

Illustratively, the subband energy data corresponding to at least two subbands in the ith frame of audio frame are obtained by using the subband energy obtaining method. Optionally, subband energy distribution data corresponding to the i-1 th frame of audio frame is obtained, wherein the subband energy distribution data is used for indicating the frequency distribution condition of the target audio on at least two subbands.

That is, when determining the sub-band energy distribution data corresponding to at least two sub-bands in the current frame of audio, the iterative determination method is adopted to determine the sub-band energy distribution data corresponding to at least two sub-bands in the previous frame of audio and the sub-band energy distribution data corresponding to at least two sub-bands in the current frame of audio together.

Illustratively, the subband energy distribution data corresponding to different frames of audio frames is determined by using the method.

The 1 st frame audio frame is the first audio frame of the target audio in the time domain dimension. Optionally, when determining the sub-band energy distribution data corresponding to different sub-bands in the 1 st frame of audio frame, the sub-band energy data corresponding to different sub-bands in the 1 st frame of audio frame is used as the sub-band energy distribution data corresponding thereto. For example: in the 1 st frame of audio frame, if the sub-band energy data corresponding to the sub-band a is x, the sub-band energy data x corresponding to the sub-band a in the 1 st frame of audio frame is taken as the sub-band energy distribution data corresponding to the sub-band a in the 1 st frame of audio frame.

Or when determining the sub-band energy distribution data corresponding to different sub-bands in the 1 st frame of audio frame, respectively processing the sub-band energy data corresponding to different sub-bands in the 1 st frame of audio frame, and using the processed data as the sub-band energy distribution data corresponding to different sub-bands. For example: in the 1 st frame of audio frame, the sub-band energy data corresponding to the sub-band A is x, the preset parameter is multiplied by the sub-band energy data x, and the product is used as the sub-band energy distribution data corresponding to the sub-band A in the 1 st frame of audio frame, and the like.

Optionally, when determining subband energy distribution data corresponding to different subbands in other audio frames except the audio frame 1, determining the subband energy distribution data in a preset energy data calculation manner.

And step 450, performing weighted fusion on the sub-band energy data respectively corresponding to at least two sub-bands in the ith frame of audio frame and the sub-band energy long-term distribution data respectively corresponding to at least two sub-bands in the (i-1) th frame of audio frame by using a first preset weight, and determining the sub-band energy long-term distribution data respectively corresponding to at least two sub-bands in the ith frame of audio frame.

Optionally, after obtaining sub-band energy long-term distribution data corresponding to the i-th frame of audio frame and sub-band energy long-term distribution data corresponding to the i-1 th frame of audio frame, determining the sub-band energy long-term distribution data corresponding to the i-th frame of audio frame according to the first preset weight. The first preset weight is a predetermined weight value and is used for balancing the weights of the sub-band energy data corresponding to the ith frame of audio frame and the sub-band energy long-term distribution data corresponding to the (i-1) th frame of audio frame when the sub-band energy long-term distribution data corresponding to the ith frame of audio frame is determined.

Illustratively, the following calculation formula of the sub-band energy distribution data is adopted to determine the long-term distribution data of the sub-band energy corresponding to the i-th frame of audio frame.

Eb_lt(i,k)＝a*Eb_lt(i-1,k)+(1-a)*Eb(i,k)a＝0.993

Wherein i is used to indicate an audio frame; k is used to indicate a sub-band; a is used for indicating a first preset weight; eb _ lt (i, k) is used for indicating sub-band energy long-term distribution data corresponding to a kth frequency band in an ith frame of audio frame, and lt is an abbreviation of long-term (long-term) and is used for describing an iterative relationship of the sub-band energy long-term distribution data corresponding to different audio frames in a time domain dimension; eb _ lt (i-1, k) is used for indicating sub-band energy long-term distribution data corresponding to a kth frequency band in an i-1 frame audio frame; eb (i, k) is used to indicate the sub-band energy data corresponding to the k frequency band in the i frame audio frame.

Optionally, the sub-band energy long-term distribution data corresponding to different sub-bands in different audio frames is determined by using the above calculation formula of the sub-band energy long-term distribution data.

Step 460, in response to that the sub-band energy long-term distribution data corresponding to the specified sub-band exists in the at least two sub-bands and reaches a preset hearing threshold, determining an adjustment parameter.

Optionally, taking any one audio frame as an example for description, after determining long-term distribution data of sub-band energies corresponding to at least two sub-bands of the audio frame, comparing the long-term distribution data of sub-band energies corresponding to at least two sub-bands with a preset auditory threshold, and determining whether to adjust the long-term distribution data of sub-band energies corresponding to at least two sub-bands according to a comparison result.

The preset auditory threshold is used for indicating the adjustment condition, is a preset numerical condition and is used for avoiding the occurrence of auditory fatigue.

The main cause of auditory fatigue is persistent energy concentration and overload of the sound signal in one or some sub-bands in the frequency domain dimension, as shown in fig. 5, which is a schematic frequency distribution diagram of the target audio, wherein the horizontal axis 510 is used to indicate the time dimension, and the vertical axis 520 is used to indicate the frequency distribution, as shown in fig. 5, there is a clear bright streak (representing too much energy of the target audio at low frequencies) in the low frequency part 530, and the listener receives signal stimulation from the frequency band (low frequency band) for a long time, resulting in the auditory threshold (hearing threshold) being shifted upwards. At this time, the audio information of other frequency bands is not sensitive due to the adjustment of the hearing threshold, so that the situation that the user cannot hear clearly appears, and the situation that the user feels tired after long-term listening and cannot concentrate on the hearing, that is, the phenomenon of hearing fatigue is caused.

In an alternative embodiment, different preset auditory thresholds are determined for different sub-bands according to the auditory equal loudness curve principle.

The auditory equal loudness curve is a curve with equal loudness and subjective perception (loudness level) obtained by measuring the real sense of the human ear sound, and different sub-bands correspond to different frequency ranges, so when the preset auditory thresholds corresponding to different sub-bands are determined, the auditory thresholds corresponding to the frequency ranges are determined, that is, the preset auditory thresholds corresponding to different sub-bands are determined.

Optionally, reaching a preset hearing threshold is used for indicating that the sub-band energy long-term distribution data is greater than the preset hearing threshold; or, the reaching of the preset hearing threshold is used for indicating that the sub-band energy long-term distribution data is greater than or equal to the preset hearing threshold.

Illustratively, the sub-band corresponding to the sub-band energy long-term distribution data reaching the preset hearing threshold is used as the designated sub-band. For example: and the sub-band energy long-term distribution data corresponding to the sub-band is greater than a preset hearing threshold.

And 470, adjusting the sub-band energy data of the designated sub-band based on the adjustment parameter to obtain the target enhanced audio.

Optionally, the same sub-band corresponds to different adjustment parameters under different audio frames, and an arbitrary audio frame is taken as an example for description. Illustratively, after determining the adjustment parameters corresponding to different sub-bands, the sub-band energy data of the corresponding sub-band is adjusted by the adjustment parameters corresponding to the different sub-bands.

In an optional embodiment, the subband energy data corresponding to the specified subband is adjusted by the adjustment parameter, and the energy adjustment gain corresponding to the specified subband is determined.

Wherein the energy adjustment gain is used for indicating the data scaling condition of the sub-band energy data.

Illustratively, after the adjustment parameters are obtained, the adjustment parameters are converted to obtain energy adjustment gains corresponding to the specified sub-frequency bands, and the target audio is subjected to voice enhancement based on the energy adjustment gains to obtain a target enhanced audio.

In summary, at least two sub-bands corresponding to the target audio are obtained through band splitting, sub-band energy data and sub-band energy distribution data corresponding to different sub-bands are obtained, a specified sub-band not meeting an adjustment condition is determined through the sub-band energy distribution data of the sub-bands, adjustment parameters corresponding to different specified sub-bands are determined through the sub-band energy distribution data corresponding to the specified sub-bands, the sub-band energy data of the specified sub-bands are selectively adjusted through the adjustment parameters, an enhanced target enhanced audio is obtained based on the adjusted sub-band energy data, and the quality of voice enhancement is improved while the characteristics of the target audio are fully considered.

In the embodiment of the present application, the sub-band energy long-term distribution data included in the sub-band energy distribution data is described, by means of the sub-band energy long-time distribution data, the change situation of the sub-band energy data of two adjacent frames of audio frames can be determined, and then comparing the sub-band energy long-term distribution data with a preset hearing threshold by means of the sub-band energy long-term distribution data with an iterative relation on a time domain dimension, when the sub-band energy long-term distribution data reaches a preset hearing threshold, adjusting the sub-band energy data corresponding to the appointed sub-band through adjusting parameters, therefore, the sub-band energy data corresponding to the designated sub-band is limited below a certain value, the situations of auditory fatigue and the like possibly caused by overlarge sub-band energy data corresponding to the designated sub-band are avoided, the fatigue feeling of the object when the object listens to the target audio is reduced, and the sub-band energy data corresponding to the target audio is selectively adjusted.

In an alternative embodiment, the adjustment parameter is determined based on the sub-band energy high-order distribution data and the sub-band energy low-order distribution data included in the sub-band energy distribution data. Illustratively, as shown in fig. 6, the embodiment shown in fig. 2 can also be implemented as the following steps 610 to 690.

Step 610, obtain the target audio.

The target audio is audio data to be subjected to voice enhancement.

Step 610 is already described in step 210, and will not be described herein.

Step 620, performing band segmentation on the target audio along the frequency domain dimension to obtain at least two sub-bands.

In an optional embodiment, based on a preset band division standard, the target audio is band-split along the frequency domain dimension to obtain at least two sub-bands. Illustratively, the target audio is band-sliced using a predetermined critical band-splitting criterion to obtain at least two sub-bands.

Step 620 is already described in step 220 and step 420, and is not described herein again.

Step 630, obtaining sub-band energy data corresponding to at least two sub-bands respectively.

Illustratively, the subband energy values corresponding to different frames of audio frames are obtained as x, then logarithms are respectively taken for the different subband energy values, and the logarithmized values are taken as subband energy data, so that subband energy data corresponding to different subbands under different audio frames are obtained.

Step 630 is already explained in step 230 and step 430, and is not described herein again.

Step 640, obtaining sub-band energy data corresponding to at least two sub-bands in the i-th frame of audio frame and sub-band energy distribution data corresponding to at least two sub-bands in the i-1-th frame of audio frame.

Wherein i is a positive integer greater than 1, and the sub-band energy distribution data is used to indicate the frequency distribution of the target audio on at least two sub-bands.

Illustratively, when determining the sub-band energy distribution data corresponding to the current frame (i-th frame) audio frame, determining the sub-band energy distribution data corresponding to the current frame audio frame jointly by using the sub-band energy distribution data corresponding to the previous frame (i-1-th frame) audio frame and the sub-band energy data corresponding to the current frame audio frame (i-th frame) in an iterative determination manner.

Step 650, determining the sub-band energy long-term distribution data corresponding to the at least two sub-bands in the ith frame audio frame based on the sub-band energy data corresponding to the at least two sub-bands in the ith frame audio frame, the sub-band energy long-term distribution data corresponding to the at least two sub-bands in the ith-1 frame audio frame, and the first preset weight.

Optionally, the subband energy data corresponding to at least two subbands in the ith frame of audio frame and the subband energy long-term distribution data corresponding to at least two subbands in the i-1 th frame of audio frame are weighted and fused by a predetermined first preset weight, so as to determine the subband energy long-term distribution data corresponding to at least two subbands in the ith frame of audio frame.

Illustratively, the sub-band energy distribution data corresponding to the kth sub-band in the ith frame of audio frame is determined by using the calculation formula of the sub-band energy distribution data shown in step 450. Similarly, the sub-band energy distribution data corresponding to different sub-bands under different audio frames are determined by adopting a calculation formula of the sub-band energy distribution data.

Step 660, obtaining the sub-band energy high-order distribution data corresponding to the at least two sub-bands in the i-th frame audio frame based on the sub-band energy data corresponding to the at least two sub-bands in the i-th frame audio frame, the sub-band energy high-order distribution data corresponding to the at least two sub-bands in the i-1-th frame audio frame, and the second preset weight.

The sub-band energy distribution data comprises sub-band energy high-bit distribution data which is used for indicating the data comparison condition of the sub-band energy data of two adjacent audio frames of the ith frame audio frame and the sub-band energy high-bit distribution data of the ith-1 frame audio frame, wherein the ith frame audio frame and the ith frame audio frame are adjacent audio frames.

Optionally, after obtaining sub-band energy data corresponding to at least two sub-bands in the i-th frame of audio frame, the sub-band energy data corresponding to at least two sub-bands in the i-th frame of audio frame is compared with sub-band energy high-order distribution data corresponding to at least two sub-bands in the i-1-th frame of audio frame, and the preset weight for performing the weighting fusion process is determined according to the comparison result.

Illustratively, the subband energy data of the kth frequency band in the audio frame of the ith frame is denoted as Eb (i, k), and the subband energy high-bit distribution data corresponding to the kth frequency band in the audio frame of the ith-1 frame is denoted as Eb _ up (i-1, k), which is used for describing the iterative relationship of the subband energy high-bit distribution data corresponding to different audio frames in the time domain dimension, wherein up is used for indicating high bits.

Alternatively, the analysis is performed by taking the specified sub-band as an example, and Eb (i, k) and Eb are obtained _up(i-1，k) Then Eb (i, k) and Eb are mixed _up(i-1，k) And comparing, and performing weighted fusion on the sub-band energy data corresponding to the i-th frame of audio frame and the sub-band energy high-order distribution data corresponding to the i-1-th frame of audio frame by adopting a second preset weight according to the data comparison result, thereby obtaining the sub-band energy high-order distribution data corresponding to the i-th frame of audio frame.

Optionally, obtaining Eb (i, k) and Eb _up(i-1，k) Then, Eb (i, k) and Eb are mixed _up(i-1，k) And comparing, and performing weighted fusion on the sub-band energy data corresponding to the i-th frame of audio frame and the sub-band energy high-order distribution data corresponding to the i-1-th frame of audio frame by adopting different second preset weights according to the data comparison result, thereby obtaining the sub-band energy high-order distribution data corresponding to the i-th frame of audio frame. I.e. according to Eb (i, k) and Eb _up(i-1，k) The numerical value of the second preset weight is different.

Illustratively, the subband energy high-bit distribution data corresponding to the ith frame of audio frame is determined by using the following calculation formula of the subband energy high-bit distribution data.

Eb _up(i，k) ＝b*Eb _up(i-1，k) +(1-b)*Eb(i，k)

Wherein i is used to indicate an audio frame; k is used to indicate a sub-band; b is used for indicating a second preset weight; eb _up(i，k) Sub-band energy high-bit distribution data corresponding to a kth frequency band in the ith frame of audio frame; eb _up(i-1，k) Sub-band energy high-bit distribution data corresponding to a kth frequency band in an i-1 frame audio frame; eb (i, k) is used to indicate the sub-band energy data corresponding to the k frequency band in the i frame audio frame.

Optionally according to Eb (i, k) and Eb _up(i-1，k) The numerical comparison relationship of (a) and (b) is different, and schematically, the second preset weight b is shown as follows.

Wherein when Eb (i, k) is less than Eb _up(i-1，k) When the value of the second preset weight b is 0.999, that is, when the high-order distribution data of the sub-band energy corresponding to the kth frequency band in the ith frame of audio frame is determined, the weight influence of the high-order distribution data of the sub-band energy corresponding to the kth frequency band in the ith-1 frame of audio frame is large; when Eb (i, k) is not less than (not less than) Eb _up(i-1，k) In the process, the value of the second preset weight b is 0.95, that is, when the sub-band energy high-order distribution data corresponding to the kth frequency band in the i-th frame audio frame is determined, the weight influence of the sub-band energy high-order distribution data corresponding to the kth frequency band in the i-1 th frame audio frame is small.

Step 670, obtaining sub-band energy low-order distribution data corresponding to at least two sub-bands in the i-th frame audio frame based on the sub-band energy data corresponding to the at least two sub-bands in the i-th frame audio frame, the sub-band energy low-order distribution data corresponding to the at least two sub-bands in the i-1-th frame audio frame, and the third preset weight.

And the sub-band energy distribution data comprises sub-band energy low-order distribution data, and the sub-band energy low-order distribution data is used for indicating the data comparison condition of the sub-band energy data of the ith frame of audio frame and the sub-band energy low-order distribution data of the ith-1 st frame of audio frame.

Optionally, after obtaining the sub-band energy data corresponding to at least two sub-bands in the i-th frame of audio frame, the sub-band energy data corresponding to at least two sub-bands in the i-th frame of audio frame is compared with the sub-band energy low-order distribution data corresponding to at least two sub-bands in the i-1-th frame of audio frame, and the preset weight for performing the weighting fusion process is determined according to the comparison result.

Schematically, the sub-band energy data of the k-th frequency band in the i-th frame audio frame is represented as Eb (i, k), and the sub-band energy corresponding to the k-th frequency band in the i-1-th frame audio frameThe lower order distribution data is expressed as Eb _dw(i，k) And dw is used for indicating low bits and describing an iterative relationship of sub-band energy low bit distribution data corresponding to different audio frames in a time domain dimension.

Alternatively, the analysis is performed by taking the specified sub-band as an example, and Eb (i, k) and Eb are obtained _dw(i-1，k) Then Eb (i, k) and Eb are mixed _dw(i-1，k) And comparing, and performing weighted fusion on the sub-band energy data corresponding to the i-th frame of audio frame and the sub-band energy low-order distribution data corresponding to the i-1-th frame of audio frame by using a third preset weight according to the data comparison result, thereby obtaining the sub-band energy low-order distribution data corresponding to the i-th frame of audio frame.

Optionally, obtaining Eb (i, k) and Eb _dw(i-1，k) Then, Eb (i, k) and Eb are mixed _dw(i-1，k) And comparing, and performing weighted fusion on the sub-band energy data corresponding to the i-th frame of audio frame and the sub-band energy low-order distribution data corresponding to the i-1-th frame of audio frame by adopting different third preset weights according to the data comparison result, thereby obtaining the sub-band energy low-order distribution data corresponding to the i-th frame of audio frame. I.e. according to Eb (i, k) and Eb _dw(i-1，k) The numerical value of the third preset weight is different.

Illustratively, the following calculation formula of the sub-band energy low-order distribution data is adopted to determine the sub-band energy low-order distribution data corresponding to the i-th frame of audio frame.

Eb _dw(i，k) ＝c*Eb _dw(i-1，k) +(1-c)*Eb(i，k)

Wherein i is used to indicate an audio frame; k is used to indicate a sub-band; c is used for indicating a third preset weight; eb _dw(i，k) The subband energy low-order distribution data is used for indicating the corresponding subband energy low-order distribution data of the kth frequency band in the ith frame of audio frame; eb _dw(i-1，k) The subband energy low-order distribution data is used for indicating the corresponding subband energy low-order distribution data of the kth frequency band in the i-1 frame audio frame; eb (i, k) is used to indicate the sub-band energy data corresponding to the k-th frequency band in the i-th frame audio frame.

Optionally according to Eb (i, k) and Eb _dw(i-1，k) The numerical comparison of the third preset weights c, which are different from each other, schematically,the third preset weight c is expressed as follows.

Wherein, when Eb (i, k) is larger than Eb _dw(i-1，k) When the value of the third preset weight c is 0.999, that is, when the sub-band energy low-order distribution data corresponding to the kth frequency band in the ith frame of audio frame is determined, the weight influence of the sub-band energy low-order distribution data corresponding to the kth frequency band in the ith-1 frame of audio frame is large; when Eb (i, k) is not more than (not more than) Eb _dw(i-1，k) And then, the value of the third preset weight c is 0.95, that is, when the sub-band energy low-order distribution data corresponding to the kth frequency band in the ith frame of audio frame is determined, the weight influence of the sub-band energy low-order distribution data corresponding to the kth frequency band in the ith-1 frame of audio frame is small.

Step 680, determining an adjustment parameter corresponding to the designated sub-band based on the sub-band energy high-order distribution data corresponding to the designated sub-band, the sub-band energy low-order distribution data corresponding to the designated sub-band, and a preset auditory threshold.

The preset auditory threshold is used for assisting in limiting the data range of the sub-band energy data corresponding to the specified sub-band.

Optionally, the sub-band energy high-bit distribution data Eb corresponding to the specified sub-band is determined _up(i，k) And sub-band energy low-order distribution data Eb corresponding to the designated sub-band _dw(i，k) Then, according to preset hearing threshold Eb _up(i，k) And Eb _dw(i，k) And determining the adjustment parameters corresponding to the specified sub-bands.

Schematically, the calculation formula of the adjustment parameter is as follows.

Where Thrd is used to indicate a preset hearing threshold. Optionally, an adjustment parameter corresponding to the specified sub-band is determined by a preset auditory threshold, and when the sub-band energy data corresponding to the specified sub-band is adjusted by the adjustment parameter, the sub-band energy data corresponding to the adjusted specified sub-band is limited within a value of the preset auditory threshold.

And 690, adjusting the sub-band energy data of the designated sub-band based on the adjustment parameter to obtain the target enhanced audio.

Schematically, as shown in fig. 7, the diagram is a schematic diagram for adjusting sub-band energy data Eb (i, k) corresponding to a kth frequency band in an i-th frame of audio frame, where a horizontal axis represents the sub-band energy data Eb (i, k) corresponding to the kth frequency band in the i-th frame of audio frame; the vertical axis represents the subband energy data Eb' (i, k) corresponding to the kth frequency band in the adjusted ith frame audio frame, wherein the point a 710 is used to indicate the subband energy low-order distribution data Eb corresponding to the subband _dw(i，k) (ii) a B point 720 is used for indicating the sub-band energy high-bit distribution data Eb corresponding to the sub-band _up(i，k) (ii) a The C point 730 is configured to indicate sub-band energy data Eb '(i, k) corresponding to the kth frequency band in the largest adjusted ith frame of audio frame, and is used to define sub-band energy long-term data based on a preset auditory threshold, so that the adjusted sub-band energy data Eb' (i, k) is smaller than the preset auditory threshold Thrd; the D point 740 is used to indicate the sub-band energy data Eb' (i, k) corresponding to the kth frequency band in the minimum ith frame audio frame after adjustment, where the a point 710, the B point 720, the C point 730, and the D point 740 are on a slope line, the slope of the slope line is the above adjustment parameter, and the slope line expression of the slope line is as shown below.

Schematically, the input Eb (i, k) is adjusted by the adjustment rule of the oblique line with the horizontal axis Eb (i, k) as the input and the vertical axis Eb '(i, k) as the output, so as to obtain the output Eb' (i, k). That is, at point A710, the input Eb (i, k) is Eb _dw(i，k) The output Eb' (i, k) is determined to be 0.2 according to the above-described oblique line expressionThrd; at point B720, the input Eb (i, k) is Eb _up(i，k) Determining the output Eb' (i, k) to be 0.8 × Thrd according to the oblique line expression; at point C730, the input Eb (i, k) is (4 × Eb) _up -Eb _dw ) (ii)/3, determining the output Eb' (i, k) as Thrd according to the oblique line expression; at point D740, the input Eb (i, k) is (4 × Eb) _dw -Eb _up ) And/3, determining that the output Eb' (i, k) is 0 and the like according to the oblique line expression.

Illustratively, the oblique line expression is adopted, and a corresponding output Eb' (i, k) is determined according to different input Eb (i, k), so as to implement an adjustment process for adjusting the sub-band energy data. Optionally, an adjustment parameter corresponding to the specified sub-band is determined by a preset auditory threshold, and when the sub-band energy data corresponding to the specified sub-band is adjusted by the adjustment parameter, the sub-band energy data corresponding to the adjusted specified sub-band is limited within a value of the preset auditory threshold.

In an alternative embodiment, after the adjusted sub-band energy data Eb '(i, k) is obtained, the energy adjustment gains corresponding to different sub-bands are determined according to the adjusted sub-band energy data Eb' (i, k) and the sub-band energy data Eb (i, k) before adjustment, where an expression of the sub-band energy gain is shown below.

Wherein, gain (i, k) is used to indicate the sub-band energy gain corresponding to the kth sub-band in the ith frame of audio frame; sqrt is used to refer to the calculation function of the positive square root.

In an optional embodiment, gain conversion is performed on the energy adjustment gain corresponding to the designated sub-band, and frequency point gain corresponding to the intermediate frequency point in the designated sub-band is determined; and obtaining the target enhanced audio based on the product of the frequency point gain corresponding to the intermediate frequency point of the designated sub-frequency band and the frequency point amplitude corresponding to the intermediate frequency point of the designated sub-frequency band.

Illustratively, after the energy adjustment gain corresponding to the specified sub-band is obtained, the energy adjustment gain corresponding to the specified sub-band is inversely transformed through a Bark domain, and the frequency point gain corresponding to the frequency point in the specified sub-band is determined. The frequency points are distributed points in the frequency band. Optionally, a sampling point corresponding to a designated sub-band in the sub-band energy data calculation process is taken as a frequency point; alternatively, any one point or multiple points in the designated sub-band may be used as the frequency points. And the frequency point gain is used for indicating the proportion adjustment condition of the sub-band energy data corresponding to the frequency point.

In an optional embodiment, time domain conversion is performed on the product of the frequency point gain corresponding to the designated sub-band intermediate frequency point and the frequency point amplitude corresponding to the designated sub-band intermediate frequency point, and the adjusted designated sub-band is determined; and performing band splicing operation on at least one adjusted designated sub-band to determine the target enhanced audio.

Illustratively, after obtaining the frequency point gain corresponding to the designated sub-band intermediate frequency point, time domain conversion is performed on the product of the frequency point gain corresponding to the designated sub-band intermediate frequency point and the frequency point amplitude corresponding to the designated sub-band intermediate frequency point.

The frequency point amplitude corresponding to the frequency point is the frequency point amplitude corresponding to the sub-frequency band in the target audio, that is, the frequency point amplitude is a value before energy adjustment.

Optionally, a time domain transform is used to indicate the conversion of the specified subbands from the frequency domain dimension to the time domain dimension. Illustratively, after a product of a frequency point gain corresponding to the intermediate frequency point of the designated sub-band and a frequency point amplitude corresponding to the intermediate frequency point of the designated sub-band is obtained by adopting inverse Fourier transform, inverse Fourier transform is performed on a product result, and thus the target enhanced audio represented on the time domain dimension is obtained.

In the embodiment of the present application, subband energy high-order distribution data and subband energy low-order distribution data included in subband energy distribution data are introduced, the subband energy high-order distribution data and the subband energy low-order distribution data are used to indicate a data comparison condition of subband energy data of two adjacent frames of audio frames, subband energy high-order distribution data and subband energy low-order distribution data corresponding to different subbands are differentially determined according to the subband energy data corresponding to different subbands under different audio frames, an adjustment parameter is determined by using the subband energy high-order distribution data and the subband energy low-order distribution data, and a process of adjusting the subband energy data of different specified subbands is further implemented by adjusting the parameter, so that the subband energy data corresponding to at least two subbands can be limited within a preset auditory threshold, and speech enhancement is performed by a band energy dynamic range control method, thereby better improving the hearing fatigue problem.

In an alternative embodiment, the technical principle of the speech enhancement method is explained, and as shown in fig. 8, the speech enhancement method includes the following three processing portions, respectively: a pre-processing section 810; (II) a processing section 820; (III) post-processing section 830.

(one) preprocessing section 810

Step 811, microphone recording or sound signal decoding.

Schematically, based on the auditory fatigue phenomenon, a method based on auditory fatigue frequency domain feature extraction and sub-band dynamic range control is provided. Optionally, obtaining a target audio by using a microphone recording device; or, the audio data obtained by network transmission is used as the obtained target audio, and the like.

Step 812, inputting a sound signal.

Illustratively, after the target audio is obtained, compressed data decoding is performed on the target audio to obtain a sound signal corresponding to the target audio.

Step 813 fourier transform.

Illustratively, after obtaining the sound signal, the sound signal is subjected to a frequency domain transform (fast fourier transform) to implement a process of transforming the sound signal from a time domain dimensional representation to a frequency domain dimensional representation, thereby obtaining the sound signal represented in a frequency domain.

Step 814, Bark domain transformation.

Illustratively, a sound signal represented in the frequency domain is sub-band divided according to a Bark domain division standard.

(II) processing section 820

In step 821, power spectra and related feature detection values corresponding to different sub-bands are determined.

Illustratively, after the sound signal is sub-band divided, a plurality of sub-bands corresponding to the target audio are obtained, such as: sub-band 1, sub-band 2, sub-band N, etc.

Optionally, power spectrums, related feature detection values, and the like corresponding to different sub-bands are respectively determined. Illustratively, subband energy data (taking logarithm, for example, log10(x), where x is a subband energy value) of the current frame of each subband is obtained, and subband energy long-term distribution data, subband energy high-order distribution data, and subband energy low-order distribution data are calculated. Namely, determining sub-band energy long-term distribution data, sub-band energy high-order distribution data and sub-band energy low-order distribution data corresponding to the 1 st sub-band; determining sub-band energy long-term distribution data, sub-band energy high-order distribution data and sub-band energy low-order distribution data corresponding to the 2 nd sub-band; and determining sub-band energy long-term distribution data, sub-band energy high-order distribution data, sub-band energy low-order distribution data and the like corresponding to the Nth sub-band.

Determining the sub-band energy long-term distribution data according to the power spectrum corresponding to the sub-band; the sub-band energy high-order distribution data and the sub-band energy low-order distribution data are determined through the sub-band energy long-term distribution data, and the sub-band energy long-term distribution data, the sub-band energy high-order distribution data and the sub-band energy low-order distribution data are used for indicating the related characteristic detection values.

Step 822, hearing fatigue determination.

Schematically, after obtaining sub-band energy long-term distribution data, comparing the sub-band energy long-term distribution data with a preset auditory fatigue threshold (different values are set for different sub-bands according to an auditory equal loudness curve principle), if the sub-band energy long-term distribution data exceeds the threshold, judging that the sub-band can cause auditory fatigue, and entering a Dynamic Range Control (DRC) processing flow to obtain a sub-band gain value; conversely, if the hearing fatigue threshold is not reached, no DRC processing is required (i.e., subband gain value of 1).

Illustratively, the preset hearing fatigue threshold value is denoted as Thrd, and the subband energy long-term distribution data is denoted as Eb _ lt (i, k), that is, the subband decision condition of hearing fatigue is: eb _ lt (i, k) > Thrd, and entering step 823 when the sub-band energy length of the kth sub-band at the ith frame audio frame meets the above sub-band decision condition; when the sub-band energy length of the k sub-band at the ith frame audio frame is not consistent with the above sub-band decision condition, the process proceeds to step 831.

Step 823, DRC parameter adjustment.

Illustratively, as shown in fig. 7, the abscissa is the input value of the DRC procedure, i.e., the unprocessed subband energy data Eb (i, k), and the ordinate is the output value of the DRC procedure, i.e.,: sub-band energy data Eb' (i, k) after DRC processing. The four ABCD points are on the same slope, and the slopes of the slopes are as follows.

Point C represents the maximum output value (this value is denoted as Thrd), which corresponds to the input of (4 × Eb _ up (i, k) -Eb _ dw (i, k))/3, and when the input value is greater than point C, the maximum output value Thrd is output; point D is the minimum output value, optionally set to 0 here, corresponding to an input of (4 × Eb _ dw (i, k) -Eb _ up (i, k))/3, and the minimum value of 0 is output when the input value is less than point D. After DRC processing, the subband output energy can be controlled within the hearing fatigue range.

At step 824, DRC controls the gain.

Alternatively, after the above DRC processing, output processed subband energy data Eb '(i, k) is obtained, and then, based on the processed subband energy data Eb' (i, k) and unprocessed subband energy data Eb (i, k), a subband gain after DRC processing is obtained, which is as follows.

(III) post-processing part 830

Step 831, Bark domain gain inverse transform.

Illustratively, after obtaining a subband gain (i, k) corresponding to a kth subband in an ith frame of audio frame, the subband gain is converted into a linear domain gain after inverse transformation in a Bark domain, so as to obtain frequency point gains corresponding to a plurality of frequency points in the kth subband.

Optionally, taking the kth sub-band as an example for description, the gain of the frequency point corresponding to the different frequency points in the kth sub-band is multiplied by the power spectrum of the original sound signal, that is, the gain of the frequency point corresponding to the different frequency points in the kth sub-band is multiplied by the amplitude of the frequency point corresponding to the original sound signal.

Step 832, inverse fourier transform.

Optionally, the multiplied values corresponding to different frequency points are subjected to inverse fourier transform, so as to implement a process of converting the sound signal from the frequency domain dimensional representation to the time domain dimensional representation.

Step 833, outputting the audio signal.

Schematically, after inverse fourier transformation, a processed sound signal represented in the time domain is obtained.

Step 834, speaker playing or server transcoding.

Alternatively, the sound signal obtained after the inverse fourier transform is a signal represented in a time domain, so the output sound signal can be directly played through a loudspeaker, or compressed and transmitted to a terminal through network coding for decoding and playing, and the like.

In an optional embodiment, the voice enhancement method is applied to a voice call scenario, and the method is applied to a terminal for example. Illustratively, as shown in fig. 9, the above-mentioned speech enhancement method can also be implemented as the following steps 910 to 950.

Step 910, obtaining a call audio.

Optionally, in a real-time call scenario, the terminal obtains a call audio of the object in real time, where the call audio is used to indicate voice data with audio information. Illustratively, the real-time call scenario includes a two-person call scenario, a multi-person call scenario, and the like.

Step 920, performing band segmentation on the call audio along the frequency domain dimension to obtain at least two sub-bands.

The frequency domain dimension is used for describing the dimension condition of the target audio in frequency, and the oscillation information of the call audio in the frequency domain dimension can be provided by analyzing the call audio in the frequency domain dimension.

Step 920 is already described in step 220 and step 420, and is not described herein again.

Step 930, obtaining sub-band energy data corresponding to at least two sub-bands respectively.

The sub-band energy data is used for indicating the frequency variation of the audio frame in the call audio along the frequency domain dimension in the sub-band.

Illustratively, the terminal acquires sub-band energy data corresponding to at least two sub-bands in real time according to the frequency variation condition of an audio frame in a call audio along the frequency domain dimension in the sub-bands, so that the sub-band energy data corresponding to the at least two sub-bands are utilized to perform energy data analysis.

Step 930 is already explained in step 230 and step 430, and is not described herein again.

And 940, analyzing the sub-band energy data respectively corresponding to the at least two sub-bands along the time domain dimension to obtain sub-band energy distribution data respectively corresponding to the at least two sub-bands.

Illustratively, the time domain dimension is the dimensional case where changes in the target audio over time are recorded using a time scale.

Optionally, the sub-band energy data corresponding to the at least two sub-bands are analyzed along the time domain dimension to obtain sub-band energy long-term distribution data, sub-band energy high-order distribution data, and sub-band energy low-order distribution data corresponding to the at least two sub-bands.

Illustratively, the terminal obtains, according to the sub-band energy data corresponding to the at least two sub-bands obtained in real time, sub-band energy distribution data corresponding to the at least two sub-bands in different audio frames, for example: and obtaining sub-band energy long-term distribution data, sub-band energy high-order distribution data, sub-band energy low-order distribution data and the like corresponding to at least two sub-bands respectively.

Step 940 has been described in the above embodiments, and is not described herein again.

Step 950, determining an adjustment parameter based on the sub-band energy distribution data corresponding to the designated sub-band when the sub-band energy distribution data corresponding to the designated sub-band reaches a preset hearing threshold in the at least two sub-bands, and adjusting the sub-band energy data of the designated sub-band to obtain a voice enhanced audio.

Optionally, the adjustment condition is a preset condition, for example: adjusting the condition to be a preset energy threshold value; alternatively, the adjustment condition is a condition determined in real time according to the plurality of sub-band energy distribution data, for example: the adjustment condition is an average value of a plurality of subband energy distribution data, or the like.

Illustratively, after acquiring sub-band energy distribution data corresponding to at least two sub-bands respectively in real time, a terminal compares the sub-band energy long-term distribution data corresponding to the at least two sub-bands with a preset auditory threshold, and determines an adjustment parameter corresponding to a specified sub-band according to the sub-band energy high-order distribution data, the sub-band energy low-order distribution data and the preset auditory threshold corresponding to the specified sub-band when the sub-band energy long-term distribution data corresponding to the specified sub-band reaches the preset auditory threshold, so as to adjust the sub-band energy data of the specified sub-band by the adjustment parameter, so that the sub-band energy data of the specified sub-band, in which the sub-band energy long-term distribution data exceeds the preset auditory threshold, is limited within the preset auditory threshold, and further, the sub-band energy data corresponding to the at least two sub-bands are all limited within the preset auditory threshold, thereby obtaining the call enhancement audio.

In summary, at least two sub-bands corresponding to the call audio are obtained by band splitting, and sub-band energy data and sub-band energy distribution data corresponding to different sub-bands are obtained, determining a specific sub-band not meeting the adjustment condition through the sub-band energy distribution data of the sub-band, further, the sub-band energy distribution data corresponding to the specified sub-bands are used to determine the adjustment parameters corresponding to different specified sub-bands, thereby selectively adjusting the sub-band energy data of the designated sub-band by using the adjustment parameter, so that the sub-band energy data corresponding to at least two sub-bands are limited within the preset hearing threshold, under the limitation of a preset auditory threshold, the phenomenon of auditory fatigue is avoided, the conversation quality of the object during conversation is improved, the discomfort of the object during long-time conversation is reduced, and the conversation effect is effectively improved.

Fig. 10 is a speech enhancement apparatus according to an exemplary embodiment of the present application, and as shown in fig. 10, the apparatus includes the following components:

an audio acquisition module 1010, configured to acquire a target audio, where the target audio is audio data to be subjected to voice enhancement;

a band splitting module 1020, configured to perform band splitting on the target audio along a frequency domain dimension to obtain at least two sub-bands;

a data obtaining module 1030, configured to obtain subband energy data corresponding to the at least two subbands, where the subband energy data is used to indicate a frequency variation condition of an audio frame in the target audio along a frequency domain dimension in the subbands;

a data analysis module 1040, configured to analyze the sub-band energy data corresponding to the at least two sub-bands along a time domain dimension to obtain sub-band energy distribution data corresponding to the at least two sub-bands, where the sub-band energy distribution data is used to indicate frequency distribution conditions of the target audio on the at least two sub-bands;

the energy adjusting module 1050 is configured to determine an adjustment parameter based on the sub-band energy distribution data corresponding to the specified sub-band when the sub-band energy distribution data corresponding to the specified sub-band meets an adjustment condition in the at least two sub-bands, and adjust the sub-band energy data of the specified sub-band to obtain the target enhanced audio.

In an optional embodiment, the data analysis module 1040 is further configured to obtain sub-band energy data corresponding to at least two sub-bands in the ith frame of audio frame and sub-band energy distribution data corresponding to at least two sub-bands in the (i-1) th frame of audio frame, where i is a positive integer greater than 1; and obtaining sub-band energy distribution data respectively corresponding to at least two sub-bands in the ith frame of audio frame based on the sub-band energy data respectively corresponding to at least two sub-bands in the ith frame of audio frame, the sub-band energy distribution data respectively corresponding to at least two sub-bands in the ith-1 frame of audio frame, and the preset weight.

In an optional embodiment, the sub-band energy distribution data comprises sub-band energy long-term distribution data, and the sub-band energy long-term distribution data is used for indicating the change condition of the sub-band energy data of two adjacent frames of audio frames;

the data analysis module 1040 is further configured to perform weighted fusion on the subband energy data corresponding to at least two subbands in the ith frame of audio frame and the subband energy long-term distribution data corresponding to at least two subbands in the i-1 th frame of audio frame with a first preset weight, and determine the subband energy long-term distribution data corresponding to at least two subbands in the ith frame of audio frame.

In an optional embodiment, the energy adjustment module 1050 is further configured to determine the adjustment parameter in response to that the sub-band energy long-term distribution data corresponding to the specified sub-band in the at least two sub-bands reaches a preset auditory threshold, where the preset auditory threshold is used to indicate the adjustment condition.

In an alternative embodiment, the sub-band energy distribution data comprises sub-band energy high bit distribution data indicating a data comparison of the sub-band energy data of the i-th frame audio frame with the sub-band energy high bit distribution data of the i-1 th frame audio frame;

the data analysis module 1040 is further configured to obtain subband energy high-order distribution data corresponding to the ith frame of audio frame based on the subband energy data corresponding to the ith frame of audio frame, the subband energy high-order distribution data corresponding to the i-1 th frame of audio frame, and a second preset weight.

In an alternative embodiment, the sub-band energy distribution data comprises sub-band energy lower bit distribution data indicating a comparison of the sub-band energy data of the i-th frame audio frame with the sub-band energy lower bit distribution data of the i-1 th frame audio frame;

the data analysis module 1040 is further configured to obtain sub-band energy low-order distribution data corresponding to the i-th frame audio frame based on the sub-band energy data corresponding to the i-th frame audio frame, the sub-band energy low-order distribution data corresponding to the i-1-th frame audio frame, and a third preset weight.

In an optional embodiment, the energy adjustment module 1050 is further configured to determine an adjustment parameter corresponding to the specific sub-band based on the sub-band energy high-order distribution data corresponding to the specific sub-band, the sub-band energy low-order distribution data corresponding to the specific sub-band, and a preset auditory threshold, where the preset auditory threshold is used to assist in limiting a data range of the sub-band energy data corresponding to the specific sub-band.

In an optional embodiment, the energy adjustment module 1050 is further configured to determine the adjustment parameter based on the sub-band energy distribution data corresponding to the specific sub-band; adjusting the sub-band energy data corresponding to the designated sub-band through the adjustment parameters, and determining the energy adjustment gain corresponding to the designated sub-band, wherein the energy adjustment gain is used for indicating the data proportion adjustment condition of the sub-band energy data; and performing voice enhancement on the target audio based on the energy adjustment gain to obtain the target enhanced audio.

In an optional embodiment, the energy adjustment module 1050 is further configured to perform gain conversion on the energy adjustment gain corresponding to the specified sub-band, and determine a frequency point gain corresponding to the intermediate frequency point in the specified sub-band; and obtaining the target enhanced audio based on the product of the frequency point gain corresponding to the designated sub-frequency band intermediate frequency point and the frequency point amplitude corresponding to the designated sub-frequency band intermediate frequency point.

In an optional embodiment, the energy adjustment module 1050 is further configured to perform time domain transformation on a product of a frequency point gain corresponding to the designated sub-band intermediate frequency point and a frequency point amplitude corresponding to the designated sub-band intermediate frequency point, and determine an adjusted designated sub-band; and performing band splicing operation on at least one adjusted designated sub-band, and determining the target enhanced audio.

In an optional embodiment, the band splitting module 1020 is further configured to perform band splitting on the target audio along the frequency domain dimension based on a preset band splitting standard to obtain the at least two sub-bands.

In an optional embodiment, the energy adjustment module 1050 is further configured to adjust the sub-band energy data of the specified sub-band to obtain adjusted sub-band energy data corresponding to the specified sub-band, where the at least two sub-bands further include a candidate discarded sub-band for which energy is not adjusted; and for the audio frame in the target audio, in response to that the frequency band ratio of the energy-adjusted designated sub-band in at least two sub-bands exceeds a preset ratio threshold, and the candidate discarding sub-band is out of the frequency domain range corresponding to the human voice, retaining the sub-band energy data of the energy-adjusted designated sub-band, discarding the sub-band energy data of the candidate discarding sub-band which is not subjected to energy adjustment, and obtaining the target enhanced audio.

In summary, with the above apparatus, after performing band splitting to obtain at least two sub-bands corresponding to a target audio, sub-band energy data and sub-band energy distribution data corresponding to different sub-bands are obtained, a specified sub-band that does not meet an adjustment condition is determined by the sub-band energy distribution data of the sub-bands, and then adjustment parameters corresponding to different specified sub-bands are determined by the sub-band energy distribution data corresponding to the specified sub-bands, so that the sub-band energy data of the specified sub-bands are selectively adjusted by the adjustment parameters, and an enhanced target enhanced audio is obtained based on the adjusted sub-band energy data, and while fully considering characteristics of the target audio, quality of speech enhancement is improved.

It should be noted that: the voice enhancement device provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the speech enhancement device and the speech enhancement method provided by the above embodiments belong to the same concept, and the specific implementation process thereof is described in the method embodiments, which is not described herein again.

Fig. 11 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. The server 1100 includes a Central Processing Unit (CPU) 1101, a system Memory 1104 including a Random Access Memory (RAM) 1102 and a Read Only Memory (ROM) 1103, and a system bus 1105 connecting the system Memory 1104 and the Central Processing Unit 1101. The server 1100 also includes a mass storage device 1106 for storing an operating system 1113, application programs 1114, and other program modules 1115.

The mass storage device 1106 is connected to the central processing unit 1101 through a mass storage controller (not shown) connected to the system bus 1105. The mass storage device 1106 and its associated computer-readable media provide non-volatile storage for the server 1100. That is, the mass storage device 1106 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1104 and mass storage device 1106 described above may collectively be referred to as memory.

The server 1100 may also operate in accordance with various embodiments of the application through remote computers connected to a network, such as the internet. That is, the server 1100 may connect to the network 1112 through the network interface unit 1111 that is coupled to the system bus 1105, or may connect to other types of networks or remote computer systems (not shown) using the network interface unit 1111.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the speech enhancement method provided by the above-mentioned method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, on which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the speech enhancement method provided by the above-mentioned method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions to cause the computer device to perform the speech enhancement method described in any of the above embodiments.

Optionally, the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), Solid State Drive (SSD), or optical disc, etc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is intended only to illustrate the alternative embodiments of the present application, and should not be construed as limiting the present application, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for speech enhancement, the method comprising:

performing frequency band segmentation on the target audio along a frequency domain dimension to obtain at least two sub-frequency bands;

and determining an adjustment parameter based on the sub-band energy distribution data corresponding to the specified sub-band under the condition that the sub-band energy distribution data corresponding to the specified sub-band in the at least two sub-bands meet the adjustment condition, and adjusting the sub-band energy data of the specified sub-band to obtain the target enhanced audio.

2. The method according to claim 1, wherein the analyzing the sub-band energy data corresponding to the at least two sub-bands along the time domain dimension to obtain sub-band energy distribution data corresponding to the at least two sub-bands comprises:

acquiring sub-band energy data respectively corresponding to at least two sub-bands in an ith frame of audio frame and sub-band energy distribution data respectively corresponding to at least two sub-bands in an (i-1) th frame of audio frame, wherein i is a positive integer greater than 1;

and obtaining sub-band energy distribution data respectively corresponding to at least two sub-bands in the ith frame of audio frame based on the sub-band energy data respectively corresponding to at least two sub-bands in the ith frame of audio frame, the sub-band energy distribution data respectively corresponding to at least two sub-bands in the ith-1 frame of audio frame, and the preset weight.

3. The method according to claim 2, wherein the sub-band energy distribution data comprises sub-band energy long-term distribution data indicating a change of sub-band energy data of two adjacent frames of audio frames;

the obtaining of the sub-band energy distribution data corresponding to at least two sub-bands in the ith frame of audio frame based on the sub-band energy data corresponding to at least two sub-bands in the ith frame of audio frame, the sub-band energy distribution data corresponding to at least two sub-bands in the ith-1 frame of audio frame, and the preset weights includes:

and performing weighted fusion on the sub-band energy data respectively corresponding to at least two sub-bands in the ith frame of audio frame and the sub-band energy long-term distribution data respectively corresponding to at least two sub-bands in the ith-1 frame of audio frame by using a first preset weight, and determining the sub-band energy long-term distribution data respectively corresponding to at least two sub-bands in the ith frame of audio frame.

4. The method according to claim 3, wherein in a case that there is sub-band energy distribution data corresponding to a specific sub-band among the at least two sub-bands that meets an adjustment condition, determining an adjustment parameter based on the sub-band energy distribution data corresponding to the specific sub-band comprises:

and determining the adjustment parameter in response to the sub-band energy long-term distribution data corresponding to the specified sub-band in the at least two sub-bands reaching a preset auditory threshold, wherein the preset auditory threshold is used for indicating the adjustment condition.

5. The method of claim 2, wherein the sub-band energy distribution data comprises sub-band energy high bit distribution data indicating a data comparison of the sub-band energy data of the i-th frame audio frame with the sub-band energy high bit distribution data of the i-1 th frame audio frame;

the obtaining sub-band energy distribution data corresponding to at least two sub-bands in the i-th frame audio frame based on the sub-band energy data corresponding to at least two sub-bands in the i-th frame audio frame, the sub-band energy distribution data corresponding to at least two sub-bands in the i-1-th frame audio frame, and the preset weights includes:

and obtaining sub-band energy high-order distribution data corresponding to at least two sub-bands in the ith frame of audio frame based on the sub-band energy data corresponding to the at least two sub-bands in the ith frame of audio frame, the sub-band energy high-order distribution data corresponding to the at least two sub-bands in the ith-1 th frame of audio frame, and a second preset weight.

6. The method according to claim 2, wherein the sub-band energy distribution data includes sub-band energy lower bit distribution data indicating a data comparison of the sub-band energy data of the i-th frame audio frame with the sub-band energy lower bit distribution data of the i-1 th frame audio frame;

and obtaining sub-band energy low-order distribution data respectively corresponding to at least two sub-bands in the ith frame of audio frame based on the sub-band energy data respectively corresponding to at least two sub-bands in the ith frame of audio frame, the sub-band energy low-order distribution data respectively corresponding to at least two sub-bands in the ith-1 frame of audio frame, and a third preset weight.

7. The method according to any one of claims 1 to 6, wherein the sub-band energy distribution data includes sub-band energy high-order distribution data and sub-band energy low-order distribution data;

the determining an adjustment parameter based on the sub-band energy distribution data corresponding to the specific sub-band comprises:

and determining an adjusting parameter corresponding to the designated sub-frequency band based on the sub-band energy high-order distribution data corresponding to the designated sub-frequency band, the sub-band energy low-order distribution data corresponding to the designated sub-frequency band and a preset hearing threshold, wherein the preset hearing threshold is used for assisting in limiting the data range of the sub-band energy data corresponding to the designated sub-frequency band.

8. The method according to any one of claims 1 to 6, wherein the determining an adjustment parameter based on the sub-band energy distribution data corresponding to the specific sub-band, and adjusting the sub-band energy data of the specific sub-band to obtain the target enhanced audio includes:

determining the adjustment parameter based on the sub-band energy distribution data corresponding to the specified sub-band;

adjusting the sub-band energy data corresponding to the specified sub-band through the adjustment parameters, and determining an energy adjustment gain corresponding to the specified sub-band, wherein the energy adjustment gain is used for indicating the data proportion adjustment condition of the sub-band energy data;

performing gain conversion on the energy adjustment gain corresponding to the designated sub-frequency band, and determining the frequency point gain corresponding to the intermediate frequency point of the designated sub-frequency band;

and obtaining the target enhanced audio based on the product of the frequency point gain corresponding to the designated sub-band intermediate frequency point and the frequency point amplitude corresponding to the designated sub-band intermediate frequency point.

9. The method according to any one of claims 1 to 6, wherein the adjusting the sub-band energy data of the specific sub-band to obtain the target enhanced audio comprises:

adjusting the sub-band energy data of the specified sub-band to obtain adjusted sub-band energy data corresponding to the specified sub-band, wherein the at least two sub-bands further comprise candidate discarded sub-bands without energy adjustment;

and for the audio frame in the target audio, in response to that the frequency band ratio of the energy-adjusted designated sub-band in at least two sub-bands exceeds a preset ratio threshold, and the candidate discarding sub-band is out of the frequency domain range corresponding to the human voice, retaining the sub-band energy data of the energy-adjusted designated sub-band, discarding the sub-band energy data of the candidate discarding sub-band which is not subjected to energy adjustment, and obtaining the target enhanced audio.

10. A speech enhancement apparatus, characterized in that the apparatus comprises:

11. A computer device, characterized in that it comprises a processor and a memory, in which at least one program is stored, which at least one program is loaded and executed by the processor to implement the speech enhancement method according to any one of claims 1 to 9.

12. A computer-readable storage medium, in which at least one program is stored, which is loaded and executed by a processor to implement the speech enhancement method according to any one of claims 1 to 9.

13. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a speech enhancement method as claimed in any one of claims 1 to 9.