CN111785292B

CN111785292B - Speech reverberation intensity estimation method and device based on image recognition and storage medium

Info

Publication number: CN111785292B
Application number: CN202010426246.0A
Authority: CN
Inventors: 张广学; 肖龙源; 叶志坚; 李稀敏; 刘晓葳
Original assignee: Xiamen Kuaishangtong Technology Co Ltd
Current assignee: Xiamen Kuaishangtong Technology Co Ltd
Priority date: 2020-05-19
Filing date: 2020-05-19
Publication date: 2023-03-31
Anticipated expiration: 2040-05-19
Also published as: CN111785292A

Abstract

The invention discloses a voice reverberation intensity estimation method, a device and a storage medium based on image recognition, which convert reverberation voice into a three-dimensional spectrogram; performing image detection on the three-dimensional spectrogram to obtain a tail section of the reverberation voice in the three-dimensional spectrogram; calculating the energy intensity of the trailing section, and taking the energy intensity as an initial estimation value of the reverberation intensity; and finally, smoothing the initial estimation values of more than two tail sections to obtain a final estimation value, and taking the final estimation value as the measurement of the reverberation intensity of the reverberation voice, thereby greatly improving the anti-interference performance and the accuracy of the reverberation intensity measurement.

Description

Speech reverberation intensity estimation method and device based on image recognition and storage medium

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice reverberation intensity estimation method based on image recognition, a voice reverberation intensity estimation device based on image recognition and a computer readable storage medium.

Background

The reverberation effect is an important phenomenon of indoor acoustics, and is generated by multiple reflections of sound in a closed space. In applications such as hands-free telephones, video teleconferencing systems, hearing aids, man-machine dialog systems, reverberation effects are an important factor affecting the intelligibility of speech signals; meanwhile, it is also an important factor affecting the binaural effect in applications such as stereo theaters, stereo car sound systems, and the like.

However, in practical life, there are very few ways to measure the reverberation strength, and the commonly used reverberation strength estimation methods mainly include:

(1) Estimating the reverberation strength according to the reverberation time:

reverberation time (denoted as RT) ₆₀ ) Is defined as: the time it takes for the residual acoustic energy in a particular room space, from when the acoustic excitation ceases, to decay to 60dB below the energy at the initial observation after multiple reflections. Reverberation time is an important index for measuring the Reverberation characteristics of a specific room space, and is closely related to the calculated estimation of Late-Reverberation (Late-Reverberation) power in a dereverberation algorithm.

However, blind source reverberation time is an academic problem, and particularly when only one channel is used, it is difficult to accurately obtain the reverberation time in any environment.

(2) Estimating the reverberation strength from the SRMR values:

the reverberation modulation energy ratio (SRMR) value is an estimate of the reverberation strength by calculating the speech-to-reverberation adjustment energy ratio. However, SRMR is text dependent and is affected by vowels in speech, and there may be no reverberation but a high reverberation strength is returned.

Disclosure of Invention

The invention mainly aims to provide a method, a device and a storage medium for estimating the reverberation intensity of voice based on image recognition, and aims to solve the technical problem that the reverberation intensity is difficult to measure accurately.

In order to achieve the above object, the present invention provides a method for estimating the reverberation strength of speech based on image recognition, which comprises the following steps:

step a, converting the reverberation voice into a three-dimensional spectrogram;

b, detecting the image of the three-dimensional spectrogram to obtain a tail section of the reverberation voice in the three-dimensional spectrogram;

step c, calculating the energy intensity of the trailing section, and taking the energy intensity as an initial estimation value of the reverberation intensity;

and d, smoothing the initial estimation values of more than two tail sections to obtain a final estimation value, and taking the final estimation value as the measurement of the reverberation intensity of the reverberation voice.

Preferably, in the step a, the three-dimensional spectrogram is color-labeled according to the intensity of the spectrogram energy; in the step c, the energy intensity of the trailing segment is calculated according to the color depth in the color mark.

Further, the color mark means that the stronger the speech spectrum energy is, the darker the color is, and the weaker the speech spectrum energy is, the lighter the color is.

Preferably, in the step b, the identifying the hangover segment according to an energy loss law of the reverberant speech includes:

b1. searching more than one frequency point on a preset time interval and a preset frequency segment;

b2. calculating a point with the highest amplitude frequency in the more than one frequency points;

b3. moving a time axis, and searching more than one frequency point with amplitude lower than the highest frequency point of the amplitude on the preset frequency segment to obtain a low-amplitude frequency point;

b4. judging whether the low-amplitude frequency points accord with an energy loss rule or not, if so, judging a time range corresponding to the low-amplitude frequency points as a reverberation time period; the reverberation period is the hangover period.

Preferably, in the step b, the three-dimensional spectrogram is used as an input of a neural network, and a trailing segment of the reverberation voice in the three-dimensional spectrogram is obtained through an image detection function of the neural network.

Further, the neural network adopts a TDNN neural network or a CNN neural network.

Preferably, in the step d, a log1p function is adopted for smoothing; the calculation method is as follows:

log1p＝log(x+1)；

wherein x is an initial estimation value of the trailing segment.

Furthermore, to achieve the above object, the present invention further provides an apparatus including a memory, a processor and an image recognition based voice reverberation strength estimation program stored on the memory and executable on the processor, wherein the image recognition based voice reverberation strength estimation program, when executed by the processor, implements the steps of the image recognition based voice reverberation strength estimation method as described above.

Furthermore, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon an image recognition-based speech reverberation intensity estimation program, which when executed by a processor implements the steps of the image recognition-based speech reverberation intensity estimation method as described above.

The invention has the beneficial effects that:

(1) The method converts the reverberation voice into a three-dimensional spectrogram; detecting the image of the three-dimensional spectrogram to obtain a tail dragging section of the reverberation voice in the three-dimensional spectrogram; calculating the energy intensity of the trailing section, and taking the energy intensity as an initial estimation value of the reverberation intensity; finally, smoothing is carried out between the initial estimation values of more than two trailing sections to obtain a final estimation value, and the final estimation value is used as the measurement of the reverberation intensity of the reverberation voice, so that the anti-interference performance and the accuracy of the reverberation intensity measurement can be greatly improved;

(2) The energy intensity is calculated by adopting the color depth in the color mark based on image recognition, so that the method is more intuitive;

(3) The identification of the trailing section is to judge the highest frequency point of the amplitude by adopting the amplitude of the frequency point based on the image identification, and to search the low-amplitude frequency point according to the highest frequency point of the amplitude on the basis, so that the trailing section can be quickly and accurately positioned;

(4) The smoothing algorithm of the invention can ensure the validity of data, thereby improving the accuracy of the calculated result of the reverberation intensity.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions will be clearly and completely described below with reference to specific embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The spectrum of the reverberation voice has obvious trailing in voice gaps, and when the spectrum is represented on a spectrogram, the trailing sections have obvious difference from the spectrogram generated by other reasons. And, the larger the reverberation is, the larger the energy of the tail section is, we can find out these tail sections through image recognition, and calculate the energy intensity of the tail section as the measure of the reverberation intensity of the reverberated voice, thereby obtaining the technical scheme of the present invention.

Specifically, the method for estimating the speech reverberation intensity based on image recognition comprises the following steps:

b, carrying out image detection on the three-dimensional spectrogram to obtain a tail section of the reverberation voice in the three-dimensional spectrogram;

In the step a, the three-dimensional spectrogram is a time-frequency amplitude three-dimensional map, and the frame number (time) is taken as an x-axis, the frequency is taken as a y-axis, and the amplitude is taken as a z-axis. In the embodiment, the three-dimensional spectrogram is subjected to color marking according to the intensity of the spectrogram energy; the color mark means that the stronger the speech spectrum energy is, the darker the color is, and the weaker the speech spectrum energy is, the lighter the color is. In this embodiment, the intensity of energy in the spectrogram is represented by red, and deeper red indicates greater energy.

In the step b, identifying the hangover segment according to an energy loss law of the reverberant voice specifically includes:

b2. calculating a highest frequency point of the amplitudes of the more than one frequency points;

b4. judging whether the low-amplitude frequency points accord with an energy loss rule or not, if so, judging the time range corresponding to the low-amplitude frequency points as a reverberation time period; the reverberation period is the hangover period.

The speech reverberation is the result of multiple reflections, which cause energy losses and therefore some smearing from the three-dimensional spectrogram. And after finding the point with the highest frequency of the amplitude in continuous time, finding a point with a smaller amplitude under the same frequency, and then obtaining the point with the smaller amplitude as the tail section of the reverberation voice.

In this embodiment, the three-dimensional spectrogram is used as an input of a neural network, and a trailing segment of the reverberation voice in the three-dimensional spectrogram is obtained through an image detection function of the neural network. Preferably, the neural network is a TDNN neural network or a CNN neural network. In the embodiment, the three-dimensional spectrogram and the color marks thereof are used as the input of a neural network, and the trailing segment of the reverberation voice is output; and meanwhile, outputting the characteristics of the frequency, the amplitude and the like corresponding to the reverberation section.

The TDNN Neural Network is a Time-Delay Neural Network (TDNN) that extends the output of each hidden layer in the Time domain, that is, the input received by each hidden layer is not only the output of the previous layer at the current Time, but also the output of the previous layer at some Time before and after the current Time. The TDNN neural network is multi-layer, each layer has strong abstraction capacity on the characteristics and has the capacity of expressing the relation of the voice characteristics on time and has time invariance. The TDNN delay neural network reduces the learning complexity by sharing the weight on the time dimension, is suitable for processing voice and time sequence signals, and is adaptive to the delay characteristic of the reverberation voice.

The CNN Neural Network is a Convolutional Neural Network (CNN), and is invented under the influence of a Time Delay Neural Network (TDNN) in speech signal processing. The convolutional neural network is one of artificial neural networks, and the weight sharing network structure of the convolutional neural network is more similar to a biological neural network, so that the complexity of a network model is reduced, and the number of weights is reduced. The advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input of the network, and the complex characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided. Convolutional networks are a multi-layered perceptron specifically designed to recognize two-dimensional shapes, and the structure of such networks is highly invariant to translation, scaling, tilting, or other forms of deformation.

In the step c, the energy intensity of the trailing segment is calculated according to the color depth in the color mark. And, the longer the tail, the lower the energy and the smaller the amplitude. The formation of longer trailing segments may be a result of 2 to 3 reflections.

In the step d, a log1p function is adopted for smoothing; the calculation method is as follows:

log1p＝log(x+1)；

wherein x is an initial estimation value of the trailing segment.

The log1p function can be used for converting data with larger skewness, so that the data are more compliant with Gaussian distribution; moreover, the log1p function can ensure the validity of x data: when x is small (e.g. subtraction of two values gives x = 10) ^-16 ) Too small to exceed numerical validity; and a log1p function is adopted to calculate a small result which is not 0, so that the accuracy of the calculated result of the reverberation intensity is improved.

The method for estimating the reverberation intensity of the invention can be applied to the conjecture of the speaking environment. Specifically, a method for estimating a speech environment can be provided: calculating the reverberation intensity of the reverberation voice of the speaking environment by acquiring the reverberation voice of the speaking environment and adopting the voice reverberation intensity estimation method based on the image recognition, and inputting the reverberation intensity into a neural network model to predict the corresponding speaking environment; the reverberation intensity and other parameter characteristics corresponding to each speaking environment are preset in the neural network model.

Alternatively, the method for estimating the reverberation strength of the present invention can be applied to, but not limited to, whether the indexes of the recording studio, the concert hall, and other places meet one of the required evaluation standards.

In addition, the present invention also provides an apparatus comprising: the device such as a mobile phone, a digital camera, or a tablet computer has a photographing function, or has a voice reverberation intensity estimation function based on image recognition, or has an image display function and a voice processing function. The device may include components such as a memory, a processor, an input unit, a display unit, a power supply, and the like.

The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (e.g., an image playing function, etc.) required by at least one function, and the like; the storage data area may store data created according to the use of the device, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may further include a memory controller to provide the processor and the input unit access to the memory.

The input unit may be used to receive input numeric or character or image information, voice information, and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. Specifically, the input unit of the present embodiment may include a microphone and other input devices in addition to the camera.

The display unit may be used to display information input by or provided to a user and various graphical user interfaces of the apparatus, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit may include a Display panel, and optionally, the Display panel may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like.

An embodiment of the present invention further provides a computer-readable storage medium, which may be the computer-readable storage medium contained in the memory in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium has stored therein at least one instruction that is loaded and executed by a processor to implement a method for speech reverberation strength estimation based on image recognition. The computer readable storage medium may be a read-only memory, a magnetic or optical disk, or the like.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the apparatus embodiment and the storage medium embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

Also, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

While the above description shows and describes the preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A speech reverberation strength estimation method based on image recognition is characterized by comprising the following steps:

step a, converting the reverberation voice into a three-dimensional spectrogram; the three-dimensional spectrogram is subjected to color marking according to the intensity of the spectrogram energy;

step c, calculating the energy intensity of the tailing section according to the color depth in the color mark, and taking the energy intensity as an initial estimation value of the reverberation intensity;

step d, smoothing the initial estimation values of more than two tail sections to obtain a final estimation value, and taking the final estimation value as the measurement of the reverberation intensity of the reverberation voice;

in the step b, the three-dimensional spectrogram is used as the input of a neural network, and the trailing section of the reverberation voice in the three-dimensional spectrogram is obtained through the image detection function of the neural network; and, identifying the hangover segment according to the energy loss law of the reverberant speech, specifically including:

2. The method of claim 1, wherein the method comprises: the color mark means that the color is darker when the energy of the language spectrum is stronger, and the color is lighter when the energy of the language spectrum is weaker.

3. The method of claim 1, wherein the method comprises: the neural network adopts TDNN neural network or CNN neural network.

4. The method of claim 1, wherein the method comprises: in the step d, a log1p function is adopted for smoothing; the calculation method is as follows:

log1p=log(x+1)；

wherein x is an initial estimation value of the trailing segment.

5. An image recognition-based speech reverberation strength estimation device, characterized in that the device comprises a memory, a processor and an image recognition-based speech reverberation strength estimation program stored on the memory and executable on the processor, wherein the image recognition-based speech reverberation strength estimation program, when executed by the processor, implements the steps of the image recognition-based speech reverberation strength estimation method according to any of the claims 1 to 4.

6. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an image recognition-based speech reverberation strength estimation program, which when executed by a processor implements the steps of the image recognition-based speech reverberation strength estimation method according to any one of claims 1 to 4.