CN116129926A

CN116129926A - Natural language interaction information processing method for intelligent equipment

Info

Publication number: CN116129926A
Application number: CN202310422056.5A
Authority: CN
Inventors: 林皓; 王留芳
Original assignee: Beijing VRV Software Corp Ltd
Current assignee: Beijing VRV Software Corp Ltd
Priority date: 2023-04-19
Filing date: 2023-04-19
Publication date: 2023-05-16
Anticipated expiration: 2043-04-19
Also published as: CN116129926B

Abstract

The invention relates to the technical field of voice enhancement, in particular to a natural language interaction information processing method of intelligent equipment, which comprises the following steps: obtaining a spectrogram of a voice signal in the interaction process of intelligent equipment, segmenting the voice signal to obtain a time period range, performing super-pixel segmentation on the spectrogram of the voice signal to obtain a frequency range corresponding to a super-pixel region, analyzing the frequency range under each time period range to obtain a fitting reference weight value, performing self-adaptive same-frequency curve fitting according to the fitting reference weight value to obtain a same-frequency curve, performing inverse Fourier transform on the same-frequency curve to obtain an expanded interaction voice signal, and decomposing and denoising the obtained expanded interaction voice signal. The invention avoids the end effect problem in the traditional EMD algorithm, so that the obtained decomposition result is more accurate, and the denoising enhancement effect on the voice is improved.

Description

Natural language interaction information processing method for intelligent equipment

Technical Field

The invention relates to the technical field of voice enhancement, in particular to a natural language interaction information processing method of intelligent equipment.

Background

Natural language interactive information processing is a very important part in the field of artificial intelligence, and the main purpose of the natural language interactive information processing is to solve the man-machine conversation problem and enable intelligent equipment to understand content expressed by human beings. In the natural language interactive information processing process, the processing process needs multiple discipline theoretical knowledge and belongs to the cross disciplines. The natural language processing of the intelligent equipment still cannot meet the 'natural' interaction requirement at present, and one main reason is that the audio information environment in the interaction process is chaotic and is easy to be interfered by noise, so that the voice recognition rate is lower, and further, the intelligent equipment is caused to recognize by mistake and further cause to understand the content expressed by human in the natural language interaction process;

Because the audio information is easily interfered by noise, the recognized voice information often needs to be subjected to denoising processing in the preprocessing process of the natural language interaction information processing process. The EMD algorithm is an algorithm suitable for nonlinear and non-stationary signal processing, and is also applied to the processing process of audio information by decomposing a complex signal into a plurality of connotation modal components (IMF components) representing different frequencies and processing the different connotation modal components to achieve the purpose of denoising. In the decomposition process, the EMD algorithm decomposes with local extremum points, so that edge local extremum points appear at the end points, and then the end point effect is caused, so that the subsequent denoising effect is poor due to poor decomposition effect. The most common solution of the end effect is to expand the end information through data fitting, and the spectrogram is the information representation of the audio information, and has better information representation capability compared with the acquired sound wave image, so that the end effect problem in the EMD algorithm is solved by combining the audio information expression capability in the spectrogram and the data fitting algorithm, and the acquired decomposition result is more accurate, and the denoising effect is improved.

Disclosure of Invention

The invention provides a natural language interaction information processing method of intelligent equipment, which aims to solve the existing problems.

The intelligent device natural language interaction information processing method adopts the following technical scheme:

the invention provides a natural language interaction information processing method of intelligent equipment, which comprises the following steps:

obtaining a basic value according to the spectrogram of the historical voice signal, and obtaining a super-pixel region according to the spectrogram of the interactive voice signal;

executing the segmentation operation of the interactive voice signal to obtain a segmentation interval, comprising:

segmenting the interactive voice signal according to a preset interactive time period range to obtain a plurality of interactive time period ranges; obtaining the frequency range contained in each interaction time period range from the range contained in the maximum value and the minimum value of the frequency in the super pixel region, obtaining self-correlation according to the energy value difference corresponding to the adjacent frequency in the same frequency range at the same time point, and obtaining adjacent correlation according to the energy value difference corresponding to the same frequency range contained in the two interaction time period ranges respectively; the self-correlation and the adjacent correlation are subjected to weight fusion by utilizing the difference between the base value and the range size of the time period range to obtain the selection degree; under a plurality of preset time period ranges, a first interaction time period range with the largest selection degree is obtained and is recorded as a segmentation interval;

Intercepting the interactive voice signal in the first segmented section, and repeatedly executing the segmentation operation on the rest of the intercepted interactive voice signals until the interactive voice signal cannot be segmented any more, so as to obtain a plurality of segmented sections of the interactive voice signal;

the frequency range obtained by equally dividing is recorded as a sub-frequency range, a local range is obtained according to the minimum value between the frequency number in the sub-frequency range in all the segmented intervals and the interval length of all the segmented intervals, an energy curve is built in the local range according to the energy value difference of adjacent time points, and a fitting reference weight value is obtained according to the ratio of the number of the maximum value points of the energy curve to the number of the energy curve and the energy value average value in the local range;

performing curve fitting by using the energy value corresponding to the maximum fitting reference weight value to obtain a same-frequency curve, and transforming and expanding the same-frequency curve to obtain an expanded interactive voice signal;

and decomposing and denoising the expanded interactive voice signals to realize denoising enhancement on the interactive voice signals.

Further, the obtaining a basic value according to the spectrogram of the historical voice signal includes the following specific steps:

acquiring a spectrogram of the historical voice signal, and acquiring a time segment segmentation point according to the spectrogram, so as to obtain the first speech signal in the historical voice signal

The time points are the preference degrees of the time segment segmentation points

The acquisition method of (1) comprises the following steps:

in the method, in the process of the invention,

representing the first of the historical speech signals

The maximum frequency at each time point, x represents any frequency in any frequency interval,

represent the first

The corresponding energy value at frequency x for each point in time,

represent the first

Corresponding energy value of each time point under the frequency x, S (a) represents the first place on the spectrogram

Frequency intervals corresponding to the time points;

the degree of preference corresponding to any one time point

When the time point is larger than a preset preference degree threshold value, the time point is taken as a time segment segmentation point, the obtained multiple time segment segmentation points are utilized to segment the historical voice signal, the time segment range of the multiple segmented historical voice signals is obtained and is recorded as a historical time segment range, the range size average value of the multiple historical time segment ranges is obtained and is recorded as a basic value。

Further, the method for obtaining the super-pixel region according to the spectrogram of the interactive voice signal comprises the following steps:

and segmenting the spectrogram of the interactive voice signal by using a super-pixel segmentation algorithm, and uniformly distributing a preset number of initial seed points in the spectrogram contained in any time range to obtain a plurality of super-pixel areas.

Further, the self-correlation is obtained by the following method:

in the method, in the process of the invention,

representing the self-correlation of the interactive voice signal within the u-th interactive time period,

representing the number of frequency ranges contained within the u-th interaction period range,

representing the ith interaction period in the range of the ith interaction period

The frequency range of the frequency band is set,

The number of frequencies of the frequency ranges;

right endpoint lower than the right endpoint representing the ith interaction period range

The energy value corresponding to the jth frequency in the frequency range,

right endpoint lower-th of the right-hand side ith interaction period range representing the interaction period range

The energy value corresponding to the j+1th frequency in the frequency range;

an exponential function based on a natural constant is represented.

Further, the adjacent correlation is obtained by the following method:

in the method, in the process of the invention,

representing the adjacent correlation of the interactive voice signal within the u-th interactive period,

The frequency range of the frequency band is set,

The number of frequencies of the frequency ranges;

The jth frequency pair in the frequency rangeThe amount of energy to be used,

the left end point of the range representing the (u+1) th interaction period is lower than the (u+1) th interaction period

The energy value corresponding to the j-th frequency in the frequency range.

Further, the selection degree is obtained by the following steps:

the method comprises the steps of presetting initial range size of a time range, taking preset fixed step length as an increment value of successive iteration of the initial range size, taking a time range corresponding to the time range after the initial range size is iteratively increased as a plurality of preset time range, obtaining self-correlation and adjacent correlation of interactive voice signal selection degree in the plurality of preset time range, and carrying out weight fusion on the self-correlation and the adjacent correlation by utilizing a basic value to obtain the selection degree, wherein the selection degree is as follows:

in the method, in the process of the invention,

the range size representing the range of the u-th interaction period,

the base value is represented by a value of,

a range size representing a range of the ith interaction period is

The degree of selection;

an exponential function based on a natural constant is represented.

Further, the energy curve is obtained by the following steps:

starting from the point at the lower left corner of the local range, taking the abscissa as the direction, acquiring the point which has the smallest energy value difference with the starting point and the nearest frequency and is smaller than the preset energy value difference threshold value, if the condition is not met, not connecting, starting to connect with the last point at the lower left corner again, and the like, acquiring the connection sequence, and acquiring the energy curve according to the connection sequence and the energy value of each point.

Further, the fitting reference weight value is obtained by the following steps:

acquiring a peak point, namely a maximum point, of an energy curve of a local range where any point is located; equally dividing each frequency range into a plurality of sub-frequency ranges, analyzing each range as the same frequency, and obtaining the first frequency in any segment interval

The first of the sub-frequency ranges

Fitting reference weight values of the individual points:

in the method, in the process of the invention,

represent the first

The first of the sub-frequency ranges

The number of different time points corresponding to the maximum value points of the points in the local range,

represent the first

The first of the sub-frequency ranges

The number of curves of the energy curve in the local range of the individual points,

represent the first

The first of the sub-frequency ranges

The energy mean value in the local range of the individual points,

represent the first

The first of the sub-frequency ranges

Fitting of individual points refers to the weight values.

Further, the implementation of denoising enhancement to the interactive voice signal specifically includes the following steps:

calculating a fitting reference weight value of each point in a spectrogram of the interactive voice signal, selecting an energy value of the maximum fitting reference weight value on each time point in a sub-frequency range to perform same-frequency curve fitting, obtaining fitted same-frequency curves for the energy values of the maximum fitting reference weight values on each time point in all the sub-frequency ranges, expanding the same-frequency curves obtained by fitting to the outside of the initial time point, further obtaining same-frequency curves under different frequencies after expansion, and performing inverse Fourier transform on the obtained expanded same-frequency curves to obtain expanded interactive voice signals;

EMD decomposition is carried out on the obtained expanded interactive voice signals to obtain a plurality of voice IMF components, and each voice IMF component is subjected to denoising by adopting a wavelet threshold denoising algorithm, so that denoising enhancement on the interactive voice signals is realized, and the interactive voice signals after denoising enhancement are obtained.

The technical scheme of the invention has the beneficial effects that: and the acquired audio information is expanded by adopting the same-frequency curve fitting and combining with the distribution characteristics in the spectrogram of the audio information, so that IMF components decomposed by an EMD algorithm are more accurate. The method comprises the steps of combining the distribution characteristics of audio information in a spectrogram, dividing the audio information into a plurality of time period ranges in the same-frequency curve fitting process, determining a sub-frequency range in each time period range, calculating fitting reference weight values of each point according to the regularity characteristics of energy value distribution in each sub-frequency range, further carrying out same-frequency curve fitting, expanding the audio information according to the obtained fitting curve, avoiding the end point effect problem in the traditional EMD algorithm, and enabling the obtained decomposition result to improve the voice denoising enhancement effect more accurately.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a natural language interaction information processing method of an intelligent device.

Detailed Description

In order to further explain the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of the specific implementation, structure, characteristics and effects of the data management method for the secure operation and maintenance system according to the invention, namely the intelligent device natural language interaction information processing method. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of the intelligent device natural language interaction information processing method provided by the invention with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of steps of a method for processing natural language interaction information of an intelligent device according to an embodiment of the present invention is shown, where the method includes the following steps:

step S001, acquiring a history voice signal and an interaction voice signal in a man-machine interaction process by using a voice sensor in the intelligent equipment and acquiring a corresponding spectrogram.

The method comprises the steps that a voice sensor in the intelligent equipment is used for collecting and acquiring historical voice signals and interactive voice signals in the interaction process of a user and the intelligent equipment, the collected historical voice signals and the interactive voice signals are collectively called as voice signals, the model of the voice sensor is not set in the embodiment, and the meaning modal components of a plurality of voice signals can be obtained through EMD decomposition after adaptive same-frequency curve fitting in the follow-up step of the interactive voice signals collected by the voice sensor and are recorded as voice IMF components.

Converting the collected historical voice signals and the collected interactive voice signals into corresponding spectrograms: because the power spectrum of the voice signal decreases along with the increase of the frequency, most of the energy of the voice is concentrated in the low frequency part, so that the signal to noise ratio of the high frequency part is very low, a first-order high-pass filter is generally used for improving the signal to noise ratio of the signal in the high frequency part, after pre-emphasis is carried out on the voice, then framing and windowing operation is carried out, short-time Fourier transform processing is carried out on the voice signal with the set frame length and sampling frequency, so that a spectrogram is generated, wherein the specific process for generating the spectrogram is a known technology, and is not repeated in the embodiment. The sampling frequency is set to be 4kH, the frame length is set to be 20ms (namely, the corresponding narrow-band spectrogram), the window function is a hamming window function, and the empirical reference value is given according to the specific implementation situation of an implementer. The horizontal axis of the spectrogram is time, the vertical axis is frequency, and the pixel value of each point in the spectrogram is an energy value, in this embodiment, a linear normalization function is adopted to normalize all the energy values, and the normalized values are multiplied by 255, i.e. the energy values are in the range of 0-255.

Step S002, obtaining self-adaptive segmentation points according to the distribution characteristics of the spectrogram, and carrying out segmentation processing on the interactive voice signals to obtain the interactive voice signals with different time range sizes.

Because the collected voice signals are easily interfered by noise, each voice IMF component decomposed by the EMD algorithm is subjected to denoising processing in the traditional method. In the processing process, the EMD decomposition algorithm is decomposed by using local extreme points, so that edge local extreme points appear at the end points, and then the end point effect is caused, so that the subsequent denoising effect is poor due to poor decomposition effect. Therefore, the present embodiment uses the same-frequency curve fitting (Frequency Warping) algorithm to expand the voice signal, so as to avoid the end benefit.

In addition, the spectrogram is a characteristic feature of the voice signal on the frequency distribution, so that the corresponding distribution feature of the voice signal of the spectrogram can be combined to perform the same-frequency curve fitting, and in the same-frequency curve fitting process, a plurality of same-frequency curves exist, so that how to acquire a precise same-frequency curve for representing the voice signal feature determines the fitting effect, namely the solving effect of the end point effect in the subsequent EMD decomposition process.

Because the voice signal has intermittent characteristics (namely, a word exists or a word corresponds to some voice signals under the noiseless distribution of the voice signal), the voice signal is required to be divided into a plurality of time segment ranges in the same-frequency curve fitting process, and the same-frequency curve fitting is carried out on each time segment range.

After the voice signal is converted into the spectrogram, the intermittent characteristic of the voice signal and the same spectrogram signal of the signal represented by the local signal are required to be analyzed for representation when the same-frequency curve fitting is carried out, so that the spectrogram converted from the voice signal is required to be divided into a plurality of time segment ranges, and the voice signal in each time segment range is subjected to respective same-frequency curve fitting, and the length of each divided time segment range is acquired according to the distribution characteristic of the spectrogram because the spectrogram characteristics of the voice signal are similar in the same time segment.

In the use process of the intelligent device, the speaking modes or habits of each person often have different speaking habits, and each corresponding person has a speaking habit which is unique to the person, so that the speaking habit of the user can be quantified through the historical voice signal (the voice signal which is not existed or is affected by noise to a small extent can be recorded in a quiet environment as the historical voice signal, and thus the noise is not influenced in the default historical voice signal), and the basic value of the time period range is quantified according to the speaking habit, and the specific process is as follows:

Analyzing the spectrogram obtained by converting the historical voice signals, analyzing the spectrogram frequency in any time range (namely analyzing the spectrogram frequency in a single time vertical to the transverse axis), wherein the voice signals in the same time range are characterized by obvious differences between two adjacent sides of a straight line characterized by a certain time point vertical to the transverse axis, so that the voice signals are divided into a plurality of time ranges by acquiring time segment segmentation points;

the abscissa of each pixel point on the spectrogram represents a time point, the ordinate represents a frequency, the gray value represents an energy value, and the spectrogram can be known

A time point corresponds to a frequency interval, denoted S (a), any one of the frequencies in the interval being denoted x,

S(a)。

then the first of the historical voice signals

The acquisition method of (1) comprises the following steps:

in the method, in the process of the invention,

representing the first of the historical speech signals

Maximum frequency of each time point, S (a) represents the first on the spectrogram

A frequency interval corresponding to each time point, x represents an arbitrary frequency in the frequency interval,

represent the first

The corresponding energy value at frequency x for each point in time,

represent the first

Corresponding energy values of the time points at the frequency x;

Wherein, if the first

Adjacent time points at each time point (i.e

And

the larger the average energy value difference of the time points) under different frequencies, the larger the energy value difference of the spectrograms of the adjacent time points of the time points is, and the corresponding adjacent time points are characterized by different audio information.

Calculating the preference degree for all time points of the historical data, performing linear normalization processing on the preference degree of all time points, and setting a preference degree threshold value

(depending on the implementation of the implementation, the embodiment is given as an empirical reference value), if the preference degree of a certain time point is greater than the preference degree threshold value, the time point is indicated to be a time segment segmentation point for dividing the historical voice signal into a plurality of time segment ranges, and a plurality of time segment segmentation points are obtained through threshold value judgment.

Segmenting the historical voice signals according to the obtained time segment segmentation points of the historical voice signals to obtain time segment ranges of a plurality of historical voice signals, namely the historical time segment ranges for short, obtaining the average value of the lengths of all the time segment ranges (namely the time length of the interval corresponding to the two adjacent time segment segmentation points), and recording the average value as a basic value

。

In addition, the spectrogram of the interactive voice signal is segmented by using the superpixel segmentation algorithm, and the obtained frequency range of the superpixel region is a range formed by the minimum value and the maximum value of the ordinate, wherein the number of initial seed points of the superpixel segmentation algorithm is set to 15, the spectrogram contained in any time range is evenly distributed, a plurality of superpixel regions are obtained, the frequency in each superpixel region is similar, and each superpixel region is formed by the corresponding maximum frequency and minimum frequency, namely, one superpixel region corresponds to one frequency range, it is required to be noted that the time range is obtained by segmenting the time corresponding to the abscissa, the corresponding time range contains all the vertical axis information, namely, the frequency in the range, and the superpixel segmentation is a known technology, which is not repeated in the embodiment.

When dividing the interactive voice signal, taking an initial value W=5 as the initial range size of the interactive time period range of the interactive voice signal, taking the time period range of the interactive voice signal as the interactive time period range for short, setting a value 2 as the step size of the interactive time period range for carrying out iterative increment, taking the time point of the initial part (namely the leftmost side) of the interactive voice signal as the beginning, dividing the interactive voice signal by taking the initial range size W=5 of the interactive time period range to obtain the divided initial interactive time period range, carrying out iterative adjustment on the initial interactive time period range by utilizing the following step (2) and step (3), obtaining a first interactive time period range after the adjustment is completed, recording the first interactive time period range as a segmented section, intercepting the corresponding interactive voice signal of the first segmented section, taking the time point of the leftmost side of the intercepted residual voice signal as the beginning again, and so on to obtain all interactive time period ranges and range sizes of the interactive voice signal;

Step (2), when the initial range size W of the time period range of the interactive voice signal is increased in successive iterations with step size 2, adjusting the iteration process of the initial range size W:

it should be noted that, the energy values and frequencies of the voice signals in the same time period in the spectrogram should be similar, the energy values in one time period are similar and concentrated in the spectrogram, the adjacent time periods have larger difference in voice signals in the adjacent time period because of the difference and the non-concentration of the energy values, but in the segmentation process of the spectrogram of the currently acquired audio information, the voice signals in the voice interaction process of the user and the intelligent device are segmented according to the obtained segmentation points of the time period, so that larger errors can occur, and the embodiment iteratively adjusts the size of the time period range according to the self-correlation of the interactive voice signals in the same time period and the adjacent correlation in the adjacent time period to obtain the selection degree of the size of the time period range obtained by segmenting the acquired interactive voice signals so as to realize better segmentation processing of the interactive voice signals;

In addition, the time range of each super pixel area on the abscissa is obtained, all the super pixel areas, which are intersected with the u-th interaction time period range, of the time range corresponding to the super pixel area are obtained, and the frequency ranges of all the intersected super pixel areas are recorded as the frequency ranges contained in the u-th interaction time period range; in this embodiment, the rightmost corresponding time point of the u-th interaction time period range of the horizontal axis (i.e., the time point on the leftmost side of the time axis is 1 and gradually increases to the right) in the spectrogram of the interaction speech signal is marked as the right end point of the u-th interaction time period range, and the leftmost corresponding time point of the u-th interaction time period range is marked as the left end point of the u-th interaction time period range; there is also a portion where intersections exist with the plurality of super pixel regions at one time point, and the frequency ranges corresponding to the portions where intersections exist with the plurality of super pixel regions at one time point are recorded as the left end point or the right lower end point of the interaction time period range

A frequency range.

The self-correlation of the interactive voice signal in the interactive time period range is as follows:

in the method, in the process of the invention,

representing the ith frequency range within the ith interaction period range,

The number of frequencies of the frequency ranges;

The energy value corresponding to the jth frequency in the frequency range,

The energy value corresponding to the j+1th frequency in the frequency range;

an exponential function that is based on a natural constant;

because the voice information has different expressive ability at different frequencies, there are different energy values expressed in a plurality of frequency ranges, so when analyzing the degree of self-correlation, it is necessary to analyze the difference value between adjacent frequencies according to each frequency range, if the difference value between adjacent frequencies in the same frequency range is smaller, it indicates that the degree of self-correlation of the interactive voice signal is larger in the range of the interactive time period.

Adjacent correlation of interactive voice signals within an interactive time period:

in the method, in the process of the invention,

The frequency range of the frequency band is set,

The number of frequencies of the frequency ranges;

The energy value corresponding to the jth frequency in the frequency range,

The energy value corresponding to the j-th frequency in the frequency range;

according to the above analysis, after the analysis time period ranges are divided in the adjacent frequency ranges, the difference of the energy values between the same frequencies between the adjacent time points indicates that the adjacent correlation degree of the interactive voice signal corresponding to the interactive time period range is smaller if the difference of the energy values between the same frequencies between the adjacent time points is larger.

Step (3), when the initial range size W of the range of the ith interaction time period is increased by taking the value size 2 as an increasing step length, obtaining a selection degree once every time the range size is increased to judge whether the corresponding range size is selected asThe final range size of the u-th interaction period range is iteratively increased to be greater than or equal to the initial range size W of the u-th interaction period range

（

Representing the base value), stopping the increase; the degree of selection

：

In the method, in the process of the invention,

the range size representing the range of the u-th interaction period,

the base value is represented by a value of,

a range size representing a range of the ith interaction period is

The degree of selection to be made is that,

an exponential function based on a natural constant is represented.

According to the analysis of the content, the degree of selection

After the initial range size representing the u-th interaction time period range is iteratively increased, the corresponding range size may be selected as a degree of range size of the final interaction time period range

The larger the interactive voice signal is, the larger the self-correlation degree of the corresponding interactive voice signal is in the interactive time period range, and the smaller the adjacent correlation degree is, the complete voice signal characteristics are contained in the interactive time period range, and the distinction between the interactive voice signal and other adjacent voice signals is obvious.

If the range size of the u-th interaction time period range and the basic value obtained according to the historical voice signal

The smaller the difference value is, the corresponding interactive voice signal in the interactive time period accords with the speaking habit of the user, so that the degree of self-correlation of the interactive voice signal in the u-th interactive time period is needed to be considered, the degree of adjacent correlation is prevented from being considered too much, and the influence of noise is avoided too much; the larger the difference value between the size of the u-th interaction time period range and the basic value is, the more the difference value indicates that the time period of the interaction voice signal in the current time period range does not accord with the speaking habit of the user, so that the adjacent correlation degree of the corresponding interaction voice signal in the current interaction time period range needs to be considered, and the situation that the difference or the lack of the complete voice information characteristic is contained due to the fact that the divided interaction time period range is too large and too small is avoided, so that the segmentation effect is poor is avoided.

When the interactive voice signal is segmented by utilizing the initial range size which is increased continuously and iteratively, the termination condition of the iteration is that

Wherein

Representing the base value when

When the iteration increase is stopped, the iteration increase number is counted as A times, and it should be noted that when the initial range size cannot be iteratively increased to be similar to that of

When the interactive voice signal is stored, the part of the interactive voice signal is reserved for subsequent calculation; and calculating the corresponding selection degree of the initial range size W after each iteration increase, namely, the range size obtained by each iteration increase corresponds to one selection degree, obtaining the corresponding selection degree of A in the iteration increase A times, obtaining the largest selection degree of A selection degrees, taking the range size corresponding to the largest selection degree as the final range size of the u-th interaction time range for dividing the interaction voice signal, and recording the final u-th interaction time range as a segmentation interval, so as to obtain the u-th segmentation interval and interval size obtained by dividing the interaction voice signal, and only keeping the first segmentation interval and interval size when u=1.

And (4) intercepting the interactive voice signal corresponding to the first segmentation interval, dividing the intercepted residual interactive voice signal again, taking the leftmost time point of the intercepted residual interactive voice signal as a starting point, and repeating the steps similar to the operations of the step (1), the step (2) and the step (3) until the acquired interactive voice signal is divided into a plurality of interactive voice signals of different segmentation intervals, so as to realize the self-adaptive division of the interactive voice signal.

So far, according to the distribution characteristics of the spectrogram, the self-adaptive interaction time period range is obtained, and the self-adaptive interaction time period range is utilized to segment the interaction voice signals, so that the interaction voice signals of a plurality of different segmentation intervals are obtained.

And step S003, performing same-frequency curve fitting on the interactive voice signals in each segmented interval, performing interactive voice signal expansion according to the fitting result, and performing EMD decomposition to obtain a plurality of voice IMF components.

And calculating the adaptively divided interactive voice signals according to the steps, wherein the interactive voice signals in each segment interval represent the characteristics of the same voice signal. In each time period, the frequency of the voice signal has very remarkable energy value change in all frequencies, and in any time period, the energy value in a local range of the frequency has obvious regularity characteristic at a certain time point, for example, the peak point of the energy at different frequencies appears at the same time point, and a regular peak characteristic is presented, so that the point which has remarkable energy in the same-frequency curve fitting process and has stronger regularity is larger in the fitting process, and the fitting reference weight value is larger.

And utilizing the interactive voice signals to carry out self-adaptive division on the segmented intervals, wherein each segmented interval represents the signal characteristics of the same voice, and carrying out the same-frequency curve fitting in a single time period. According to the obtained frequency range (super-pixel division obtaining frequency range) in each segment interval, the frequency range is equally divided into 30 sub-frequency ranges (which can be given as experience reference values according to the specific implementation situation of an implementer), and each range is used as the same frequency for analysis.

Acquiring the first of any segmented intervals

The first of the sub-frequency ranges

Fitting reference weight values of the individual points:

in the first place

The first of the sub-frequency ranges

The local range is built by taking the points as the central points, wherein the local range can be obtained according to the range sizes of the sub-frequency range and the time period, and the size of the local range is set

The specific acquisition method comprises the following steps:

wherein, the liquid crystal display device comprises a liquid crystal display device,

the representation is rounded down and up,

represent the first

The first of the sub-frequency ranges

The number of frequencies in the sub-frequency range in which the point is located,

represent the first

The first of the sub-frequency ranges

The range size of the time period range in which the individual points are located,

represent the first

The first of the sub-frequency ranges

The size of the local range of the individual points;

The representation takes the minimum value of the value,

indicating that the local range in odd-numbered sizes is obtained from the rounded-down values.

Within the local rangeConstructing the energy curve with the same value, wherein the point at the left lower corner of the local range is taken as a starting point, the point closest to the minimum frequency of the energy value difference of the starting point and smaller than the energy value difference threshold (energy value difference threshold) is obtained by taking the abscissa as the direction (namely, the point connected with a certain frequency at the next moment)

) If the condition is not met, connection is not carried out, connection is started at another point (the last point at the lower left corner is started), and the connection sequence is obtained, and an energy curve is obtained according to the connection sequence and the energy value of each point; after the calculation of the point with the smallest time point in the local range (the leftmost point in the range) is completed, the energy curve is constructed for the starting point of the point which is not connected subsequently. The peak point, i.e. the maximum point, of each energy curve is obtained in all the energy curves.

The corresponding first segment of the interactive voice signal in the section of the ith segment of the spectrogram

The first of the sub-frequency ranges

Fitting reference weight values of individual points

The acquisition method of (1) comprises the following steps:

In the method, in the process of the invention,

represent the first

The first of the sub-frequency ranges

Different corresponding times of peak points of points in local rangeThe number of the intermediate points is equal to the number of the intermediate points,

represent the first

The first of the sub-frequency ranges

represent the first

The first of the sub-frequency ranges

The energy mean value in the local range of the individual points,

represent the first

The first of the sub-frequency ranges

Fitting of individual points refers to the weight values.

The energy average value of all the points in the local range of the point is taken as the basic value of the fitting reference weight value, wherein the larger the energy average value is, the more the characteristic signals of the voices contained in the local range of the point are indicated, the more important the corresponding point is, namely, the fitting reference weight value is larger, but the energy distribution regularity in the local range of the point is used for representing the credibility degree due to the influence of noise, and if the energy distribution signal is irregular (namely, the position with the largest corresponding equivalent energy value does not appear on the same time point) in the local range of the point, the irregular energy distribution signal in the local range of the corresponding point is indicated, and the lower the credibility degree of the point is indicated, and the smaller the corresponding fitting reference weight value is indicated.

Calculating fitting reference weight values of all points in all sub-frequency ranges, selecting energy values of the maximum fitting reference weight values on all time points in the sub-frequency ranges to perform same-frequency curve fitting, performing similar operation, obtaining fitted same-frequency curves in all the sub-frequency ranges, expanding the fitted same-frequency curves out of the initial time points (it is to be noted that when expanding, curve expression equations of the same-frequency curves obtained by known fitting are input into independent variables-1, -2, …, A, and corresponding energy values are output, so that expansion out of the initial time points is achieved, wherein A represents the degree of expansion, A= -5 is preset according to experience, the same-frequency curves of the expanded time points in different frequencies are obtained, and inverse Fourier transformation is performed on the obtained expanded same-frequency curves to obtain expanded interactive voice signals. The same frequency curve fitting and the inverse fourier transform are known techniques, and are not described in detail in this embodiment.

EMD decomposition is performed on the obtained expanded interactive voice signal to obtain a plurality of voice IMF components for subsequent denoising, wherein the EMD decomposition is a known technique, and is not repeated in the embodiment.

And performing same-frequency curve fitting on the interactive voice signals in each segmented interval, performing signal expansion according to the result of the same-frequency curve fitting to obtain expanded interactive voice signals, and performing EMD decomposition to obtain a plurality of voice IMF components.

Step S004, denoising each voice IMF component, and reconstructing the denoised voice IMF component to obtain denoised interactive voice signals.

According to the voice IMF components of the expanded interactive voice signals obtained through calculation in the steps, denoising each voice IMF component by adopting a wavelet threshold denoising algorithm, wherein the number of layers of wavelet decomposition is set to be 5, a threshold function of the wavelet threshold denoising algorithm adopts an existing soft threshold function, wherein wavelet threshold denoising is a known technology, and details are omitted in the embodiment. And reconstructing each voice IMF component after wavelet threshold denoising (reconstructing in an EMD algorithm) to obtain the interactive voice signal after denoising and enhancing.

Finally, it should be noted that, the energy value in this embodiment is the gray value of each point (i.e. pixel point) in the spectrogram corresponding to the voice signal.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The natural language interaction information processing method of the intelligent equipment is characterized by comprising the following steps of:

2. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the obtaining a basic value according to a spectrogram of a historical voice signal comprises the following specific steps:

The time points are the preference degree +. >

The acquisition method of (1) comprises the following steps:

in the method, in the process of the invention,

representing the +.>

Maximum frequency of each time point, x represents any frequency in any frequency interval, +.>

Indicate->

Energy values corresponding to the time points at frequency x, < >>

Indicate->

The corresponding energy value of the time points at the frequency x, S (a) represents +.>

Frequency intervals corresponding to the time points;

the degree of preference corresponding to any one time point

When the time point is larger than a preset preference degree threshold value, the time point is used as a time period segmentation point, the obtained multiple time period segmentation points are utilized to carry out segmentation processing on the historical voice signals, the time period range of the multiple segmented historical voice signals is obtained and is recorded as a historical time period range, and the range size average value of the multiple historical time period ranges is obtained and is recorded as a basic value.

3. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the method for obtaining the super-pixel region according to the spectrogram of the interaction voice signal is as follows:

4. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the self-correlation is obtained by the following steps:

in the method, in the process of the invention,

representing the self-correlation of the interactive voice signal within the u-th interactive time period,/for the interactive voice signal>

Representing the number of frequency ranges contained in the u-th interaction period range,/for the number of frequency ranges contained in the u-th interaction period range>

Representing the +.>

The frequency range of the frequency band is set,

representing the +.>

The number of frequencies of the frequency ranges; />

The right side end point representing the u-th interaction period range is down +.>

Energy value corresponding to the jth frequency in the frequency range,/->

Right end point of the right side ith interaction period range representing the right side ith interaction period range of the interaction period range +.>

The j+1th frequency in the frequency range corresponds toAn energy value; />

An exponential function based on a natural constant is represented.

5. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the method for acquiring the adjacent correlation is as follows:

in the method, in the process of the invention,

representing the adjacent correlation of the interactive voice signal within the u-th interactive time period,/for the interactive voice signal>

Representing the number of frequency ranges contained in the u-th interaction period range,/for the number of frequency ranges contained in the u-th interaction period range >

Representing the +.>

The frequency range of the frequency band is set,

representing the +.>

The number of frequencies of the frequency ranges; />

Energy value corresponding to the jth frequency in the frequency range,/->

Left end point of the range representing the (u+1) th interaction period is lower than the (n)>

The energy value corresponding to the j-th frequency in the frequency range.

6. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the selection degree is obtained by the following steps:

in the method, in the process of the invention,

a range size indicating a range of the u-th interaction period,/->

Represents a base value- >

The range size representing the range of the u-th interaction period is +.>

The degree of selection; />

An exponential function based on a natural constant is represented.

7. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the energy curve is obtained by the following steps:

8. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the fitting reference weight value is obtained by the following steps:

The +.>

Fitting reference weight values of individual points：

In the method, in the process of the invention,

indicate->

The +.>

The number of different time points corresponding to the maximum value points of the individual points in their local range,/->

Indicate->

The +.>

The number of curves of the energy curve in the local area of the individual points, +.>

Indicate->

The +.>

Energy mean value in local range of individual points, < >>

Indicate->

The +.>

Fitting of individual points refers to the weight values.

9. The method for processing natural language interaction information of intelligent equipment according to claim 1, wherein the implementation of denoising enhancement on the interaction voice signal comprises the following steps: