CN104299611A

CN104299611A - Chinese tone recognition method based on time frequency crest line-Hough transformation

Info

Publication number: CN104299611A
Application number: CN201410509560.XA
Authority: CN
Inventors: 于凤芹
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2014-09-28
Filing date: 2014-09-28
Publication date: 2015-01-21

Abstract

The invention provides a Chinese tone recognition method based on time frequency crest line-Hough transformation. Chinese tone recognition is converted into classification of the change trend of a line segment in a time frequency distribution diagram so that a new Chinese tone recognition method and technique can be acquired. The method includes the steps that firstly, final voice signals carrying Chinese tones are expressed through the SPWVD time frequency distribution diagram and tone information is shown through a group of similarly-parallel time frequency crest lines in the time frequency diagram; secondly, due to the fact that the main time frequency crest line is a region with larger energy in the diagram, the change trend of different tones is reflected, and in order to reduce the calculated amount, treatment such as binaryzation, thresholding and refining is conducted on the time frequency distribution diagram, and a center line segment of the main time frequency crest line reflecting the change trend of the tones is acquired; thirdly, Hough transformation is conducted on the time frequency distribution diagram containing the center line of the main crest line, so that the intercept and included angle parameters of the center line of the main crest line are acquired; finally, the tone type is judged according to the intercept and the included angle of the line segment and the coordinate values of a start point and an end point of the line segment.

Description

Based on the Chinese tone recognition method of time-frequency crestal line-Hough transform

Technical field

The invention belongs to the Tone recognition technical field in phonetic synthesis and speech recognition.The present invention represents the simple or compound vowel of a Chinese syllable voice signal carrying Chinese language tone by a kind of time-frequency distributions, the tone information of Chinese is then embodied in the crestal line variation tendency in time-frequency distributions image, by obtaining the line segment reflecting tone variation tendency after carrying out the pre-service such as binaryzation, thresholding, refinement to time-frequency distributions image, Hough transform is carried out to these line segments, identifies the tone of Chinese according to Hough transform parameter.

Background technology

Chinese speech signal is except having the general character of voice signal non-stationary, and it also has tone feature, and tone is one of underlying attribute of Chinese, has word-building, distinguishes justice and improve the functions such as expression effect.In Chinese 30% is the unisonance not people having the same aspiration and interest, and tone is the unavoidable problem of Chinese speech analysis process, tone Chinese speech signal identification and synthesis etc. play an important role, the speech recognition in conjunction with tone feature contributes to the discrimination improving voice; Consider that the phonetic synthesis of tone can reduce the machine taste enhancing natural sense of synthetic speech.

Individual Chinese character in Chinese is all single syllable, syllable can as the elementary cell of Chinese speech analysis, and the syllable of Chinese is made up of initial consonant and simple or compound vowel of a Chinese syllable, tone information is carried by simple or compound vowel of a Chinese syllable, standard Chinese is a kind of language with tone, and Chinese language tone is generally divided into (high and level tone), two sound (rising tone), three sound (upper sound) and the four tones of standard Chinese pronunciation (falling tone) four class.The pronunciation of Chinese character forms a syllable by simple or compound vowel of a Chinese syllable and initial consonant cooperation, and tone is carried by simple or compound vowel of a Chinese syllable pronunciation part, and each tone all presents the pitch curve of given shape, and it reflects the pitch form of normal syllable, has arched feature.

Current extraction tone feature mainly time domain approach and frequency domain method.Time domain approach utilizes linear prediction and autocorrelation function etc. to extract fundamental frequency, and frequency domain method carries out to linear predictive residual the elaborate position that cepstral analysis can obtain fundamental frequency.Time domain approach operand is little, but noiseproof feature difference and easily there is frequency multiplication or half frequency multiplication, complicated with the frequency domain method computing that Hilbert-Huang conversion and cepstrum combine, and in the process extracting fundamental tone, the fundamental frequency track adopting any method to extract and real fundamental frequency track all can not fit like a glove.In addition, tone feature generally all uses the sorter identifications such as support vector machine, gauss hybrid models, neural network, could identify it is which tone after needing training process after extracting, and algorithm is complicated, operation time is long.

Summary of the invention

(1) the best time-frequency representation of Chinese simple or compound vowel of a Chinese syllable

Voice are typical non-stationary signals, and time-frequency distributions analyzes the powerful of Non-stationary Signal Analysis.Wigner-power distribution (Wigner-Ville Distribution, WVD) has best time-frequency locality, but there is cross term for multicomponent data processing, and the existence of intercrossing disturbs the true time-frequency distributions of signal.Level and smooth pseudo-Wigner-power distribution (Smoothed Pseudo Wigner-Ville Distribution, SPWVD), by smoothly suppressing the cross term of WVD in time domain and window adding in frequency domain function, has taken into account time-frequency locality and cross-term restrain.SPWVD is defined as:

{SPWVD}_{z} (t, f) = {&Integral;}_{- \infty}^{\infty} {&Integral;}_{- \infty}^{\infty} z (t - u + \frac{τ}{2}) z^{*} (t - u - \frac{τ}{2}) g (u) h (τ) e^{- j 2 πτf} dudτ - - - (1)

In formula, g (u), h (τ) they are the even window functions of two realities, and g (0)=h (0)=1.

The tone of Chinese is carried by simple or compound vowel of a Chinese syllable, namely tone information is reflected in the voiced segments of voice, the present invention is by carrying out SPWVD to taking toned simple or compound vowel of a Chinese syllable, by the instantaneous frequency of simple or compound vowel of a Chinese syllable voice signal over time process clearly show at time frequency plane, instantaneous frequency that what time-frequency crestal line represented in time-frequency figure is with change procedure, be the most concentrated region of signal energy.SPWVD time-frequency crestal line clearly shows not same tone crestal line, and over time, crestal line is different along the change of time shaft frequently at that time for the not same tone of same simple or compound vowel of a Chinese syllable.The SPWVD of initial consonant " o " four tones as shown in Figure 1.

(2) time-frequency backbone line drawing and thinning preprocess

Thus present harmonic wave because Chinese simple or compound vowel of a Chinese syllable belongs to voiced sound sounding, namely in time-frequency figure, there will be one or several time-frequency crestal line, but the variation tendency of these several time-frequency crestal lines is substantially identical, only need extract a wherein time-frequency backbone line for Tone recognition.For this reason, need to carry out thresholding process to SPWVD time-frequency matrix.Due to SPWVD by added-time window and frequently window to WVD smoothing come suppressing crossterms, its time-frequency locality is caused to be deteriorated, namely time-frequency crestal line is thicker, the time-frequency crestal line now extracted has certain width, if directly carry out Hough transform to SPWVD, operation time can be increased, therefore, need to carry out binaryzation, thresholding, the further pre-service of refinement to SPWVD image, extract the center line of time-frequency crestal line.Center line such as the figure (2) of the SPWVD time-frequency crestal line of initial consonant " o " four tones represents.

(3) parameter space of line segment is obtained through Hough transform

The center line of the SPWVD time-frequency crestal line of extraction is carried out Hough transform, obtains the coordinate figure reflecting line segment intercept and angle parameter and line segment initial sum distal point.Formation spike is assembled in position corresponding with straight line parameter in parameter space for straight line in detected image by Hough transform, according to number and the position of spike, thus obtains the straight line of image space and the parameter of straight line.

The basic thought of Hough transform is a little-line duality, and at image space before image conversion, at parameter space after conversion.In image space, all straight lines crossing point (x, y) all meet equation:

y＝px+q (2)

Wherein p is slope, and q is intercept, and above-mentioned straight-line equation also can be written as:

q＝-px+y (3)

The straight line of point (p, q) is crossed in its representation parameter space.Two point (x on the same straight line of image space ₁, y ₁) and (x ₂, y ₂) all meet equation of line (2), can be write as q=-px in parameter space ₁+ y ₁and q=-px ₂+ y ₂, they are two different straight lines at parameter space, but because they have identical slope and intercept at image space, so these two straight lines intersect, as shown in Fig. 3 (a), (b) at the point (p, q) of parameter space.As can be seen here, the corresponding line intersected in parameter space of the point of conllinear in image space, conversely, all straight lines intersecting at same point at parameter space have the point of conllinear corresponding with it at image space.According to point-line duality, when some marginal points of Given Graph image space, just determine by Hough transform the straight line connecting these points, Hough transform straight-line detection question variation in image space in parameter space to point test problems, by carrying out cumulative statistics to the point intersected in parameter space, just can the detection of accomplish linear and parameter estimation task.

In order to avoid when straight line is close to vertical and horizontal direction, problem calculated amount being increased due to the value approach infinity of p and q, can straight line be used instead polar coordinate representation:

ρ = x \cos θ + y \sin θ = \sqrt{x^{2} + y^{2}} \sin (θ + \arctan \frac{x}{y}) - - - (4)

Here ρ represents the normal distance of straight line apart from initial point, and θ is the angle of this normal and X-axis forward, as shown in Fig. 4 (a).According to this equation, the point in original image space correspond to a sinusoidal curve in new parameter space, namely be transformed into polar coordinate space by Cartesian coordinate space, Hough transform becomes a little-sinusoidal curve antithesis, as shown in Fig. 4 (b) by original point-straight line antithesis.The straight line detected in image space needs in parameter space, detect sinusoidal intersection point, and the parameter of straight line is represented by the angle theta of normal distance ρ and normal and X-axis forward.

(4) obtain line segment parameter space according to Hough transform and identify tone

Tone type is judged according to the value of angle theta of normal distance ρ and normal and X-axis forward and the extreme coordinates at the whole story of respective straight.Be aided with line segment two-end-point coordinate according to the value scope of θ, four kinds of tones can be distinguished.When θ value is positive-angle, or the ordinate that the ordinate of end is greater than top is then two sound; When θ value is negative angle, or the ordinate that the ordinate of end is less than top is then the four tones of standard Chinese pronunciation; If θ's is less, be almost 0, be then; Other situations are three sound.

Accompanying drawing explanation

Fig. 1 is the SPWVD time frequency distribution map under initial consonant " o " four tones, wherein Fig. 1 (a) is the SPWVD time-frequency figure of " o ", Fig. 1 (b) is the SPWVD time-frequency figure of two sound " o ", the SPWVD time-frequency figure of Fig. 1 (c) to be the SPWVD time-frequency figure of three sound " o ", Fig. 1 (d) be four tones of standard Chinese pronunciation " o ".

Fig. 2 is to the center line of the time-frequency crestal line extracted after the SPWVD thresholding of initial consonant " o " four tones and refinement.

The image space of Fig. 3 Hough transform represents explanation schematic diagram to the parameter space of intercept and slope.

The image space of Fig. 4 Hough transform represents explanation schematic diagram to the parameter space of normal distance and angle.

Fig. 5 overall framework explanation of the present invention

Embodiment

Step 1: speech signal pre-processing and sound segmentation.After filtering and pre-emphasis process are first carried out to signal, carry out end-point detection according to short-time average magnitade difference function and zero-crossing rate etc. and remove the unvoiced segments of voice, then carry out sound segmentation and find and take toned simple or compound vowel of a Chinese syllable part.

Step 2: the SPWVD time frequency distribution map of making simple or compound vowel of a Chinese syllable.With SPWVD, time-frequency conversion is carried out to rhythm parent signal and obtain SPWVD time-frequency image.Time-frequency crestal line is the region that in time-frequency image, energy is larger, and the time-frequency ridge of same tone is not different along the change of time shaft.Because simple or compound vowel of a Chinese syllable has very strong harmonic wave, so several time-frequency crestal lines can be there are in time-frequency figure simultaneously.

Step 3: carry out binaryzation, thresholding and thinning processing to time-frequency distributions image, obtains time-frequency backbone line.By carrying out binaryzation to SPWVD time-frequency image, thresholding process extracts a main time-frequency crestal line.The image crestal line now extracted, has certain width, also needs to carry out thinning processing with bwmorph function, obtains the center line of backbone line.

Step 4: carry out Hough transform to the time-frequency image of the center line containing backbone line, obtains these line segments of center line of backbone line, and obtains intercept and the angle parameter of this line segment, the Hough matrix be namely made up of ρ and θ.Under certain threshold value, search for the value that Hough matrix returns ρ and θ being more than or equal to this threshold value place, preserve the extreme coordinates value at the whole story of respective straight simultaneously.

Step 5: judge that tone obtains type according to the value of ρ and θ and the extreme coordinates value at the whole story of respective straight.Value scope according to extracting θ is aided with line segment two-end-point coordinate figure, can distinguish four kinds of tones.When θ value is positive-angle, or the ordinate that the ordinate of end is greater than top is then two sound; When θ value is negative angle, or the ordinate that the ordinate of end is less than top is then the four tones of standard Chinese pronunciation; If θ's is less, be almost 0, be then; Other situations are three sound.

Claims

1., based on the Chinese tone recognition method of time-frequency crestal line-Hough transform, it is characterized in that:

The simple or compound vowel of a Chinese syllable voice signal carrying Chinese language tone is represented by kind of a time-frequency distributions, then the tone information of Chinese is then embodied in the crestal line variation tendency in time-frequency distributions image, by obtaining the line segment reflecting tone variation tendency after carrying out the pre-service such as binaryzation, thresholding, refinement to time-frequency distributions image, Hough transform is carried out to these line segments, identifies the tone of Chinese according to Hough transform parameter.

2. as claimed in claim 1 based on the Chinese tone recognition method of time-frequency crestal line-Hough transform, it is characterized in that: by carrying out SPWVD to taking toned simple or compound vowel of a Chinese syllable, by the instantaneous frequency of simple or compound vowel of a Chinese syllable voice signal over time process clearly show at time frequency plane.

Voice are typical non-stationary signals, and time-frequency distributions analyzes the powerful of Non-stationary Signal Analysis.Wigner-power distribution (Wigner-Ville Distribution, WVD) has best time-frequency locality, but there is cross term for multicomponent data processing, and the existence of intercrossing disturbs the true time-frequency distributions of signal.Level and smooth pseudo-Wigner-power distribution (Smoothed Pseudo Wigner-Ville Distribution, SPWVD), by smoothly suppressing the cross term of WVD in time domain and window adding in frequency domain function, has taken into account time-frequency locality and cross-term restrain.

The tone of Chinese is carried by simple or compound vowel of a Chinese syllable, namely tone information is reflected in the voiced segments of voice, the present invention is by carrying out SPWVD to taking toned simple or compound vowel of a Chinese syllable, by the instantaneous frequency of simple or compound vowel of a Chinese syllable voice signal over time process clearly show at time frequency plane, instantaneous frequency that what time-frequency crestal line represented in time-frequency figure is with change procedure, be the most concentrated region of signal energy.SPWVD time-frequency crestal line clearly shows not same tone crestal line, and over time, crestal line is different along the change of time shaft frequently at that time for the not same tone of same simple or compound vowel of a Chinese syllable.

3., as claimed in claim 1 based on the Chinese tone recognition method of time-frequency crestal line-Hough transform, it is characterized in that: binaryzation, thresholding, the further pre-service of refinement are carried out to SPWVD image, extracts the center line of time-frequency crestal line.

Thus present harmonic wave because Chinese simple or compound vowel of a Chinese syllable belongs to voiced sound sounding, namely in time-frequency figure, there will be one or several time-frequency crestal line, but the variation tendency of these several time-frequency crestal lines is substantially identical, only need extract a wherein time-frequency backbone line for Tone recognition.For this reason, need to carry out thresholding process to SPWVD time-frequency matrix.Due to SPWVD by added-time window and frequently window to WVD smoothing come suppressing crossterms, its time-frequency locality is caused to be deteriorated, namely time-frequency crestal line is thicker, the time-frequency crestal line now extracted has certain width, if directly carry out Hough transform to SPWVD, operation time can be increased, therefore, need to carry out binaryzation, thresholding, the further pre-service of refinement to SPWVD image, extract the center line of time-frequency crestal line.

4. as claimed in claim 1 based on the Chinese tone recognition method of time-frequency crestal line-Hough transform, it is characterized in that: the center line of the SPWVD time-frequency crestal line of extraction is carried out Hough transform, obtain the parameter space value of the coordinate figure reflecting line segment intercept and angle parameter and line segment initial sum distal point.

Formation spike is assembled in position corresponding with straight line parameter in parameter space for straight line in detected image by Hough transform, according to number and the position of spike, thus obtains the straight line of image space and the parameter of straight line.The basic thought of Hough transform is a little-line duality, and at image space before image conversion, at parameter space after conversion.The corresponding line intersected in parameter space of the point of conllinear in image space, conversely, all straight lines intersecting at same point at parameter space have the point of conllinear corresponding with it at image space.According to point-line duality, when some marginal points of Given Graph image space, just determine by Hough transform the straight line connecting these points, Hough transform straight-line detection question variation in image space in parameter space to point test problems, by carrying out cumulative statistics to the point intersected in parameter space, just can the detection of accomplish linear and parameter estimation task.

5., as claimed in claim 1 based on the Chinese tone recognition method of time-frequency crestal line-Hough transform, it is characterized in that: obtain line segment parameter space according to Hough transform and identify tone

Tone type is judged according to the value of angle theta of normal distance ρ and normal and X-axis forward and the extreme coordinates at the whole story of respective straight.Be aided with line segment two-end-point coordinate according to the value scope of θ, four kinds of tones can be distinguished.When θ value is positive-angle, or the ordinate that the ordinate of end is greater than top is then two sound; When θ value is negative angle, or the ordinate that the ordinate of end is less than top is then the four tones of standard Chinese pronunciation; If θ's is less, be almost 0, be then; Other situation is three sound.