CN115359782A - Ancient poetry reading evaluation method based on quality and rhythm feature fusion - Google Patents
Ancient poetry reading evaluation method based on quality and rhythm feature fusion Download PDFInfo
- Publication number
- CN115359782A CN115359782A CN202210989714.4A CN202210989714A CN115359782A CN 115359782 A CN115359782 A CN 115359782A CN 202210989714 A CN202210989714 A CN 202210989714A CN 115359782 A CN115359782 A CN 115359782A
- Authority
- CN
- China
- Prior art keywords
- quality
- rhythm
- characteristic
- rsd
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000033764 rhythmic process Effects 0.000 title claims abstract description 35
- 238000011156 evaluation Methods 0.000 title claims abstract description 32
- 230000004927 fusion Effects 0.000 title claims abstract description 24
- 238000004458 analytical method Methods 0.000 claims abstract description 14
- 238000013210 evaluation model Methods 0.000 claims abstract description 13
- 238000013507 mapping Methods 0.000 claims abstract description 12
- 238000013441 quality evaluation Methods 0.000 claims abstract description 8
- 238000001228 spectrum Methods 0.000 claims abstract description 5
- 230000004931 aggregating effect Effects 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 17
- 238000000034 method Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000009432 framing Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims description 2
- 238000009499 grossing Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000005259 measurement Methods 0.000 abstract description 3
- 229910044991 metal oxide Inorganic materials 0.000 abstract 1
- 150000004706 metal oxides Chemical class 0.000 abstract 1
- 239000004065 semiconductor Substances 0.000 abstract 1
- 230000008447 perception Effects 0.000 description 5
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 230000030279 gene silencing Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002028 premature Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1807—Speech classification or search using natural language modelling using prosody or stress
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention provides an ancient poetry reading evaluation method based on quality and rhythm feature fusion, which comprises the steps of establishing an objective voice quality evaluation model based on an MOS (metal oxide semiconductor), extracting mel frequency spectrum features, extracting signal high-dimensional features by a mask _ res residual convolution network, and aggregating MOS scores of single ancient poetry reading by a UnMask output module; establishing a characteristic fusion-based rhythm evaluation model, extracting fundamental frequency, energy, zero crossing rate and other signal basic characteristics, calculating parameters of light and heavy tones, intonation and rhythm characteristic through a multi-characteristic analysis model, establishing a rhythm scoring function, and mapping actual rhythm score; and establishing a comprehensive measurement system based on polynomial fitting, and establishing a reference-free evaluation model based on the fusion of quality and prosodic features based on the target of an optimal solution and a minimized model.
Description
Technical Field
The invention belongs to the technical field of voice signal processing, and particularly relates to an ancient poetry reading evaluation method based on fusion of quality and rhythm characteristics.
Background
The ancient poems are the combination of the ancient poems and the poems, the ancient poems are divided into a plurality of poems, such as the opera poems and the temperament poems, but the ancient poems also have some common characteristics, for example, most syllables arranged are uniform and neat, have certain rules, are exquisite and narrow and have a rhyme, so that the ancient poems can have the feeling of restraining the rising and the falling after being read, and are suitable for being used as reading materials. By reading, the contents and emotion of poetry to be expressed can be better conveyed. When ancient poetry is evaluated, the speech quality and the pronunciation rhythm are two important evaluation dimensions, namely the two aspects of 'pronunciation' and 'rhyme'. The former means that the pronunciation is clear and is the most basic layer for evaluating the voice reading quality; the latter refers to the rhythm, lightness, tone, intonation, etc., exhibited during recitation.
Ancient poetry reading is an important component of learning content, however, the evaluation technology which is widely used at present is only limited to the correctness of a specific phoneme, and the ancient poetry reading is proved to be extremely popular due to the multi-dimensional quality evaluation. Or simply by comparing the spoken voice with a reference voice to give a score, there are many limitations in flexibility and coverage. In summary, it is necessary to provide a quantitative, objective and reference-free method for evaluating the reading of ancient poems by analyzing the reading voice signals and combining with acoustic characteristic parameters. There is a certain rule, pay attention to the level and oblique rhyme, through the rhythm of reading, the reader can convey the content and emotion of poetry vividly to express.
Disclosure of Invention
In view of the above, the invention aims to provide an ancient poetry reading evaluation method based on quality and prosody feature fusion, which is used for evaluating Chinese classical poetry with prosody, signal-to-noise ratio and definition by taking acoustic features and perception features as important indexes, and by extracting pitch frequency of the poetry, a reference evaluation function is quantized to obtain a prosody score based on deviation degree. The prediction score has good correlation with the human score, and effectively reflects the reading level of the Chinese classical poetry readers and the quality of the audio.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
as shown in fig. 1, the invention provides an ancient poetry reading evaluation method based on fusion of quality and prosodic features, which comprises the following steps:
(1) And establishing an objective voice quality evaluation model based on the MOS. And extracting mel frequency spectrum characteristics, extracting signal high-dimensional characteristics by using a mask _ res residual convolution network, and aggregating MOS scores of single classical verse recitations in a UnMask output module.
(2) Establishing a characteristic fusion-based rhythm evaluation model, extracting fundamental frequency, energy, zero crossing rate and other signal basic characteristics, converting the fundamental frequency, energy, zero crossing rate and other signal basic characteristics into light and accent, intonation and rhythm characteristics according to a multi-characteristic analysis method, and mapping the characteristics into actual rhythm scores through a rhythm scoring function.
(3) Establishing a comprehensive measurement system based on polynomial fitting, aiming at two scoring models obtained by task division in the steps 1 and 2, and constructing a reference-free ancient poetry reading evaluation mapping function g () based on fusion of quality and rhythm characteristics based on the targets of an optimal solution and a minimized model, wherein the reference-free ancient poetry reading evaluation mapping function g () comprises the following steps: :
S=g(w 1 S R ,w 2 S MOS )
S R is a prosodic feature fusion prosodic score, S MOS Is a quality model score, w 1 、w 2 The weight of the evaluation model is determined by a polynomial regression equation.
Further, the step (1) comprises the following steps:
(11) Feature extraction, namely calculating Mel subframes from an input signal, dividing overlapped sections, and filling the lengths of different voice segments;
(12) Performing quality analysis according to the prosodic features obtained in the step (11), performing feature dimension reduction by taking Mel subframes as input, and predicting a voice sequence, wherein the method specifically comprises the following steps: and extracting high-dimensional features by using a residual convolutional layer network, and performing downward convolution in BasicBlock to realize 3-time feature dimensionality reduction. Then, output is carried out through a full connection layer, the output characteristic dimension is set to be 20, and output flattening is achieved through view.
(13) Performing UnMask output according to the high-dimensional features obtained in the step (12), and estimating a single MOS value by feature aggregation according to the recovery feature length and the speech time, wherein the specific steps are as follows: firstly, according to the original length recorded before, obtaining a Unmask mask, multiplying the Unmask mask by the UnMask value at the corresponding position of the feature vector, finishing the zero-removing operation and obtaining the actual speech segment length. And then, for each effective feature vector, taking the maximum value of all feature numbers through a maximum pooling layer to obtain MOS score output of a single voice.
Further, the step (2) comprises the following steps:
(21) And (3) prosodic feature extraction, namely framing the input, using a rectangular window, taking a sampling rate with N being 0.05 times, calculating a short-time average amplitude function and a pitch curve of the ancient poetry, and extracting each peak value in the function curve to obtain the relative standard deviation of the peak value. The fundamental frequency is calculated and the cepstrum of each frame is estimated. Mean filtering is used to smooth the fundamental frequency curve and fine tune the threshold parameters to label the main peaks.
(22) And (4) multi-feature analysis. And (4) calculating characteristic parameters according to the prosody characteristics obtained in the step (21). Calculating a standard deviation of each peak of the short-term average amplitude to reflect accent variations of the sound; calculating the relative standard deviation parameter of each adjacent peak value time interval so as to reflect the voice rhythm characteristics; calculating the relative standard deviation parameter of each peak, and reflecting the processing mode of the reader for the intonation; calculating the relative standard deviation of the syllable length of each word in the poem to reflect the pause or extension of the syllable; the mute time is calculated to reflect whether the pause in reading is reasonable.
(23) A prosodic scoring model. According to the characteristic parameters obtained in the step (22), mapping the actual rhythm evaluation score by using a scoring formula:
θ i ∈{σ,RSD p ,RSD t ,RSD,t s }
wherein, thereinIs a corresponding characteristic parameter theta i ∈{σ,RSD p ,RSD t ,RSD,t s The quantized value of, lambda is the magnification factor of the mapping fraction.
And converting the characteristic parameters of the reading sample into percentage scores, and making a reference value according to the experimental value of the optimal reading sample. And (4) scoring different characteristics of the sample, and taking the weighted average of the characteristics as a final score.
Compared with the prior art, the ancient poetry reading evaluation method based on the fusion of the quality and the rhythm characteristics has the following advantages:
the invention combines a traditional rhythm evaluation method with a voice quality evaluation method based on a neural network, provides a no-reference evaluation method for Chinese classical poetry, which respectively evaluates rhythm, signal-to-noise ratio and definition by taking acoustic characteristics and perception characteristics as important indexes, and quantifies a reference evaluation function to obtain a rhythm score based on deviation degree by extracting the pitch frequency of the poetry. The prediction score has good correlation with the human score, and effectively reflects the reading level of the readers of the Chinese classical poetry and the quality of the audio. In one aspect, the basic spectro-temporal structure of target speech and noise is captured. On the other hand, the technology analyzes the acoustic characteristic parameters of the voice, gives a weighted prosody score, and has certain reference value and application prospect. Through the objective subjective evaluation standard, the overall quality of the audio frequency of the Chinese classical poetry is reasonably quantified from the aspects of audibility and aesthetic appreciation, the objective psychological rule of audience evaluation reading quality is further disclosed, and some new ideas can be provided for theoretical research. A comprehensive scoring model combining a deep learning method and a prosody analysis theory is combined to fit human subjective perception from the two aspects, so that a feasible evaluation system is obtained.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic structural diagram of an ancient poetry reading evaluation model of the present invention;
fig. 2 is a graph showing the results of the optimal evaluation model.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the orientations and positional relationships indicated in the drawings, which are based on the orientations and positional relationships indicated in the drawings, and are used for convenience in describing the present invention and for simplicity in description, but do not indicate or imply that the device or element so referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
The invention provides an ancient poetry reading evaluation method based on quality and rhythm feature fusion, which comprises the following steps:
1. establishing an objective voice quality evaluation model based on the MOS.
And evaluating the perceived voice quality such as the signal-to-noise ratio by using the perception characteristics. The method is divided into 3 parts: feature extraction, quality analysis and output from UnMask. The characteristic extraction module is used for calculating the Mel sub-frames from the input signals, dividing the Mel sub-frames into overlapped sections and filling the lengths of different voice segments; and the quality analysis module is used for performing characteristic dimension reduction by taking the Mel subframe as input and predicting the voice sequence. And a UnMask output module, wherein the voice time is aggregated according to the recovery characteristic length and the characteristics to estimate a single MOS value.
1. And a feature extraction module.
And (3) expressing the signal-to-noise ratio characteristic of the voice by using the mel spectrogram, and learning the characteristics through a neural network by the obvious difference of the voice spectrogram under different additive noises. Setting the number of Mel bands as 24 units, dividing sub-frames with length of 7 units, making frame shift of 2 units, overlapping rate of adjacent sub-frames is 71.4%, and utilizing short-time steady characteristic of voice to make characteristic change smooth. The time lengths of different voices are generally different, and 0 is supplemented during batch processing, so that the feature vector lengths of the voices are ensured to be the same.
The Mel spectrogram can well express that the signal-to-noise ratio characteristics of the voice have obvious difference in spectrogram under different additive noises, and the neural network can well express the characteristics. The extraction steps are as follows:
(1) Firstly, processing voice signals such as pre-emphasis, framing and windowing;
(2) Performing Discrete Fourier Transform (DFT) on each frame of voice signal, squaring the coefficient of DFT of each frame to obtain short-time spectrum energy, and arranging the frames according to time sequence to obtain an energy spectrogram;
(3) Performing Mel filtering by weighted summation of Mel filter bank and energy spectrogram, wherein the equation of Mel filter bank is shown in equation (1), the equation of weighted summation is shown in equation (2), wherein f (m) is the center frequency of Mel filter bank, m represents the order of Mel filter, k represents the number of the FFT midpoint, and | X (k) | 2 Table spectral energy;
(4) Filtering the spectrum characteristics after DFT to obtain m filter bank energies, and finally performing log operation to obtain Mel spectrogram, wherein | X (k) | 2 Representing the spectral energy.
2. And a mass analysis module.
According to the spectrogram characteristics extracted by the characteristic extraction module, extracting high-dimensional characteristics by using a residual convolution layer network, wherein the specific method comprises the following steps of: convolution is performed downwards in BasicBlock to realize feature dimension reduction for 3 times. Then, output is carried out through a full connection layer, the output characteristic dimension is set to be 20, and output flattening is achieved through view.
3. And a UnMask output module.
According to the high-dimensional characteristics obtained by the quality analysis module, firstly, 0 removing operation is carried out on a UnMask layer to obtain the actual length of each speech segment, and the implementation mode is as follows: and according to the original length recorded before, obtaining a Unmask mask, multiplying the Unmask mask with the UnMask value at the corresponding position of the feature vector, and obtaining a minimum value at the zero filling position. And then, taking the maximum value of all feature numbers of each effective feature vector through a maximum pooling layer, and outputting the maximum value as the MOS score of the single voice.
2. And establishing a prosody evaluation model based on feature fusion.
The invention designs an evaluation method of the classical poetry recitation quality from the three aspects of light and heavy sound, intonation and rhythm by utilizing acoustic characteristics. They are important feature dimensions that determine prosodic perception. The tone control embodies the high-low variation of tone in the read-aloud audio; the rhythm characteristics reflect the density change and pause length of the rhythm; the light-heavy variation embodies the reader's grasp of light and heavy reading.
1. And extracting prosodic features.
Calculating a short-term average amplitude function and a smooth pitch frequency curve of the ancient poetry, extracting each peak value in the function curve, and obtaining a relative value of the peak values as a standard deviation as a value reflecting the change characteristics of the two. Then, the ratio of the sum of the relative standard deviation of the peak-to-peak intervals and the silence period to the total period is calculated, reflecting the rhythm characteristics of the sample. The framing is input. The length of each frame is set to be 0.02 times the sampling rate. The fundamental frequency is calculated and the cepstrum of each frame is estimated. Mean filtering is used to smooth the fundamental frequency curve and fine tune the threshold parameters to label the main peaks. A short-term average amplitude curve is calculated using a rectangular window, taking a sampling rate of N times 0.05.
2. And (4) multi-feature analysis.
And calculating characteristic parameters according to the obtained prosody characteristics. Calculating the standard deviation of each peak value of the short-term average amplitude to obtain sigma so as to reflect the stress change of the sound; calculating relative standard deviation parameter of each adjacent peak value time interval to obtain RSD t To reflect the characteristics of the rhythm of the voice; calculating relative standard deviation parameters of each peak to obtain RSD p The method reflects the processing mode of the reader for the intonation; calculating the relative standard deviation of the syllable length of each word in the poem to obtain RSD so as to reflect the pause or extension of the syllable; dividing the total mute duration by the total audio duration to obtain t s And (4) silencing time to reflect whether the pause of reading is reasonable or not.
3. And (4) a prosody scoring model.
And mapping the actual prosody evaluation score according to the obtained characteristic parameters. And converting the characteristic parameters of the reading sample into percentage scores, and formulating module parameters according to the experimental value of the optimal reading sample. And (4) scoring different characteristics of the sample, and taking a weighted average of the characteristics as a final score. The scoring formula is a function that selects the reference value to be the highest and decreases to both sides, and a single parameter is converted into a score using formula 3:
θ i ∈{σ,RSD p ,RSD t ,RSD,t s }
whereinIs the corresponding characteristic parameter theta i ∈{σ,RSD p ,RSD t ,RSD,t s The quantized value of, λ is the magnification factor of the mapping fraction.
3. Establishing a comprehensive measurement system based on polynomial fitting
And (3) considering the S overall scoring model aiming at the two scoring models obtained in the steps (1) and (2), and obtaining the maximum reliability and effectiveness of the model under the condition of the optimal solution on the basis of the target of the minimum model:
S=g(w 1 S R ,w 2 S MOS )
S R is a prosodic feature fusion prosodic score, S MOS Is a quality model score, w 1 、w 2 The weight of the evaluation model is determined by a polynomial regression equation.
The invention combines the traditional rhythm evaluation method with the voice quality evaluation method based on the neural network. We completed the search for 6 network structures for evaluating signal-to-noise characteristics, 4 prosodic scoring functions, and an optimal polynomial regression. The performance of the module is measured by the Root Mean Square Error (RMSE) and the Pearson correlation coefficient R, and the lower the RMSE, the higher the R indicates the better performance. As shown in the first table and the second table, the first table is the performance index of the signal-noise network, and the second table is the performance index of the comprehensive evaluation model. RMSE and R are defined as follows:
wherein x i Average target MOS for single input, X i Is a corresponding subjective MOS.Is all X i Is determined by the average value of (a) of (b),is all x i The average values of (a) and (b) respectively reflect the deviation and correlation between the subjective MOS and the objective MOS.
In the second table, alpha and beta respectively represent the polynomial orders of the mass model and the prosody model, and the maximum values of alpha and beta are taken as the orders of the fitting polynomial. As the order increases, the prediction effect increases at the cost of increasing the complexity of the system, but also carries the risk of premature overfitting. The model performs better and better as the order of the regression polynomial increases. Parameters and polynomial orders are designed by determining the optimal orders of the two input parts, so that the overfitting is avoided, and the complexity and the accurate prediction of the system are considered at the same time. Based on the combination of resnet-18, max _ unnmask, linear scoring function and second-order polynomial, this optimal performance model achieved overall good results for R =0.90 and RMSE =0.39 on the scale similar to the existing single-quality evaluation method. And a prosody evaluation system is also fused. And the second graph is a visual two-dimensional distribution graph of the model and reflects a function model of poetry quality overall score along with the MOS value and rhythm score. In the second diagram, the x-axis represents the predicted objective prosody score MOS _ pred; the y-axis represents the predicted Prosody score Prosody _ pred; the z-axis represents the overall Score of the classical reading with feature fusion. Each axis maps to a range of values between 0 and 5. On the one hand, it captures the basic spectro-temporal structure of the target speech and noise. On the other hand, the acoustic characteristic parameters of the voice are analyzed, the weighted prosody score is given, and the method has certain reference value and application prospect. Through the objective subjective evaluation standard, the overall quality of the audio frequency of the Chinese classical poetry is reasonably quantized from the aspects of audibility and aesthetic appreciation, and the objective psychological law of audience evaluation reading quality is further disclosed, so that the method is unique and innovative, and can provide some new ideas for the research of Chinese classical poetry reading theory.
Watch 1
Watch two
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.
Claims (4)
1. An ancient poetry reading evaluation method based on quality and rhythm feature fusion is characterized in that: the method comprises the following steps:
(1) Establishing an objective voice quality evaluation model based on an MOS, extracting mel frequency spectrum characteristics, extracting signal high-dimensional characteristics by using a mask _ res residual convolution network, and aggregating MOS scores of single classical poetry recitation at a UnMask output module;
(2) Establishing a prosody evaluation model based on feature fusion, extracting basic features of signals, converting the basic features into light and heavy tones, intonation and rhythm prosody features according to a multi-feature analysis method, and mapping the features into actual prosody scores through a prosody scoring function;
(3) And aiming at the two evaluation models, establishing a reference-free ancient poetry reading evaluation model based on the fusion of quality and rhythm characteristics based on polynomial fitting, and grading by using the model.
2. The ancient poetry reading evaluation method based on quality and prosodic feature fusion as claimed in claim 1, characterized in that: the step (1) specifically comprises the following steps:
(11) The characteristic extraction is carried out on the basis of the characteristic,
calculating Mel sub-frames from the input signal, dividing into overlapping segments, filling the lengths of different speech segments, and learning via neural network to obtain speech spectrogram characteristics;
(12) The mass analysis is carried out on the mixture of the components,
performing quality analysis according to the spectrogram characteristics obtained in the step (11), performing characteristic dimension reduction by taking Mel subframes as input, and predicting the voice sequence, wherein the method specifically comprises the following steps: extracting high-dimensional characteristics by using a residual convolutional layer network, carrying out downward convolution in a BasicBlock to realize 3-time characteristic dimension reduction, then outputting through a full connection layer, setting the output characteristic dimension to be 20, and realizing output flattening through view;
(13) From UnMask output
And (4) carrying out UnMask output according to the high-dimensional characteristics obtained in the step (12), carrying out characteristic aggregation on the voice time according to the restoration characteristic length, and estimating a single MOS value, wherein the specific steps are as follows: firstly, according to the recorded original length, obtaining a UnMask mask and multiplying the UnMask mask with an UnMask value at a position corresponding to a feature vector, completing zero-removing operation to obtain the actual voice segment length, and then, taking the maximum value of all feature numbers for each effective feature vector through a maximum pooling layer to obtain the MOS score output of a single voice.
3. The ancient poetry reading evaluation method based on quality and prosodic feature fusion as claimed in claim 1, characterized in that: the step (2) specifically comprises the following steps:
(21) The prosodic features are extracted, and the prosodic features are extracted,
framing the input, using a rectangular window, taking a sampling rate with N being 0.05 times, calculating a short-time average amplitude function and a pitch curve of the ancient poetry, extracting each peak value in the function curve, obtaining a relative standard deviation of the peak values, calculating a fundamental frequency, estimating a cepstrum of each frame, smoothing the fundamental frequency curve by using mean filtering, and finely adjusting a threshold parameter to mark a main peak;
(22) The multi-feature analysis is carried out on the data,
calculating characteristic parameters according to the prosodic features obtained in the step (21), and calculating the standard deviation sigma of each peak value of the short-term average amplitude to reflect the accent change of the voice; calculating a relative standard deviation parameter RSD for each adjacent peak time interval t To reflect the characteristics of the voice rhythm; calculating a relative standard deviation parameter RSD of each peak p Reflecting the processing mode of the reader for the intonation; calculating relative standard deviation RSD of syllable length of each word in the poetry so as to reflect pause or extension of syllables; calculating the mute time t s To reflect inWhether the pause of reading is reasonable or not;
(23) A prosodic scoring model that is used to generate a prosodic score,
according to the characteristic parameters obtained in the step (22), mapping the actual rhythm evaluation score by using a scoring formula:
θ i ∈{σ,RSD p ,RSD t ,RSD,t s }
whereinIs a corresponding characteristic parameter theta i ∈{σ,RSD p ,RSD t ,RSD,t s The quantized value of { lambda is the magnification factor of the mapping fraction;
and converting the characteristic parameters of the reading sample into percentage scores, formulating reference values according to the experimental values of the best reading sample, scoring different characteristics of the sample, and taking the weighted average of the characteristics as a final score.
4. The ancient poetry reading evaluation method based on the fusion of quality and prosodic features as claimed in claim 1, characterized in that: in the step (3), regarding the two scoring models obtained in the steps 1 and 2, considering the S overall scoring model, and based on the target of the minimum model, under the condition of an optimal solution, constructing a no-reference ancient poetry reading evaluation mapping function g () based on the fusion of quality and prosodic features as follows:
S=g(w 1 S R ,w 2 S MOS )
S R is prosodic feature fusion prosody scoring, S MOS Is a quality model score, w 1 、w 2 The weight of the evaluation model is determined by a polynomial regression equation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210989714.4A CN115359782B (en) | 2022-08-18 | 2022-08-18 | Ancient poetry reading evaluation method based on fusion of quality and rhythm characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210989714.4A CN115359782B (en) | 2022-08-18 | 2022-08-18 | Ancient poetry reading evaluation method based on fusion of quality and rhythm characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115359782A true CN115359782A (en) | 2022-11-18 |
CN115359782B CN115359782B (en) | 2024-05-14 |
Family
ID=84003368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210989714.4A Active CN115359782B (en) | 2022-08-18 | 2022-08-18 | Ancient poetry reading evaluation method based on fusion of quality and rhythm characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115359782B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118714377A (en) * | 2024-08-27 | 2024-09-27 | 深圳市致尚信息技术有限公司 | OTT platform content quality assessment method and system based on data analysis |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1938756A (en) * | 2004-03-05 | 2007-03-28 | 莱塞克技术公司 | Prosodic speech text codes and their use in computerized speech systems |
CN102237081A (en) * | 2010-04-30 | 2011-11-09 | 国际商业机器公司 | Method and system for estimating rhythm of voice |
US20120245942A1 (en) * | 2011-03-25 | 2012-09-27 | Klaus Zechner | Computer-Implemented Systems and Methods for Evaluating Prosodic Features of Speech |
CN104240717A (en) * | 2014-09-17 | 2014-12-24 | 河海大学常州校区 | Voice enhancement method based on combination of sparse code and ideal binary system mask |
US20190385480A1 (en) * | 2018-06-18 | 2019-12-19 | Pearson Education, Inc. | System to evaluate dimensions of pronunciation quality |
-
2022
- 2022-08-18 CN CN202210989714.4A patent/CN115359782B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1938756A (en) * | 2004-03-05 | 2007-03-28 | 莱塞克技术公司 | Prosodic speech text codes and their use in computerized speech systems |
CN102237081A (en) * | 2010-04-30 | 2011-11-09 | 国际商业机器公司 | Method and system for estimating rhythm of voice |
US20120245942A1 (en) * | 2011-03-25 | 2012-09-27 | Klaus Zechner | Computer-Implemented Systems and Methods for Evaluating Prosodic Features of Speech |
CN104240717A (en) * | 2014-09-17 | 2014-12-24 | 河海大学常州校区 | Voice enhancement method based on combination of sparse code and ideal binary system mask |
US20190385480A1 (en) * | 2018-06-18 | 2019-12-19 | Pearson Education, Inc. | System to evaluate dimensions of pronunciation quality |
Non-Patent Citations (1)
Title |
---|
陈楠: "基于语音评测技术的古诗朗诵游戏设计研究", 中国优秀硕士论文电子期刊网, 15 February 2020 (2020-02-15), pages 1 - 66 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118714377A (en) * | 2024-08-27 | 2024-09-27 | 深圳市致尚信息技术有限公司 | OTT platform content quality assessment method and system based on data analysis |
Also Published As
Publication number | Publication date |
---|---|
CN115359782B (en) | 2024-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Drugman et al. | Glottal source processing: From analysis to applications | |
Mustafa et al. | Robust formant tracking for continuous speech with speaker variability | |
CN101589430B (en) | Voice isolation device, voice synthesis device, and voice quality conversion device | |
Alku et al. | Closed phase covariance analysis based on constrained linear prediction for glottal inverse filtering | |
Harrison | Making accurate formant measurements: An empirical investigation of the influence of the measurement tool, analysis settings and speaker on formant measurements | |
CN104361894A (en) | Output-based objective voice quality evaluation method | |
Vojtech et al. | Refining algorithmic estimation of relative fundamental frequency: Accounting for sample characteristics and fundamental frequency estimation method | |
Koutsogiannaki et al. | The importance of phase on voice quality assessment | |
CN112397074A (en) | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning | |
CN115359782B (en) | Ancient poetry reading evaluation method based on fusion of quality and rhythm characteristics | |
Almaghrabi et al. | The reproducibility of bio-acoustic features is associated with sample duration, speech task, and gender | |
CN116230018A (en) | Synthetic voice quality evaluation method for voice synthesis system | |
Arsikere et al. | Automatic height estimation using the second subglottal resonance | |
Papakyritsis | Acoustic phonetics for the speech clinician | |
Le | The use of spectral information in the development of novel techniques for speech-based cognitive load classification | |
Castillo-Guerra et al. | Automatic modeling of acoustic perception of breathiness in pathological voices | |
Villa-Canas et al. | Automatic assessment of voice signals according to the grbas scale using modulation spectra, mel frequency cepstral coefficients and noise parameters | |
Dubey et al. | Hypernasality Severity Detection Using Constant Q Cepstral Coefficients. | |
Türk et al. | Voice conversion methods for vocal tract and pitch contour modification. | |
RU2559689C2 (en) | Method of determining risk of development of individual's disease by their voice and hardware-software complex for method realisation | |
Slaney et al. | Pitch-gesture modeling using subband autocorrelation change detection. | |
Chen et al. | Teager Mel and PLP fusion feature based speech emotion recognition | |
Villavicencio et al. | Extending efficient spectral envelope modeling to mel-frequency based representation | |
Sawusch | Acoustic analysis and synthesis of speech | |
CN113129923A (en) | Multi-dimensional singing playing analysis evaluation method and system in art evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |