CN115359782A - An evaluation method for reading aloud ancient poems based on the fusion of quality and prosodic features - Google Patents
An evaluation method for reading aloud ancient poems based on the fusion of quality and prosodic features Download PDFInfo
- Publication number
- CN115359782A CN115359782A CN202210989714.4A CN202210989714A CN115359782A CN 115359782 A CN115359782 A CN 115359782A CN 202210989714 A CN202210989714 A CN 202210989714A CN 115359782 A CN115359782 A CN 115359782A
- Authority
- CN
- China
- Prior art keywords
- quality
- features
- fusion
- prosodic
- rhythm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000011156 evaluation Methods 0.000 title claims abstract description 28
- 230000004927 fusion Effects 0.000 title claims abstract description 23
- 230000033764 rhythmic process Effects 0.000 claims abstract description 20
- 238000004458 analytical method Methods 0.000 claims abstract description 15
- 238000013210 evaluation model Methods 0.000 claims abstract description 13
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 238000013441 quality evaluation Methods 0.000 claims abstract description 8
- 238000001228 spectrum Methods 0.000 claims abstract description 4
- 230000004931 aggregating effect Effects 0.000 claims abstract 2
- 230000006870 function Effects 0.000 claims description 16
- 238000000034 method Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 4
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 238000009499 grossing Methods 0.000 claims 1
- 239000000203 mixture Substances 0.000 claims 1
- 238000005259 measurement Methods 0.000 abstract description 3
- 229910044991 metal oxide Inorganic materials 0.000 abstract 1
- 150000004706 metal oxides Chemical class 0.000 abstract 1
- 239000004065 semiconductor Substances 0.000 abstract 1
- 230000003595 spectral effect Effects 0.000 description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002028 premature Effects 0.000 description 1
- 230000001020 rhythmical effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/01—Assessment or evaluation of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1807—Speech classification or search using natural language modelling using prosody or stress
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
Description
技术领域technical field
本发明属于语音信号处理技术领域,尤其是涉及一种基于质量和韵律特征融合的古诗词朗读评估方法。The invention belongs to the technical field of speech signal processing, and in particular relates to an evaluation method for reading aloud ancient poems based on the fusion of quality and prosody features.
背景技术Background technique
古诗词是古诗和词的合称,而古诗又分为多种诗体,有乐府诗、格律诗等等,但这些古诗词也都具备一些共同的特点,例如音节排布大多匀称整齐,有一定规律,讲究平仄和押韵,因此读起来会有抑扬顿挫的感觉,适合作为朗读材料。通过朗读,可以更好地传达诗词所要表现的内容和情感。在对古诗词进行评价时,语音质量和发音韵律是两个重要的评测维度,简单来讲,就是“音”和“韵”这两个层面。前者指的是要发音能够辨别清晰,是评价语音朗读好坏的最基本的层面;后者是指朗诵时表现出的节奏、轻重、语气、语调等等。Ancient poetry is the collective name of ancient poetry and words, and ancient poetry is divided into various types of poetry, such as Yuefu poems, metrical poems, etc., but these ancient poems also have some common characteristics, such as the arrangement of syllables. It has certain rules, pays attention to level and rhyme, so it will have a sense of cadence when reading, and it is suitable as a reading material. Through reading aloud, the content and emotion to be expressed in the poem can be better conveyed. When evaluating ancient poetry, voice quality and pronunciation rhythm are two important evaluation dimensions. Simply put, they are the two levels of "sound" and "rhyme". The former refers to the ability to distinguish clear pronunciation, which is the most basic level for evaluating the quality of speech reading aloud; the latter refers to the rhythm, severity, tone, intonation, etc. displayed during recitation.
古诗词朗读是学习内容的重要组成部分,然而目前广泛投入使用的这种评测技术仅仅局限在特定音素的正确性层面,对于古诗词朗读这样应具备多维度的质量评价则显得捉襟见肘。或是仅仅通过将朗读语音与参考语音进行对比,来给出评分,在灵活性和可覆盖面则有很多限制。综上,有必要通过对朗读语音信号的分析,结合声学特征参数,提出一种量化、客观、无参考的古诗词朗读评价方法。有一定的规律,注意层次和斜韵,通过阅读的韵律,读者可以形象地传达诗歌的内容和情感来表达。Reading aloud ancient poems is an important part of learning content. However, the evaluation technology that is widely used at present is only limited to the correctness of specific phonemes, and it seems stretched to have multi-dimensional quality evaluation for reading aloud ancient poems. Or just scoring by comparing spoken speech with a reference speech has many limitations in terms of flexibility and reach. In summary, it is necessary to propose a quantified, objective, and non-reference-based evaluation method for reading aloud ancient poems through the analysis of speech signals for reading aloud, combined with acoustic feature parameters. There are certain rules, pay attention to the level and oblique rhyme, through the rhythm of reading, readers can express the content and emotion of the poem vividly.
发明内容Contents of the invention
有鉴于此,本发明旨在提出一种基于质量和韵律特征融合的古诗词朗读评估方法,以声学特征和感知特征为重要指标分别评估韵律和信噪比、清晰度的中国古典诗歌,通过提取诗歌的音高频率,我们量化参考评估函数以获得基于偏离程度的韵律分数。预测分数与人类评分具有较好的相关性,有效地反映了中国古典诗歌阅读者的阅读水平和音频本身的质量。In view of this, the present invention aims to propose a method for evaluating the reading aloud of ancient poems based on the fusion of quality and prosodic features, using acoustic features and perceptual features as important indicators to evaluate the Chinese classical poetry of prosody, signal-to-noise ratio, and clarity, by extracting The pitch frequency of the poem, we quantify the reference evaluation function to obtain a prosody score based on the degree of deviation. The predicted scores have a good correlation with human ratings, effectively reflecting the reading level of readers of classical Chinese poetry and the quality of the audio itself.
为达到上述目的,本发明的技术方案是这样实现的:In order to achieve the above object, technical solution of the present invention is achieved in that way:
如图1所示,本发明提供一种基于质量和韵律特征融合的古诗词朗读评估方法,包括如下步骤:As shown in Figure 1, the present invention provides a kind of ancient poetry reading evaluation method based on quality and prosodic feature fusion, comprises the following steps:
(1)建立基于MOS的客观语音质量评价模型。提取mel频谱特征,用mask_res残差卷积网络提取信号高维度特征,在UnMask输出模块聚合单个古诗词朗诵的MOS评分。(1) Establish an objective speech quality evaluation model based on MOS. Extract the mel spectral features, use the mask_res residual convolution network to extract high-dimensional features of the signal, and aggregate the MOS score of a single ancient poetry recitation in the UnMask output module.
(2)建立基于特征融合韵律评价模型,提取基频、能量、过零率等信号基本特征,按照多特征分析方法转化为轻重音、语调、节奏韵律特征,通过韵律评分函数映射为实际韵律得分。(2) Establish a prosody evaluation model based on feature fusion, extract basic signal features such as fundamental frequency, energy, and zero-crossing rate, convert them into light and heavy sounds, intonation, and rhythm prosody features according to the multi-feature analysis method, and map them into actual prosody scores through the prosody scoring function .
(3)建立基于多项式拟合的综合度量体系,针对步骤1、2中任务划分得到的两个评分模型,基于最优解和最小化模型的目标,构建基于质量和韵律特征融合的无参考古诗词朗读评估映射函数g()为::(3) Establish a comprehensive measurement system based on polynomial fitting. For the two scoring models obtained by task division in steps 1 and 2, based on the goal of optimal solution and minimization model, construct a no-reference ancient model based on the fusion of quality and rhythm features. The poetry reading evaluation mapping function g() is:
S=g(w1SR,w2SMOS)S=g(w 1 S R ,w 2 S MOS )
SR是韵律特征融合韵律评分,SMOS是质量模型评分,w1、w2是评价模型的权值,由多项式回归方程确定。S R is prosodic feature fusion prosodic score, S MOS is quality model score, w 1 and w 2 are the weights of the evaluation model, which are determined by the polynomial regression equation.
进一步的,步骤(1)包括:Further, step (1) includes:
(11)特征提取,从输入信号中计算梅尔子帧,并划分重叠段,并补齐不同语音片段的长度;(11) feature extraction, calculate Mel subframe from input signal, and divide overlapping section, and complement the length of different speech segments;
(12)根据步骤(11)得到的韵律特征进行质量分析,以梅尔子帧为输入进行特征降维,对语音序列进行预测,具体是:使用残差卷积层网络提取高维度特征,在BasicBlock中向下卷积,实现3次特征降维。然后通过全连接层输出,设定输出特征维度为20,并通过view实现输出扁平化。(12) Carry out quality analysis according to the prosodic features obtained in step (11), and perform feature dimensionality reduction with Mel subframes as input, and predict the speech sequence, specifically: use the residual convolutional layer network to extract high-dimensional features, in Downward convolution in BasicBlock achieves 3 feature dimension reductions. Then output through the fully connected layer, set the output feature dimension to 20, and flatten the output through the view.
(13)根据步骤(12)得到的高维度特征进行UnMask输出,语音时间根据复原特征长度,特征聚合,估计出单个MOS值,具体是:首先根据之前记录的原始长度,得到UnMask掩膜并与特征向量对应位置上的unmask值相乘,完成去零操作,得到实际语音段长。然后通过最大池化层,对每个有效特征向量,取所有特征数的最大值,得到单个语音的MOS评分输出。(13) Perform UnMask output according to the high-dimensional features obtained in step (12). The speech time is based on the restored feature length and feature aggregation to estimate a single MOS value. Specifically: first, according to the original length recorded before, get the UnMask mask and compare it with The unmask value at the corresponding position of the eigenvector is multiplied to complete the zero-removal operation to obtain the actual speech segment length. Then, through the maximum pooling layer, for each effective feature vector, the maximum value of all feature numbers is taken to obtain the MOS score output of a single speech.
进一步的,步骤(2)包括:Further, step (2) includes:
(21)韵律特征提取,对输入进行分帧,使用矩形窗,取N为0.05倍的采样率,计算古诗词的短时平均幅度函数、基音曲线,并提取函数曲线中的每个峰值,得到峰值的相对标准差。计算基频,并估计每一帧的倒谱。使用均值滤波平滑基频曲线,并微调阈值参数以标记主峰。(21) Prosodic feature extraction, divide the input into frames, use a rectangular window, take N as a sampling rate of 0.05 times, calculate the short-term average amplitude function and pitch curve of ancient poetry, and extract each peak value in the function curve, get The relative standard deviation of the peak value. Calculate the fundamental frequency, and estimate the cepstrum for each frame. Use mean filtering to smooth the fundamental frequency curve, and fine-tune the threshold parameter to mark the main peak.
(22)多特征分析。根据步骤(21)得到的韵律特征,计算特征参数。计算短期平均幅度的每个峰值的标准偏差,以反映声音的重音变化;计算每个相邻峰值时间间隔的相对标准差参数,以反映了语音节奏特征;计算每个峰的相对标准差参数,反映了读者对语调的处理方式;计算一首诗中每个单词的音节长度的相对标准偏差,以反映了音节的停顿或延长;计算静音时间,以反映朗读的停顿是否合理。(22) Multi-feature analysis. According to the prosodic feature obtained in step (21), the feature parameter is calculated. Calculate the standard deviation of each peak of the short-term average amplitude to reflect the stress change of the sound; calculate the relative standard deviation parameter of each adjacent peak time interval to reflect the rhythm characteristics of the voice; calculate the relative standard deviation parameter of each peak, Reflects how readers process intonation; calculates the relative standard deviation of the syllable length of each word in a poem to reflect the pause or lengthening of syllables; calculates the silence time to reflect whether the pause in reading is reasonable.
(23)韵律评分模型。根据步骤(22)得到的特征参数,使用评分公式映射实际韵律评价分数:(23) Prosodic scoring model. According to the feature parameters obtained in step (22), use the scoring formula to map the actual prosody evaluation score:
θi∈{σ,RSDp,RSDt,RSD,ts}θ i ∈{σ,RSD p ,RSD t ,RSD,t s }
其中,其中是对应特征参数θi∈{σ,RSDp,RSDt,RSD,ts}的量化值,λ是映射分数的放大系数。among them is the quantized value of the corresponding feature parameter θ i ∈ {σ,RSD p ,RSD t ,RSD,t s }, and λ is the amplification factor of the mapping score.
将阅读样本的特征参数转化为百分制分数,根据最佳阅读样本的实验值制定参考值。对样本的不同特征打分,取其加权平均作为最终得分。Convert the characteristic parameters of the reading samples into percentile scores, and formulate reference values based on the experimental values of the best reading samples. Score the different features of the sample, and take the weighted average as the final score.
相对于现有技术,本发明所述的一种基于质量和韵律特征融合的古诗词朗读评估方法具有以下优势:Compared with the prior art, a kind of ancient poetry reading assessment method based on the fusion of quality and prosodic features of the present invention has the following advantages:
本发明将传统韵律评价方法与基于神经网络的语音质量评价方法相结合,提出了一种以声学特征和感知特征为重要指标分别评估韵律和信噪比、清晰度的中国古典诗歌的无参考评价方法,通过提取诗歌的音高频率,我们量化参考评估函数以获得基于偏离程度的韵律分数。预测分数与人类评分具有较好的相关性,有效地反映了中国古典诗歌阅读者的阅读水平和音频本身的质量。一方面,捕捉目标语音和噪声的基本频谱-时间结构。另一方面,本技术对语音的声学特征参数进行了分析,给出了加权韵律评分,具有一定的参考价值和应用前景。通过客观化主观评价标准,从可听度和审美角度合理量化中国古典诗歌音频的整体质量,进一步揭示听众评价阅读质量的客观心理规律,可为理论研究提供一些新思路。将深度学习方法与韵律分析理论相结合的综合评分模型,从这两个方面联合拟合人类主观感知,得到了一个可行的评价体系。The present invention combines the traditional prosody evaluation method with the neural network-based voice quality evaluation method, and proposes a no-reference evaluation of Chinese classical poetry, which uses acoustic features and perceptual features as important indicators to evaluate prosody, signal-to-noise ratio, and clarity, respectively. method, by extracting the pitch frequencies of the poems, we quantify the reference evaluation function to obtain a prosody score based on the degree of deviation. The predicted scores have a good correlation with human ratings, effectively reflecting the reading level of readers of classical Chinese poetry and the quality of the audio itself. On the one hand, the underlying spectral-temporal structure of target speech and noise is captured. On the other hand, this technology analyzes the acoustic feature parameters of speech and gives a weighted prosody score, which has certain reference value and application prospect. By objectifying the subjective evaluation criteria, we can reasonably quantify the overall quality of Chinese classical poetry audio from the perspective of audibility and aesthetics, and further reveal the objective psychological laws of the audience's evaluation of reading quality, which can provide some new ideas for theoretical research. A comprehensive scoring model that combines deep learning methods with prosody analysis theory, jointly fits human subjective perception from these two aspects, and obtains a feasible evaluation system.
附图说明Description of drawings
构成本发明的一部分的附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the schematic embodiments and descriptions of the present invention are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:
图1为本发明的古诗词朗读评估模型的结构示意图;Fig. 1 is the structural representation of ancient poetry reading evaluation model of the present invention;
图2为最佳评价模型结果图。Figure 2 is the result graph of the best evaluation model.
具体实施方式Detailed ways
需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other.
在本发明的描述中,需要理解的是,术语“中心”、“纵向”、“横向”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”等的特征可以明示或者隐含地包括一个或者更多个该特征。在本发明的描述中,除非另有说明,“多个”的含义是两个或两个以上。In describing the present invention, it should be understood that the terms "center", "longitudinal", "transverse", "upper", "lower", "front", "rear", "left", "right", " The orientations or positional relationships indicated by "vertical", "horizontal", "top", "bottom", "inner" and "outer" are based on the orientations or positional relationships shown in the drawings, and are only for the convenience of describing the present invention and Simplified descriptions, rather than indicating or implying that the device or element referred to must have a particular orientation, be constructed and operate in a particular orientation, and thus should not be construed as limiting the invention. In addition, the terms "first", "second", etc. are used for descriptive purposes only, and should not be understood as indicating or implying relative importance or implicitly specifying the quantity of the indicated technical features. Thus, a feature defined as "first", "second", etc. may expressly or implicitly include one or more of that feature. In the description of the present invention, unless otherwise specified, "plurality" means two or more.
在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以通过具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that unless otherwise specified and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection. Connected, or integrally connected; it can be mechanically connected or electrically connected; it can be directly connected or indirectly connected through an intermediary, and it can be the internal communication of two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention based on specific situations.
下面将参考附图并结合实施例来详细说明本发明。The present invention will be described in detail below with reference to the accompanying drawings and examples.
本发明提出一种基于质量和韵律特征融合的古诗词朗读评估方法,包括如下步骤:The present invention proposes a method for evaluating the reading of ancient poems based on the fusion of quality and prosody features, including the following steps:
一、.建立基于MOS的客观语音质量评价模型。1. Establish an objective speech quality evaluation model based on MOS.
运用感知特征,评价信噪比等能够感知的语音质量。分为3个部分:特征提取、质量分析、自UnMask输出。特征提取模块,从输入信号中计算梅尔子帧,划分为重叠段,并补齐不同语音片段的长度;质量分析模块,以梅尔子帧为输入进行特征降维,对语音序列进行预测。UnMask输出模块,语音时间根据复原特征长度,特征聚合,估计出单个MOS值。Use perceptual features to evaluate perceivable speech quality such as signal-to-noise ratio. It is divided into 3 parts: feature extraction, quality analysis, and output from UnMask. The feature extraction module calculates the Mel subframe from the input signal, divides it into overlapping segments, and completes the length of different speech segments; the quality analysis module uses the Mel subframe as input to perform feature dimensionality reduction and predict the speech sequence. In the UnMask output module, the voice time is estimated based on the restored feature length and feature aggregation to estimate a single MOS value.
1、特征提取模块。1. Feature extraction module.
用mel谱图表达语音的信噪比特性,在不同加性噪声下,语谱图出现的明显差异,通过神经网络学习到这些特征。设置梅尔带数为24单位,分割出长度为7单位的子帧,做2单位的帧移,相邻子帧的重叠率为71.4%,利用语音短时平稳特性,使特征变化平滑。不同语音的时间长度一般也不相同,在进行批处理的时候进行补0,保证各语音的特征向量长度相同。The signal-to-noise ratio characteristics of speech are expressed by mel spectrograms. Under different additive noises, there are obvious differences in the spectrograms, and these features are learned through neural networks. Set the number of mel bands to 24 units, divide subframes with a length of 7 units, and do a frame shift of 2 units. The overlapping rate of adjacent subframes is 71.4%. Using the short-term stationary characteristics of speech, the feature changes are smooth. The time lengths of different voices are generally different, and zeros are added during batch processing to ensure that the feature vector lengths of each voice are the same.
梅尔谱图可以很好地表达语音的信噪比特征不同加性噪声下的频谱图有明显的差异,神经网络可以很好地表示这些特征。提取步骤如下:The mel spectrogram can well express the signal-to-noise ratio characteristics of speech. There are obvious differences in the spectrogram under different additive noises, and the neural network can well represent these features. The extraction steps are as follows:
(1)首先对语音信号进行预加重、分帧加窗等处理;(1) At first the voice signal is pre-emphasized, framed and windowed;
(2)对每一帧语音信号进行离散傅里叶变换(Discrete Fourier Transform,DFT),然后对每一帧的DFT的系数求平方得到短时频谱能量,最后将这些帧按时间顺序排列,得到能量语谱图;(2) Perform Discrete Fourier Transform (DFT) on each frame of speech signal, and then square the DFT coefficient of each frame to obtain short-term spectrum energy, and finally arrange these frames in time order to obtain energy spectrogram;
(3)将Mel滤波器组与能量语谱图加权求和进行Mel滤波,Mel滤波器组公式如式(1)所示,加权求和公式如式(2)所示,其中f(m)为Mel滤波器组中心频率,m表示Mel滤波器的阶数,k表示FFT中点的编号,|X(k)|2表频谱能量;(3) Perform Mel filtering by weighting and summing the Mel filter bank and the energy spectrogram. The formula of the Mel filter bank is shown in formula (1), and the weighted sum formula is shown in formula (2), where f(m) is the center frequency of the Mel filter bank, m represents the order of the Mel filter, k represents the number of the midpoint of the FFT, and |X(k)| 2 represents the spectrum energy;
(4)对DFT之后的谱特征进行滤波,得到m个滤波器组能量,最后进行log操作,得到Mel语谱图,其中|X(k)|2表示频谱能量。(4) Filter the spectral features after DFT to obtain m filter bank energies, and finally perform log operation to obtain Mel spectrograms, where |X(k)| 2 represents spectral energy.
2、质量分析模块。2. Quality analysis module.
根据特征提取模块提取到的语谱图特征,使用残差卷积层网络提取高维度特征,具体方法是:在BasicBlock中向下卷积,实现3次特征降维。然后通过全连接层输出,设定输出特征维度为20,并通过view实现输出扁平化。According to the spectrogram features extracted by the feature extraction module, the residual convolutional layer network is used to extract high-dimensional features. The specific method is: downward convolution in BasicBlock to achieve three times of feature dimensionality reduction. Then output through the fully connected layer, set the output feature dimension to 20, and flatten the output through the view.
3、UnMask输出模块。3. UnMask output module.
根据质量分析模块得到的高维度特征,首先在UnMask层进行去0操作,得到各语音段的实际长度,实施方式是:根据之前记录的原始长度,得到UnMask掩膜并与特征向量对应位置上的unmask值相乘,零填充处会得到一个极小值。之后通过最大池化层,对每个有效特征向量,取所有特征数的最大值,作为单个语音的MOS评分输出。According to the high-dimensional features obtained by the quality analysis module, firstly perform the de-zero operation on the UnMask layer to obtain the actual length of each speech segment. The unmask value is multiplied, and a minimum value will be obtained at the zero padding. Afterwards, through the maximum pooling layer, for each effective feature vector, the maximum value of all feature numbers is taken as the MOS score output of a single speech.
二、建立基于特征融合韵律评价模型。2. Establish a prosody evaluation model based on feature fusion.
本发明利用声学特征从轻重音、语调、节奏三个方面设计了古诗朗诵质量的评估方法。它们是决定韵律感知的重要特征维度。音调控制体现了朗读音频中音调的高低变化;节奏特征体现了节奏的疏密变化和停顿的长短;轻重变化体现了朗读者对轻读和重读的把握。The present invention designs an evaluation method for the quality of recitation of ancient poems from the aspects of lightness and stress, intonation and rhythm by using acoustic features. They are important feature dimensions that determine prosodic perception. Pitch control reflects the change of pitch in the reading audio; rhythm characteristics reflect the change of rhythm and the length of pauses; the change of lightness and severity reflects the reader's grasp of light reading and heavy reading.
1、韵律特征提取。1. Prosodic feature extraction.
计算古诗词的短期平均振幅函数和平滑的音高频率曲线,并提取函数曲线中的每个峰值,得到峰值的相对值作为标准差,作为反映两者变化特征的值。然后,计算峰间间隔的相对标准偏差和无声时长之和与总时长之比,反映了样本的节奏特征。输入分帧。设定每一帧的长度是采样率的0.02倍。计算基频,并估计每一帧的倒谱。使用均值滤波平滑基频曲线并微调阈值参数以标记主峰。使用一个矩形窗口,取N为0.05倍的采样率,计算短期平均幅度曲线。Calculate the short-term average amplitude function and smooth pitch-frequency curve of ancient poetry, and extract each peak in the function curve, and get the relative value of the peak as the standard deviation, as a value that reflects the changing characteristics of the two. Then, the relative standard deviation of the peak-to-peak interval and the ratio of the sum of silent durations to the total duration were calculated, reflecting the rhythmic characteristics of the samples. Enter framing. Set the length of each frame to be 0.02 times the sampling rate. Calculate the fundamental frequency, and estimate the cepstrum for each frame. Use mean filtering to smooth the fundamental frequency curve and fine-tune the threshold parameter to mark the main peak. Use a rectangular window and take N as 0.05 times the sampling rate to calculate the short-term average amplitude curve.
2、多特征分析。2. Multi-feature analysis.
根据得到的韵律特征,计算特征参数。计算短期平均幅度的每个峰值的标准偏差得到σ,以反映声音的重音变化;计算每个相邻峰值时间间隔的相对标准差参数得到RSDt,以反映了语音节奏特征;计算每个峰的相对标准差参数得到RSDp,反映了读者对语调的处理方式;计算一首诗中每个单词的音节长度的相对标准偏差得到RSD,以反映了音节的停顿或延长;总静音时长除以总音频时长得到ts静音时间,以反映朗读的停顿是否合理。According to the obtained prosodic features, feature parameters are calculated. Calculate the standard deviation of each peak of the short-term average amplitude to get σ to reflect the stress change of the sound; calculate the relative standard deviation parameter of each adjacent peak time interval to get RSD t to reflect the rhythm characteristics of the speech; calculate the The Relative Standard Deviation parameter yields RSD p , which reflects how readers process intonation; calculates the relative standard deviation of the syllable lengths of each word in a poem to yield RSD, which reflects pauses or prolongations of syllables; the total silence duration divided by the total The audio duration is obtained by t s silence time to reflect whether the pause in reading is reasonable.
3、韵律评分模型。3. Prosodic scoring model.
根据得到的特征参数,映射实际韵律评价分数。将阅读样本的特征参数转化为百分制分数,根据最佳阅读样本的实验值制定模块参数。对样本的不同特征打分,取其加权平均作为最终得分。评分公式是选择参考值最高且向两侧递减的函数,使用公式3将单个参数转换为评分:According to the obtained feature parameters, the actual prosodic evaluation scores are mapped. The characteristic parameters of the reading samples are converted into percentile scores, and the module parameters are formulated according to the experimental values of the best reading samples. Score the different features of the sample, and take the weighted average as the final score. The scoring formula is to select the function with the highest reference value and decrease to both sides, and use formula 3 to convert a single parameter into a score:
θi∈{σ,RSDp,RSDt,RSD,ts}θ i ∈{σ,RSD p ,RSD t ,RSD,t s }
其中是对应特征参数θi∈{σ,RSDp,RSDt,RSD,ts}的量化值,λ是映射分数的放大系数。in is the quantized value of the corresponding feature parameter θ i ∈ {σ,RSD p ,RSD t ,RSD,t s }, and λ is the amplification factor of the mapping score.
三、建立基于多项式拟合的综合度量体系3. Establish a comprehensive measurement system based on polynomial fitting
针对步骤1、2中得到的两个评分模型,考虑S总体评分模型,并基于最小模型的目标,在最优解的情况下,获得最大的模型可靠性和有效性:For the two scoring models obtained in steps 1 and 2, consider the S overall scoring model, and based on the goal of the smallest model, in the case of the optimal solution, the maximum model reliability and effectiveness are obtained:
S=g(w1SR,w2SMOS)S=g(w 1 S R ,w 2 S MOS )
SR是韵律特征融合韵律评分,SMOS是质量模型评分,w1、w2是评价模型的权值,由多项式回归方程确定。S R is prosodic feature fusion prosodic score, S MOS is quality model score, w 1 and w 2 are the weights of the evaluation model, which are determined by the polynomial regression equation.
本发明将传统韵律评价方法与基于神经网络的语音质量评价方法相结合。我们完成了对6个用于评估信噪特性的网络结构、4个韵律评分函数和一个最优多项式回归的探索。我们用均方根误差(RMSE)和皮尔逊相关系数R来衡量模块的性能,RMSE越低,R越高表明性能越好。如表一、表二所示,表一为信噪网络的性能指标,表二为综合评价模型的性能指标。RMSE和R的定义如下:The invention combines the traditional prosody evaluation method with the neural network-based speech quality evaluation method. We complete the exploration of 6 network structures for evaluating signal-to-noise properties, 4 prosodic scoring functions, and an optimal polynomial regression. We use root mean square error (RMSE) and Pearson correlation coefficient R to measure the performance of the module, with lower RMSE and higher R indicating better performance. As shown in Table 1 and Table 2, Table 1 is the performance index of the signal-to-noise network, and Table 2 is the performance index of the comprehensive evaluation model. RMSE and R are defined as follows:
其中xi为单个输入的平均目标MOS,Xi为相应的主观MOS。为所有Xi的平均值,为所有xi的平均值,分别反映主观MOS与客观MOS的偏差和相关性。Where xi is the average target MOS of a single input, and Xi is the corresponding subjective MOS. is the average value of all Xi , is the average value of all xi , respectively reflecting the deviation and correlation between subjective MOS and objective MOS.
表二中α和β分别表示质量模型和韵律模型的多项式阶数,取α和β最大值用做本拟合多项式的阶数。随着阶数增加,以增加系统的复杂性为代价提高预测效果,但也会带来过早过拟合的风险。随着回归多项式阶数的增加,模型的性能越来越好。通过确定两个输入部分的最优阶数来设计参数和多项式阶数,从而在避免过拟合的同时兼顾系统复杂性和准确预测。基于resnet-18、max_UnMask、线性评分函数和二阶多项式的组合,这个最佳性能模型在指标上取得了R=0.90和RMSE=0.39的总体良好结果,指标上与现有单质量评价方法相近。并且又融合了韵律评价体系。图二是该模型的可视化二维分布图,反映了诗歌质量总体评分随MOS值和韵律打分的函数模型。图二中,x轴表示预测的客观韵律评分MOS_pred;y轴表示预测的韵律评分Prosody_pred;z轴代表特征融合下古诗词朗读的总体得分Score。每个轴均映射到0-5的值区间范围内。一方面,其捕捉了目标语音和噪声的基本频谱-时间结构。另一方面,语音的声学特征参数进行了分析,给出了加权韵律评分,具有一定的参考价值和应用前景。通过客观化主观评价标准,从可听度和审美角度合理量化中国古典诗歌音频的整体质量,进一步揭示听众评价阅读质量的客观心理规律,这是独特且创新的,可为中国古诗词朗读理论研究提供一些新思路。In Table 2, α and β represent the polynomial order of the quality model and the prosody model respectively, and the maximum value of α and β is used as the order of the fitting polynomial. As the order increases, the prediction performance is improved at the cost of increasing the complexity of the system, but it also brings the risk of premature overfitting. As the order of the regression polynomial increases, the performance of the model gets better and better. The parameters and polynomial order are designed by determining the optimal order of the two input parts, which balances system complexity and accurate prediction while avoiding overfitting. Based on the combination of resnet-18, max_UnMask, linear scoring function and second-order polynomial, this best-performing model achieves overall good results of R = 0.90 and RMSE = 0.39 in terms of indicators, which are similar to existing single-quality evaluation methods. And it also incorporates the prosody evaluation system. Figure 2 is a visualized two-dimensional distribution diagram of the model, which reflects the function model of the overall score of poetry quality with the MOS value and prosody score. In Figure 2, the x-axis represents the predicted objective prosody score MOS_pred; the y-axis represents the predicted prosody score Prosody_pred; the z-axis represents the overall score Score of ancient poetry reading under feature fusion. Each axis is mapped to a range of values from 0-5. On the one hand, it captures the fundamental spectral-temporal structure of target speech and noise. On the other hand, the acoustic feature parameters of speech are analyzed, and the weighted prosodic score is given, which has certain reference value and application prospect. Through objective subjective evaluation criteria, the overall quality of Chinese classical poetry audio is reasonably quantified from the perspective of audibility and aesthetics, and further reveals the objective psychological laws of the audience's evaluation of reading quality. This is unique and innovative, and can be used for theoretical research on Chinese ancient poetry reading. Provide some new ideas.
表一Table I
表二Table II
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the scope of the present invention. within the scope of protection.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210989714.4A CN115359782B (en) | 2022-08-18 | 2022-08-18 | A method for evaluating ancient poetry reading based on the fusion of quality and rhythmic features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210989714.4A CN115359782B (en) | 2022-08-18 | 2022-08-18 | A method for evaluating ancient poetry reading based on the fusion of quality and rhythmic features |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115359782A true CN115359782A (en) | 2022-11-18 |
CN115359782B CN115359782B (en) | 2024-05-14 |
Family
ID=84003368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210989714.4A Active CN115359782B (en) | 2022-08-18 | 2022-08-18 | A method for evaluating ancient poetry reading based on the fusion of quality and rhythmic features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115359782B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118714377A (en) * | 2024-08-27 | 2024-09-27 | 深圳市致尚信息技术有限公司 | A method and system for evaluating content quality of OTT platforms based on data analysis |
CN119181382A (en) * | 2024-09-09 | 2024-12-24 | 长沙翊丰汽车科技有限公司 | Pronunciation correction method for deaf-mute |
CN119694321A (en) * | 2025-02-27 | 2025-03-25 | 山东浪潮科学研究院有限公司 | Identity recognition method, device, electronic device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1938756A (en) * | 2004-03-05 | 2007-03-28 | 莱塞克技术公司 | Prosodic speech text codes and their use in computerized speech systems |
CN102237081A (en) * | 2010-04-30 | 2011-11-09 | 国际商业机器公司 | Method and system for estimating rhythm of voice |
US20120245942A1 (en) * | 2011-03-25 | 2012-09-27 | Klaus Zechner | Computer-Implemented Systems and Methods for Evaluating Prosodic Features of Speech |
CN104240717A (en) * | 2014-09-17 | 2014-12-24 | 河海大学常州校区 | Voice enhancement method based on combination of sparse code and ideal binary system mask |
US20190385480A1 (en) * | 2018-06-18 | 2019-12-19 | Pearson Education, Inc. | System to evaluate dimensions of pronunciation quality |
-
2022
- 2022-08-18 CN CN202210989714.4A patent/CN115359782B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1938756A (en) * | 2004-03-05 | 2007-03-28 | 莱塞克技术公司 | Prosodic speech text codes and their use in computerized speech systems |
CN102237081A (en) * | 2010-04-30 | 2011-11-09 | 国际商业机器公司 | Method and system for estimating rhythm of voice |
US20120245942A1 (en) * | 2011-03-25 | 2012-09-27 | Klaus Zechner | Computer-Implemented Systems and Methods for Evaluating Prosodic Features of Speech |
CN104240717A (en) * | 2014-09-17 | 2014-12-24 | 河海大学常州校区 | Voice enhancement method based on combination of sparse code and ideal binary system mask |
US20190385480A1 (en) * | 2018-06-18 | 2019-12-19 | Pearson Education, Inc. | System to evaluate dimensions of pronunciation quality |
Non-Patent Citations (1)
Title |
---|
陈楠: "基于语音评测技术的古诗朗诵游戏设计研究", 中国优秀硕士论文电子期刊网, 15 February 2020 (2020-02-15), pages 1 - 66 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118714377A (en) * | 2024-08-27 | 2024-09-27 | 深圳市致尚信息技术有限公司 | A method and system for evaluating content quality of OTT platforms based on data analysis |
CN118714377B (en) * | 2024-08-27 | 2024-12-13 | 深圳市致尚信息技术有限公司 | OTT platform content quality assessment method and system based on data analysis |
CN119181382A (en) * | 2024-09-09 | 2024-12-24 | 长沙翊丰汽车科技有限公司 | Pronunciation correction method for deaf-mute |
CN119694321A (en) * | 2025-02-27 | 2025-03-25 | 山东浪潮科学研究院有限公司 | Identity recognition method, device, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN115359782B (en) | 2024-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115359782A (en) | An evaluation method for reading aloud ancient poems based on the fusion of quality and prosodic features | |
Narendra et al. | Automatic assessment of intelligibility in speakers with dysarthria from coded telephone speech using glottal features | |
CN107610715A (en) | A kind of similarity calculating method based on muli-sounds feature | |
CN102881289B (en) | Hearing perception characteristic-based objective voice quality evaluation method | |
Alku et al. | Closed phase covariance analysis based on constrained linear prediction for glottal inverse filtering | |
CN104361894A (en) | A Method of Objective Speech Quality Assessment Based on Output | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
CN101527141A (en) | Method of converting whispered voice into normal voice based on radial group neutral network | |
CN102610236A (en) | Method for improving voice quality of throat microphone | |
EP1995723A1 (en) | Neuroevolution training system | |
Narendra et al. | Estimation of the glottal source from coded telephone speech using deep neural networks | |
CN116230018A (en) | A Synthesized Speech Quality Evaluation Method for Speech Synthesis System | |
Jokinen et al. | Intelligibility enhancement of telephone speech using Gaussian process regression for normal-to-Lombard spectral tilt conversion | |
Jokinen et al. | Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network | |
US7801725B2 (en) | Method for speech quality degradation estimation and method for degradation measures calculation and apparatuses thereof | |
Janbakhshi et al. | Automatic pathological speech intelligibility assessment exploiting subspace-based analyses | |
CN113053398A (en) | Speaker recognition system and method based on MFCC (Mel frequency cepstrum coefficient) and BP (Back propagation) neural network | |
RU2013119828A (en) | METHOD FOR DETERMINING THE RISK OF DEVELOPMENT OF INDIVIDUAL DISEASES BY ITS VOICE AND HARDWARE AND SOFTWARE COMPLEX FOR IMPLEMENTING THE METHOD | |
CN115240680B (en) | A method, system and device for converting fuzzy whispered speech | |
Castillo-Guerra et al. | Automatic modeling of acoustic perception of breathiness in pathological voices | |
Dubey et al. | Hypernasality detection using zero time windowing | |
Arun Sankar et al. | Design of MELPe-based variable-bit-rate speech coding with mel scale approach using low-order linear prediction filter and representing excitation signal using glottal closure instants | |
Villavicencio et al. | Extending efficient spectral envelope modeling to mel-frequency based representation | |
Mahdi et al. | New single-ended objective measure for non-intrusive speech quality evaluation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |