CN115223588B

CN115223588B - Child voice phrase matching method based on pinyin distance and sliding window

Info

Publication number: CN115223588B
Application number: CN202210292844.2A
Authority: CN
Inventors: 杨静; 王佳镐; 徐岸冲; 张郡航
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2024-08-13
Anticipated expiration: 2042-03-24
Also published as: CN115223588A

Abstract

The invention discloses a child voice phrase matching method based on pinyin distance and a sliding window, which comprises the following steps: collecting phrase audios of children, obtaining audio transcription texts through a voice recognition model, and marking given target text phrases; converting the target text phrase and the transcribed text into corresponding pinyin sequences, and searching the minimum pinyin distance between the target text phrase and the pinyin sequences of the transcribed text by using a sliding window; and calculating an optimal judgment section according to the data labeling result and the minimum distance set, wherein the phrase matching is successful when the minimum distance is smaller than the left end point of the section, the matching is failed when the minimum distance is larger than or equal to the right end point of the section, and the matching is submitted to manual judgment when the minimum distance is within the section. The invention considers the ambiguity of the pronunciation of the children and the uncertainty of the sentence length, combines the ideas of the pinyin distance and the sliding window, and uses the artificial auxiliary judgment, thereby being beneficial to improving the accuracy of target text phrase matching, more accurately judging the cognition level of the children and having practicability.

Description

A children's speech phrase matching method based on pinyin distance and sliding window

技术领域Technical Field

本发明涉及自然语言处理领域，具体涉及一种基于拼音距离和滑动窗口的儿童语音短语匹配方法。The invention relates to the field of natural language processing, and in particular to a children's speech phrase matching method based on pinyin distance and sliding window.

背景技术Background Art

如今，儿童的认知能力评估是脑科学研究的一个方向，其中一个方案是让儿童给出图片或场景的短语描述，与目标文本短语匹配进行认知正确性的判定。而由于低龄儿童识字能力欠缺，常常需要根据说话内容进行评估，这涉及音频的采集、转写和判定，大幅增加了志愿者的工作量。针对该问题，机器可以参与转写和判定环节，以节约人工成本。随着语音识别技术的发展，目前成人语音的识别准确率可以达到95％以上，相关产品应用广泛。然而儿童说话可能口齿不清，现有语音识别模型难以纠正表达模糊的部分，导致目标文本短语匹配困难，增加了误判为认知错误的音频数量。Nowadays, the assessment of children's cognitive abilities is a direction of brain science research. One of the solutions is to ask children to give short descriptions of pictures or scenes, and match them with target text phrases to determine the correctness of cognition. However, due to the lack of literacy skills of young children, it is often necessary to evaluate based on the content of the speech, which involves the collection, transcription and judgment of audio, which greatly increases the workload of volunteers. To address this problem, machines can participate in the transcription and judgment links to save labor costs. With the development of speech recognition technology, the current recognition accuracy of adult speech can reach more than 95%, and related products are widely used. However, children may speak unclearly, and existing speech recognition models find it difficult to correct the ambiguous parts of the expression, resulting in difficulty in matching the target text phrases and increasing the number of audios misjudged as cognitive errors.

从拼音角度而言，若两个完全不同的汉字发音相近，则对应的拼音也具有一定的相似性。通过拼音距离的度量，允许在一定范围内匹配发音类似的汉字，能够较好地解决上述问题。目前，拼音距离常采用两个拼音对应英文字母串的编辑距离表示，具有一定的可实施性，但忽略了拼音的声母或韵母之间发音的相似程度。From the perspective of pinyin, if two completely different Chinese characters have similar pronunciations, the corresponding pinyins also have certain similarities. By measuring the pinyin distance, it is allowed to match Chinese characters with similar pronunciations within a certain range, which can better solve the above problem. At present, the pinyin distance is often expressed by the edit distance between the English letter strings corresponding to two pinyins, which has certain feasibility, but ignores the similarity of the pronunciation between the initial consonants or finals of the pinyin.

在音频采集过程中，考虑到低龄儿童的认知能力，难以对儿童说话内容的长度进行限制，往往会出现较多冗余词，影响目标文本短语的匹配。因此，在较长的儿童语音转写文本中，需要寻找可能匹配的目标文本短语，滑动窗口策略具有可实施性。In the audio collection process, considering the cognitive ability of young children, it is difficult to limit the length of children's speech content, and there will often be many redundant words, which will affect the matching of target text phrases. Therefore, in the longer children's speech transcription text, it is necessary to find possible matching target text phrases, and the sliding window strategy is feasible.

发明内容Summary of the invention

有鉴于此，本发明的目的在于提供一种基于拼音距离和滑动窗口的儿童语音短语匹配方法，以便在儿童说话的内容中，寻找可能匹配的目标文本短语，减少儿童发音模糊带来的不利影响。In view of this, the purpose of the present invention is to provide a children's speech phrase matching method based on pinyin distance and sliding window, so as to find possible matching target text phrases in the content of children's speech and reduce the adverse effects of children's ambiguous pronunciation.

为了实现上述目的，本发明采用如下技术方案：In order to achieve the above object, the present invention adopts the following technical solution:

一种基于拼音距离和滑动窗口的儿童语音短语匹配方法，包括以下步骤：A children's speech phrase matching method based on pinyin distance and sliding window comprises the following steps:

步骤1：给定目标文本短语，收集儿童短语音频，通过语音识别模型得到儿童短语音频的转写文本，根据音频表达的内容是否包括目标文本短语进行标注；Step 1: Given a target text phrase, collect audio of the child's phrase, obtain the transcribed text of the child's phrase audio through the speech recognition model, and annotate it according to whether the content expressed in the audio includes the target text phrase;

步骤2：将目标文本短语与转写文本转化为对应的拼音序列，在转写文本对应的拼音序列中，使用滑动窗口算法，寻找与目标文本短语的拼音距离最小的子序列，并记录最小距离，具体包括：Step 2: Convert the target text phrase and the transcribed text into corresponding pinyin sequences. In the pinyin sequence corresponding to the transcribed text, use the sliding window algorithm to find the subsequence with the smallest pinyin distance to the target text phrase and record the minimum distance, including:

2.1)不考虑拼音声调，将目标文本短语与转写文本转化为对应的拼音序列；2.1) Ignoring the pinyin tones, the target text phrases and the transcribed text are converted into corresponding pinyin sequences;

2.2)使用滑动窗口算法，窗口大小与目标文本短语的字数相同，窗口每次向右滑动1个字，遍历转写文本对应的拼音序列，寻找与目标文本短语的拼音距离最小的子序列，子序列长度＝窗口大小，并记录最小距离，若存在多个目标文本短语，则对每个目标文本短语分别进行该操作，得到最小距离的集合，集合元素个数为目标文本短语的个数，最后在该集合中寻找最小值作为转写文本与多个目标文本短语的最小距离；2.2) Use a sliding window algorithm, with the window size being the same as the number of characters in the target text phrase. The window slides rightward one character at a time, traversing the pinyin sequence corresponding to the transcribed text, and finding the subsequence with the minimum pinyin distance to the target text phrase. The subsequence length = window size, and record the minimum distance. If there are multiple target text phrases, perform this operation on each target text phrase to obtain a set of minimum distances. The number of set elements is the number of target text phrases. Finally, find the minimum value in the set as the minimum distance between the transcribed text and multiple target text phrases.

2.3)对于两个拼音序列S＝{s₁，s₂，......，s_n}、Q＝{q₁，q₂，......，q_n}，有：2.3) For two pinyin sequences S = {s ₁ , s ₂ , ..., s _n }, Q = {q ₁ , q ₂ , ..., q _n }, we have:

d(S，Q)＝[d(s₁，q₁)+d(s₂，q₂)+……+d(s_n，q_n)]÷nd(S, Q)=[d(s ₁ , q ₁ )+d(s ₂ , q ₂ )+……+d(s _n , q _n )]÷n

d为拼音距离，对于两个独立字的拼音s_i、q_i，将s_i、q_i分别拆分为声母部分和韵母部分，则有：d is the phonetic distance. For the phonetic symbols s _i and q _i of two independent characters, s _i and q _i are split into the initial part and the final part respectively, then:

d(s_i，q_i)＝声母距离(s_i，q_i)+韵母距离(s_i，q_i)d( _si , _qi ) = initial consonant distance ( _si , _qi ) + final vowel distance ( _si , _qi )

声母距离(s_i，q_i)＝声母编辑距离(s_i，q_i)×声母权值(s_i，q_i)Initial consonant distance (s _i , q _i ) = initial consonant edit distance (s _i , q _i ) × initial consonant weight (s _i , q _i )

其中声母权值(s_i，q_i)由人工根据s_i、q_i的声母发音相似度设计，权值范围[0.5，1.5]，韵母距离(s_i，q_i)的计算方式与声母距离一致。The initial consonant weights (s _i , q _i ) are manually designed based on the pronunciation similarity of the initial consonants of s _i and q _i , with a weight range of [0.5, 1.5]. The calculation method of the final distance (s _i , q _i ) is consistent with the initial consonant distance.

步骤3：对于步骤1的所有已标注数据，使用步骤2所述方法计算最小距离，得到最小距离的集合，并根据人工参与程度的设定比例，得到判定区间，对于每一个最小距离，若小于区间左端点，则目标文本短语匹配成功，若大于等于区间右端点，则目标文本短语匹配失败，若在区间内即包括区间左端点但不包括区间右端点，则由人工判定是否匹配目标文本短语；根据数据标注结果，对于每一个设定的人工参与比例，使用滑动窗口算法，寻找使准确率达到最大时对应的判定区间，具体包括：Step 3: For all the labeled data in step 1, the minimum distance is calculated using the method described in step 2 to obtain a set of minimum distances, and a judgment interval is obtained according to the set ratio of manual participation. For each minimum distance, if it is less than the left endpoint of the interval, the target text phrase matches successfully; if it is greater than or equal to the right endpoint of the interval, the target text phrase matches unsuccessfully; if it is within the interval, that is, including the left endpoint of the interval but not including the right endpoint of the interval, it is manually determined whether the target text phrase matches; according to the data labeling results, for each set manual participation ratio, a sliding window algorithm is used to find the corresponding judgment interval when the accuracy is maximized, specifically including:

3.1)令判定区间为[left，right)，若最小距离＜left，则目标文本短语匹配成功，若最小距离≥right，则目标文本短语匹配失败，若left≤最小距离＜right，则由人工判定是否匹配目标文本短语；3.1) Let the judgment interval be [left, right). If the minimum distance is less than left, the target text phrase matches successfully. If the minimum distance is greater than or equal to right, the target text phrase matches unsuccessfully. If left is less than or equal to the minimum distance and less than right, it is manually determined whether the target text phrase matches.

3.2)人工参与程度的设定比例为序列{0，k₁％，k₂％，......，k_t％}，对步骤2计算得到的所有已标注数据的最小距离共m个进行升序排序，得到有序数组a＝{d₁，d₂，......，d_i，......，d_j，......，d_m}，当人工比例为k_r％时，对有序数组a使用滑动窗口算法，以m×k_r％为窗口大小，令当前窗口为(d_i，d_j)，则j-i+1＝m×k_r％，判定区间[left，right)的确定方式为：3.2) The set ratio of the degree of manual participation is the sequence {0, k ₁ %, k ₂ %, ..., k _t %}. The m minimum distances of all the labeled data calculated in step 2 are sorted in ascending order to obtain an ordered array a = {d ₁ , d ₂ , ..., d _i , ..., d _j , ..., d _m }. When the manual ratio is k _r %, the sliding window algorithm is used on the ordered array a, with m × k _r % as the window size. Let the current window be (d _i , d _j ), then j-i+1 = m × k _r %, and the determination method of the judgment interval [left, right) is:

对每一个判定区间，所有数据使用步骤3.1)所述规则进行匹配结果判定，过滤需要人工判定的数据，并与已标注数据进行比较，计算当前判定结果的准确率；使用滑动窗口算法时，初始i＝0，窗口每次向右移动1个单位，寻找使判定结果的准确率最大时的判定区间作为人工比例为k_r％时的最佳判定区间。For each judgment interval, all data are matched using the rules described in step 3.1) to filter the data that require manual judgment and compare them with the labeled data to calculate the accuracy of the current judgment result. When using the sliding window algorithm, initially i=0, the window moves to the right by 1 unit each time, and the judgment interval that maximizes the accuracy of the judgment result is found as the optimal judgment interval when the manual ratio is k _r %.

本发明与现有技术相比，具有以下技术效果：Compared with the prior art, the present invention has the following technical effects:

本发明是基于拼音距离和滑动窗口的儿童语音短语匹配方法，相较过去仅使用拼音的编辑距离计算拼音相似度，考虑了声母和韵母的发音相似度，构建了声母和韵母之间编辑距离的权值矩阵，进一步优化了拼音距离的计算方式。同时，判定区间基于大量数据进行确定，具有统计意义。The present invention is a children's speech phrase matching method based on pinyin distance and sliding window. Compared with the previous method of calculating pinyin similarity using only the edit distance of pinyin, the pronunciation similarity of initials and finals is considered, and a weight matrix of the edit distance between initials and finals is constructed, which further optimizes the calculation method of pinyin distance. At the same time, the judgment interval is determined based on a large amount of data and has statistical significance.

本发明考虑了儿童发音的模糊性和说话内容的冗余性，提高了目标文本短语匹配的精确度，更准确地判定儿童的认知水平，具有可实施性。The present invention takes into account the ambiguity of children's pronunciation and the redundancy of speech content, improves the accuracy of target text phrase matching, more accurately determines the cognitive level of children, and is feasible.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例的流程示意图。FIG1 is a schematic diagram of a flow chart of an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合具体的实施例以及附图对本发明进行进一步说明。The present invention is further described below in conjunction with specific embodiments and drawings.

实施例Example

参阅图1所示，本发明是一种基于拼音距离和滑动窗口的儿童语音短语匹配方法，包括以下步骤：Referring to FIG. 1 , the present invention is a method for matching children's speech phrases based on pinyin distance and sliding window, comprising the following steps:

步骤1：通过语音识别模型得到发音模糊的儿童语音的转写文本{这是牙此}，并给定目标文本短语{牙齿}；Step 1: Obtain the transcribed text {this is tooth this} of the child's speech with ambiguous pronunciation through the speech recognition model, and give the target text phrase {teeth};

2.1)不考虑拼音声调，将目标文本短语与转写文本分别转化为对应的拼音序列{ya，chi}、{zhe，shi，ya，ci}；2.1) Without considering the pinyin tones, the target text phrase and the transcribed text are converted into the corresponding pinyin sequences {ya, chi} and {zhe, shi, ya, ci} respectively;

2.2)使用滑动窗口算法，窗口大小与目标文本短语的字数相同(字数为2)，窗口每次向右滑动1个字，遍历转写文本对应的拼音序列{zhe，shi，ya，ci}，寻找与目标文本短语的拼音距离最小的子序列，子序列长度＝窗口大小，并记录最小距离d_min＝min{d({ya，chi}，{zhe，shi})，d({ya，chi}，{shi，ya})，d({ya，chi}，{ya，ci})}，d为拼音距离。若存在多个目标文本短语，则对每个目标文本短语分别进行该操作，得到最小距离的集合，集合元素个数为目标文本短语的个数，最后在该集合中寻找最小值作为转写文本与多个目标文本短语的最小距离；2.2) Use a sliding window algorithm, with the window size being the same as the number of characters in the target text phrase (the number of characters is 2). The window slides to the right one character at a time, traversing the pinyin sequence {zhe, shi, ya, ci} corresponding to the transcribed text, and finding the subsequence with the minimum pinyin distance from the target text phrase. The subsequence length = window size, and record the minimum distance d _min = min{d({ya, chi}, {zhe, shi}), d({ya, chi}, {shi, ya}), d({ya, chi}, {ya, ci})}, where d is the pinyin distance. If there are multiple target text phrases, perform this operation on each target text phrase separately to obtain a set of minimum distances. The number of set elements is the number of target text phrases. Finally, find the minimum value in the set as the minimum distance between the transcribed text and multiple target text phrases.

2.3)两个拼音序列S＝{ya，chi}、Q＝{ya，ci}的距离为：2.3) The distance between two pinyin sequences S = {ya, chi} and Q = {ya, ci} is:

d(S，Q)＝[d(ya，ya)+d(chi，ci)]÷2(d为拼音距离)d(S,Q)=[d(ya,ya)+d(chi,ci)]÷2 (d is the phonetic distance)

对于两个独立字的拼音chi、ci，将chi、ci分别拆分为声母部分ch、c和韵母部分i、i，则有：For the pinyin of two independent characters chi and ci, split chi and ci into the initial consonant part ch and c and the final part i and i respectively, then we have:

d(chi，ci)＝d(ch，c)+d(i，i)d(chi,ci)=d(ch,c)+d(i,i)

d(ch，c)＝编辑距离(ch，c)×权值(ch，c)d(ch, c) = edit distance (ch, c) × weight (ch, c)

其中权值(ch，c)由人工根据ch、c的发音相似度设计，值为0.5，则d(ch，c)＝1×0.5＝0.5，d(i，i)＝0×1.0＝0。The weight (ch, c) is designed manually based on the pronunciation similarity of ch and c, and its value is 0.5, so d(ch, c)=1×0.5=0.5, d(i, i)=0×1.0=0.

3.2)人工参与程度的设定比例为序列{0，5％，10％，......，50％}，对步骤2计算得到的所有已标注数据的共m＝5000个最小距离进行升序排序，得到有序数组a＝{d₁，d₂，......，d_i-1，d_i，d_i+1，......，d_j-1，d_j，d_j+1，......，d_m}＝{0，0，......，1.4，1.5，1.5，......1.9，1.9，1.9，......，4.0}，当人工比例为5％时，对有序数组a使用滑动窗口算法，以5000×5％＝250为窗口大小，则i、j满足j-i+1＝250，存在i、j，使得窗口为(d_i，d_j)＝(1.5，1.9)，判定区间[left，right)的确定方式为：3.2) The set ratio of the degree of manual participation is a sequence of {0, 5%, 10%, ..., 50%}. The m = 5000 minimum distances of all the labeled data calculated in step 2 are sorted in ascending order to obtain an ordered array a = {d ₁ , d ₂ , ..., d _i-1 , d _i , d _i+1 , ..., d _j-1 , d _j , d _j+1 , ..., d _m } = {0, 0, ..., 1.4, 1.5, 1.5, ..., 1.9, 1.9, 1.9, ..., 4.0}. When the manual ratio is 5%, the sliding window algorithm is used for the ordered array a, with 5000 × 5% = 250 as the window size, then i and j satisfy j-i+1 = 250, there are i and j such that the window is (d _i , d _j ) = (1.5, 1.9), and the determination method of the decision interval [left, right) is:

left＝(1.4+1.5+1.5)÷3≈1.47left＝(1.4+1.5+1.5)÷3≈1.47

right＝(1.9+1.9+1.9)÷3＝1.9right＝(1.9+1.9+1.9)÷3＝1.9

对每一个判定区间，所有数据使用步骤3.1)所述规则进行匹配结果判定，过滤需要人工判定的数据，并与已标注数据进行比较，计算当前判定结果的准确率；使用滑动窗口算法时，初始i＝0，窗口每次向右移动1个单位，寻找使判定结果的准确率最大即89.29％时的判定区间[1.5，1.9)作为人工比例为5％时的最佳判定区间。For each judgment interval, all data are matched using the rules described in step 3.1) to determine the results, filter the data that require manual judgment, and compare with the labeled data to calculate the accuracy of the current judgment result; when using the sliding window algorithm, initially i=0, the window moves to the right by 1 unit each time, and finds the judgment interval [1.5, 1.9) that maximizes the accuracy of the judgment result, i.e. 89.29%, as the optimal judgment interval when the manual ratio is 5%.

以上所述仅为本发明的较佳实施例，在本发明权利要求所限定的范围内可对其进行一定修改，但都将落入本发明的保护范围内。The above description is only a preferred embodiment of the present invention, and certain modifications may be made thereto within the scope defined by the claims of the present invention, but all modifications will fall within the protection scope of the present invention.

Claims

1. A children's voice phrase matching method based on spelling distance and sliding window is characterized in that the method comprises the following steps:

Step 1: giving a target text phrase, collecting audio of the phrase of the child, obtaining a transcription text of the audio of the phrase of the child through a voice recognition model, and marking according to whether the content of the audio expression comprises the target text phrase;

Step 2: converting the target text phrase and the transcription text into corresponding pinyin sequences, searching a subsequence with the smallest pinyin distance with the target text phrase in the pinyin sequences corresponding to the transcription text by using a sliding window algorithm, and recording the minimum distance;

Step 3: calculating the minimum distance of all marked data in the step 1 by using the method in the step 2 to obtain a set of minimum distances, obtaining a judging section according to the set proportion of the artificial participation degree, and for each minimum distance, if the minimum distance is smaller than the left end point of the section, matching the target text phrase successfully, if the minimum distance is larger than or equal to the right end point of the section, matching the target text phrase fails, and if the minimum distance is within the section, namely comprises the left end point of the section but does not comprise the right end point of the section, manually judging whether the target text phrase is matched; and according to the data labeling result, for each set manual participation proportion, searching a corresponding judgment section when the accuracy reaches the maximum by using a sliding window algorithm.

2. The method for matching children's voice phrase according to claim 1, wherein the step 2 specifically comprises:

2.1 Regardless of pinyin tones, converting the target text phrase and the transcribed text into a corresponding pinyin sequence;

2.2 Using a sliding window algorithm, wherein the window size is the same as the word number of the target text phrases, sliding the window rightwards by 1 word each time, traversing the pinyin sequence corresponding to the transcription text, searching the subsequence with the smallest pinyin distance with the target text phrases, wherein the subsequence length=window size, recording the minimum distance, if a plurality of target text phrases exist, respectively operating each target text phrase to obtain a set with the minimum distance, wherein the number of elements in the set is the number of the target text phrases, and finally searching the minimum value in the set to be used as the minimum distance between the transcription text and the plurality of target text phrases;

2.3 For two pinyin sequences s= { S ₁,s₂,……,s_n}、Q＝{q₁,q₂,……,q_n }, there are:

d(S，Q)＝[d(s₁,q₁)+d(s₂,q₂)+……+d(s_n,q_n)]÷n

d is the pinyin distance, and for pinyin s _i、q_i of two independent words, s _i、q_i is split into an initial part and a final part respectively, then there are:

d (s _i,q_i) =initial distance (s _i,q_i) +final distance (s _i,q_i)

Initial consonant distance (s _i,q_i) =initial consonant edit distance (s _i,q_i) ×initial weight (s _i,q_i)

Wherein the sound mother right value (s _i,q_i) is designed manually according to the similarity of the pronunciation of the initials of s _i、q_i, the weight range is 0.5 and 1.5, and the calculation mode of the vowel distance (s _i,q_i) is consistent with the distance of the initials.

3. The method for matching children's voice phrases according to claim 1, wherein the step 3 specifically comprises:

3.1 If the minimum distance is less than or equal to right, the matching of the target text phrase is successful, if the minimum distance is less than or equal to right, the matching of the target text phrase is failed, and if the minimum distance is less than or equal to right, the matching of the target text phrase is manually judged;

3.2 The set proportion of the artificial participation degree is a sequence {0, k ₁％,k₂％,……,k_t% }, m minimum distances of all marked data calculated in the step 2 are subjected to ascending order to obtain an ordered array a= { d ₁,d₂,……,d_i,……,d_j,……,d_m }, when the artificial proportion is k _r%, a sliding window algorithm is used for the ordered array a, m x k _r% is used as a window size, the current window is made to be (d _i,d_j), and j-i+1=m x k _r%, and the determination mode of a determination interval [ left, right ] is as follows:

For each judging section, all data are subjected to matching result judgment by using the rule in the step 3.1), data needing manual judgment are filtered, and compared with marked data, and the accuracy of the current judging result is calculated; when the sliding window algorithm is used, the window is moved rightward by 1 unit each time, and a determination section where the accuracy of the determination result is maximized is found as an optimal determination section where the manual ratio is k _r%.