CN109817205B

CN109817205B - Text confirmation method and device based on semantic analysis and terminal equipment

Info

Publication number: CN109817205B
Application number: CN201811502282.XA
Authority: CN
Inventors: 彭捷
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-12-10
Filing date: 2018-12-10
Publication date: 2024-03-22
Anticipated expiration: 2038-12-10
Also published as: CN109817205A

Abstract

The invention is applicable to the technical field of data processing, and provides a text confirmation method, a text confirmation device, a terminal device and a computer readable storage medium based on semantic analysis, which comprise the following steps: acquiring at least two voice mark texts corresponding to the target voice, and segmenting the target voice according to the voice mark text with the largest word number to obtain at least two sections of text voice; determining partial texts with differences between different voice mark texts as difference texts, and judging whether the difference voices corresponding to the difference texts have unvoiced sound attributes or not; if the difference voice has the unvoiced sound attribute, judging whether the difference text and the unvoiced sound attribute have an association relation; and adding the voice annotation text corresponding to the difference text with the association relation with the unvoiced attribute to the annotation set, and outputting the voice annotation text corresponding to the difference text with the highest repetition rate in the annotation set as a confirmation result. According to the method and the device, whether the voice labeling text is correct or not is judged based on the unvoiced sound attribute, so that the accuracy of voice labeling is improved.

Description

Text confirmation method and device based on semantic analysis and terminal equipment

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a text confirmation method and device based on semantic analysis, terminal equipment and a computer readable storage medium.

Background

With the development of information technology, the analysis of voice signals has become a popular research direction nowadays. One important branch of voice analysis is voice annotation, namely corresponding text is annotated according to voice signals, and the voice annotation can be realized through manual annotation or algorithm annotation.

Under the influence of factors such as unclear voice signals or inaccurate annotation algorithms, errors may exist in text annotated according to the voice signals, and in the prior art, an effective confirmation method does not exist for the annotated text, so that the voice annotation accuracy is low, and the situation that the text is inconsistent with the voice signals easily occurs.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a text confirmation method, apparatus, terminal device and computer readable storage medium based on semantic parsing, so as to solve the problem of low accuracy of voice annotation in the prior art.

A first aspect of an embodiment of the present invention provides a text confirmation method based on semantic parsing, including:

acquiring at least two voice labeling texts corresponding to a target voice, and segmenting the target voice according to the voice labeling text with the largest word number to obtain at least two sections of text voice, wherein different voice labeling texts are generated by different labeling parties;

Determining partial texts with differences among different voice mark texts as difference texts, determining the text voices corresponding to the difference texts as difference voices, and judging whether the difference voices have unvoiced sound attributes or not;

if the difference voice has the unvoiced sound attribute, judging whether the difference text and the unvoiced sound attribute have an association relation or not;

and adding the voice annotation text corresponding to the difference text with the association relation with the unvoiced attribute to an annotation set, determining the difference text with the highest repetition rate in the annotation set, and outputting the voice annotation text corresponding to the difference text with the highest repetition rate as a confirmation result, wherein the repetition rate refers to the ratio between the occurrence number of the difference text in the annotation set and the number of the voice annotation texts in the annotation set.

A second aspect of an embodiment of the present invention provides a text confirmation device based on semantic parsing, including:

the segmentation unit is used for acquiring at least two voice marking texts corresponding to the target voice, and segmenting the target voice according to the voice marking text with the largest word number to obtain at least two sections of text voice, wherein different voice marking texts are generated by different marking parties;

The first judging unit is used for determining partial texts with differences among different voice mark texts as difference texts, determining the text voices corresponding to the difference texts as difference voices and judging whether the difference voices have unvoiced sound attributes or not;

the second judging unit is used for judging whether the association relation exists between the difference text and the unvoiced sound attribute if the difference text has the unvoiced sound attribute;

and the output unit is used for adding the voice annotation text corresponding to the difference text with the association relation with the unvoiced attribute to an annotation set, determining the difference text with the highest repetition rate in the annotation set, and outputting the voice annotation text corresponding to the difference text with the highest repetition rate as a confirmation result, wherein the repetition rate refers to the ratio between the occurrence times of the difference text in the annotation set and the number of the voice annotation texts in the annotation set.

A third aspect of an embodiment of the present invention provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

A fourth aspect of the embodiments of the present invention provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of:

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

according to the embodiment of the invention, the voice mark text corresponding to the difference voice with the unvoiced attribute (the part voice corresponding to the difference text) is determined by analyzing the difference text with the difference between at least two voice mark texts, the voice mark text corresponding to the difference text associated with the unvoiced attribute is further added into the mark set, and the voice mark text corresponding to the difference text with the highest repetition rate in the mark set is output as a confirmation result. According to the embodiment of the invention, at least two voice labeling texts are comprehensively compared, and whether the voice labeling texts are correct or not is judged through the unvoiced attribute, so that the accuracy of voice labeling is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a text confirmation method based on semantic parsing according to an embodiment of the present invention;

fig. 2 is a flowchart of a text confirmation method based on semantic parsing according to a second embodiment of the present invention;

FIG. 3 is a flowchart of a text confirmation method based on semantic parsing according to a third embodiment of the present invention;

fig. 4 is a flowchart of an implementation of a text confirmation method based on semantic parsing according to a fourth embodiment of the present invention;

FIG. 5 is a flowchart of a text confirmation method based on semantic parsing according to a fifth embodiment of the present invention;

fig. 6 is a block diagram of a text confirming device based on semantic parsing according to a sixth embodiment of the present invention;

fig. 7 is a schematic diagram of a terminal device according to a seventh embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

Fig. 1 shows an implementation flow of a text confirmation method based on semantic parsing according to an embodiment of the present invention, which is described in detail below:

in S101, at least two voice labeling texts corresponding to the target voice are obtained, and the target voice is segmented according to the voice labeling text with the largest number of words to obtain at least two text voices, wherein different voice labeling texts are generated by different labeling parties.

Speech tagging is the analysis of speech by a human or specific speech recognition model (e.g., hidden Markov model) to label text corresponding to the speech. Because the voice may have the problems of unclear pronunciation or accent interference, the voice label is carried out by manual work or a voice recognition model, and the obtained voice label text may have errors. In order to improve accuracy of voice annotation, in the embodiment of the invention, at least two voice annotation texts related to the same target voice are obtained first, wherein different voice annotation texts are generated by different annotation parties, and the annotation parties refer to a main body for generating the voice annotation texts, for example, the annotation parties can be users or third party annotation software and the like.

In the embodiment of the invention, the voice marking text with the most words is determined in all the obtained voice marking texts, and the target voice is segmented according to the voice marking text with the most words to obtain at least two text voices forming the target voice, wherein each text voice corresponds to different words in the voice marking text, and the segmentation is performed according to the voice marking text with the most words because the segmented text voices are shorter in duration and more in number, and the subsequent analysis is facilitated under the condition that no information is lost. Specifically, the segmentation operation is to segment the target voice according to each word in the voice mark text, the segmentation may be average segmentation, that is, the time length of the target voice is equally divided according to the word number of the voice mark text, the part voice corresponding to each time length after the equally divided is used as a section of text voice, for example, the voice mark text with the largest word number is "eat more vegetables", the time length of the voice mark text is 4 seconds, and the time length is equally divided by 4 because the word number of the voice mark text is 4, so that the part voices corresponding to 0 to 1 st second, 1 to 2 nd second, 2 nd to 3 rd second and 3 rd to 4 th second in the target voice are respectively used as a single text voice. Of course, besides the average segmentation, the user or third party software may freely segment the target voice according to other modes to generate text voice corresponding to the word in the voice labeling text, and the embodiment of the invention does not limit the specific segmentation mode. It should be noted that, since the interlinked words may exist in the voice labeling text obtained by recognizing the target voice, in order to save computing resources, when the segmentation operation is performed, a part of voice corresponding to the interlinked word in the target voice may be used as a text voice, and the above example is taken as an example, and assuming that "vegetables" are interlinked words, after the segmentation operation is performed, three text voices corresponding to "many", "eating" and "vegetables" are respectively obtained, and the interlinked words in the voice labeling text may be determined based on the interlinked word library of the open source. In addition, if the voice markup text is English, the target voice is segmented according to the voice markup text with the largest word number to obtain at least two sections of text voices, wherein each section of text voice corresponds to different words in the voice markup text.

In S102, a portion of text having a difference between different voice mark texts is determined as a difference text, the text voice corresponding to the difference text is determined as a difference voice, and whether the difference voice has an unvoiced sound attribute is determined.

Comparing all the obtained voice mark texts, and if all the voice mark texts are completely the same, directly outputting all the voice mark texts as a confirmation result; if there is a difference between different voice mark texts, determining a part of texts with the difference as difference texts and determining text voices corresponding to the difference texts as difference voices, wherein the text voices are obtained by segmenting based on the voice mark text with the largest word number, and after the difference texts are obtained, the text voices are relative according to the difference textsAnd determining the difference voices at the relative positions of the voice mark texts with the maximum number of words, wherein the number of the determined difference voices is at least one. For example, phonetic annotation Text _A The Text is marked by the voice with the maximum number of words for' eating more vegetables _B Is "eat more vegetables", due to Text _A And Text _B The relative position of the difference text with respect to the voice marking text with the largest word number is the third word, so the text voice corresponding to the third word is used as the difference voice.

Generally, sounds are classified into voiced sounds, which are sounds in which vocal cords vibrate during sound production, and unvoiced sounds, which are sounds in which vocal cords do not vibrate during sound production, wherein energy of voiced sounds is high relative to unvoiced sounds because voiced sounds require vocal cord vibration to be produced. Because the phonetic symbols corresponding to voiced sounds in the general language are more and are not easy to distinguish, in the embodiment of the invention, after the differential speech is obtained, whether the differential speech has an unvoiced sound attribute is firstly judged, the accuracy of a speech labeling text is judged according to the unvoiced sound attribute, the energy of the differential speech is specifically obtained, and if the differential speech is judged to contain the unvoiced sound according to the energy of the differential speech, the differential speech is determined to have the unvoiced sound attribute; if it is determined that the difference speech does not contain unvoiced sound according to the energy of the difference speech, it is determined that the difference speech does not have unvoiced sound attribute, and the specific content will be described later.

In S103, if the difference voice has the unvoiced sound attribute, it is determined whether there is an association between the difference text and the unvoiced sound attribute.

If the difference voice does not have the unvoiced sound attribute, the accuracy of the voice labeling text cannot be judged through the unvoiced sound attribute, so that a prompt which cannot be confirmed can be directly output; if the difference voice has unvoiced sound attribute, further judging whether the difference text corresponding to the difference voice has association relation with unvoiced sound attribute, and the specific judging mode is described later. It should be noted that, since the difference text is obtained based on the mutual comparison between at least two voice mark texts, the number of difference texts corresponding to the difference voices is at least two. For example, phonetic annotation Text _C Is "waste mytime ", phonetic annotation Text _D For "voice my time", the Text is phonetic marked _E For "vans my time", it can be seen that the difference speech is the text speech corresponding to the first word, and if the difference speech has unvoiced properties, it can be determined that the difference text corresponding to the difference speech includes "wave", "voice", and "vans".

In S104, adding the voice annotation text corresponding to the difference text having the association relation with the unvoiced attribute to an annotation set, determining the difference text with the highest repetition rate in the annotation set, and outputting the voice annotation text corresponding to the difference text with the highest repetition rate as a confirmation result, wherein the repetition rate refers to a ratio between the occurrence number of the difference text in the annotation set and the number of the voice annotation texts in the annotation set.

If there is a difference text with association relation with the unvoiced attribute, the voice label text corresponding to the difference text is added to the label set, and for different voice label texts in the label set, the difference text with the highest repetition rate is determined, the voice label text corresponding to the difference text with the highest repetition rate is output as a confirmation result, so that the accuracy of the confirmation result is improved, wherein the label set is only used for indicating to classify specific voice label texts independently and not for indicating a specific storage format, and in addition, the repetition rate is the ratio between the occurrence number of the difference text in the label set and the number of the voice label texts in the label set.

As can be seen from the embodiment shown in fig. 1, in the embodiment of the present invention, at least two voice labeling texts corresponding to a target voice are obtained, the target voice is segmented to obtain at least two text voices, a part of texts having differences between different voice labeling texts is determined as a difference text, when the difference text corresponds to the difference voice with an unvoiced attribute, the voice labeling text corresponding to the difference text having an association relationship with the unvoiced attribute is added to a labeling set, and finally, the voice labeling text corresponding to the difference text with the highest repetition rate in the labeling set is output as a confirmation result. According to the embodiment of the invention, when at least two voice labeling texts exist, judgment is carried out through the unvoiced sound attribute, and the confirmation result is output, so that the accuracy of voice labeling is improved.

Fig. 2 shows a method of refining the process of determining whether the difference speech has unvoiced sound attribute on the basis of the first embodiment of the present invention. The embodiment of the invention provides a realization flow chart of a text confirmation method based on semantic analysis, as shown in fig. 2, the text confirmation method can comprise the following steps:

in S201, the difference speech is divided into at least two segments of sub-speech according to a preset scale duration, and after each segment of sub-speech is multiplied by a preset reduction coefficient, an attribute measurement value of each segment of sub-speech is obtained, where the attribute measurement value is used to indicate the energy level of the sub-speech.

In order to improve accuracy of judging whether the difference voice has unvoiced sound attributes or not under the condition that the target voice is a continuous voice signal, in the embodiment of the invention, the difference voice is split into at least two sections of sub-voices according to preset scale duration, wherein the duration of each section of sub-voice is identical to the scale duration. Specifically, considering that the voice signal has stability in a shorter time length, the scale time length is preferably less than 40 ms, after the scale time length is set, intercepting is performed once every other scale time length from the starting position of the target voice, each part of the intercepted voice is used as a sub-voice, for example, the preset scale time length is 30 ms, the time length of the target voice is 120 ms, and then 4 sub-voices can be intercepted.

Optionally, a preset loss prevention duration is obtained, after a sub-voice is intercepted, the loss prevention duration is moved forward at the time point of the end of the sub-voice, and the next sub-voice is intercepted according to the scale duration. In the embodiment of the invention, since the target voice is a continuous signal, in order to prevent the loss of dynamic information in the target voice, the loss prevention duration is preset, and the target voice is intercepted according to the scale duration and the loss prevention duration, wherein the set loss prevention duration is smaller than the scale duration. After the interception is completed, the next sub-voice and the previous sub-voice are overlapped, and the duration of the overlapped area is always the loss prevention duration. For example, the scale duration is 30 ms, the anti-lost duration is 10 ms, the duration of the pre-stored voice signal is 120 ms, the duration of the first sub-voice is 0 second to 30 ms of the target voice, the duration of the second sub-voice is 20 ms to 50 ms of the target voice, the duration of the third sub-voice is 40 ms to 70 ms of the target voice, and so on. By the method, faults can be prevented from occurring between the continuous sub-voices, and the continuity of the intercepted sub-voices is improved.

On the basis, in order to increase the continuity of the left end and the right end of each sub-voice, carrying out product operation on each sub-voice and a preset reduction coefficient so as to carry out reduction (weakening) treatment on the left end and the right end of the sub-voice, wherein the reduction coefficient is derived from a preset reduction space, and the formula of the reduction space is as follows:

wherein ω (N) is an abatement coefficient, N is an nth time when the sub-speech is located in the abatement space, N is a width (duration) of the abatement space, and can be freely set according to a scale duration in an actual application scene, and a formula for performing product operation is as follows: x is x _new (n) =x (n) ·ω (n), where x (n) is the sub-speech at the nth time instant within the abatement space, x _new And (n) is the sub-speech at the nth time after the product operation. It should be noted that, when performing the subtraction, a signal curve is constructed substantially based on the sub-voices (the horizontal axis of the coordinate system where the signal curve is located is time, the vertical axis of the coordinate system may be voice amplitude or other voice signal unit), and the signal curve is passed through the subtraction space, and the voice amplitude (or other voice signal unit) at each time of the signal curve is multiplied by the subtraction coefficient generated by the subtraction space at that time.

After the product operation is performed on each segment of sub-voice and the preset reduction coefficient, for each segment of sub-voice, the energy value at the middle time of the sub-voice (the middle time is the scale duration/2 if the starting time of the sub-voice is 0) can be used as the attribute constant value of the sub-voice, at least two sampling times can be set in the scale duration of the sub-voice, and the average value of the energy values at all the sampling times is determined as the attribute constant value corresponding to the sub-voice, wherein the energy value is preferably short-time average energy.

In S202, classifying at least two segments of continuous sub-voices corresponding to the attribute metric value falling in the preset target metric value interval into sub-voice sets, and obtaining the number of sub-voices of each obtained sub-voice set.

Because the energy of voiced sound is higher than that of unvoiced sound, a first threshold value and a second threshold value are preset, a section between the first threshold value and the second threshold value is set as a target measurement value section, and at least two sections of continuous sub-voices corresponding to the attribute measurement value falling into the target measurement value section are classified into a sub-voice set. The first threshold value and the second threshold value can be set according to an energy value definition standard of voiced sound and unvoiced sound in an actual application scene, the first threshold value corresponds to unvoiced sound, the second threshold value corresponds to voiced sound, and the first threshold value is smaller than the second threshold value. After the setting is completed, the sub-voices corresponding to the attribute measurement values higher than the second threshold value are determined to be voiced sounds, the sub-voices corresponding to the attribute measurement values lower than the first threshold value are determined to be blank sounds (because the energy is too low, the sub-voices are not regarded as voices sent by people), and the sub-voices corresponding to the attribute measurement values between the first threshold value and the second threshold value are determined to be unvoiced sounds. The above-mentioned classifying operation is to classify at least two unvoiced sub-voices in a continuous state into sub-voice sets, and the number of sub-voice sets obtained after the classifying operation is completed may be either zero or at least one because the at least two unvoiced sub-voices may not be in a continuous state. If the sub-voice set is not obtained after the classifying operation is completed, directly outputting a prompt which cannot be confirmed; if at least one sub-voice set is obtained after the classifying operation is completed, the number of sub-voices in each sub-voice set is obtained.

In S203, if there is more than a preset number of sub-voices, it is determined that the difference voice has the unvoiced sound attribute.

Because the scale duration is shorter, a single sub-voice is not representative for the whole difference voice, the difference voice is determined to have the unvoiced attribute after the number of the sub-voices which are unvoiced and continuous exceeds a certain number, and particularly if the number of the sub-voices which exceed a preset number exists, the difference voice is determined to have the unvoiced attribute; if the number of the sub-voices exceeding the preset number does not exist, determining that the difference voices do not have unvoiced sound attributes. The preset number may be determined according to the total number of sub-voices split according to the scale duration, and preferably, the preset number is set to be greater than or equal to half the total number of sub-voices, for example, 30 sub-voices split, and then the preset number may be set to be 70% of the total number of sub-voices, that is, 21.

As can be seen from the embodiment shown in fig. 2, in the embodiment of the present invention, the difference speech is divided into at least two segments of sub-speech according to a preset scale duration, and after each segment of sub-speech is multiplied by a preset reduction coefficient, an attribute scale value of each segment of sub-speech is obtained, then at least two segments of continuous sub-speech corresponding to the attribute scale value falling into a preset target scale value interval are classified into sub-speech sets, the number of sub-speech of each obtained sub-speech set is obtained, and if the number of sub-speech exceeding the preset number exists, the difference speech is determined to have the unvoiced attribute. According to the embodiment of the invention, whether the sub-voices are unvoiced or not is judged by calculating the attribute measurement value, so that the accuracy of judging whether the difference voices have unvoiced attributes is improved.

Fig. 3 shows a method of refining a process of determining whether there is an association between a difference text and unvoiced sound attribute on the basis of the second embodiment of the present invention. The embodiment of the invention provides a realization flow chart of a text confirmation method based on semantic analysis, as shown in fig. 3, the text confirmation method can comprise the following steps:

in S301, the difference text is compared with all target words in a preset target word library, where the target words are words with phonetic symbols corresponding to the unvoiced sound attribute.

Because the unvoiced sound attribute often corresponds to a specific phonetic symbol in the language, when judging whether the association relationship exists between the difference text and the unvoiced sound attribute, the difference text can be compared with all target words in a preset target word stock. Wherein the target word is a word (or word) with a phonetic symbol corresponding to an unvoiced attribute. For example, english is used, the corresponding phonetic symbols in English with the unvoiced sound attribute comprise/p/,/t/,/k/,/f/,/θ/,/s/,/W/,/t +/and/ts/,/tr/and/h/, so that all English words comprising any phonetic symbol can be added to the target word stock in advance.

In S302, if the difference text contains the target word, it is determined that the difference text has an association relationship with the unvoiced attribute.

If the difference text contains any target word in the target word library, determining that the difference text has an association relationship with the unvoiced sound attribute; and if the difference text does not contain the target word in the target word stock, determining that the difference text and the unvoiced attribute have no association relation. In addition, the phonetic symbols of each word (or word) in the difference text can be directly analyzed, whether at least one analyzed phonetic symbol contains the phonetic symbol corresponding to the unvoiced sound attribute or not is judged, and if so, the association relationship between the difference text and the unvoiced sound attribute is determined; and if the difference text does not exist, determining that the association relation between the difference text and the unvoiced sound attribute does not exist.

As can be seen from the embodiment shown in FIG. 3, in the embodiment of the invention, the objectivity and the accuracy of judging the association relationship are improved by analyzing whether the difference text contains the target word and determining that the association relationship exists between the difference text and the unvoiced attribute when the difference text contains the target word.

Fig. 4 shows a method of refining a process of determining that there is an association relationship between a difference text and unvoiced sound attributes if the difference text contains a target word on the basis of the third embodiment of the present invention. The embodiment of the invention provides a realization flow chart of a text confirmation method based on semantic analysis, as shown in fig. 4, the text confirmation method can comprise the following steps:

In S401, a scale occupation ratio interval of the scale corresponding to the unvoiced sound attribute in the difference text relative to all the scales in the difference text is calculated, and a first sound emission interval is calculated according to the scale occupation ratio interval and the duration of the difference voice, wherein the first sound emission interval is a sound emission period of the scale corresponding to the unvoiced sound attribute expected to occupy in the difference voice.

Generally speaking, the speaking speed of each phonetic symbol in a word or word is uniform, so in the embodiment of the invention, after determining that the difference text contains the target word, the phonetic symbol duty cycle interval of the phonetic symbol corresponding to the unvoiced attribute in the difference text relative to all phonetic symbols in the difference text is calculated, where the phonetic symbol duty cycle interval is the area occupied by the phonetic symbol corresponding to the unvoiced attribute in all phonetic symbols in the difference text. For example, difference Text _F All the phonetic symbols of "wave" in American pronunciation are [ west ]]But difference Text _F The phonetic symbols corresponding to the mid-vowel attribute are only/t/, and the Text can be obtained because the phonetic symbols corresponding to the mid-vowel attribute occupy the area of the last quarter in all phonetic symbols _F The corresponding phonetic symbol accounts for 75% of the interval ]. And then, performing product operation on two endpoints of the phonetic symbol duty ratio interval and the duration of the difference voice respectively, and combining the result of the product operation into a first sounding interval, wherein the first sounding interval is the sounding period of the phonetic symbol corresponding to the expected unvoiced attribute in the difference voice. Text with the above difference _F By way of example, assuming a duration of 2 seconds for the difference speech, a first sounding interval of [1.5 seconds, 2 seconds may be obtained]。

In S402, the sub-speech set corresponding to the number of sub-speech exceeding the preset number is determined, and a second pronunciation section occupied by all the sub-speech in the sub-speech set in the differential speech is determined.

Determining the pronunciation time period of the phonetic symbol corresponding to the expected unvoiced sound attribute in the differential voice and simultaneously determining the pronunciation time period of the phonetic symbol corresponding to the unvoiced sound attribute in the differential voice, specifically determining a sub-voice set corresponding to the number of sub-voices exceeding a preset number, and if the determined sub-voice set is only one, taking the pronunciation time period of all the sub-voices in the sub-voice set in the differential voice as a second pronunciation interval; if the determined sub-voice set is more than one, the pronunciation time period occupied by all the sub-voices in the difference voices in all the sub-voice sets is used as a second pronunciation interval.

In S403, if the overlap ratio between the first sounding interval and the second sounding interval exceeds a preset overlap ratio threshold, it is determined that the difference text has an association relationship with the unvoiced attribute.

After the first sounding interval and the second sounding interval are obtained, calculating an intersection between the first sounding interval and the second sounding interval, calculating a union between the first sounding interval and the second sounding interval, and taking the ratio of the intersection to the union as the coincidence degree between the first sounding interval and the second sounding interval. For example, the first sounding interval is [1.5 seconds, 2 seconds ], the second sounding interval is [1.75 seconds, 2 seconds ], and the overlap ratio is (2-1.75)/(2-1.5) =50%. Judging whether the calculated overlap ratio exceeds a preset overlap ratio threshold value, if the overlap ratio exceeds the overlap ratio threshold value, proving that the pronunciation time period of a phonetic symbol corresponding to the unvoiced sound attribute in reality accords with the expectation, and determining that the association relationship exists between the difference text and the unvoiced sound attribute; if the contact ratio does not exceed the contact ratio threshold, determining that the association relation between the difference text and the unvoiced sound attribute does not exist, wherein the contact ratio threshold can be set according to an actual application scene, for example, the contact ratio threshold is set to be 50%.

As can be seen from the embodiment shown in fig. 4, in the embodiment of the present invention, a scale ratio interval of a scale corresponding to an unvoiced sound attribute in a difference text is calculated relative to all scales in the difference text, a first sounding interval is calculated according to the scale ratio interval and a duration of the difference voice, meanwhile, a sub-voice set corresponding to a number of sub-voices exceeding a preset number is determined, a second sounding interval occupied by all the sub-voices in the sub-voice set in the difference voice is determined, if a contact ratio between the first sounding interval and the second sounding interval exceeds a preset contact ratio threshold, it is determined that a relationship exists between the difference text and the unvoiced sound attribute, and the embodiment of the present invention compares an expected sounding period with a sounding period in reality, so as to further improve accuracy of judging the relationship.

Fig. 5 shows a method for refining a process of determining a difference text with the highest repetition rate in a labeling set and outputting a voice labeling text corresponding to the difference text with the highest repetition rate as a confirmation result on the basis of the first embodiment of the present invention and on the basis that at least two difference texts exist in voice labeling texts in the labeling set. The embodiment of the invention provides a realization flow chart of a text confirmation method based on semantic analysis, as shown in fig. 5, the text confirmation method can comprise the following steps:

In S501, a preset basic value corresponding to each difference text in the labeling set is obtained, and the repetition rate corresponding to each difference text in the voice labeling text is weighted and summed based on the preset basic value, so as to obtain a text scoring value.

In the embodiment of the invention, if at least two difference texts are included in the voice labeling texts in the labeling set, a preset basic value corresponding to each difference text in the labeling set is obtained, and for each voice labeling text in the labeling set, the preset basic value corresponding to each difference text in the voice labeling text is used as the weight of the repetition rate corresponding to the difference text, so that weighted summation is carried out, and the text scoring value of the voice labeling text is obtained. The preset basic values of different difference texts can be the same or different, and the different difference texts can be freely set according to actual application scenes. For example, text comprising phonetic annotations within an annotation set _G 、Text _H And Text _I ，Text _G For "paste my time", the corresponding difference texts are "paste" and "my", text _H For "program time", the corresponding difference texts are "program" and "program", text _I For "vanmy time", the corresponding difference texts are "vans" and "my", assuming "wave", "my", "mine" and "vans The preset basic values are all 1, and Text can be obtained _G Text score value is 1×2/3) +1×2/3=4/3, text _H Text score value of 1×2/3) +1×1 (1/3) =1, text _I The text score value of (1 x (1/3) +1 x (2/3) =1.

Optionally, obtaining an error labeling record of a labeling party corresponding to the voice labeling text, wherein the error labeling record comprises a text which is recognized to be in error by the labeling party; if the error labeling record contains a difference text in the voice labeling text, setting a preset basic value corresponding to the difference text as a first preset value; if the error labeling record does not contain the difference text in the voice labeling text, setting a preset basic value corresponding to the difference text as a second preset value, wherein the second preset value is larger than the first preset value. In the embodiment of the invention, since different voice labeling texts are generated by different labeling parties, the error labeling record of the labeling party corresponding to the voice labeling texts in the labeling set can be obtained, and the error labeling record comprises the text which is recognized by the labeling party and has errors, wherein the text which has errors can be obtained by recognizing the known voice through the labeling party under the condition that the known text and the corresponding known voice exist, and comparing the recognized result with the known text. If the error labeling record of a labeling party contains a difference text in a voice labeling text corresponding to the labeling party, setting a preset basic value corresponding to the difference text as a first preset value; if the error labeling record of a labeling party does not contain a difference text in the voice labeling text corresponding to the labeling party, setting a preset basic value corresponding to the difference text as a second preset value, wherein the value of the second preset value is larger than the first preset value. According to the method, the labeling party is taken as an object, the weight of the difference text with the identified errors is reduced, and the accuracy of calculating the text scoring value is improved.

In S502, the voice markup text corresponding to the text score value with the highest value is output as the confirmation result.

After the text score value corresponding to each voice label text in the label set is calculated, outputting the voice label text corresponding to the text score value with the highest numerical value as a confirmation result.

As can be seen from the embodiment shown in fig. 5, in the embodiment of the present invention, a preset basic value corresponding to each difference text in the labeling set is obtained, and the repetition rate corresponding to each difference text in the voice labeling text is weighted and summed based on the preset basic value to obtain a text grading value, and then the voice labeling text corresponding to the text grading value with the highest numerical value is output as a confirmation result.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Corresponding to the text confirmation method based on semantic parsing described in the above embodiments, fig. 6 shows a block diagram of a text confirmation device based on semantic parsing according to an embodiment of the present invention, and referring to fig. 6, the text confirmation device includes:

a segmentation unit 61, configured to obtain at least two voice markup texts corresponding to a target voice, and segment the target voice according to the voice markup text with the largest number of words to obtain at least two text voices, where different voice markup texts are generated by different markup parties;

a first judging unit 62, configured to determine a portion of text having a difference between different voice markup texts as a difference text, determine the text voice corresponding to the difference text as a difference voice, and judge whether the difference voice has an unvoiced attribute;

a second judging unit 63, configured to judge whether an association exists between the difference text and the unvoiced sound attribute if the difference text has the unvoiced sound attribute;

and an output unit 64, configured to add the voice markup text corresponding to the difference text having an association relationship with the unvoiced attribute to a markup set, determine the difference text with the highest repetition rate in the markup set, and output the voice markup text corresponding to the difference text with the highest repetition rate as a confirmation result, where the repetition rate is a ratio between the number of occurrences of the difference text in the markup set and the number of the voice markup texts in the markup set.

Alternatively, the first judging unit 62 includes:

the splitting unit is used for splitting the difference voice into at least two sections of sub-voices according to the preset scale duration average, and obtaining an attribute measurement value of each section of sub-voice after multiplying each section of sub-voice by a preset reduction coefficient, wherein the attribute measurement value is used for indicating the energy level of the sub-voice;

the classifying unit is used for classifying at least two sections of continuous sub-voices corresponding to the attribute measurement value falling into a preset target measurement value interval into sub-voice sets, and acquiring the number of sub-voices of each obtained sub-voice set;

and the determining unit is used for determining that the difference voice has the unvoiced sound attribute if the number of the sub-voices exceeding the preset number exists.

Alternatively, the second judging unit 63 includes:

the comparison unit is used for comparing the difference text with all target words in a preset target word stock, wherein the target words are words with phonetic symbols corresponding to the unvoiced sound attribute;

and the association unit is used for determining that the association relation exists between the difference text and the unvoiced sound attribute if the difference text contains the target word.

Optionally, determining the association unit includes:

a first interval calculation unit, configured to calculate a phonetic symbol duty cycle interval of a phonetic symbol corresponding to the unvoiced sound attribute in the difference text relative to all phonetic symbols in the difference text, and calculate a first sounding interval according to the phonetic symbol duty cycle interval and a duration of the difference voice, where the first sounding interval is a sounding period occupied by a phonetic symbol corresponding to the unvoiced sound attribute expected in the difference voice;

a second interval calculation unit, configured to determine the sub-speech set corresponding to the number of sub-speech exceeding the preset number, and determine a second pronunciation interval occupied by all the sub-speech in the sub-speech set in the differential speech;

and the determining association subunit is used for determining that the association relationship exists between the difference text and the unvoiced sound attribute if the coincidence degree between the first sounding interval and the second sounding interval exceeds a preset coincidence degree threshold value.

Alternatively, if there are at least two different texts among the voice markup texts in the markup set, the output unit 64 includes:

the weighting unit is used for acquiring a preset basic value corresponding to each difference text in the labeling set, and carrying out weighted summation on the repetition rate corresponding to each difference text in the voice labeling text based on the preset basic value to obtain a text scoring value;

And the output subunit is used for outputting the voice marking text corresponding to the text grading value with the highest numerical value as the confirmation result.

Optionally, the weighting unit includes:

the record acquisition unit is used for acquiring an error annotation record of the annotating party corresponding to the voice annotation text, wherein the error annotation record comprises a text which is recognized to be in error by the annotating party;

the first setting unit is used for setting the preset basic value corresponding to the difference text as a first preset value if the error labeling record contains the difference text in the voice labeling text;

and the second setting unit is used for setting the preset basic value corresponding to the difference text as a second preset value if the error labeling record does not contain the difference text in the voice labeling text, wherein the second preset value is larger than the first preset value.

Therefore, the text confirmation device based on semantic analysis provided by the embodiment of the invention judges whether the voice labeling text is correct or not based on the unvoiced sound attribute, thereby improving the accuracy of voice labeling.

Fig. 7 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 7, the terminal device 7 of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70, for example a text confirmation program based on semantic parsing. The processor 70, when executing the computer program 72, implements the steps of the respective semantic parsing based text confirmation method embodiments described above, such as steps S101 to S104 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, performs the functions of the units in the above-described embodiments of the text confirming device based on semantic parsing, such as the functions of the units 61 to 64 shown in fig. 6.

By way of example, the computer program 72 may be divided into one or more units, which are stored in the memory 71 and executed by the processor 70 to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 72 in the terminal device 7. For example, the computer program 72 may be divided into a segmentation unit, a first judgment unit, a second judgment unit, and an output unit, each unit specifically functioning as follows:

The terminal device 7 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the terminal device 7 and does not constitute a limitation of the terminal device 7, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The processor 70 may be a central processing unit (Central Processing Unit, CPU), or may be another general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 7. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 71 may also be used for temporarily storing data that has been output or is to be output.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units is illustrated, and in practical application, the above-mentioned functional allocation may be performed by different functional units, that is, the internal structure of the terminal device is divided into different functional units, so as to perform all or part of the above-mentioned functions. The functional units in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present application. The specific working process of the units in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed terminal device and method may be implemented in other manners. For example, the above-described terminal device embodiments are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A text validation method based on semantic parsing, comprising:

adding the voice annotation text corresponding to the difference text with the association relation with the unvoiced attribute to an annotation set, determining the difference text with the highest repetition rate in the annotation set, and outputting the voice annotation text corresponding to the difference text with the highest repetition rate as a confirmation result, wherein the repetition rate refers to the ratio between the occurrence number of the difference text in the annotation set and the number of the voice annotation texts in the annotation set;

if at least two different texts exist in the voice labeling texts in the labeling set, determining the different text with the highest repetition rate in the labeling set, and outputting the voice labeling text corresponding to the different text with the highest repetition rate as a confirmation result, wherein the steps include:

acquiring a preset basic value corresponding to each difference text in the annotation set, and carrying out weighted summation on the repetition rate corresponding to each difference text in the voice annotation text based on the preset basic value to obtain a text scoring value, wherein the preset basic values of different difference texts are freely set according to actual application scenes;

And outputting the voice mark text corresponding to the text score value with the highest value as the confirmation result.

2. The text confirmation method of claim 1, wherein the determining whether the difference speech has unvoiced sound properties comprises:

dividing the difference voice into at least two sections of sub-voices according to a preset scale duration average, and obtaining an attribute measurement value of each section of sub-voice after multiplying each section of sub-voice by a preset reduction coefficient, wherein the attribute measurement value is used for indicating the energy level of the sub-voice;

classifying at least two sections of continuous sub-voices corresponding to the attribute measurement value falling into a preset target measurement value interval into sub-voice sets, and obtaining the number of sub-voices of each sub-voice set;

and if the number of the sub-voices exceeding the preset number exists, determining that the difference voices have the unvoiced sound attribute.

3. The text confirmation method of claim 2, wherein determining whether the difference text has an association with the unvoiced sound attribute comprises:

comparing the difference text with all target words in a preset target word stock, wherein the target words are words with phonetic symbols corresponding to the unvoiced sound attribute;

And if the difference text contains the target word, determining that the difference text and the unvoiced sound attribute have an association relation.

4. The text confirmation method of claim 3, wherein if the difference text contains the target word, determining that the difference text has an association relationship with the unvoiced attribute comprises:

calculating a phonetic symbol duty ratio interval of phonetic symbols corresponding to the unvoiced sound attribute in the difference text relative to all phonetic symbols in the difference text, and calculating a first phonetic symbol interval according to the phonetic symbol duty ratio interval and the duration of the difference voice, wherein the first phonetic symbol interval is a pronunciation period of the phonetic symbols corresponding to the unvoiced sound attribute, which is expected to occupy in the difference voice;

determining the sub-voice set corresponding to the number of sub-voices exceeding the preset number, and determining a second pronunciation interval occupied by all the sub-voices in the sub-voice set in the differential voices;

and if the coincidence degree between the first sounding interval and the second sounding interval exceeds a preset coincidence degree threshold value, determining that the association relationship exists between the difference text and the unvoiced sound attribute.

5. The method for confirming text according to claim 1, wherein the obtaining a preset basic value corresponding to each of the difference texts in the voice markup text comprises:

obtaining an error annotation record of the annotating party corresponding to the voice annotation text, wherein the error annotation record comprises a text which is recognized to be in error by the annotating party;

if the error annotation record contains the difference text in the voice annotation text, setting the preset basic value corresponding to the difference text as a first preset value;

if the error annotation record does not contain the difference text in the voice annotation text, setting the preset basic value corresponding to the difference text as a second preset value, wherein the second preset value is larger than the first preset value.

6. A text confirmation device based on semantic parsing, comprising:

the output unit is used for adding the voice annotation text corresponding to the difference text with the association relation with the unvoiced attribute to an annotation set, determining the difference text with the highest repetition rate in the annotation set, and outputting the voice annotation text corresponding to the difference text with the highest repetition rate as a confirmation result, wherein the repetition rate refers to the ratio between the occurrence times of the difference text in the annotation set and the number of the voice annotation texts in the annotation set;

if at least two different texts exist in the voice annotation texts in the annotation set, the output unit comprises:

the weighting unit is used for obtaining preset basic values corresponding to each difference text in the annotation set, and carrying out weighted summation on the repetition rate corresponding to each difference text in the voice annotation text based on the preset basic values to obtain text scoring values, wherein the preset basic values of different difference texts are freely set according to actual application scenes;

7. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

8. The terminal device of claim 7, wherein the determining whether the difference speech has unvoiced sound properties comprises:

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the text confirmation method according to any one of claims 1 to 5.