CN117648450A

CN117648450A - Corpus labeling method and device, electronic equipment and storage medium

Info

Publication number: CN117648450A
Application number: CN202311482699.5A
Authority: CN
Inventors: 彭霖铠; 孙艳庆; 李璐; 李佳威; 王强; 张润楠
Original assignee: Netease Youdao Information Technology Beijing Co Ltd
Current assignee: Netease Youdao Information Technology Beijing Co Ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-03-05

Abstract

The embodiment of the invention provides a corpus labeling method, a corpus labeling device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a corpus to be annotated and a corpus type corresponding to the corpus to be annotated; determining a labeling system matched with the corpus type; identifying ultrasonic section features in the corpus to be marked by using a marking system matched with the corpus type, and marking the corpus to be marked according to the identification result, wherein the ultrasonic section features are used for representing pronunciation of the corpus to be marked. By the method, the marking system matched with the corpus type corresponding to the corpus to be marked can be utilized to identify the ultrasonic section features in the corpus to be marked, and then the ultrasonic section features in the corpus to be marked are automatically marked according to the identification result, so that the marking system is not limited to the content and the corpus type of the corpus to be marked, and does not need to be manually marked, the marking corpus content is remarkably enlarged, the manual marking cost is reduced, and better experience is brought to users.

Description

Corpus labeling method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to a corpus labeling method, a corpus labeling device, electronic equipment and a storage medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

With the continuous development of natural language processing technology, a corpus is widely applied in a plurality of fields. For example, in the fields of language evaluation, language teaching and the like, the corpus in the corpus can well help users to perform evaluation scoring and language learning by marking the sound segment features and the ultrasonic segment features.

However, in the prior art, when the corpus is marked on the ultrasound segment features, manual marking is usually performed on some selected fixed corpora, so that the content of the marked corpora is limited, and the manual marking cost is high.

Disclosure of Invention

However, for the reason that automatic labeling is not realized for ultrasonic segment features, in the prior art, when corpus labeling is performed on ultrasonic segment features, manual labeling is usually performed on some selected fixed corpora.

Therefore, in the prior art, the problem that the labeling corpus content is limited and the manual labeling cost is high exists, which is a very annoying process.

Therefore, an improved corpus labeling method is highly needed, so that automatic labeling of ultrasonic segment features can be realized, corpus content is enlarged, and manual labeling cost is reduced.

In this context, the embodiment of the invention is expected to provide a corpus labeling method, a corpus labeling device, an electronic device and a storage medium.

In a first aspect of the embodiment of the present invention, there is provided a corpus labeling method, including:

acquiring a corpus to be annotated and a corpus type corresponding to the corpus to be annotated;

determining an annotation system matched with the corpus type;

identifying ultrasonic segment features in the corpus to be marked by using a marking system matched with the corpus type, and marking the corpus to be marked according to the identification result, wherein the ultrasonic segment features are used for representing pronunciation of the corpus to be marked.

In one embodiment of the present invention, the corpus type is a first corpus type, a second corpus type or a third corpus type, wherein the first corpus type is a corpus type only including text-form corpus, the second corpus type is a corpus type only including speech-form corpus, and the third corpus type is a corpus type including text-form corpus and speech-form corpus;

the identifying of the ultrasonic section features in the corpus to be marked by using the marking system matched with the corpus type comprises the following steps:

When the corpus to be annotated is of the first corpus type, identifying first ultrasonic section features of original text of the corpus to be annotated by using a first annotation system, wherein the first annotation system is used for annotating the corpus in a text form; or,

when the corpus to be annotated is of the second corpus type, identifying second ultrasonic section features of original voices of the corpus to be annotated by using a second annotation system, wherein the second annotation system is used for annotating the corpus in a voice form; or,

and under the condition that the corpus to be annotated is of the third corpus type, identifying first ultrasonic section features of original texts of the corpus to be annotated by using the first annotation system, identifying second ultrasonic section features of original voices of the corpus to be annotated by using the second annotation system, and determining third ultrasonic section features in the corpus to be annotated according to the first ultrasonic section features and the second ultrasonic section features.

In another embodiment of the present invention, the identifying, by using a first labeling system, a first ultrasonic segment feature of the original text of the corpus to be labeled includes:

inputting the original text of the corpus to be annotated into the first annotation system to obtain first position information needing to be stopped, second position information needing to be re-read, third position information needing to be subjected to explosion loss and/or fourth position information needing to be subjected to continuous reading in the original text;

And determining the first position information, the second position information, the third position information and/or the fourth position information as the first ultrasonic section characteristics.

In yet another embodiment of the present invention, the identifying, by using a second labeling system, the second supersonic feature of the original speech of the corpus to be labeled includes:

inputting the original voice of the corpus to be annotated into the second annotation system, and identifying to obtain an identification text of the corpus to be annotated;

performing alignment processing on the original voice and the recognition text to obtain alignment information, wherein the alignment information is used for representing the starting pronunciation time and the ending pronunciation time of each text unit in the recognition text in the original voice;

determining the voice characteristics of the corpus to be annotated according to the alignment information and the original voice, wherein the voice characteristics comprise at least one of the phoneme probability of each voice frame in the original voice, the corresponding fundamental frequency, energy, mel frequency cepstrum coefficient and the average duration of the phonemes of each text unit;

according to the alignment information and the voice characteristics, determining fifth position information needing to be stopped, sixth position information needing to be re-read, seventh position information needing to be subjected to explosion-losing and/or eighth position information needing to be subjected to continuous reading in the original voice;

And determining the fifth position information, the sixth position information, the seventh position information and/or the eighth position information as the second ultrasonic section feature.

In still another embodiment of the present invention, the determining, according to the alignment information and the voice feature, fifth location information that needs to be stopped, sixth location information that needs to be re-read, seventh location information that needs to be burst-broken, and/or eighth location information that needs to be read continuously in the original voice includes:

determining the pronunciation interval time length between any two adjacent text units in the original voice according to the alignment information, and determining the position information between the two adjacent text units with the pronunciation interval time length longer than the preset time length as the fifth position information;

determining the pronunciation duration of each text unit in the original voice according to the alignment information, and determining the sixth position information according to the pronunciation duration of each text unit and the voice characteristics;

determining whether each text unit in the original voice meets a preset explosion-losing condition according to the voice characteristics, and determining the position information of the text unit meeting the preset explosion-losing condition as the seventh position information, wherein the preset explosion-losing condition is that the last pronunciation phoneme of the previous text unit in two adjacent text units is a plosive, the first pronunciation phoneme of the next text unit is a consonant, and the average value of phoneme probabilities of all voice frames in the duration time period of the plosive belongs to a first preset range;

According to the voice characteristics, determining whether each text unit in the original voice meets a preset continuous reading condition, and determining the position information of the text unit meeting the preset continuous reading condition as the eighth position information, wherein the preset continuous reading condition is that the last pronunciation phoneme of the previous text unit and the first pronunciation phoneme of the next text unit in two adjacent text units meet a preset corresponding relation, and the average value of phoneme probabilities of all voice frames in the duration time period of the last pronunciation phoneme of the previous text unit belongs to a second preset range.

In still another embodiment of the present invention, the determining, according to the first ultrasonic segment feature and the second ultrasonic segment feature, a third ultrasonic segment feature in the corpus to be annotated includes:

and under the condition that the first ultrasonic section feature and the second ultrasonic section feature have ultrasonic section features meeting a preset fusion rule, determining the ultrasonic section features meeting the preset fusion rule as the third ultrasonic section feature, wherein the preset fusion rule is that the distribution positions of ultrasonic section features of the same type in the first ultrasonic section feature and the second ultrasonic section feature need to be overlapped.

In still another embodiment of the present invention, the labeling the corpus to be labeled according to the recognition result includes:

determining the type of the ultrasonic section feature and the position of the ultrasonic section feature contained in the corpus to be annotated according to the identification result, wherein the type of the ultrasonic section feature comprises at least one of a first feature type used for representing whether pause is needed, a second feature type used for representing whether rereading is needed, a third feature type used for representing whether explosion loss is needed and a fourth feature type used for representing whether continuous reading is needed;

and marking the corpus to be marked according to the type of the ultrasonic section features and the positions of the ultrasonic section features.

In a second aspect of the embodiment of the present invention, there is provided a corpus labeling apparatus, including: the acquisition module is used for acquiring the corpus to be annotated and the corpus type corresponding to the corpus to be annotated;

the determining module is used for determining an annotation system matched with the corpus type;

the identifying and labeling module is used for identifying ultrasonic segment features in the corpus to be labeled by utilizing a labeling system matched with the corpus type, and labeling the corpus to be labeled according to the identification result, wherein the ultrasonic segment features are used for representing the pronunciation rhythm of the corpus to be labeled.

In a third aspect of the embodiments of the present invention, there is provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory perform communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the corpus labeling method according to any one of the first aspect when executing the program stored in the memory.

In a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the corpus labeling method of any of the first aspects.

According to the corpus labeling method, the device, the electronic equipment and the storage medium, the labeling system matched with the corpus type corresponding to the corpus to be labeled can be utilized, the ultrasonic section characteristics in the corpus to be labeled are identified, then the ultrasonic section characteristics in the corpus to be labeled are automatically labeled according to the identification result, the content and the corpus type of the corpus to be labeled are not required to be limited, and the labeling is not required to be carried out manually, so that the labeling corpus content is remarkably enlarged, the manual labeling cost is reduced, and better experience is brought to a user.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically shows an application scenario diagram according to an embodiment of the invention;

FIG. 2 schematically shows a schematic diagram of a corpus to be annotated according to an embodiment of the invention;

FIG. 3 schematically illustrates a labeling corpus schematic diagram according to an embodiment of the invention;

FIG. 4 schematically shows a flow chart of a corpus labeling method according to an embodiment of the invention;

FIG. 5 schematically illustrates a pause prediction process diagram of an original text according to an embodiment of the present invention;

FIG. 6 schematically shows a flow chart of a corpus labeling method according to yet another embodiment of the invention, according to an embodiment of the invention;

FIG. 7 schematically illustrates a functional block diagram of a corpus labeling means according to an embodiment of the present invention;

fig. 8 schematically shows a functional block diagram of an electronic device according to an embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and practice the invention and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Summary of The Invention

The inventor discovers that the existing ultrasonic section feature corpus labeling mode generally carries out manual labeling on some selected fixed corpuses, so that the problem that the labeling corpus content is limited and the manual labeling cost is high exists.

Based on the method, the corpus to be annotated and the corpus types corresponding to the corpus to be annotated are obtained, an annotation system matched with the corpus types is determined, then the ultrasound segment features in the corpus to be annotated are identified by utilizing the annotation system matched with the corpus types, and the corpus to be annotated is annotated according to the identification result. Therefore, the marking system matched with the corpus type corresponding to the corpus to be marked can be utilized to identify the ultrasonic section features in the corpus to be marked, and then the ultrasonic section features in the corpus to be marked are automatically marked according to the identification result, so that the marking is not limited to the content and the corpus type of the corpus to be marked, and manual marking is not needed, the marking corpus content is remarkably enlarged, the manual marking cost is reduced, and better experience is brought to users.

Having described the basic principles of the present invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

First, referring to fig. 1, an application scenario of a corpus labeling method, a corpus labeling device, an electronic device and a storage medium according to an embodiment of the present invention will be described in detail.

Fig. 1 schematically shows an application scenario diagram according to an embodiment of the invention. It should be noted that fig. 1 is only an example of an application scenario where an embodiment of the present invention may be applied to help those skilled in the art understand the technical content of the present invention, and does not mean that the embodiment of the present invention may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, in the application scenario of speech evaluation and speech teaching, a system architecture to which the corpus labeling method of the embodiment of the present invention is applied may include a terminal device 101 and a server 102. The terminal device 101 may be provided with a voice corpus input device (such as a recorder and the like) and/or a text corpus input device (such as a keyboard and the like) so as to acquire the corpus to be annotated in real time. After the terminal device 101 obtains the corpora to be annotated, the corpora to be annotated can be sent to the server 102, and the server 102 annotates the corpora to be annotated. Alternatively, after the terminal device 101 obtains the corpus to be annotated, the terminal device 101 may annotate the corpus to be annotated. Optionally, after the terminal device 101 obtains the corpus to be annotated, the terminal device 101 and the server 102 may cooperatively annotate the corpus to be annotated.

The terminal device 101 may be a variety of electronic devices including, but not limited to, smart phones, tablet computers, laptop computers, desktop computers, and intelligent learning desk lamps, among others.

The server 102 may interact with the terminal device 101 through a network to receive or send messages or the like. For example, the server 102 may receive the corpus to be annotated sent by the terminal device 101, and annotate the corpus to be annotated.

It should be noted that, each step in the corpus labeling method in the embodiment of the present invention may be performed by the terminal device 101 or the server 102, which is not limited in the embodiment of the present invention.

When the terminal equipment 101 inputs the text form corpus as shown in fig. 2 as the corpus to be annotated, the user can annotate the corpus through the terminal equipment 101 and/or the server 102, and finally the annotated corpus shown in fig. 3 is obtained. Specifically, marks of some ultrasonic section features can be added on the text of the corpus to be annotated, and the marks are used for assisting a user in pronunciation practice. Wherein the black bolded text portion represents syllables that need to be emphasized, which represents what needs to be emphasized in the sentence. The highlighting is marked directly onto syllables rather than words, which can avoid the possibility of a user becoming unfamiliar with the location of accents within the words, resulting in misreading. The underlined portion indicates what is needed for the implosion. "|" represents the intent group and the user is expected to pause briefly at this location while speaking. The continuous reading is represented, and comprises the phenomena of sound swallowing, continuous reading of auxiliary elements, sound increasing and the like.

Exemplary method

The method for use according to an exemplary embodiment of the present invention is described below with reference to the drawings in conjunction with the application scenario of the drawings. It should be noted that the above application scenario is only shown for the convenience of understanding the spirit and principle of the present invention, and the embodiments of the present invention are not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.

Referring first to FIG. 4, a flow chart of a corpus labeling method according to an embodiment of the present invention is schematically shown. As shown in fig. 4, the corpus labeling method may include:

step 402, obtaining the corpus to be annotated and the corpus types corresponding to the corpus to be annotated.

Specifically, the corpus to be annotated can be any form of corpus, such as english dialogue in a television play, english lines, characters in an english textbook, english listening materials in test materials, and the like. The corpus types corresponding to the corpus to be annotated can be in a pure text form, a pure voice form or a text form and a voice form, and the embodiment of the application is not particularly limited.

When the corpus to be annotated is obtained, the corpus information input by the user can be received in real time, and the received corpus information is used as the corpus to be annotated. After the corpus to be annotated is obtained, the corpus type corresponding to the corpus to be annotated can be further determined.

Step 404, determining an annotation system matched with the corpus type.

In the embodiment of the application, a plurality of labeling systems can be set, and different labeling systems are used for labeling the to-be-labeled corpus with different corpus types, so that the labeling system matched with the corpus type corresponding to the to-be-labeled corpus is determined according to the corpus type corresponding to the to-be-labeled corpus, and the to-be-labeled corpus is conveniently labeled by adopting a proper labeling system.

And 406, identifying ultrasonic section features in the corpus to be marked by using a marking system matched with the corpus type, and marking the corpus to be marked according to the identification result, wherein the ultrasonic section features are used for representing the pronunciation of the corpus to be marked.

The evaluation of pronunciation generally comprises two levels: a sound segment level and a ultrasound segment level. A segment of sound may be considered a phoneme, a phonetic symbol actually pronounces. The ultrasonic segment is some acoustic feature on multiple segments, such as intonation, accent, prosodic rhythm, and the like. The ultrasonic characteristics herein may include, but are not limited to, pause, rereading, implosion, and read-through types.

In the embodiment of the application, the marking system matched with the corpus type corresponding to the corpus to be marked can be utilized to identify the ultrasonic section features in the corpus to be marked, then the ultrasonic section features in the corpus to be marked are automatically marked according to the identification result, the marking is not required to be limited to the content and the corpus type of the corpus to be marked, and the ultrasonic section features are not required to be marked manually, so that the marking corpus content is remarkably enlarged, the manual marking cost is reduced, and better experience is brought to a user.

In some embodiments, the corpus types are a first corpus type, a second corpus type, or a third corpus type, wherein the first corpus type is a corpus type that only includes text-form corpus, the second corpus type is a corpus type that only includes speech-form corpus, and the third corpus type is a corpus type that includes text-form corpus and speech-form corpus.

Identifying ultrasonic segment features in the corpus to be annotated by utilizing an annotation system matched with the corpus type, wherein the method comprises the following steps:

under the condition that the corpus to be annotated is of a first corpus type, identifying first ultrasonic section features of an original text of the corpus to be annotated by using a first annotation system, wherein the first annotation system is used for annotating the corpus in a text form; or,

under the condition that the corpus to be annotated is of a second corpus type, identifying second ultrasonic section features of original voice of the corpus to be annotated by using a second annotation system, wherein the second annotation system is used for annotating the corpus in a voice form; or,

and under the condition that the corpus to be annotated is of a third corpus type, identifying first ultrasonic section features of original text of the corpus to be annotated by using a first annotation system, identifying second ultrasonic section features of original voice of the corpus to be annotated by using a second annotation system, and determining third ultrasonic section features in the corpus to be annotated according to the first ultrasonic section features and the second ultrasonic section features.

Specifically, the first labeling system is a text-based labeling system and is used for labeling text-form corpus. The second labeling system is a voice-based labeling system and is used for labeling the linguistic data.

When the corpus to be annotated, which is input by the user, is of a first corpus type (namely, the corpus to be annotated, which is input by the user, is of a pure text form), a first annotation system matched with the first corpus type can be determined and obtained, and the first annotation system is utilized to identify the first ultrasonic segment characteristics of the original text of the corpus to be annotated.

When the corpus to be annotated, which is input by the user, is of a second corpus type (namely, the corpus to be annotated, which is input by the user, is of a pure voice form), a second annotation system matched with the second corpus type can be determined and obtained, and the second annotation system is utilized to recognize the second ultrasonic section characteristics of the original voice of the corpus to be annotated. The second ultrasonic section feature may be the same as the first ultrasonic section feature or may be different from the first ultrasonic section feature.

When the corpus to be annotated, which is input by the user, is of a third corpus type (namely, the corpus to be annotated, which is input by the user, contains the corpus in a text form and the corpus in a voice form), the first labeling system can be utilized to identify the first ultrasonic section characteristics of the original text of the corpus to be annotated, the second labeling system is utilized to identify the second ultrasonic section characteristics of the original voice of the corpus to be annotated, and then the third ultrasonic section characteristics in the corpus to be annotated are determined according to the first ultrasonic section characteristics and the second ultrasonic section characteristics. The third ultrasonic segment feature is understood herein to be the intersection of the first ultrasonic segment feature and the second ultrasonic segment feature.

In the embodiment of the application, whether the corpus to be annotated is text-form corpus or voice-form corpus or the combination of the text-form corpus and the voice-form corpus, the first annotation system and the second annotation system can be utilized to accurately identify the ultrasonic section characteristics of the corpus to be annotated, so that the flexibility of inputting the corpus to be annotated is improved, and the voice learning range is enlarged.

In some embodiments, identifying, with a first labeling system, first ultrasonic features of an original text of a corpus to be labeled, includes:

inputting an original text of a corpus to be annotated into a first annotation system to obtain first position information needing to be stopped, second position information needing to be reread, third position information needing to be subjected to explosion loss and/or fourth position information needing to be subjected to continuous reading in the original text;

and determining the first position information, the second position information, the third position information and/or the fourth position information as the first ultrasonic section characteristic.

In the embodiment of the application, after the original text of the corpus to be annotated is input into the first annotation system, the input original text can be processed by using different models in the first annotation system. Specifically, a word segmentation device (i.e., token) and a preset word list may be used to obtain a label (i.e., token) corresponding to each text unit (e.g., english word) in the original text, then the token corresponding to each text unit is input to the Transformer Encoder model, so as to predict the first position information to be stopped, and finally the label corresponding to each text unit is output. For example, when the original text is "The survey found that the statistical … …", the label of Yes is output after the word that, which means that a pause is required after the word that; the No tag is output after the remaining words, which means that No pauses are required after the remaining words, as shown in fig. 5. As an alternative embodiment, the Transformer Encoder model may consist of 6 layers Transformer Encoder Layer, i.e. a pure encoder structure, without a decoder. The number of hidden neurons per layer may be 768.

Meanwhile, the position needing to be read again can be predicted by using the generated large language model in the first labeling system. The term "accentuation" as used herein refers to highlighting, at a sentence level, accentuated syllables of a text unit (e.g., english word, etc.) for the purpose of expressing a meaning. I.e. making a certain text unit sound more visible. For example, a child who has transferred a skin returns home and is asked by parents to "do you escape lessons? ", and the child beeps to say with the mouth anger anger: "I go to school today". Rereading the "to" word in this sentence may express that he does go to school today, and thus this prediction of the reread word/phrase at the sentence level requires very advanced understanding and context analysis capabilities. In some alternative embodiments, text units that need to be reread may be automatically parsed using a generative large language model. For example, the original text "Thesurvey found that the statistical … …" may be predicted using a generative large language model to obtain second location information, such as the word survey, statistical, that needs to be reread.

Meanwhile, the explosion-loss refers to that between the last pronunciation phoneme of a text unit (such as an english word) and the first pronunciation phoneme of a next text unit (such as an english word), if a preset explosion-loss rule is met, the last explosion sound of the text unit is omitted during pronunciation. Specifically, when several plosives of/t/,/d/,/p/,/b/,/k/,/g/are uttered before consonants, a phenomenon that the plosives cannot be completely released occurs, that is, when the plosives need to be uttered, the uttering is stopped immediately at the moment of just uttering, and the transition to the next consonant is immediately made. When the third position information of the explosion is acquired, the pronunciation table corresponding to each word can be acquired to obtain the pronunciation sequence of the original text (the phonetic symbol similar to hello is ) It is then determined whether a misfire is required between the words by analyzing the phonetic symbols between the words. For example, for "found that" in the original text, since the last phoneme of found is plosive d, the first phonetic symbol of the following word is/th/, and/th/is consonant, the explosion-losing condition is satisfied, and the position information of d can be marked as the third position information required for explosion-losing.

At the same time, read-through is similar to implosion, and their occurrence in sentences is also relatively fixed. When the fourth position information which needs to be read continuously is acquired, a pronunciation table corresponding to each word can be acquired to obtain a pronunciation sequence of the original text, and then whether the words need to be read continuously or not is determined by analyzing phonetic symbols among the words.

In the embodiment of the application, the first labeling system can be utilized to accurately identify the first position information needing to be stopped, the second position information needing to be reread, the third position information needing to be in explosion failure and/or the fourth position information needing to be read continuously in the original text, so that the function of labeling the corpus in the pure text form is realized.

In some embodiments, identifying, with a second labeling system, second ultrasonic features of the original speech of the corpus to be labeled, includes:

Inputting the original voice of the corpus to be annotated into a second annotation system, and identifying to obtain an identification text of the corpus to be annotated;

carrying out alignment processing on the original voice and the recognition text to obtain alignment information, wherein the alignment information is used for representing the starting pronunciation time and the ending pronunciation time of each text unit in the recognition text in the original voice;

determining voice characteristics of a corpus to be annotated according to the alignment information and the original voice, wherein the voice characteristics comprise at least one of a phoneme probability of each voice frame in the original voice, a fundamental frequency, energy, a mel frequency cepstrum coefficient and a phoneme average duration corresponding to each text unit;

according to the alignment information and the voice characteristics, determining fifth position information needing to be stopped, sixth position information needing to be re-read, seventh position information needing to be subjected to explosion loss and/or eighth position information needing to be subjected to continuous reading in the original voice;

In this embodiment of the present application, after the original speech of the corpus to be annotated is input into the second annotation system, the input original speech may be processed by using different models in the second annotation system, respectively. Specifically, first, a speech recognition model (such as Whisper) may be used to recognize the original speech, so as to obtain a recognized text. The original speech and the recognized text may then be input to a kaldi ASR system (i.e., an automatic speech recognition technology open source project) and the input original speech and recognized text may be aligned using the kaldi ASR system to obtain alignment information. The alignment information here is used to characterize the starting pronunciation time and the ending pronunciation time of each text unit in the recognized text in the original speech. For example, assuming that The original speech is The corresponding speech of "The Survey found that The statistical … …", the alignment information may represent The pronunciation of The word "The" from 0.08s to 0.34s, the pronunciation of The word "Survey" from 0.48s to 0.56s, and so on.

Besides, on the basis of the alignment information and the original voice, the voice features of the corpus to be annotated, such as the phoneme probability of each voice frame in the original voice, the corresponding fundamental frequency, energy, mel frequency cepstrum coefficient, phoneme average duration and the like of each text unit, can be obtained. Wherein the phoneme probability of each speech frame in the original speech refers to the phoneme probability corresponding to the speech frame, for example, assuming that the duration of each speech frame is 20ms, the phonemes can be generated as phonemes on the speech frame of 0.08s-0.10sThe probability of (a) is 0.9 (normalized), and the probability of pronunciation phoneme is about 1 e-3/b/etc. The fundamental frequency corresponding to each text unit refers to the frequency with the largest energy in the complex wave, and can be obtained by calculating the original voice by using an autocorrelation method. The energy corresponding to each text unit refers to the energy per frame in the time domain, the square of the amplitude. Which may be an average of the energy of each frame of the corresponding text unit in the original speech. The fundamental frequency and energy have a numerical point on each speech frame and the mean and maximum/minimum values will be calculated as the fundamental frequency and energy characteristics of a word for the duration of that word. The Mel-frequency cepstral coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short) corresponding to each text unit is a cepstral parameter extracted in the Mel-scale frequency domain. The method comprises the steps of pre-emphasis, framing and windowing, short-time Fourier transform and Mel scale transform of signals to obtain the first 13 coefficients and delta coefficients. The average duration of phonemes corresponding to each text unit refers to the pitch within each text unit The element average duration is a number. Since the number of phonemes in each text unit is explicit, the duration of pronunciation of the entire text unit divided by the number of phonemes is the average duration of phonemes in that text unit.

It should be noted that, there are about 70 phonemes in the english corpus, and these phonemes may be obtained by an embedding layer (i.e., an embedding layer) to obtain 768-dimensional features of each phoneme, so as to identify the phonemes in the speech. The MFCC may be converted by a linear layer, the 39-dimensional feature is mapped into 768-dimensional feature, then the 768-dimensional feature of the chord element is spliced, the spliced feature enters a Tramsformer Encoder model, the model output result passes through a Max Pooling layer (i.e., a Max Pooling layer) in time to obtain a 768-dimensional feature, then the 768-dimensional feature is spliced with the voice feature (such as a fundamental frequency, energy, word duration, average duration of phonemes in a word, etc.), and then the predicted result (i.e., a second supersonic segment feature) is obtained through a final linear layer.

In the embodiment of the application, the second labeling system can accurately identify the fifth position information needing to be stopped, the sixth position information needing to be reread, the seventh position information needing to be implosion and/or the eighth position information needing to be continuously read in the original voice so as to realize the function of labeling the corpus in the pure voice form.

In some embodiments, determining fifth location information to be paused, sixth location information to be reread, seventh location information to be implosion, and/or eighth location information to be read continuously in the original speech based on the alignment information and the speech features includes:

determining the pronunciation interval time length between any two adjacent text units in the original voice according to the alignment information, and determining the position information between the two adjacent text units with the pronunciation interval time length longer than the preset time length as fifth position information;

determining the pronunciation duration of each text unit in the original voice according to the alignment information, and determining sixth position information according to the pronunciation duration of each text unit and the voice characteristics;

determining whether each text unit in original voice meets a preset explosion-losing condition according to voice characteristics, and determining the position information of the text unit meeting the preset explosion-losing condition as seventh position information, wherein the preset explosion-losing condition is that the last pronunciation phoneme of the previous text unit in two adjacent text units is plosive, the first pronunciation phoneme of the next text unit is consonant, and the average value of phoneme probabilities of all voice frames in the duration time period of the plosive belongs to a first preset range;

According to the voice characteristics, determining whether each text unit in the original voice meets a preset continuous reading condition or not, and determining the position information of the text unit meeting the preset continuous reading condition as eighth position information, wherein the preset continuous reading condition is that the last pronunciation phoneme of the previous text unit and the first pronunciation phoneme of the next text unit in two adjacent text units meet a preset corresponding relation, and the average value of phoneme probabilities of all voice frames in the duration time period of the last pronunciation phoneme of the previous text unit belongs to a second preset range.

In the embodiment of the application, according to the alignment information, the pronunciation interval duration between any two adjacent text units in the original voice can be obtained. For example, assuming that The pronunciation of The word "The" is represented in The alignment information from 0.08s to 0.34s and The pronunciation of The word "Survey" is represented from 0.48s to 0.56s, the pronunciation interval duration between The word "The" and The word "Survey" is 0.48-0.34=0.14 s, i.e., 140ms. According to the statistics of the phonetics and the hearing of the human, the position which is longer than the preset pronunciation interval time (such as 300 ms) can be marked as the fifth position information which needs to be stopped.

According to the alignment information, the pronunciation duration of each text unit in the original voice and the corresponding voice waveform of each text unit in the original voice can be obtained, then the voice characteristics including the fundamental frequency, pronunciation duration, phoneme average duration, energy, mel frequency cepstrum coefficient, phonemes and the like corresponding to each text unit are extracted according to the voice waveforms, and then the voice characteristics are spliced together and input into a Tramsformer Encoder model for prediction, so that sixth position information needing to be reread is determined.

According to the speech characteristics, it is possible to search the phonemes/t/,/d/,/p/,/b/,/k/,/g/these plosive sounds, determine if they are before the consonant, and if they are before the consonant, determine by analyzing the phoneme probability if a misfire is required between this text unit and the next text unit. For example, for an original speech containing "found that", the last pronunciation phoneme of the word "found" is plosive/d/, and the first pronunciation phoneme of the word "that" is consonant/th/. So that it textually satisfies the explosion-loss condition, but whether or not explosion is lost in the speech, it is also necessary to check that the average value of the phoneme probabilities of all speech frames in the duration of the plosive falls within the first preset range. Specifically, the time position of the plosive/d/needs to be concerned, and if the time position is between 1.30s and 1.36s, the probability that the phonemes corresponding to all the voice frames in the time period are/d/is averaged, if the average value belongs to a first preset range, the condition of explosion loss in the voice can be considered to be met, and then the seventh position information of explosion loss is determined. The first preset range may be set according to actual situations, and is not specifically limited in this embodiment. The average of the phoneme probabilities of all speech frames in the duration of the plosive will not be too high as a result of the plosive, but this pronunciation action has already been made partly, so the average will not be too low, which can be considered as a requirement for a plosive if it is in the range of 0.3-0.7, depending on production practice.

According to the voice characteristics, whether each text unit in the original voice meets the preset continuous reading condition can be determined. For example, for an original speech containing "com on", it is generally possible to read in tandem as "komon", i.e., connect "mon" in a more smooth manner. That is, the last pronunciation phoneme of the word "com" and the first pronunciation phoneme of the word "on" satisfy the preset correspondence, and the average value of the phoneme probabilities of all the speech frames in the duration period of the last pronunciation phoneme of the word "com" belongs to the second preset range, which can be considered as satisfying the preset continuous reading condition in the speech, so as to determine the eighth position information to be continuously read. The second preset range may be set according to actual situations, and is not specifically limited in this embodiment. Because of the readthrough, their pronunciation actions are not as clear and complete as would otherwise be the case, the average of the phoneme probabilities for all speech frames over the duration of the last pronunciation phoneme of the previous text unit is not too high, but this pronunciation action is still complete and so the average is not too low. Depending on production practice, if the average value is in the range of 0.5 to 0.9, it may be considered that read-through is required.

It should be noted that, in the above preset continuous reading condition, the last pronunciation phoneme of the previous text unit and the first pronunciation phoneme of the next text unit in the two adjacent text units need to satisfy a preset corresponding relationship, where the corresponding relationship is as follows:

1. continuous reading of auxiliary elements

In a sentence where the preceding word ends with a consonant and the following word starts with a vowel, the consonant and vowel can be read together, for example: hold-on, com-on.

2. Same or similar pronunciation encounters

In a sentence, when the tail sounds of the preceding words and the head sounds of the following words are similar or identical, they can be combined into one pronunciation, and the pronunciation needs to be slightly prolonged.

3. Consonant continuous-reading sound-swallowing method

In one sentence, the non-pronunciation may be omitted when/t/and/d/is located between consonants.

4. Vowel continuous reading add/r-

Two adjacent words in a sentence, the former word using vowelsEnding, when the latter word starts with a vowel, a slight/r/-is added between the two vowels.

5. Vowel continuous reading add/j-

Two adjacent words in a sentence, the former word using vowelsEnding, when the latter word starts with a vowel, a slight/j/-is added between the two vowels.

6. Vowel continuous reading add/w-

Two adjacent words in a sentence, the former word using vowelsEnding, when the latter word starts with a vowel, a slight/w/-is added between the two vowels.

In the embodiment of the application, the positions of the ultrasonic section features in the original voice can be accurately determined according to the alignment information and the voice features, so that the annotation of the voice corpus is more accurate.

In some embodiments, determining the third ultrasonic segment feature in the corpus to be annotated according to the first ultrasonic segment feature and the second ultrasonic segment feature includes:

and under the condition that the ultrasonic section features meeting the preset fusion rule exist in the first ultrasonic section features and the second ultrasonic section features, determining the ultrasonic section features meeting the preset fusion rule as third ultrasonic section features, wherein the preset fusion rule is that the distribution positions of the ultrasonic section features of the same type in the first ultrasonic section features and the second ultrasonic section features need to be overlapped.

In the embodiment of the application, when the corpus to be annotated is of a third corpus type, a first ultrasonic section feature of an original text of the corpus to be annotated can be identified by using a first annotation system, a second ultrasonic section feature of an original voice of the corpus to be annotated is identified by using a second annotation system, then whether the distribution positions of ultrasonic section features of the same type in the identified first ultrasonic section feature and the second ultrasonic section feature are overlapped or not is judged, and if the distribution positions of ultrasonic section features of the same type in the identified first ultrasonic section feature and the identified second ultrasonic section feature are overlapped, the ultrasonic section features are used as third ultrasonic section features; if the distribution positions of the same type of ultrasonic section features in the identified first ultrasonic section features and the second ultrasonic section features are not overlapped, discarding the ultrasonic section features. Specifically, the first ultrasonic section feature and the second ultrasonic section feature may be fused in a consistent manner. That is, for a certain corpus location, the system will recognize that the corpus location needs to be stopped only if the first segment feature is recognized as being stopped and the second segment feature is recognized as being stopped. Similarly, the system recognizes that the corpus location needs to be read continuously only if the first ultrasound segment feature is recognized as being read continuously and the second ultrasound segment feature is recognized as being read continuously. For example, for a corpus to be annotated, the first annotation system outputs "the higher found by the statistical value", and the second annotation system outputs "the survey found that by the statistical value", where x represents a missing explosion and i represents a pause. The end result is "the survey found that |the statistical", i.e. the decision about the explosion loss is overruled, according to the fusion rules.

By the method, the first labeling system and the second labeling system can be combined to accurately identify the third ultrasonic section characteristics in the to-be-labeled corpus containing the text form corpus and the voice form corpus, so that the function of labeling the to-be-labeled corpus containing the text form corpus and the voice form corpus is realized.

In some embodiments, labeling the corpus to be labeled according to the recognition result includes:

determining the type of ultrasonic section features and the positions of ultrasonic section features contained in the corpus to be annotated according to the identification result, wherein the type of ultrasonic section features comprises at least one of a first feature type used for representing whether pause is needed, a second feature type used for representing whether rereading is needed, a third feature type used for representing whether explosion loss is needed and a fourth feature type used for representing whether continuous reading is needed;

and labeling the corpus to be labeled according to the type of the ultrasonic section features and the positions of the ultrasonic section features.

In the embodiment of the application, the corpus to be annotated can be annotated according to the recognition result, specifically, the type of the ultrasonic section feature and the position of the ultrasonic section feature contained in the corpus to be annotated can be determined according to the recognition result, and then the corpus to be annotated is annotated according to the type of the ultrasonic section feature and the position of the ultrasonic section feature. Therefore, the ultrasonic section features in the corpus to be marked can be marked according to the types of the ultrasonic section features and the positions of the ultrasonic section features, so that the corpus marking effect is ensured, and the user can conveniently identify the corpus.

In an example, the corpus labeling process provided by the embodiment of the present application may be as shown in fig. 6. The determination may be made based on user input first. In order to obtain better quality, the information of two modes of text and voice can be used for judgment under the condition of permission. If the user enters a corpus in plain text form, a text-based labeling system (i.e., the first labeling system above) is entered. Because there is no real pronunciation, a reasonable prosodic pattern can be obtained. If the user inputs a corpus in the form of pure speech, the recognized text may be obtained by a speech recognition system, and the recognized text and the original speech may then be fed into a text-and-speech-based hybrid labeling system (i.e., the second labeling system above). As an alternative implementation mode, the recognition text obtained by the voice recognition system can be subjected to post-processing through a generated large language model, so that the recognition effect is optimized. In a text-and-speech-based hybrid labeling system, speech and text are respectively fed into corresponding labeling systems, then a fusion system observes their output results simultaneously, and then fuses the output results as labeling results of the corpus to be labeled.

Exemplary apparatus

Having described the method of an exemplary embodiment of the present invention, next, a corpus labeling apparatus of an exemplary embodiment of the present invention will be described with reference to fig. 7.

Fig. 7 schematically shows a functional block diagram of a corpus labeling means according to an embodiment of the invention. As shown in fig. 7, the apparatus 700 may include:

the obtaining module 702 is configured to obtain a corpus to be annotated and a corpus type corresponding to the corpus to be annotated;

a determining module 704, configured to determine an annotation system that matches the corpus type;

the identifying and labeling module 706 is configured to identify, by using a labeling system that matches a corpus type, a supervoice segment feature in the corpus to be labeled, and label the corpus to be labeled according to the identification result, where the supervoice segment feature is used to characterize a pronunciation prosody of the corpus to be labeled.

In one embodiment of the present invention, the corpus type is a first corpus type, a second corpus type or a third corpus type, wherein the first corpus type is a corpus type only including text-form corpus, the second corpus type is a corpus type only including speech-form corpus, and the third corpus type is a corpus type only including text-form corpus and speech-form corpus;

The recognition and annotation module 706 includes:

the first identification sub-module is used for identifying first ultrasonic segment features of original texts of the corpus to be marked by using a first marking system under the condition that the corpus to be marked is of a first corpus type, wherein the first marking system is used for marking the corpus in a text form; or,

the second recognition sub-module is used for recognizing second ultrasonic segment features of original voice of the corpus to be marked by using a second marking system under the condition that the corpus to be marked is of a second corpus type, wherein the second marking system is used for marking the corpus in a voice form; or,

the third recognition sub-module is used for recognizing the first ultrasonic section characteristics of the original text of the corpus to be marked by using the first marking system under the condition that the corpus to be marked is of a third corpus type, recognizing the second ultrasonic section characteristics of the original voice of the corpus to be marked by using the second marking system, and determining the third ultrasonic section characteristics in the corpus to be marked according to the first ultrasonic section characteristics and the second ultrasonic section characteristics.

In one embodiment of the invention, the first recognition submodule comprises:

the first input unit is used for inputting an original text of a corpus to be annotated into the first annotation system to obtain first position information needing to be stopped, second position information needing to be re-read, third position information needing to be subjected to explosion loss and/or fourth position information needing to be subjected to continuous reading in the original text;

And the first determining unit is used for determining the first position information, the second position information, the third position information and/or the fourth position information as the first ultrasonic section characteristic.

In one embodiment of the invention, the second recognition submodule comprises:

the second input unit is used for inputting the original voice of the corpus to be annotated into the second annotation system, and identifying the text to be annotated;

the alignment unit is used for carrying out alignment processing on the original voice and the recognition text to obtain alignment information, wherein the alignment information is used for representing the starting pronunciation time and the ending pronunciation time of each text unit in the recognition text in the original voice;

the second determining unit is used for determining the voice characteristics of the corpus to be annotated according to the alignment information and the original voice, wherein the voice characteristics comprise at least one of the phoneme probability of each voice frame in the original voice, the corresponding fundamental frequency, energy, mel frequency cepstrum coefficient and the average duration of the phonemes of each text unit;

the third determining unit is used for determining fifth position information needing to be stopped, sixth position information needing to be re-read, seventh position information needing to be subjected to explosion-losing and/or eighth position information needing to be subjected to continuous reading in the original voice according to the alignment information and the voice characteristics;

And a fourth determining unit configured to determine fifth position information, sixth position information, seventh position information, and/or eighth position information as the second ultrasonic section feature.

In one embodiment of the invention, the fourth determining unit is specifically configured to:

In one embodiment of the invention, the third recognition submodule comprises:

and a fifth determining unit, configured to determine, as a third ultrasonic segment feature, an ultrasonic segment feature that satisfies a preset fusion rule when an ultrasonic segment feature that satisfies the preset fusion rule exists in the first ultrasonic segment feature and the second ultrasonic segment feature, where the preset fusion rule is that distribution positions of ultrasonic segment features of the same type in the first ultrasonic segment feature and the second ultrasonic segment feature need to be overlapped.

In one embodiment of the invention, the recognition and annotation module 706 further includes:

the determining submodule is used for determining the type of the ultrasonic section feature and the position of the ultrasonic section feature contained in the corpus to be annotated according to the identification result, wherein the type of the ultrasonic section feature comprises at least one of a first feature type used for representing whether pause is needed, a second feature type used for representing whether rereading is needed, a third feature type used for representing whether explosion is needed and a fourth feature type used for representing whether continuous reading is needed;

And the labeling sub-module is used for labeling the corpus to be labeled according to the type of the ultrasonic section characteristics and the positions of the ultrasonic section characteristics.

The apparatus according to the embodiments of the present invention has been described and explained in detail above in connection with the method, and will not be described again here.

In a third aspect of the embodiments of the present invention, there is provided an electronic device, as shown in fig. 8, comprising a processor 811, a communication interface 812, a memory 813 and a communication bus 814, wherein the processor 811, the communication interface 812, the memory 813 complete communication with each other through the communication bus 814,

a memory 813 for storing a computer program;

in one embodiment of the present application, the processor 811 is configured to implement the corpus labeling method steps provided in any of the foregoing method embodiments when executing the program stored in the memory 813.

The embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the corpus labeling method provided in any one of the foregoing method embodiments.

It should be noted that although several devices or sub-devices of the electronic apparatus are mentioned in the above detailed description, this division is not mandatory only. Indeed, the features and functions of two or more of the devices described above may be embodied in one device, in accordance with embodiments of the present invention. Conversely, the features and functions of one device described above may be further divided into multiple devices to be embodied.

Furthermore, although the operations of the methods of the present invention are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

Use of the verb "comprise," "include" and its conjugations in this application does not exclude the presence of elements or steps other than those stated in the application. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

While the spirit and principles of the present invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments nor does it imply that features of the various aspects are not useful in combination, nor are they useful in any combination, such as for convenience of description. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A corpus labeling method, characterized in that the method comprises:

determining an annotation system matched with the corpus type;

2. The method of claim 1, wherein the corpus type is a first corpus type, a second corpus type or a third corpus type, wherein the first corpus type is a corpus type comprising only text-form corpora, the second corpus type is a corpus type comprising only speech-form corpora, and the third corpus type is a corpus type comprising text-form corpora and speech-form corpora;

the identifying the ultrasonic section features in the corpus to be marked by using the marking system matched with the corpus type comprises the following steps:

3. The method of claim 2, wherein the identifying, with the first labeling system, the first supersonic feature of the original text of the corpus to be labeled comprises:

4. The method according to claim 2, wherein the identifying, with a second labeling system, the second supersonic feature of the original speech of the corpus to be labeled comprises:

5. The method according to claim 4, wherein determining fifth location information to be stopped, sixth location information to be re-read, seventh location information to be explosion-proof, and/or eighth location information to be read continuously in the original speech according to the alignment information and the speech feature comprises:

6. The method according to claim 2, wherein the determining a third hyperband feature in the corpus to be annotated according to the first hyperband feature and the second hyperband feature comprises:

7. The method according to claim 1, wherein labeling the corpus to be labeled according to the recognition result includes:

8. A corpus tagging device, the device comprising:

the acquisition module is used for acquiring the corpus to be annotated and the corpus type corresponding to the corpus to be annotated;

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the corpus labeling method according to any of claims 1-7 when executing a program stored on a memory.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the corpus labeling method of any of claims 1-7.