CN115862584A - Rhythm information labeling method and related equipment - Google Patents

Rhythm information labeling method and related equipment Download PDF

Info

Publication number
CN115862584A
CN115862584A CN202111124499.3A CN202111124499A CN115862584A CN 115862584 A CN115862584 A CN 115862584A CN 202111124499 A CN202111124499 A CN 202111124499A CN 115862584 A CN115862584 A CN 115862584A
Authority
CN
China
Prior art keywords
text
information
prosodic
labeled
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111124499.3A
Other languages
Chinese (zh)
Inventor
陈飞扬
李太松
陈珊珊
王喆锋
李明磊
怀宝兴
袁晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202111124499.3A priority Critical patent/CN115862584A/en
Priority to PCT/CN2022/099389 priority patent/WO2023045433A1/en
Publication of CN115862584A publication Critical patent/CN115862584A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the application discloses a prosodic information labeling method and related equipment, which are used for improving the labeling efficiency. The method in the embodiment of the application comprises the following steps: the computer device obtains audio information and first text information. And the computer equipment marks prosodic words and prosodic phrases in the first text information to obtain a first marked text, wherein the prosodic phrases in the first marked text need to be carried out based on the audio information. And the computer equipment labels the prosodic words labeled in the text after the first label, the prosodic phrases labeled in the text after the first label and the intonation phrases in the text after the first label based on the audio information to obtain a text after the second label.

Description

Rhythm information labeling method and related equipment
Technical Field
The embodiment of the application relates to the field of voice synthesis, in particular to a prosodic information labeling method and related equipment.
Background
A text-to-speech (TTS) technique is a technique for obtaining natural and smooth speech information based on a text and labeling information in the text, where the labeling information includes a label indicating a prosodic phrase, a label indicating a prosodic word, and a label indicating a intonation phrase, and the labeling information can represent a pause and a mood change of the text. In the speech synthesis technology, a text including label information is input into a model, so that the model outputs corresponding speech information, and in order to make the naturalness of the speech information output by the model higher, a large amount of texts including label information are required to train the model.
In the current labeling mode of prosodic phrases, prosodic words and intonation phrases of texts, labeling is mainly performed by professional personnel, and the required time is long, so that the acquisition of texts containing labeling information is difficult, and the model is not beneficial to quickly training to an ideal effect.
Disclosure of Invention
The embodiment of the application provides a prosodic information labeling method and related equipment, which are used for improving the efficiency of labeling.
A first aspect of an embodiment of the present application provides a prosodic information labeling method:
the computer equipment acquires audio information and first text information which have corresponding relations, and labels prosodic words and prosodic phrases in the first text information to obtain a first labeled text, wherein the prosodic phrases need to be labeled based on the audio information. After the prosodic words and prosodic phrases in the first text information are labeled, the computer device labels the prosodic words and prosodic phrases in the first text information based on the first labeled text to obtain a second labeled text.
In the embodiment of the application, the manner of collecting the first text information and the corresponding audio information by the computer device may be collecting in a network, so that a large amount of data can be collected. And the prosodic words, prosodic phrases and intonation phrases in the first text information are labeled by the computer equipment without manual labeling, so that the labeling efficiency of the prosodic words, the prosodic phrases and the intonation phrases is improved, and the accuracy of the intonation phrase labeling is improved due to the fact that the prosodic words and the prosodic phrases are combined to label the intonation phrases.
In one possible implementation, the computer device may further correct the prosodic words, prosodic phrases, and intonation phrases labeled in the second labeled text in response to an operation instruction of the user.
In the embodiment of the application, after the computer device labels the prosodic words, the prosodic phrases and the intonation phrases, the user can correct the related labels, so that the accuracy of the labels is further improved.
In a possible implementation manner, the computer device may further obtain a third labeled text, where text information of the third labeled text is consistent with text information of the second labeled text, labeling information in the third labeled text is different from labeling information in the second labeled text, and the labeling information indicates by a label for indicating a prosodic word, a label for indicating a prosodic phrase, and a label for indicating a intonation phrase. The computer device may further determine a target label in the third labeled text based on the label information of the second labeled text and the label information of the third labeled text, where the target label includes at least one of a label for indicating a prosodic word, a label for indicating a prosodic phrase, and a label for indicating a intonation phrase.
In the embodiment of the application, the third annotated text can be based on the first text information after the prosodic words, the prosodic phrases and the intonation phrases are artificially annotated, the third annotated text can be verified based on the second annotated text, and a target annotation with an incorrect annotation possibly existing in the third annotated text is determined, so that the third annotated text is corrected, and the annotation accuracy is improved.
In a possible implementation manner, the computer device may further correct the target annotation in response to an operation instruction of the user.
In a possible implementation manner, the computer device or the audio information and the first text information may specifically be that the audio information is obtained first, and then the first text information corresponding to the audio information is obtained based on a speech recognition technology. Or acquiring the video information, then acquiring the audio information in the video information and the second text information corresponding to the subtitle information in the video information, acquiring the third text information corresponding to the audio information based on a voice recognition technology, and determining the first text information based on the second text information and the third text information.
A second aspect of the embodiments of the present application provides a computer device, where the computer device includes a plurality of functional modules, and the functional modules interact with each other to implement the method in the first aspect and the embodiments thereof. The functional modules may be implemented based on software, hardware or a combination of software and hardware, and may be arbitrarily combined or divided based on specific implementations.
A third aspect of embodiments of the present application provides a computer device, comprising a processor coupled with a memory, the memory being configured to store instructions that, when executed by the processor, cause a display device to perform the method as described in the first aspect.
A fourth aspect of embodiments of the present application provides a computer program product comprising code which, when run on a computer, causes the computer to perform the method according to the first aspect.
A fifth aspect of embodiments of the present application provides a computer-readable storage medium having a computer program or instructions stored thereon, wherein the computer program or instructions, when executed, have a computer program or instructions stored thereon, which, when executed, cause a computer to perform the method according to the first aspect.
Drawings
FIG. 1a and FIG. 1b are schematic diagrams of system architectures in embodiments of the present application;
FIG. 2 is a schematic flow chart illustrating a prosodic information labeling method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating a prosodic information labeling method according to an embodiment of the present application;
FIG. 4 is a diagram illustrating prosodic words, prosodic phrases, and intonation phrases in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a computer device in the embodiment of the present application;
fig. 6 is another schematic structural diagram of a computer device in the embodiment of the present application.
Detailed Description
Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the application provides a prosodic information labeling method and related equipment, which are used for improving the labeling efficiency of prosodic information.
For the sake of understanding, the following description is made about the related concepts related to the embodiments of the present application:
syllable: the voice unit is the most easily distinguished in hearing and the most natural voice unit in the voice stream.
Rhythm word: is a group of syllables that are closely related in the actual stream of speech, often pronounced together.
Prosodic phrases: and an intermediate rhythm block between the prosodic words and the intonation phrases.
Intonation phrases: a phrase that connects several prosodic phrases according to a certain intonation pattern.
The embodiment of the application can be applied to speech synthesis and text emotion recognition. In speech synthesis, text with prosodic phrases, prosodic words and intonation phrases marked is input into a model, the model outputs smooth and natural speech, and the model is widely applied to technologies such as voiced novels, digital people, speech assistants and intelligent sound boxes. The model can be smoothly and naturally output voice only by training the model by using a large amount of texts marked with prosodic phrases, prosodic words and intonation phrases, and if the texts marked with the prosodic phrases, prosodic words and intonation phrases can be quickly acquired, the training efficiency of the model can be improved, and the naturalness and the fluency of the output voice are further improved. Emotion recognition, also known as trend analysis, is the process of analyzing, processing, generalizing, and reasoning subjective text with emotional colors. By utilizing the emotion recognition function, the method has very wide application in comment analysis and decision, E-commerce comment classification and public opinion monitoring. Emotion recognition typically analyzes a text directly, and if prosodic phrases, prosodic words, and intonation phrases in the text can be labeled, the accuracy of emotion recognition is improved.
The embodiments of the present application can be applied to the system architecture shown in fig. 1a or fig. 1b, which are described below. As shown in fig. 1a, the system architecture includes a data acquisition module, a labeling module, and a verification module. The data acquisition module is used for collecting video information or audio information on the network, extracting the audio information in the video information for the video information, determining corresponding second text information according to subtitles in the video information, performing voice recognition on the audio information to determine third text information, and determining the second text information or the third text information as the first text information if the second text information is not different from the third text information; if the second text information and the third text information are different, the data acquisition module can correct the second text information or the third text information based on an operation instruction of a user, and the corrected text information is the first text; and for the audio information, directly acquiring first text information corresponding to the audio information through a voice recognition technology. It should be understood that the second text information, the third text information and the first text information may be embodied in a document file, for example, the second text information may be embodied as a document 1 including the second text information, the third text information may be embodied as a document 2 including the third text information, and the first text information may be embodied as a document 3 including the first text information.
The data acquisition module sends the audio information and the first text information to the labeling module, for example, the audio information and the document 3 are sent to the labeling module. The labeling module labels prosodic phrases and prosodic words for the document 3 based on corresponding models, algorithms or rules in combination with the audio information to obtain a first labeled text, labels the prosodic phrases and prosodic words in the first labeled text based on the prosodic phrases and prosodic words labeled in the first labeled text in combination with the audio information to obtain a second labeled text, and sends the second labeled text to the verification module. The verification module corrects the second labeled text based on the operation instruction of the user, and it should be noted that, in one mode, the labeling module may send the document 3 labeled with the prosodic words to the verification module first, and the verification module corrects the prosodic words labeled in the document 3 based on the operation instruction of the user, and then sends the corrected document 3 to the labeling module for subsequent labeling, so as to obtain the second labeled text finally.
As shown in fig. 1b, the data collection module may further create a document 4 including the first text information, and send the document 4 and the audio information to the manual labeling module, and the manual labeling module obtains an operation instruction of the user to label a prosodic phrase, prosodic words, and intonation phrase in the document 4, so as to obtain a third labeled text. The labeling module and the manual labeling module respectively send the second labeled text and the third labeled text to the screening module, the screening module compares labeling information in the second labeled text and the third labeled text, and determines a target label in the third labeled text, and it needs to be explained that the target label can be understood as an error label. After the target mark is determined, the screening module sends the third mark to the verification module, and the user corrects the target mark through the verification module.
It should be noted that the modules shown in fig. 1a and fig. 1b may be respectively located on different computer devices, or may be located on the same computer device, or a part of the modules is located on the same computer device, and another part is located on another computer device, which is not limited herein.
Referring to fig. 2, a flow of a method for labeling prosodic information in the embodiment of the present application is described as follows:
201. the method comprises the steps that the computer equipment obtains audio information and first text information;
the computer device collects video information and audio information through downloading on a network, extracts the audio information in the video information for the collected video information, and identifies subtitles in the video information through an optical character recognition technology so as to determine corresponding second text information, wherein the subtitles in the video information have a corresponding relationship with the audio information, so that the second text information determined according to the subtitles also has a corresponding relationship with the audio information. Then, voice recognition is carried out on the audio information so as to determine third text information, and if the second text information is not different from the third text information, the second text information or the third text information is determined to be the first text information; if the second text information is different from the third text information, the user can correct the second text information or the third text information on the computer equipment, and the corrected text information is the first text information. And for the collected audio information, determining text information corresponding to the audio information directly through a voice recognition technology, wherein the text information is the first text information.
202. The computer equipment marks prosodic words and prosodic phrases in the first text information to obtain a first marked text;
after the first text information is obtained, the computer device labels the prosodic words in the first text information based on a corresponding model, algorithm or rule, for example, the prosodic words in the first text information may be labeled according to a coarse-grained word segmentation model. Or, after the computer device labels the prosodic words in the first text information based on the corresponding model, algorithm or rule, the computer device may further correct the prosodic words labeled in the first text information based on an operation instruction of the user.
And then, the computer equipment marks the prosodic phrase in the first text information according to the audio information, specifically, the pronunciation duration and the tone information of each word in the audio information can be extracted through a neural network or a machine algorithm, and the prosodic phrase in the first text information is marked according to the information.
It should be noted that the order of labeling the prosodic words and prosodic phrases is not limited, and the prosodic words may be labeled first, or the prosodic phrases may be labeled first.
203. And the computer equipment labels the prosodic words and prosodic phrases in the text after the first label based on the first label and the intonation phrases in the text after the first label based on the audio information to obtain a text after the second label.
Because the prosodic words and the prosodic phrases have strong association, the computer equipment can label the prosodic words and the prosodic phrases in the first text by combining the pronunciations of the characters and the tone information of the characters in the prosodic words, the prosodic phrases and the audio information labeled in the first text, so as to obtain a second labeled text.
204. The computer device corrects at least one of prosodic words, prosodic phrases, and intonation phrases labeled in the second labeled text in response to an operation instruction of a user.
After the second labeled text is obtained, the computer device corrects at least one of the prosodic words, prosodic phrases and intonation phrases labeled in the second labeled text based on an operation instruction of a user.
In the embodiment of the application, the manner of collecting the first text information and the corresponding audio information by the computer device may be collecting in a network, so that a large amount of data can be collected. And the prosodic words, prosodic phrases and intonation phrases in the first text information are labeled by the computer equipment without manual labeling, so that the labeling efficiency of the prosodic words, the prosodic phrases and the intonation phrases is improved, and the accuracy of the intonation phrase labeling is improved due to the fact that the prosodic words and the prosodic phrases are combined to label the intonation phrases.
Optionally, on the basis of the embodiment shown in fig. 2, after the prosodic words, prosodic phrases, and intonation phrases in the first text information are labeled, the user may also correct the prosodic words, prosodic phrases, and intonation phrases labeled in the first text information through a computer device, where the user may be a related professional.
Referring to fig. 3, another flow of the prosodic information labeling method in the embodiment of the present application is described as follows:
steps 301 to 303 in this embodiment are similar to steps 201 to 203 in the embodiment shown in fig. 2, and are not described again here.
304. The computer equipment acquires a third labeled text;
and the computer equipment acquires a third labeled text, wherein the third labeled text is labeled with prosodic words, prosodic phrases and intonation phrases, the text information of the third labeled text is consistent with the text information of the second labeled text, and the labeling information of the third labeled text is different from the labeling information of the second labeled text. It can be understood that the text information only includes text information in the text, and the annotation information is embodied as the position of each annotation in the text and the type corresponding to each annotation, where the type of the annotation includes an annotation for indicating a prosodic word boundary, an annotation for indicating a prosodic phrase, and an annotation for indicating a intonation phrase.
It should be understood that, in the present embodiment, the second labeled text may be obtained by labeling, by the computer device, prosodic words, prosodic phrases, and intonation phrases in the document 3 including the first text information, and the third labeled text may be obtained by labeling, by the computer device, prosodic words, prosodic phrases, and intonation phrases in the document 4 including the first text information based on an operation instruction of the user.
It should be noted that, the execution sequence of step 304 is not limited herein, and it is only necessary to ensure that the step is executed after step 301 and before step 305.
305. The computer equipment determines a target label in the third labeled text based on the label information of the second labeled text and the label information of the third labeled text;
and the computer equipment analyzes the difference of the labeling information between the two texts by combining the labeling information of the second labeled text and the labeling information of the third labeled text. Referring to fig. 4, fig. 4 is a schematic diagram of labeling text with prosodic words, prosodic phrases, and intonation phrases, as shown in fig. 4, in one implementation, a label for indicating a prosodic word boundary, a label for indicating a prosodic phrase, or a label for indicating an intonation phrase may be added after each word in a sentence, for example, a word or word before "#1" and after the previous "#1", "#2", or "#3" is labeled as a prosodic word by "# 1"; words or words before "#2" and after the last "#1" are also labeled as prosodic words by "#2", and all prosodic words before "#2" and after the last "#2" or "#3" are labeled as prosodic phrases; a word or word before "#3" and after the last "#1" is also labeled as a prosodic word by "#3", all prosodic words before "#3" and after the last "#2 are labeled as prosodic phrases, and all prosodic phrases or all prosodic words before" #3 "and after the last" #3 "or all are labeled as intonation phrases. For example, the sentence "continuously speaking to the technical expert's Beijing" is labeled as "continuously speaking #1 to the technical #3, the expert #1 to the Beijing #3", which means "continuously", "having", "speaking", "technical", "expert", "existing", and "Beijing" are different prosodic words, the "continuous biographical phrase", "technical expert" and "Beijing in place" are different prosodic phrases, respectively, and the "continuous biographical phrase" and "Beijing in place of technical expert" are different intonation phrases, respectively.
In one case, the second notation, hereafter labeled as "# a" at position 1, and "a" may be 0, 1, 2, or 3, and the third notation, hereafter labeled as "# b" at position 1, and "b" may be 0, 1, 2, or 3. If b minus a is greater than or equal to 2, then the annotation of the second annotation text at the position 1 is possibly wrong, and the computer equipment determines the annotation as the target annotation. In the above example, if the labeling information of a certain sentence in the second labeled text is "continuously #1 has a #1 biographical expression, namely #3 technology #1 expert #1, the existing beijing #3", and the labeling information of the same sentence in the third labeled text is "continuously #3 has a #1 biographical expression, namely #3 technology #1 expert #1, the existing beijing #1", it is obvious that two points of "continuously #3" and "existing beijing #1" in the third labeled text are labeled as target labels.
Note that if no label is given after a certain word or phrase, the label-free part has a correspondence relationship with "# 0". For example, "now beijing #3" in the second annotated text may be further understood as "now #0 and north #0 and beijing #3", and if the annotation information of the third annotated text at that point is "now #2 and beijing #3", then "now #2" is also the target annotation.
306. And the computer equipment responds to the operation instruction of the user and corrects the target label.
After the target label is determined, the computer equipment displays the target label, and a user can correct the target label through the computer equipment.
In the embodiment of the application, the third labeled text which is artificially labeled may have a place which is improperly labeled, and if the third labeled text is completely verified manually, a lot of time is consumed, so that the computer device determines the target label in the third labeled text based on the second labeled text, and then verifies the target label manually, thereby improving the verification efficiency.
With reference to fig. 5, the prosodic information labeling method in the embodiment of the present application is described as follows:
as shown in fig. 5, the computer device 500 in the embodiment of the present application includes a processing unit 501.
The processing unit 501 is configured to obtain audio information and first text information, where the audio information and the first text information have a corresponding relationship.
The processing unit 501 is further configured to label prosodic words and prosodic phrases in the first text information to obtain a first labeled text, where the prosodic phrases in the first text information need to be labeled based on the audio information.
The processing unit 501 is further configured to obtain a second labeled text based on the prosodic words labeled in the first labeled text, the prosodic phrases labeled in the first labeled text, and the intonation phrases labeled in the first labeled text by the audio information.
In one form of implementation, the first and second electrodes are,
the processing unit 501 is further configured to correct at least one of a prosodic word, a prosodic phrase, and a intonation phrase labeled in the second labeled text in response to an operation instruction of the user.
In one form of implementation, the first and second electrodes are,
the processing unit 501 is further configured to obtain a third labeled text, where text information of the third labeled text is consistent with text information of the second labeled text, where labeling information in the third labeled text is different from labeling information in the second labeled text, and the labeling information indicates a label for indicating a prosodic word, a label for indicating a prosodic phrase, and a label for indicating a intonation phrase.
The processing unit 501 is further configured to determine a target label in the third labeled text based on the label information of the second labeled text and the label information of the third labeled text, where the target label includes at least one of a label for indicating a prosodic word, a label for indicating a prosodic phrase, and a label for indicating a intonation phrase.
In one form of implementation, the first and second electrodes are,
the processing unit 501 is further configured to correct the target label in response to an operation instruction of a user.
In one form of implementation, the first and second electrodes are,
the processing unit 501 is specifically configured to obtain audio information.
The processing unit 501 is further configured to determine first text information corresponding to the audio information based on speech recognition.
Or the like, or, alternatively,
the processing unit 501 is specifically configured to obtain video information.
The processing unit 501 is further configured to obtain audio information in the video information and second text information corresponding to subtitle information in the video information.
The processing unit 501 is further configured to determine third text information corresponding to the audio information based on the speech recognition.
The processing unit 501 is further configured to determine the first text information based on the second text information and the third text information.
Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 600 may include one or more Central Processing Units (CPUs) 601 and a memory 605, and the memory 605 stores one or more application programs or data therein.
The memory 605 may be volatile storage or persistent storage, among other things. The program stored in the memory 605 may include one or more modules, each of which may include a sequence of instructions operating on a server. Still further, the central processor 601 may be arranged in communication with the memory 605 to execute a series of instruction operations in the memory 605 on the computer device 600.
The computer apparatus 600 may also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input-output interfaces 604, and/or one or more operating systems, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
The central processing unit 601 may perform the operations performed by the computer device in the embodiments shown in fig. 2 and fig. 3, which are not described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims (13)

1. A prosodic information labeling method is characterized by comprising the following steps:
the method comprises the steps that computer equipment obtains audio information and first text information, wherein the audio information and the first text information have a corresponding relation;
the computer equipment marks prosodic words and prosodic phrases in the first text information to obtain a first marked text, wherein the prosodic phrases in the first marked text need to be marked based on the audio information;
and the computer equipment marks the prosodic words marked in the first marked text, the prosodic phrases marked in the first marked text and the intonation phrases in the first marked text according to the audio information to obtain a second marked text.
2. The method of claim 1, further comprising:
and the computer equipment responds to an operation instruction of a user and corrects at least one of the prosodic words marked in the second marked text, the prosodic phrases marked in the second marked text and the prosodic phrases marked in the second marked text.
3. The method of claim 1, further comprising:
the computer equipment acquires a third labeled text, wherein the text information of the third labeled text is consistent with the text information of the second labeled text, the labeling information in the third labeled text is different from the labeling information in the second labeled text, and the labeling information indicates the labeling of prosodic words, the labeling of prosodic phrases and the labeling of intonation phrases;
and the computer equipment determines a target label in the third labeled text based on the label information of the second labeled text and the label information of the third labeled text, wherein the target label comprises at least one of the label for indicating a prosodic word, the label for indicating a prosodic phrase and the label for indicating a intonation phrase.
4. The method of claim 3, further comprising:
and the computer equipment responds to an operation instruction of a user and corrects the target label.
5. The method of any of claims 1-4, wherein the computer device obtaining audio information and first text information comprises:
the computer equipment acquires the audio information;
the computer device determining the first text information corresponding to the audio information based on speech recognition;
or the like, or, alternatively,
the computer equipment acquires video information;
the computer equipment acquires the audio information in the video information and second text information corresponding to subtitle information in the video information;
the computer device determining third text information based on speech recognition and the audio information;
the computer device determines the first textual information based on the second textual information and the third textual information.
6. A computer device, comprising:
the processing unit is used for acquiring audio information and first text information, and the audio information and the first text information have a corresponding relation;
the processing unit is further configured to label prosodic words and prosodic phrases in the first text information to obtain a first labeled text, where the prosodic phrases in the first text information need to be labeled based on the audio information;
the processing unit is further configured to label a prosodic word labeled in the first labeled text, a prosodic phrase labeled in the first labeled text, and a intonation phrase in the first labeled text based on the audio information, so as to obtain a second labeled text.
7. The apparatus of claim 6,
the processing unit is further configured to correct at least one of the prosodic words labeled in the second labeled text, the prosodic phrases labeled in the second labeled text, and the intonation phrases labeled in the second labeled text in response to an operation instruction of a user.
8. The apparatus of claim 6,
the processing unit is further configured to obtain a third labeled text, where text information of the third labeled text is consistent with text information of the second labeled text, where labeling information in the third labeled text is different from labeling information in the second labeled text, and the labeling information indicates a label for indicating a prosodic word, a label for indicating a prosodic phrase, and a label for indicating a intonation phrase;
the processing unit is further configured to determine a target label in the third labeled text based on the label information of the second labeled text and the label information of the third labeled text, where the target label includes at least one of the label for indicating a prosodic word, the label for indicating a prosodic phrase, and the label for indicating a intonation phrase.
9. The apparatus of claim 8,
the processing unit is further used for responding to an operation instruction of a user and correcting the target label.
10. The apparatus according to any one of claims 6 to 9,
the processing unit is specifically configured to acquire the audio information;
the processing unit is further configured to determine the first text information corresponding to the audio information based on speech recognition;
or the like, or, alternatively,
the processing unit is specifically used for acquiring video information;
the processing unit is further configured to obtain the audio information in the video information and second text information corresponding to subtitle information in the video information;
the processing unit is further used for determining third text information based on voice recognition and the audio information;
the processing unit is further configured to determine the first text information based on the second text information and the third text information.
11. A computer device comprising a processor coupled with a memory, the memory to store instructions that, when executed by the processor, cause the display device to perform the method of any of claims 1-5.
12. A computer program product comprising code which, when run on a computer, causes the computer to run the method of any one of claims 1 to 5.
13. A computer-readable storage medium having stored thereon computer instructions or a program, which when executed, cause a computer to perform the method of any of claims 1 to 5.
CN202111124499.3A 2021-09-24 2021-09-24 Rhythm information labeling method and related equipment Pending CN115862584A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111124499.3A CN115862584A (en) 2021-09-24 2021-09-24 Rhythm information labeling method and related equipment
PCT/CN2022/099389 WO2023045433A1 (en) 2021-09-24 2022-06-17 Prosodic information labeling method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111124499.3A CN115862584A (en) 2021-09-24 2021-09-24 Rhythm information labeling method and related equipment

Publications (1)

Publication Number Publication Date
CN115862584A true CN115862584A (en) 2023-03-28

Family

ID=85653183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111124499.3A Pending CN115862584A (en) 2021-09-24 2021-09-24 Rhythm information labeling method and related equipment

Country Status (2)

Country Link
CN (1) CN115862584A (en)
WO (1) WO2023045433A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012178A (en) * 2023-07-31 2023-11-07 支付宝(杭州)信息技术有限公司 Prosody annotation data generation method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055526A1 (en) * 2005-08-25 2007-03-08 International Business Machines Corporation Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis
CN106601228B (en) * 2016-12-09 2020-02-04 百度在线网络技术(北京)有限公司 Sample labeling method and device based on artificial intelligence rhythm prediction
CN109326281B (en) * 2018-08-28 2020-01-07 北京海天瑞声科技股份有限公司 Rhythm labeling method, device and equipment
CN111226275A (en) * 2019-12-31 2020-06-02 深圳市优必选科技股份有限公司 Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
CN111274807B (en) * 2020-02-03 2022-05-10 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012178A (en) * 2023-07-31 2023-11-07 支付宝(杭州)信息技术有限公司 Prosody annotation data generation method and device

Also Published As

Publication number Publication date
WO2023045433A1 (en) 2023-03-30

Similar Documents

Publication Publication Date Title
US10755701B2 (en) Method and apparatus for converting English speech information into text
US7401018B2 (en) Foreign language learning apparatus, foreign language learning method, and medium
JP5150747B2 (en) Method and system for grammatical fitness evaluation as predictive value of speech recognition error
CN111798832A (en) Speech synthesis method, apparatus and computer-readable storage medium
US7962341B2 (en) Method and apparatus for labelling speech
CN112863483A (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
US20090305203A1 (en) Pronunciation diagnosis device, pronunciation diagnosis method, recording medium, and pronunciation diagnosis program
EP3613046B1 (en) System and method for automatic speech analysis
CN109326281B (en) Rhythm labeling method, device and equipment
US11810471B2 (en) Computer implemented method and apparatus for recognition of speech patterns and feedback
Qian et al. Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT)
KR20160122542A (en) Method and apparatus for measuring pronounciation similarity
CN109166569B (en) Detection method and device for phoneme mislabeling
Middag et al. Robust automatic intelligibility assessment techniques evaluated on speakers treated for head and neck cancer
Teixeira et al. Prosodic features for automatic text-independent evaluation of degree of nativeness for language learners.
CN115862584A (en) Rhythm information labeling method and related equipment
CN109697975B (en) Voice evaluation method and device
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
CN113241065B (en) Dysarthria voice recognition method and system based on visual facial contour motion
CN113903360A (en) Pronunciation correcting method based on machine vision
CN111951827B (en) Continuous reading identification correction method, device, equipment and readable storage medium
Tits et al. Flowchase: a Mobile Application for Pronunciation Training
CN113257221A (en) Voice model training method based on front-end design and voice synthesis method
Radzevičius et al. Speech synthesis using stressed sample labels for languages with higher degree of phonemic orthography
CN111816157A (en) Music score intelligent video-singing method and system based on voice synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication