WO2023045433A1 - Prosodic information labeling method and related device - Google Patents

Prosodic information labeling method and related device Download PDF

Info

Publication number
WO2023045433A1
WO2023045433A1 PCT/CN2022/099389 CN2022099389W WO2023045433A1 WO 2023045433 A1 WO2023045433 A1 WO 2023045433A1 CN 2022099389 W CN2022099389 W CN 2022099389W WO 2023045433 A1 WO2023045433 A1 WO 2023045433A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
information
prosodic
marked
phrases
Prior art date
Application number
PCT/CN2022/099389
Other languages
French (fr)
Chinese (zh)
Inventor
陈飞扬
李太松
陈珊珊
王喆锋
李明磊
怀宝兴
袁晶
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2023045433A1 publication Critical patent/WO2023045433A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the embodiments of the present application relate to the field of speech synthesis, and in particular, to a prosodic information labeling method and related equipment.
  • Speech synthesis text to speech, TTS
  • TTS text to speech
  • annotation information includes annotations indicating prosodic phrases, prosodic words and intonation phrases.
  • Annotation, annotation information can represent pauses and tone changes in the text.
  • Speech synthesis technology inputs text containing annotation information into the model, so that the model outputs corresponding speech information. In order to make the speech information output by the model more natural, it is necessary to use a large amount of text containing annotation information to train the model.
  • Embodiments of the present application provide a prosodic information labeling method and related equipment, which are used to improve labeling efficiency.
  • the computer equipment acquires the corresponding audio information and the first text information, and marks prosodic words and prosodic phrases in the first text information to obtain the first marked text, wherein the prosodic phrases need to be marked based on the audio information.
  • the computer device marks the intonation phrases in the first text information based on the prosodic words and prosodic phrases marked in the first marked text, so as to obtain the second marked text.
  • the manner in which the computer device collects the first text information and the corresponding audio information may be collected on the network, so a large amount of data may be collected. And mark the prosodic words, prosodic phrases and intonation phrases in the first text information by computer equipment, no longer need to pass through the mode of artificial labeling, thereby improve the labeling efficiency of prosodic words, prosodic phrases and intonation phrases, and because combine prosodic words and Prosodic phrases are used to mark intonation phrases, which improves the accuracy of intonation phrase labeling.
  • the computer device may also correct the prosodic words, prosodic phrases and intonation phrases marked in the second marked text in response to the user's operation instruction.
  • the user can also correct the related tags, so as to further improve the accuracy of tagging.
  • the computer device may also obtain the third marked text, where the text information of the third marked text is consistent with the text information of the second marked text, and the marked information in the third marked text Different from the annotation information in the second annotated text, the annotation information is indicated by an annotation indicating a prosodic word, an annotation indicating a prosodic phrase, and an annotation indicating an intonation phrase.
  • the computer device may also determine target annotations in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, where the target annotations include annotations for indicating prosodic words, for At least one of a label indicating a prosodic phrase and a label indicating an intonation phrase.
  • the third annotated text may be based on the first text information after manually annotating prosodic words, prosodic phrases, and intonation phrases, and the third annotated text may be verified based on the second annotated text to determine the first
  • There may be mislabeled target annotations in the text after the third annotation so that the text after the third annotation can be corrected to improve the accuracy of annotation.
  • the computer device may also correct the target annotation in response to the user's operation instruction.
  • the computer device or the audio information and the first text information may specifically acquire the audio information first, and then acquire the first text information corresponding to the audio information based on a speech recognition technology.
  • acquire the video information acquire the audio information in the video information and the second text information corresponding to the subtitle information in the video information, acquire the third text information corresponding to the audio information based on speech recognition technology, and based on the second text information and The third text information determines the first text information.
  • the second aspect of the embodiments of the present application provides a computer device, where the computer device includes a plurality of functional modules, and the plurality of functional modules interact to implement the method in the above first aspect and various implementation manners thereof.
  • Multiple functional modules can be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.
  • the third aspect of the embodiment of the present application provides a computer device, including a processor, the processor is coupled with a memory, and the memory is used to store instructions.
  • the display device performs the above-mentioned first aspect. Methods.
  • the fourth aspect of the embodiments of the present application provides a computer program product, including codes, and when the codes are run on a computer, the computer is made to execute the method described in the aforementioned first aspect.
  • the fifth aspect of the embodiment of the present application provides a computer-readable storage medium on which computer programs or instructions are stored, and is characterized in that when the computer programs or instructions are executed, the computer programs or instructions are stored thereon, When the instructions are executed, the computer is made to execute the method as described in the aforementioned first aspect.
  • FIG 1a and Figure 1b are schematic diagrams of the system architecture in the embodiment of the present application.
  • Fig. 2 is a schematic flow chart of the prosodic information labeling method in the embodiment of the present application
  • Fig. 3 is another schematic flow chart of the prosodic information labeling method in the embodiment of the present application.
  • Fig. 4 is a schematic diagram of labeling prosodic words, prosodic phrases and intonation phrases in the embodiment of the present application;
  • FIG. 5 is a schematic structural diagram of a computer device in an embodiment of the present application.
  • FIG. 6 is another schematic structural diagram of a computer device in an embodiment of the present application.
  • Embodiments of the present application provide a prosodic information prosodic information labeling method and related equipment, which are used to improve the prosodic information labeling efficiency.
  • Syllable It is the most easily distinguishable phonetic unit for people to hear, and it is also the most natural phonetic unit in the speech flow.
  • Prosodic words a group of syllables that are closely related in the actual language flow and are often pronounced together.
  • Rhythmic Phrases Moderate rhythmic chunks between prosodic words and intonation phrases.
  • Intonation phrase A phrase that connects several prosodic phrases according to a certain intonation pattern.
  • the embodiments of the present application can be applied to speech synthesis and text emotion recognition.
  • speech synthesis the text that has been marked with prosodic phrases, prosodic words, and intonation phrases is input into the model, and the model outputs smooth and natural speech, which is widely used in technologies such as audio novels, digital humans, voice assistants, and smart speakers.
  • Speech synthesis needs to use a large number of texts that have been marked with prosodic phrases, prosodic words and intonation phrases to train the model, so that the model can output smooth and natural speech. If you can quickly obtain the text that has been marked with prosodic phrases, prosodic words and intonation phrases, then It can improve the training efficiency of the model, thereby improving the naturalness and fluency of the output speech.
  • Emotion recognition also known as tendency analysis, is the process of analyzing, processing, inducing and inferring emotionally subjective texts. Using the emotion recognition function, it has a very wide range of applications in comment analysis and decision-making, e-commerce comment classification and public opinion monitoring. Emotion recognition usually analyzes the text directly. If prosodic phrases, prosodic words, and intonation phrases in the text can be marked, the accuracy of emotion recognition will be improved.
  • the system architecture includes a data collection module, a labeling module, and a verification module.
  • the data acquisition module is used to collect video information or audio information on the network.
  • For the video information extract the audio information in the video information, and determine the corresponding second text information according to the subtitles in the video information, and perform speech recognition on the audio information to determine The third text information, if there is no difference between the second text information and the third text information, then determine that the second text information or the third text information is the first text information; if there is a difference between the second text information and the third text information , then the data acquisition module can correct the second text information or the third text information based on the user's operation instruction, and the corrected text information is the first text; The first text message.
  • the embodiments of the above-mentioned second text information, third text information and first text information may specifically be document files, for example, the second text information may be embodied as a document 1 including the second text information, and the third text information may be embodied in As document 2 including the third text information, the first text information may be embodied as document 3 including the first text information.
  • the data acquisition module sends the audio information and the first text information to the labeling module, for example, sends the audio information and the document 3 to the labeling module.
  • the tagging module tags prosodic phrases and prosodic words for the document 3 based on corresponding models, algorithms or rules combined with audio information to obtain the first tagged text, and then tags the first tagged text based on the prosodic phrases and prosodic words tagged in the first tagged text combined with audio information.
  • An intonation phrase in a tagged text is obtained to obtain a second tagged text, and the second tagged text is sent to a verification module.
  • the verification module corrects the second annotated text based on the user's operation instruction.
  • the tagging module can first send the document 3 after the prosodic words are annotated to the verification module, and the verification module is based on After correcting the prosodic words marked in the document 3 by the user's operation instruction, the corrected document 3 is sent to the marking module for subsequent marking, and finally the second marked text is obtained.
  • the data acquisition module can also create a document 4 including the first text information, and send the document 4 and the audio information to the manual labeling module, and the manual labeling module obtains the user's operation instructions to mark the prosodic phrases, Prosodic words and intonation phrases are used to obtain the third annotated text.
  • the labeling module and the manual labeling module will respectively send the second labeled text and the third labeled text to the screening module, and the screening module will compare the labeling information in the second labeled text and the third labeled text, and determine the The target label in the text after the third label, it should be noted that the target label can be understood as a wrong label.
  • the screening module sends the third annotated text to the verification module, and the user corrects the target annotation through the verification module.
  • modules shown in Fig. 1a and Fig. 1b can be located on different computer equipments, or all can be located on the same computer equipment, or some of them can be located on the same computer equipment, and the other can be located on the same computer equipment.
  • Other computer equipment is not specifically limited here.
  • FIG. 2 Please refer to FIG. 2 , the following is an introduction to a flow of the method for labeling prosodic information in the embodiment of the present application:
  • the computer device acquires audio information and first text information
  • the computer equipment collects video information and audio information by downloading on the network. For the collected video information, the computer equipment extracts the audio information in the video information, and recognizes the subtitles in the video information through optical character recognition technology, so as to determine the corresponding For the second text information, since the subtitles in the video information have a corresponding relationship with the audio information, the second text information determined according to the subtitles also has a corresponding relationship with the audio information.
  • the text information corresponding to the audio information is determined directly through the speech recognition technology, and the text information is the first text information.
  • the computer equipment marks the prosodic words and prosodic phrases in the first text information to obtain the first marked text;
  • the computer device After acquiring the first text information, the computer device marks the prosodic words in the first text information based on corresponding models, algorithms or rules, for example, it can mark the prosodic words in the first text information according to the coarse-grained word segmentation model. Or, after the computer device marks the prosodic words in the first text information based on corresponding models, algorithms or rules, the computer device can also correct the prosodic words marked in the first text information based on the user's operation instruction.
  • the computer equipment marks the prosodic phrases in the first text information according to the audio information.
  • the pronunciation duration of each word and the tone information of each word in the audio information can be extracted through a neural network or a machine algorithm, and the first text is marked according to these information Rhythmic phrases in messages.
  • prosodic words and prosodic phrases are not limited, prosodic words can be annotated first, or prosodic phrases can also be annotated first.
  • the computer device annotates intonation phrases in the first annotated text based on the annotated prosodic words, prosodic phrases and audio information in the first annotated text, to obtain a second annotated text.
  • the computer device can combine the prosodic words, prosodic phrases marked in the first text and the pronunciation duration of each word in the audio information and the tone information of each word , mark the intonation phrase in the first text, so as to obtain the second marked text.
  • the computer device corrects at least one of prosodic words, prosodic phrases, and intonation phrases marked in the second marked text in response to the user's operation instruction.
  • the computer device After obtaining the second marked text, the computer device corrects at least one of the prosodic words, prosodic phrases and intonation phrases marked in the second marked text based on the user's operation instruction.
  • the manner in which the computer device collects the first text information and the corresponding audio information may be collected on the network, so a large amount of data may be collected. And mark the prosodic words, prosodic phrases and intonation phrases in the first text information by computer equipment, no longer need to pass through the mode of artificial labeling, thereby improve the labeling efficiency of prosodic words, prosodic phrases and intonation phrases, and because combine prosodic words and Prosodic phrases are used to mark intonation phrases, which improves the accuracy of intonation phrase labeling.
  • the user can also use the computer device to mark the prosodic words in the first text information.
  • Prosodic words, prosodic phrases and intonation phrases are corrected, wherein the user may be a relevant professional.
  • Steps 301 to 303 in this embodiment are similar to steps 201 to 203 in the embodiment shown in FIG. 2 , and will not be repeated here.
  • the computer device acquires the third annotated text
  • the computer device obtains the third marked text, which has marked prosodic words, prosodic phrases and intonation phrases, and the text information of the third marked text is consistent with the text information of the second marked text, and the third The annotation information of the annotated text is different from the annotation information of the second annotated text.
  • the text information only includes text information in the text
  • the annotation information is embodied in the position of each annotation in the text and the corresponding type of each annotation, wherein the type of annotation includes annotations for indicating prosodic word boundaries, for Annotations to indicate prosodic phrases and annotations to indicate intonation phrases.
  • the second marked text may be obtained by the computer device marking the prosodic words, prosodic phrases and intonation phrases in the document 3 including the first text information
  • the third marked text may be obtained by the computer device based on The user's operation instruction is obtained by annotating prosodic words, prosodic phrases and intonation phrases in the document 4 including the first text information.
  • step 304 is not limited here, it only needs to be executed after step 301 and before step 305 .
  • the computer device determines the target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text;
  • the computer device combines the annotation information of the second annotated text and the annotation information of the third annotated text, and analyzes the difference in the annotation information between the two texts.
  • Figure 4 is a schematic diagram of marking prosodic words, prosodic phrases and intonation phrases on text.
  • the text at position 1 after the second mark is marked as "#a", "a” can be 0, 1, 2 or 3, and the text after the third mark at position 1 is marked as "# b", "b" can be 0, 1, 2 or 3. If b minus a is greater than or equal to 2, it indicates that there may be an error in the annotation at position 1 of the second annotation text, and the computer device determines this annotation as the target annotation.
  • the computer device corrects the target label in response to the user's operation instruction.
  • the computer equipment After the computer equipment determines the target label, it displays the target label, and the user can correct the target label through the computer device.
  • the computer equipment is based on the second marked text.
  • the text determines the target label in the text after the third label, and then manually checks the target label, which improves the efficiency of the check.
  • a computer device 500 in this embodiment of the present application includes a processing unit 501 .
  • the processing unit 501 is configured to acquire audio information and first text information, where the audio information has a corresponding relationship with the first text information.
  • the processing unit 501 is further configured to mark prosodic words and prosodic phrases in the first text information to obtain the first marked text, and the prosodic phrases in the first text information need to be marked based on audio information.
  • the processing unit 501 is further configured to annotate intonation phrases in the first annotated text based on prosodic words annotated in the first annotated text, prosodic phrases annotated in the first annotated text, and audio information to obtain a second annotated text.
  • the processing unit 501 is further configured to correct at least one of the prosodic words, prosodic phrases and intonation phrases marked in the second marked text in response to the user's operation instruction.
  • the processing unit 501 is further configured to acquire a third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marking information in the third marked text is the same as that in the second marked text
  • the annotation information is different, and the annotation information is indicated by annotations indicating prosodic words, annotations indicating prosodic phrases, and annotations indicating intonation phrases.
  • the processing unit 501 is further configured to determine target annotations in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, the target annotations include annotations for indicating prosodic words, for At least one of a label indicating a prosodic phrase and a label indicating an intonation phrase.
  • the processing unit 501 is further configured to correct the target label in response to the user's operation instruction.
  • the processing unit 501 is specifically configured to acquire audio information.
  • the processing unit 501 is further configured to determine first text information corresponding to the audio information based on speech recognition.
  • the processing unit 501 is specifically configured to acquire video information.
  • the processing unit 501 is further configured to acquire audio information in the video information and second text information corresponding to subtitle information in the video information.
  • the processing unit 501 is further configured to determine third text information corresponding to the audio information based on speech recognition.
  • the processing unit 501 is further configured to determine the first text information based on the second text information and the third text information.
  • the computer device 600 may include one or more central processing units (central processing units, CPU) 601 and a memory 605, and one or one above applications or data.
  • CPU central processing units
  • the storage 605 may be a volatile storage or a persistent storage.
  • the program stored in the memory 605 may include one or more modules, and each module may include a series of instructions to operate on the server.
  • the central processing unit 601 may be configured to communicate with the memory 605 , and execute a series of instruction operations in the memory 605 on the computer device 600 .
  • the computer device 600 can also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input and output interfaces 604, and/or, one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • one or more operating systems such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the central processing unit 601 may perform the operations performed by the computer device in the foregoing embodiments shown in FIG. 2 and FIG. 3 , and details are not repeated here.
  • the disclosed system, device and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

Provided is a prosodic information labeling method, comprising: a computer device acquiring audio information and first text information (201); the computer device labeling prosodic words and prosodic phrases in the first text information, so as to obtain a first labeled text (202), wherein the prosodic phrases in the first labeled text need to be performed on the basis of the audio information; and the computer device labeling intonation phrases in the first labeled text on the basis of the labeled prosodic words in the first labeled text, the labeled prosodic phrases in the first labeled text, and the audio information, so as to obtain a second labeled text (203). Further provided are a computer device, a computer program product, and a computer-readable storage medium.

Description

一种韵律信息标注方法以及相关设备A prosodic information labeling method and related equipment
本申请要求于2021年9月24日提交中国国家知识产权局、申请号202111124499.3、申请名称为“一种韵律信息标注方法以及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China on September 24, 2021, with application number 202111124499.3 and titled "A Method for Prosodic Information Labeling and Related Equipment", the entire contents of which are hereby incorporated by reference In this application.
技术领域technical field
本申请实施例涉及语音合成领域,尤其涉及一种韵律信息标注方法以及相关设备。The embodiments of the present application relate to the field of speech synthesis, and in particular, to a prosodic information labeling method and related equipment.
背景技术Background technique
语音合成(text to speech,TTS)技术是一种基于文本以及文本中的标注信息得到自然流畅的语音信息的技术,其中标注信息包括指示韵律短语的标注、指示韵律词的标注以及指示语调短语的标注,标注信息可以表征文本的停顿以及语气变化。语音合成技术通过将包含标注信息的文本输入模型,从而使得模型输出对应的语音信息,为了能够让模型输出的语音信息的自然度更高,需要利用大量的包含标注信息的文本对模型进行训练。Speech synthesis (text to speech, TTS) technology is a technology that obtains natural and smooth speech information based on text and annotation information in the text, where the annotation information includes annotations indicating prosodic phrases, prosodic words and intonation phrases. Annotation, annotation information can represent pauses and tone changes in the text. Speech synthesis technology inputs text containing annotation information into the model, so that the model outputs corresponding speech information. In order to make the speech information output by the model more natural, it is necessary to use a large amount of text containing annotation information to train the model.
在当前对文本进行韵律短语、韵律词以及语调短语的标注方式中,主要是通过专业的人员进行标注,所需的时间较长,因此获取包含标注信息的文本较为困难,不利于快速将模型训练至理想的效果。In the current method of tagging prosodic phrases, prosodic words, and intonation phrases on text, it is mainly done by professional personnel, which takes a long time. Therefore, it is difficult to obtain text containing tagged information, which is not conducive to quickly training the model. to the desired effect.
发明内容Contents of the invention
本申请实施例提供了一种韵律信息标注方法以及相关设备,用于提高标注的效率。Embodiments of the present application provide a prosodic information labeling method and related equipment, which are used to improve labeling efficiency.
本申请实施例第一方面提供了一种韵律信息标注方法:The first aspect of the embodiment of the present application provides a prosodic information labeling method:
计算机设备获取具有对应关系的音频信息以及第一文本信息,并标注第一文本信息中的韵律词以及韵律短语,得到第一标注后文本,其中,韵律短语需要基于音频信息进行标注。在标注第一文本信息中的韵律词以及韵律短语之后,计算机设备基于第一标注后文本中标注的韵律词以及韵律短语标注第一文本信息中的语调短语,从而得到第二标注后文本。The computer equipment acquires the corresponding audio information and the first text information, and marks prosodic words and prosodic phrases in the first text information to obtain the first marked text, wherein the prosodic phrases need to be marked based on the audio information. After marking the prosodic words and prosodic phrases in the first text information, the computer device marks the intonation phrases in the first text information based on the prosodic words and prosodic phrases marked in the first marked text, so as to obtain the second marked text.
本申请实施例中,计算机设备收集第一文本信息以及对应的音频信息的方式可以是在网络收集,因此可以收集到较多的数据量。并由计算机设备标注第一文本信息中的韵律词、韵律短语以及语调短语,不再需要通过人工标注的方式,从而提高韵律词、韵律短语以及语调短语的标注效率,并且由于结合了韵律词以及韵律短语标注语调短语,提高了语调短语标注的准确性。In the embodiment of the present application, the manner in which the computer device collects the first text information and the corresponding audio information may be collected on the network, so a large amount of data may be collected. And mark the prosodic words, prosodic phrases and intonation phrases in the first text information by computer equipment, no longer need to pass through the mode of artificial labeling, thereby improve the labeling efficiency of prosodic words, prosodic phrases and intonation phrases, and because combine prosodic words and Prosodic phrases are used to mark intonation phrases, which improves the accuracy of intonation phrase labeling.
在一种可能的实现方式中,计算机设备还可以响应于用户的操作指令,对第二标注后文本中标注的韵律词、韵律短语以及语调短语进行校正。In a possible implementation manner, the computer device may also correct the prosodic words, prosodic phrases and intonation phrases marked in the second marked text in response to the user's operation instruction.
本申请实施例中,计算机设备进行韵律词、韵律短语以及语调短语的标注之后,还可以由用户对相关标注进行校正,从而进一步提高标注的准确性。In the embodiment of the present application, after the computer equipment tags the prosodic words, prosodic phrases, and intonation phrases, the user can also correct the related tags, so as to further improve the accuracy of tagging.
在一种可能的实现方式中,计算机设备还可以获取第三标注后文本,其中,第三标注后文本的文本信息与第二标注后文本的文本信息一致,第三标注后文本中的标注信息与第二标注后文本中的标注信息不同,该标注信息通过用于指示韵律词的标注、用于指示韵律短语的标注以及用于指示语调短语的标注指示。计算机设备还可以基于第二标注后文本的 标注信息以及所述第三标注后文本的标注信息,确定第三标注后文本中的目标标注,该目标标注包括用于指示韵律词的标注、用于指示韵律短语的标注以及用于指示语调短语的标注中的至少一种。In a possible implementation manner, the computer device may also obtain the third marked text, where the text information of the third marked text is consistent with the text information of the second marked text, and the marked information in the third marked text Different from the annotation information in the second annotated text, the annotation information is indicated by an annotation indicating a prosodic word, an annotation indicating a prosodic phrase, and an annotation indicating an intonation phrase. The computer device may also determine target annotations in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, where the target annotations include annotations for indicating prosodic words, for At least one of a label indicating a prosodic phrase and a label indicating an intonation phrase.
本申请实施例中,第三标注后文本可以是基于人工标注韵律词、韵律短语以及语调短语之后的第一文本信息,基于第二标注后文本可以对第三标注后文本进行校验,确定第三标注后文本中可能存在标注错误的目标标注,从而对第三标注后文本进行校正,提高标注的准确性。In the embodiment of the present application, the third annotated text may be based on the first text information after manually annotating prosodic words, prosodic phrases, and intonation phrases, and the third annotated text may be verified based on the second annotated text to determine the first There may be mislabeled target annotations in the text after the third annotation, so that the text after the third annotation can be corrected to improve the accuracy of annotation.
在一种可能的实现方式中,计算机设备还可以响应于用户的操作指令,对目标标注进行校正。In a possible implementation manner, the computer device may also correct the target annotation in response to the user's operation instruction.
在一种可能的实现方式中,计算机设备或音频信息以及第一文本信息具体可以是先获取音频信息,之后基于语音识别技术获取与音频信息对应的第一文本信息。或者,获取视频信息,之后获取视频信息中的音频信息以及视频信息中的字幕信息对应的第二文本信息,基于语音识别技术获取与音频信息对应的第三文本信息,并基于第二文本信息以及第三文本信息确定第一文本信息。In a possible implementation manner, the computer device or the audio information and the first text information may specifically acquire the audio information first, and then acquire the first text information corresponding to the audio information based on a speech recognition technology. Or, acquire the video information, then acquire the audio information in the video information and the second text information corresponding to the subtitle information in the video information, acquire the third text information corresponding to the audio information based on speech recognition technology, and based on the second text information and The third text information determines the first text information.
本申请实施例第二方面提供了一种计算机设备,该计算机设备包括多个功能模块,所述多个功能模块相互作用,实现上述第一方面及其各实施方式中的方法。多个功能模块可以基于软件、硬件或软件和硬件的结合实现,且所述多个功能模块可以基于具体实现进行任意组合或分割。The second aspect of the embodiments of the present application provides a computer device, where the computer device includes a plurality of functional modules, and the plurality of functional modules interact to implement the method in the above first aspect and various implementation manners thereof. Multiple functional modules can be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.
本申请实施例第三方面提供了一种计算机设备,包括处理器,处理器与存储器耦合,存储器用于存储指令,当指令被处理器执行时,使得显示设备执行如前述第一方面中所述的方法。The third aspect of the embodiment of the present application provides a computer device, including a processor, the processor is coupled with a memory, and the memory is used to store instructions. When the instructions are executed by the processor, the display device performs the above-mentioned first aspect. Methods.
本申请实施例第四方面提供了一种计算机程序产品,包括代码,当代码在计算机上运行时,使得计算机运行如前述第一方面所述的方法。The fourth aspect of the embodiments of the present application provides a computer program product, including codes, and when the codes are run on a computer, the computer is made to execute the method described in the aforementioned first aspect.
本申请实施例第五方面提供了一种计算机可读存储介质,其上存储有计算机程序或指令,其特征在于,计算机程序或指令被执行时,其上存储有计算机程序或指令,计算机程序或指令被执行时,使得计算机执行如前述第一方面所述的方法。The fifth aspect of the embodiment of the present application provides a computer-readable storage medium on which computer programs or instructions are stored, and is characterized in that when the computer programs or instructions are executed, the computer programs or instructions are stored thereon, When the instructions are executed, the computer is made to execute the method as described in the aforementioned first aspect.
附图说明Description of drawings
图1a以及图1b为本申请实施例中的系统架构示意图;Figure 1a and Figure 1b are schematic diagrams of the system architecture in the embodiment of the present application;
图2为本申请实施例中韵律信息标注方法的一个流程示意图;Fig. 2 is a schematic flow chart of the prosodic information labeling method in the embodiment of the present application;
图3为本申请实施例中韵律信息标注方法的另一流程示意图;Fig. 3 is another schematic flow chart of the prosodic information labeling method in the embodiment of the present application;
图4为本申请实施例中标注韵律词、韵律短语以及语调短语的一个示意图;Fig. 4 is a schematic diagram of labeling prosodic words, prosodic phrases and intonation phrases in the embodiment of the present application;
图5为本申请实施例中计算机设备的一个结构示意图;FIG. 5 is a schematic structural diagram of a computer device in an embodiment of the present application;
图6为本申请实施例中计算机设备的另一结构示意图。FIG. 6 is another schematic structural diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。Embodiments of the present application are described below in conjunction with the accompanying drawings. Apparently, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Those of ordinary skill in the art know that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second" and the like in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.
本申请实施例提供了一种韵律信息韵律信息标注方法以及相关设备,用于提高韵律信息的标注效率。Embodiments of the present application provide a prosodic information prosodic information labeling method and related equipment, which are used to improve the prosodic information labeling efficiency.
为了便于理解,下面先对本申请实施例涉及到的相关概念进行介绍:For ease of understanding, the following first introduces the related concepts involved in the embodiment of the present application:
音节:是人们在听觉上最容易分辨出来的语音单位,也是语流中最自然的语音单位。Syllable: It is the most easily distinguishable phonetic unit for people to hear, and it is also the most natural phonetic unit in the speech flow.
韵律词:是一组在实际语流中联系密切的,经常连在一起发音的音节。Prosodic words: a group of syllables that are closely related in the actual language flow and are often pronounced together.
韵律短语:介于韵律词与语调短语之间的中等节奏组块。Rhythmic Phrases: Moderate rhythmic chunks between prosodic words and intonation phrases.
语调短语:将几个韵律短语按照一定的语调模式连接起来的短语。Intonation phrase: A phrase that connects several prosodic phrases according to a certain intonation pattern.
本申请实施例可以应用于语音合成以及文本情感识别。在语音合成中,将已经标注韵律短语、韵律词以及语调短语的文本输入到模型中,模型输出流畅、自然的语音,其广泛应用于有声小说、数字人、语音助手以及智能音箱等技术中。语音合成需要利用大量已经标注韵律短语、韵律词以及语调短语的文本对模型进行训练,才能使模型输出流畅、自然的语音,如果能够快速获取已经标注韵律短语、韵律词以及语调短语的文本,则可以提高模型的训练效率,进而提高输出的语音的自然度以及流畅度。情感识别又称倾向性分析,是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程。利用情感识别功能,在评论分析与决策、电商评论分类以及舆情监控中有非常广泛的应用。情感识别通常直接对文本进行分析,如果能够标注文本中的韵律短语、韵律词以及语调短语,则会提高情感识别的准确率。The embodiments of the present application can be applied to speech synthesis and text emotion recognition. In speech synthesis, the text that has been marked with prosodic phrases, prosodic words, and intonation phrases is input into the model, and the model outputs smooth and natural speech, which is widely used in technologies such as audio novels, digital humans, voice assistants, and smart speakers. Speech synthesis needs to use a large number of texts that have been marked with prosodic phrases, prosodic words and intonation phrases to train the model, so that the model can output smooth and natural speech. If you can quickly obtain the text that has been marked with prosodic phrases, prosodic words and intonation phrases, then It can improve the training efficiency of the model, thereby improving the naturalness and fluency of the output speech. Emotion recognition, also known as tendency analysis, is the process of analyzing, processing, inducing and inferring emotionally subjective texts. Using the emotion recognition function, it has a very wide range of applications in comment analysis and decision-making, e-commerce comment classification and public opinion monitoring. Emotion recognition usually analyzes the text directly. If prosodic phrases, prosodic words, and intonation phrases in the text can be marked, the accuracy of emotion recognition will be improved.
本申请实施例可以应用于如图1a或图1b所示的系统架构中,下面分别进行介绍。如图1a所示,该系统架构包括数据采集模块、标注模块以及校验模块。数据采集模块用于在网络上收集视频信息或者音频信息,对于视频信息,提取视频信息中的音频信息,并根据视频信息中的字幕确定对应的第二文本信息,对音频信息进行语音识别从而确定第三文本信息,如果第二文本信息与第三文本信息不存在差异,则确定第二文本信息或者第三文本信息为第一文本信息;如果第二文本信息与第三文本信息之间存在差异,则数据采集模块可以基于用户的操作指令对第二文本信息或者第三文本信息进行校正,校正之后的文本信息即为第一文本;对于音频信息,通过语音识别技术直接获取与音频信息对应的第一文本信息。应理解,上述第二文本信息、第三文本信息以及第一文本信息的体现形式具体可以是文档文件,例如第二文本信息可以体现为包括第二文本信息的文档1,第三文本信息可以体现为包括第三文本信息的文档2,第一文本信息可以体现为包括第一文本信息的文档3。The embodiment of the present application can be applied to the system architecture shown in FIG. 1a or FIG. 1b , which will be introduced respectively below. As shown in Figure 1a, the system architecture includes a data collection module, a labeling module, and a verification module. The data acquisition module is used to collect video information or audio information on the network. For the video information, extract the audio information in the video information, and determine the corresponding second text information according to the subtitles in the video information, and perform speech recognition on the audio information to determine The third text information, if there is no difference between the second text information and the third text information, then determine that the second text information or the third text information is the first text information; if there is a difference between the second text information and the third text information , then the data acquisition module can correct the second text information or the third text information based on the user's operation instruction, and the corrected text information is the first text; The first text message. It should be understood that the embodiments of the above-mentioned second text information, third text information and first text information may specifically be document files, for example, the second text information may be embodied as a document 1 including the second text information, and the third text information may be embodied in As document 2 including the third text information, the first text information may be embodied as document 3 including the first text information.
数据采集模块将上述音频信息以及第一文本信息发送至标注模块,例如将音频信息以 及文档3发送至标注模块。标注模块基于相应的模型、算法或者规则结合音频信息为文档3标注韵律短语、韵律词得到第一标注后文本,再基于第一标注后文本中标注的韵律短语以及韵律词,结合音频信息标注第一标注后文本中的语调短语,得到第二标注后文本,并将第二标注后文本发送至校验模块。校验模块基于用户的操作指令对第二标注后文本进行校正,需要说明的是,在一种方式中,标注模块可以将标注韵律词之后的文档3先发送至校验模块,校验模块基于用户的操作指令对文档3中标注的韵律词进行校正后,再将校正完成的文档3发送至标注模块进行后续的标注,最终得到第二标注后文本。The data acquisition module sends the audio information and the first text information to the labeling module, for example, sends the audio information and the document 3 to the labeling module. The tagging module tags prosodic phrases and prosodic words for the document 3 based on corresponding models, algorithms or rules combined with audio information to obtain the first tagged text, and then tags the first tagged text based on the prosodic phrases and prosodic words tagged in the first tagged text combined with audio information. An intonation phrase in a tagged text is obtained to obtain a second tagged text, and the second tagged text is sent to a verification module. The verification module corrects the second annotated text based on the user's operation instruction. It should be noted that, in one way, the tagging module can first send the document 3 after the prosodic words are annotated to the verification module, and the verification module is based on After correcting the prosodic words marked in the document 3 by the user's operation instruction, the corrected document 3 is sent to the marking module for subsequent marking, and finally the second marked text is obtained.
如图1b所示,数据采集模块还可以创建包括第一文本信息的文档4,并将文档4以及音频信息发送至人工标注模块,人工标注模块获取用户的操作指令标注文档4中的韵律短语、韵律词以及语调短语,得到第三标注后文本。标注模块将以及人工标注模块分别将第二标注后文本以及第三标注后文本发送至筛选模块,筛选模块将第二标注后文本以及第三标注后文本中的标注信息进行比对,并确定出第三标注后文本中的目标标注,需要说明的是,目标标注可以被理解为存在错误的标注。确定出目标标注后,筛选模块将第三标注后文本发送至校验模块,用户通过校验模块对目标标注进行校正。As shown in Figure 1b, the data acquisition module can also create a document 4 including the first text information, and send the document 4 and the audio information to the manual labeling module, and the manual labeling module obtains the user's operation instructions to mark the prosodic phrases, Prosodic words and intonation phrases are used to obtain the third annotated text. The labeling module and the manual labeling module will respectively send the second labeled text and the third labeled text to the screening module, and the screening module will compare the labeling information in the second labeled text and the third labeled text, and determine the The target label in the text after the third label, it should be noted that the target label can be understood as a wrong label. After the target annotation is determined, the screening module sends the third annotated text to the verification module, and the user corrects the target annotation through the verification module.
需要说明的是,上述图1a以及图1b所示的各个模块,可以各自位于不同的计算机设备上,或者也可以都位于同一个计算机设备上,或者其中的一部分位于相同的计算机设备,另一部分位于其他的计算机设备,具体此处不做限定。It should be noted that the above-mentioned modules shown in Fig. 1a and Fig. 1b can be located on different computer equipments, or all can be located on the same computer equipment, or some of them can be located on the same computer equipment, and the other can be located on the same computer equipment. Other computer equipment is not specifically limited here.
请参阅图2,下面开始对本申请实施例中标注韵律信息的方法的一个流程进行介绍:Please refer to FIG. 2 , the following is an introduction to a flow of the method for labeling prosodic information in the embodiment of the present application:
201、计算机设备获取音频信息以及第一文本信息;201. The computer device acquires audio information and first text information;
计算机设备在网络上通过下载的方式收集视频信息以及音频信息,对于收集到的视频信息,计算机设备提取视频信息中的音频信息,并通过光学字符识别技术识别视频信息中的字幕,从而确定对应的第二文本信息,由于视频信息中的字幕与音频信息具有对应关系,因此根据字幕确定的第二文本信息也与音频信息具有对应关系。之后对音频信息进行语音识别从而确定第三文本信息,如果第二文本信息与第三文本信息不存在差异,则确定第二文本信息或者第三文本信息为第一文本信息;如果第二文本信息与第三文本信息之间存在差异,则用户可以在计算机设备上对第二文本信息或者第三文本信息进行校正,校正之后的文本信息即为第一文本信息。对于收集到的音频信息,直接通过语音识别技术确定与音频信息对应的文本信息,该文本信息即为第一文本信息。The computer equipment collects video information and audio information by downloading on the network. For the collected video information, the computer equipment extracts the audio information in the video information, and recognizes the subtitles in the video information through optical character recognition technology, so as to determine the corresponding For the second text information, since the subtitles in the video information have a corresponding relationship with the audio information, the second text information determined according to the subtitles also has a corresponding relationship with the audio information. Carry out voice recognition to the audio information afterwards so as to determine the third text information, if there is no difference between the second text information and the third text information, then determine that the second text information or the third text information is the first text information; if the second text information If there is a difference with the third text information, the user can correct the second text information or the third text information on the computer device, and the corrected text information is the first text information. For the collected audio information, the text information corresponding to the audio information is determined directly through the speech recognition technology, and the text information is the first text information.
202、计算机设备标注第一文本信息中的韵律词以及韵律短语,得到第一标注后文本;202. The computer equipment marks the prosodic words and prosodic phrases in the first text information to obtain the first marked text;
获取到第一文本信息之后,计算机设备基于相应的模型、算法或者规则标注第一文本信息中的韵律词,例如可以根据粗粒度分词模型标注第一文本信息中的韵律词。或者,计算机设备在基于相应的模型、算法或者规则标注第一文本信息中的韵律词之后,计算机设备还可以基于用户的操作指令对第一文本信息中标注的韵律词进行校正。After acquiring the first text information, the computer device marks the prosodic words in the first text information based on corresponding models, algorithms or rules, for example, it can mark the prosodic words in the first text information according to the coarse-grained word segmentation model. Or, after the computer device marks the prosodic words in the first text information based on corresponding models, algorithms or rules, the computer device can also correct the prosodic words marked in the first text information based on the user's operation instruction.
之后,计算机设备根据音频信息标注第一文本信息中的韵律短语,具体可以通过神经网络或者机器算法提取音频信息中每个字的发音时长以及每个字的音调信息,按照这些信息标注第一文本信息中的韵律短语。Afterwards, the computer equipment marks the prosodic phrases in the first text information according to the audio information. Specifically, the pronunciation duration of each word and the tone information of each word in the audio information can be extracted through a neural network or a machine algorithm, and the first text is marked according to these information Rhythmic phrases in messages.
需要说明的是,标注韵律词以及韵律短语的顺序不做限定,可以先标注韵律词,或者 也可以先标注韵律短语。It should be noted that the order of annotating prosodic words and prosodic phrases is not limited, prosodic words can be annotated first, or prosodic phrases can also be annotated first.
203、计算机设备基于第一标注后文本中标注的韵律词、韵律短语以及音频信息标注第一标注后文本中的语调短语,得到第二标注后文本。203. The computer device annotates intonation phrases in the first annotated text based on the annotated prosodic words, prosodic phrases and audio information in the first annotated text, to obtain a second annotated text.
由于语调短语与韵律词以及韵律短语之间有着较强的关联关系,计算机设备可以结合第一文本中标注的韵律词、韵律短语以及音频信息中每个字的发音时长以及每个字的音调信息,标注第一文本中的语调短语,从而得到第二标注后文本。Since intonation phrases have a strong association with prosodic words and prosodic phrases, the computer device can combine the prosodic words, prosodic phrases marked in the first text and the pronunciation duration of each word in the audio information and the tone information of each word , mark the intonation phrase in the first text, so as to obtain the second marked text.
204、计算机设备响应于用户的操作指令,对第二标注后文本中标注的韵律词、韵律短语以及语调短语中的至少一种进行校正。204. The computer device corrects at least one of prosodic words, prosodic phrases, and intonation phrases marked in the second marked text in response to the user's operation instruction.
得到第二标注后文本之后,计算机设备基于用户的操作指令对第二标注后文本中标注的韵律词、韵律短语以及语调短语中的至少一种进行校正。After obtaining the second marked text, the computer device corrects at least one of the prosodic words, prosodic phrases and intonation phrases marked in the second marked text based on the user's operation instruction.
本申请实施例中,计算机设备收集第一文本信息以及对应的音频信息的方式可以是在网络收集,因此可以收集到较多的数据量。并由计算机设备标注第一文本信息中的韵律词、韵律短语以及语调短语,不再需要通过人工标注的方式,从而提高韵律词、韵律短语以及语调短语的标注效率,并且由于结合了韵律词以及韵律短语标注语调短语,提高了语调短语标注的准确性。In the embodiment of the present application, the manner in which the computer device collects the first text information and the corresponding audio information may be collected on the network, so a large amount of data may be collected. And mark the prosodic words, prosodic phrases and intonation phrases in the first text information by computer equipment, no longer need to pass through the mode of artificial labeling, thereby improve the labeling efficiency of prosodic words, prosodic phrases and intonation phrases, and because combine prosodic words and Prosodic phrases are used to mark intonation phrases, which improves the accuracy of intonation phrase labeling.
可选的,在上述图2所示实施例的基础之上,在标注第一文本信息中的标注韵律词、韵律短语以及语调短语之后,用户还可以通过计算机设备对第一文本信息中标注的韵律词、韵律短语以及语调短语进行校正,其中,用户可以是相关的专业人员。Optionally, on the basis of the embodiment shown in FIG. 2 above, after marking the prosodic words, prosodic phrases and intonation phrases in the first text information, the user can also use the computer device to mark the prosodic words in the first text information. Prosodic words, prosodic phrases and intonation phrases are corrected, wherein the user may be a relevant professional.
请参阅图3,下面对本申请实施例中韵律信息标注方法的另一流程进行描述:Please refer to FIG. 3 , the following describes another flow of the prosodic information labeling method in the embodiment of the present application:
本实施例中步骤301至步骤303与图2所示实施例中步骤201至步骤203类似,此处不再赘述。 Steps 301 to 303 in this embodiment are similar to steps 201 to 203 in the embodiment shown in FIG. 2 , and will not be repeated here.
304、计算机设备获取第三标注后文本;304. The computer device acquires the third annotated text;
计算机设备获取第三标注后文本,该第三标注后文本中已经标注了韵律词、韵律短语以及语调短语,且第三标注后文本的文本信息与第二标注后文本的文本信息一致,第三标注后文本的标注信息与第二标注后文本的标注信息不同。可以理解的是,文本信息只包括文本中的文字信息,标注信息具体体现为文本中各个标注的位置以及各个标注分别对应的类型,其中标注的类型包括用于指示韵律词边界的标注、用于指示韵律短语的标注以及用于指示语调短语的标注。The computer device obtains the third marked text, which has marked prosodic words, prosodic phrases and intonation phrases, and the text information of the third marked text is consistent with the text information of the second marked text, and the third The annotation information of the annotated text is different from the annotation information of the second annotated text. It can be understood that the text information only includes text information in the text, and the annotation information is embodied in the position of each annotation in the text and the corresponding type of each annotation, wherein the type of annotation includes annotations for indicating prosodic word boundaries, for Annotations to indicate prosodic phrases and annotations to indicate intonation phrases.
应理解,在本实施例中,第二标注后文本可以是计算机设备标注包括第一文本信息的文档3中的韵律词、韵律短语以及语调短语得到的,第三标注后文本可以是计算机设备基于用户的操作指令标注包括第一文本信息的文档4中的韵律词、韵律短语以及语调短语得到的。It should be understood that, in this embodiment, the second marked text may be obtained by the computer device marking the prosodic words, prosodic phrases and intonation phrases in the document 3 including the first text information, and the third marked text may be obtained by the computer device based on The user's operation instruction is obtained by annotating prosodic words, prosodic phrases and intonation phrases in the document 4 including the first text information.
需要说明的是,步骤304的执行顺序此处不做限定,只需保证步骤301之后,且在步骤305之前执行即可。It should be noted that the execution order of step 304 is not limited here, it only needs to be executed after step 301 and before step 305 .
305、计算机设备基于第二标注后文本的标注信息以及第三标注后文本的标注信息,确定第三标注后文本中的目标标注;305. The computer device determines the target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text;
计算机设备结合第二标注后文本的标注信息以及第三标注后文本的标注信息,分析两 个文本之间的标注信息的差异。请参阅图4,图4为对文本标注韵律词、韵律短语以及语调短语的一个示意图,如图4所示,在一种实现方式中,可以在句子中的每个字后面添加用于指示韵律词边界的标注、用于指示韵律短语的标注或用于指示语调短语的标注,例如,通过“#1”将在“#1”之前的且在上一个“#1”、“#2”或者“#3”之后的字或词标记为一个韵律词;通过“#2”将在“#2”之前的且在上一个“#1”之后的字或词也标注为韵律词,并将在“#2”之前的且在上一个“#2”或“#3”之后的所有韵律词标记为韵律短语;通过“#3”将在“#3”之前的且在上一个“#1”之后的字或词也标注为韵律词,将在“#3”之前的且在上一个“#2之后的所有韵律词标注为韵律短语,”并将在“#3”之前的且在上一个“#3”或者之后的所有韵律短语或者所有韵律词标注为语调短语。例如将句子“不断有传言称技术专家现身北京”标注为“不断#1有#1传言称#3技术#1专家#1现身北京#3”,则表示“不断”、“有”、“传言称”、“技术”、“专家”、“现身”以及“北京”分别为不同的韵律词,“不断有传言称”、“技术专家”、以及“现身北京”分别为不同的韵律短语,“不断有传言称”以及“技术专家现身北京”分别为不同的语调短语。The computer device combines the annotation information of the second annotated text and the annotation information of the third annotated text, and analyzes the difference in the annotation information between the two texts. Please refer to Figure 4. Figure 4 is a schematic diagram of marking prosodic words, prosodic phrases and intonation phrases on text. Annotation of word boundaries, annotations to indicate prosodic phrases, or annotations to indicate intonation phrases, for example, by "#1" will be before "#1" and in the previous "#1", "#2" or The word or word after "#3" is marked as a prosodic word; the word or word before "#2" and after the last "#1" is also marked as a prosodic word by "#2", and will be All prosodic words before "#2" and after the previous "#2" or "#3" are marked as prosodic phrases; those before "#3" and after the previous "#1" are marked as prosodic phrases by "#3" The following words or words are also marked as prosodic words, and all prosodic words before "#3" and after the previous "#2" are marked as prosodic phrases," and those before "#3" and after the previous All prosodic phrases or all prosodic words after "#3" are marked as intonation phrases. For example, if the sentence "there are constant rumors that technical experts appear in Beijing" is marked as "constantly #1 there are #1 rumors that #3 technology #1 expert #1 appears in Beijing #3", it means "constantly", "yes", "Rumors", "technology", "experts", "appearing" and "Beijing" are different prosodic words respectively, "constantly rumored", "technical experts", and "appearing in Beijing" are respectively different prosodic words. Rhythmic phrases, "there are constant rumors" and "technical experts have appeared in Beijing" are different intonation phrases.
在一种情况中,第二标注后文本在位置1处的标注为“#a”,“a”可为0、1、2或3,第三标注后文本在位置1处的标注为“#b”,“b”可为0、1、2或3。如果b减a大于或等于2,则说明第二标注文本在位置1处的标注可能存在错误,则计算机设备将该处标注确定为目标标注。以上述实例进行说明,若第二标注后文本中的某个句子的标注信息为“不断#1有#1传言称#3技术#1专家#1现身北京#3”,第三标注后文本中的相同句子的标注信息为“不断#3有#1传言称#3技术#1专家#1现身北京#1”,显然,在第三标注后文本中,“不断#3”以及“现身北京#1”这两处标注为目标标注。In one case, the text at position 1 after the second mark is marked as "#a", "a" can be 0, 1, 2 or 3, and the text after the third mark at position 1 is marked as "# b", "b" can be 0, 1, 2 or 3. If b minus a is greater than or equal to 2, it indicates that there may be an error in the annotation at position 1 of the second annotation text, and the computer device determines this annotation as the target annotation. Using the above example to illustrate, if the tag information of a certain sentence in the text after the second tag is "Continuous #1 has #1 rumor that #3 technology #1 expert #1 appeared in Beijing #3", the text after the third tag The annotation information of the same sentence in is "Continuous #3 There is #1 rumor that #3 technology #1 expert #1 appeared in Beijing #1", obviously, in the text after the third annotation, "Continuous #3" and "Now Body Beijing #1" are marked as target marks.
需要说明的是,若在某个字或词之后无标注,则该无标注处与“#0”具有对应关系。例如第二标注后文本中的“现身北京#3”可以被进一步理解为“现#0身#0北#0京#3”,如果第三标注后文本在该处的标注信息为“现#2身北京#3”,则“现#2”也为目标标注。It should be noted that if there is no label after a certain word or word, then the no label has a corresponding relationship with "#0". For example, "appearing in Beijing #3" in the text after the second annotation can be further understood as "now #0身#0北#0京#3", if the annotation information of the text after the third annotation is "now #2 body Beijing #3", then "now #2" is also marked as the target.
306、计算机设备响应于用户的操作指令,对目标标注进行校正。306. The computer device corrects the target label in response to the user's operation instruction.
计算机设备在确定出目标标注之后,将目标标注进行显示,用户可以通过计算机设备对目标标注进行校正。After the computer equipment determines the target label, it displays the target label, and the user can correct the target label through the computer device.
本申请实施例中,人工标注的第三标注后文本可能存在标注不当的地方,如果完全通过人工对第三标注后文本进行校验,会消耗非常多的时间,因此计算机设备基于第二标注后文本确定出第三标注后文本中的目标标注,再由人工对目标标注进行校验,提高了校验的效率。In the embodiment of the present application, there may be improper marking in the third marked text manually marked. If the third marked text is completely verified manually, it will consume a lot of time. Therefore, the computer equipment is based on the second marked text. The text determines the target label in the text after the third label, and then manually checks the target label, which improves the efficiency of the check.
上面对本申请实施例中的韵律信息标注方法进行了介绍,请参阅图5,下面对本申请实施例中的计算机设备进行介绍:The prosody information labeling method in the embodiment of the present application has been introduced above, please refer to FIG. 5, and the computer equipment in the embodiment of the present application will be introduced below:
如图5所示,本申请实施例中的计算机设备500包括处理单元501。As shown in FIG. 5 , a computer device 500 in this embodiment of the present application includes a processing unit 501 .
处理单元501,用于获取音频信息以及第一文本信息,音频信息与第一文本信息具有对应关系。The processing unit 501 is configured to acquire audio information and first text information, where the audio information has a corresponding relationship with the first text information.
处理单元501,还用于还用于标注第一文本信息中的韵律词以及韵律短语,得到第一标注后文本,第一文本信息中的韵律短语需要基于音频信息进行标注。The processing unit 501 is further configured to mark prosodic words and prosodic phrases in the first text information to obtain the first marked text, and the prosodic phrases in the first text information need to be marked based on audio information.
处理单元501,还用于基于第一标注后文本中标注的韵律词、第一标注后文本中标注的韵律短语以及音频信息标注第一标注后文本中的语调短语,得到第二标注后文本。The processing unit 501 is further configured to annotate intonation phrases in the first annotated text based on prosodic words annotated in the first annotated text, prosodic phrases annotated in the first annotated text, and audio information to obtain a second annotated text.
在一种实现方式中,In one implementation,
处理单元501,还用于响应于用户的操作指令,对第二标注后文本中标注的韵律词、韵律短语以及语调短语中的至少一种进行校正。The processing unit 501 is further configured to correct at least one of the prosodic words, prosodic phrases and intonation phrases marked in the second marked text in response to the user's operation instruction.
在一种实现方式中,In one implementation,
处理单元501,还用于获取第三标注后文本,第三标注后文本的文本信息与第二标注后文本的文本信息一致,第三标注后文本中的标注信息与第二标注后文本中的标注信息不同,标注信息通过用于指示韵律词的标注、用于指示韵律短语的标注以及用于指示语调短语的标注指示。The processing unit 501 is further configured to acquire a third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marking information in the third marked text is the same as that in the second marked text The annotation information is different, and the annotation information is indicated by annotations indicating prosodic words, annotations indicating prosodic phrases, and annotations indicating intonation phrases.
处理单元501,还用于基于第二标注后文本的标注信息以及第三标注后文本的标注信息,确定第三标注后文本中的目标标注,目标标注包括用于指示韵律词的标注、用于指示韵律短语的标注以及用于指示语调短语的标注中的至少一种。The processing unit 501 is further configured to determine target annotations in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, the target annotations include annotations for indicating prosodic words, for At least one of a label indicating a prosodic phrase and a label indicating an intonation phrase.
在一种实现方式中,In one implementation,
处理单元501,还用于响应于用户的操作指令,对目标标注进行校正。The processing unit 501 is further configured to correct the target label in response to the user's operation instruction.
在一种实现方式中,In one implementation,
处理单元501,具体用于获取音频信息。The processing unit 501 is specifically configured to acquire audio information.
处理单元501,还用于基于语音识别确定与音频信息对应的第一文本信息。The processing unit 501 is further configured to determine first text information corresponding to the audio information based on speech recognition.
或,or,
处理单元501,具体用于获取视频信息。The processing unit 501 is specifically configured to acquire video information.
处理单元501,还用于获取视频信息中的音频信息,以及与视频信息中的字幕信息对应的第二文本信息。The processing unit 501 is further configured to acquire audio information in the video information and second text information corresponding to subtitle information in the video information.
处理单元501,还用于基于语音识别确定与音频信息对应的第三文本信息。The processing unit 501 is further configured to determine third text information corresponding to the audio information based on speech recognition.
处理单元501,还用于基于第二文本信息以及第三文本信息确定第一文本信息。The processing unit 501 is further configured to determine the first text information based on the second text information and the third text information.
图6是本申请实施例提供的一种计算机设备结构示意图,该计算机设备600可以包括一个或一个以上中央处理器(central processing units,CPU)601和存储器605,该存储器605中存储有一个或一个以上的应用程序或数据。6 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device 600 may include one or more central processing units (central processing units, CPU) 601 and a memory 605, and one or one above applications or data.
其中,存储器605可以是易失性存储或持久存储。存储在存储器605的程序可以包括一个或一个以上模块,每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器601可以设置为与存储器605通信,在计算机设备600上执行存储器605中的一系列指令操作。Wherein, the storage 605 may be a volatile storage or a persistent storage. The program stored in the memory 605 may include one or more modules, and each module may include a series of instructions to operate on the server. Furthermore, the central processing unit 601 may be configured to communicate with the memory 605 , and execute a series of instruction operations in the memory 605 on the computer device 600 .
计算机设备600还可以包括一个或一个以上电源602,一个或一个以上有线或无线网络接口603,一个或一个以上输入输出接口604,和/或,一个或一个以上操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等。The computer device 600 can also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input and output interfaces 604, and/or, one or more operating systems, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
该中央处理器601可以执行前述图2以及图3所示实施例中计算机设备所执行的操作,具体此处不再赘述。The central processing unit 601 may perform the operations performed by the computer device in the foregoing embodiments shown in FIG. 2 and FIG. 3 , and details are not repeated here.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装 置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, and will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disc, etc., which can store program codes. .

Claims (13)

  1. 一种韵律信息标注方法,其特征在于,包括:A prosodic information tagging method, characterized in that it includes:
    计算机设备获取音频信息以及第一文本信息,所述音频信息与所述第一文本具有对应关系;The computer device acquires audio information and first text information, where the audio information has a corresponding relationship with the first text;
    所述计算机设备标注所述第一文本信息中的韵律词以及韵律短语,得到第一标注后文本,所述第一标注后文本中的韵律短语需要基于所述音频信息进行标注;The computer device marks prosodic words and prosodic phrases in the first text information to obtain a first marked text, and the prosodic phrases in the first marked text need to be marked based on the audio information;
    所述计算机设备基于所述第一标注后文本中标注的韵律词、所述第一标注后文本中标注的韵律短语以及所述音频信息标注所述第一标注后文本中的语调短语,得到第二标注后文本。The computer device annotates intonation phrases in the first annotated text based on the prosodic words annotated in the first annotated text, the prosodic phrases annotated in the first annotated text, and the audio information, and obtains the first 2. Text after annotation.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, further comprising:
    所述计算机设备响应于用户的操作指令,对所述第二标注后文本中标注的韵律词、所述第二标注后文本中标注的韵律短语,以及所述第二标注后文本中标注的语调短语中的至少一种进行校正。The computer device responds to the user's operation instruction, and the prosodic words marked in the second marked text, the prosodic phrase marked in the second marked text, and the intonation marked in the second marked text Correct at least one of the phrases.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, further comprising:
    所述计算机设备获取第三标注后文本,所述第三标注后文本的文本信息与所述第二标注后文本的文本信息一致,所述第三标注后文本中的标注信息与所述第二标注后文本中的标注信息不同,所述标注信息通过用于指示韵律词的标注、用于指示韵律短语的标注以及用于指示语调短语的标注指示;The computer device acquires the third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marked information in the third marked text is the same as that of the second marked text. The tagging information in the tagged text is different, and the tagging information is indicated by tags used to indicate prosodic words, tags used to indicate prosodic phrases, and tags used to indicate intonation phrases;
    所述计算机设备基于所述第二标注后文本的标注信息以及所述第三标注后文本的标注信息,确定所述第三标注后文本中的目标标注,所述目标标注包括所述用于指示韵律词的标注、所述用于指示韵律短语的标注以及所述用于指示语调短语的标注中的至少一种。The computer device determines a target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, and the target annotation includes the At least one of prosodic word labeling, the labeling for indicating prosodic phrases, and the labeling for indicating intonation phrases.
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method according to claim 3, further comprising:
    所述计算机设备响应于用户的操作指令,对所述目标标注进行校正。The computer device corrects the target annotation in response to a user's operation instruction.
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述计算机设备获取音频信息以及第一文本信息包括:The method according to any one of claims 1 to 4, wherein the acquisition of the audio information and the first text information by the computer device comprises:
    所述计算机设备获取所述音频信息;the computer device acquires the audio information;
    所述计算机设备基于语音识别确定与所述音频信息对应的所述第一文本信息;determining, by the computer device, the first text information corresponding to the audio information based on speech recognition;
    或,or,
    所述计算机设备获取视频信息;The computer device acquires video information;
    所述计算机设备获取所述视频信息中的所述音频信息,以及与所述视频信息中的字幕信息对应的第二文本信息;The computer device acquires the audio information in the video information, and second text information corresponding to the subtitle information in the video information;
    所述计算机设备基于语音识别以及所述音频信息确定第三文本信息;said computer device determining third textual information based on speech recognition and said audio information;
    所述计算机设备基于所述第二文本信息以及所述第三文本信息确定所述第一文本信息。The computer device determines the first text information based on the second text information and the third text information.
  6. 一种计算机设备,其特征在于,包括:A computer device, characterized in that it includes:
    处理单元,用于获取音频信息以及第一文本信息,所述音频信息与所述第一文本信息具有对应关系;a processing unit, configured to acquire audio information and first text information, where the audio information has a corresponding relationship with the first text information;
    所述处理单元,还用于标注所述第一文本信息中的韵律词以及韵律短语,得到第一标注后文本,所述第一文本信息中的韵律短语需要基于所述音频信息进行标注;The processing unit is further configured to mark prosodic words and prosodic phrases in the first text information to obtain a first marked text, and the prosodic phrases in the first text information need to be marked based on the audio information;
    所述处理单元,还用于基于所述第一标注后文本中标注的韵律词、所述第一标注后文本中标注的韵律短语以及所述音频信息标注所述第一标注后文本中的语调短语,得到第二标注后文本。The processing unit is further configured to mark the intonation in the first marked text based on the prosodic words marked in the first marked text, the prosodic phrase marked in the first marked text and the audio information Phrase, get the text after the second markup.
  7. 根据权利要求6所述的设备,其特征在于,The apparatus according to claim 6, characterized in that,
    所述处理单元,还用于响应于用户的操作指令,对所述第二标注后文本中标注的韵律词、所述第二标注后文本中标注的韵律短语,以及所述第二标注后文本中标注的语调短语中的至少一种进行校正。The processing unit is further configured to, in response to a user's operation instruction, annotate prosodic words in the second annotated text, prosodic phrases annotated in the second annotated text, and the second annotated text Correct at least one of the intonation phrases marked in .
  8. 根据权利要求6所述的设备,其特征在于,The apparatus according to claim 6, characterized in that,
    所述处理单元,还用于获取第三标注后文本,所述第三标注后文本的文本信息与所述第二标注后文本的文本信息一致,所述第三标注后文本中的标注信息与所述第二标注后文本中的标注信息不同,所述标注信息通过用于指示韵律词的标注、用于指示韵律短语的标注以及用于指示语调短语的标注指示;The processing unit is further configured to acquire a third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marking information in the third marked text is consistent with The annotation information in the second annotated text is different, and the annotation information is indicated by annotations for indicating prosodic words, annotations for indicating prosodic phrases, and annotations for indicating intonation phrases;
    所述处理单元,还用于基于所述第二标注后文本的标注信息以及所述第三标注后文本的标注信息,确定所述第三标注后文本中的目标标注,所述目标标注包括所述用于指示韵律词的标注、所述用于指示韵律短语的标注以及所述用于指示语调短语的标注中的至少一种。The processing unit is further configured to determine a target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, and the target annotation includes the At least one of the above annotations for indicating prosodic words, the annotations for indicating prosodic phrases, and the annotations for indicating intonation phrases.
  9. 根据权利要求8所述的设备,其特征在于,The apparatus according to claim 8, characterized in that,
    所述处理单元,还用于响应于用户的操作指令,对所述目标标注进行校正。The processing unit is further configured to correct the target annotation in response to a user's operation instruction.
  10. 根据权利要求6至9中任一项所述的设备,其特征在于,Apparatus according to any one of claims 6 to 9, characterized in that
    所述处理单元,具体用于获取所述音频信息;The processing unit is specifically configured to acquire the audio information;
    所述处理单元,还用于基于语音识别确定与所述音频信息对应的所述第一文本信息;The processing unit is further configured to determine the first text information corresponding to the audio information based on speech recognition;
    或,or,
    所述处理单元,具体用于获取视频信息;The processing unit is specifically configured to acquire video information;
    所述处理单元,还用于获取所述视频信息中的所述音频信息,以及与所述视频信息中的字幕信息对应的第二文本信息;The processing unit is further configured to acquire the audio information in the video information, and second text information corresponding to the subtitle information in the video information;
    所述处理单元,还用于基于语音识别以及所述音频信息确定第三文本信息;The processing unit is further configured to determine third text information based on speech recognition and the audio information;
    所述处理单元,还用于基于所述第二文本信息以及所述第三文本信息确定所述第一文本信息。The processing unit is further configured to determine the first text information based on the second text information and the third text information.
  11. 一种计算机设备,其特征在于,包括处理器,所述处理器与存储器耦合,所述存储器用于存储指令,当所述指令被所述处理器执行时,使得所述显示设备执行如权利要求1至5中任一项所述的方法。A computer device, characterized in that it includes a processor, the processor is coupled with a memory, and the memory is used to store instructions, and when the instructions are executed by the processor, the display device performs the following steps: The method described in any one of 1 to 5.
  12. 一种计算机程序产品,包括代码,当所述代码在计算机上运行时,使得计算机运行如权利要求1至5中任一项所述的方法。A computer program product comprising codes which, when run on a computer, cause the computer to execute the method according to any one of claims 1 to 5.
  13. 一种计算机可读存储介质,其上存储有计算机指令或程序,其特征在于,所述计算机指令或程序被执行时,使得计算机执行如权利要求1至5中任一项所述的方法。A computer-readable storage medium on which computer instructions or programs are stored, wherein when the computer instructions or programs are executed, the computer executes the method according to any one of claims 1 to 5.
PCT/CN2022/099389 2021-09-24 2022-06-17 Prosodic information labeling method and related device WO2023045433A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111124499.3 2021-09-24
CN202111124499.3A CN115862584A (en) 2021-09-24 2021-09-24 Rhythm information labeling method and related equipment

Publications (1)

Publication Number Publication Date
WO2023045433A1 true WO2023045433A1 (en) 2023-03-30

Family

ID=85653183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099389 WO2023045433A1 (en) 2021-09-24 2022-06-17 Prosodic information labeling method and related device

Country Status (2)

Country Link
CN (1) CN115862584A (en)
WO (1) WO2023045433A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012178A (en) * 2023-07-31 2023-11-07 支付宝(杭州)信息技术有限公司 Prosody annotation data generation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055526A1 (en) * 2005-08-25 2007-03-08 International Business Machines Corporation Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN109326281A (en) * 2018-08-28 2019-02-12 北京海天瑞声科技股份有限公司 Prosodic labeling method, apparatus and equipment
CN111226275A (en) * 2019-12-31 2020-06-02 深圳市优必选科技股份有限公司 Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
CN111274807A (en) * 2020-02-03 2020-06-12 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070055526A1 (en) * 2005-08-25 2007-03-08 International Business Machines Corporation Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN109326281A (en) * 2018-08-28 2019-02-12 北京海天瑞声科技股份有限公司 Prosodic labeling method, apparatus and equipment
CN111226275A (en) * 2019-12-31 2020-06-02 深圳市优必选科技股份有限公司 Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
CN111274807A (en) * 2020-02-03 2020-06-12 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN115862584A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
Eyben et al. Unsupervised clustering of emotion and voice styles for expressive TTS
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
JP2015026057A (en) Interactive character based foreign language learning device and method
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
US10460731B2 (en) Apparatus, method, and non-transitory computer readable storage medium thereof for generating control instructions based on text
CN109448704A (en) Construction method, device, server and the storage medium of tone decoding figure
Qian et al. Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT)
Ekpenyong et al. Statistical parametric speech synthesis for Ibibio
CN112037769B (en) Training data generation method and device and computer readable storage medium
CN110010136A (en) The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
CN111881297A (en) Method and device for correcting voice recognition text
CN112216267B (en) Prosody prediction method, device, equipment and storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
WO2023045433A1 (en) Prosodic information labeling method and related device
CN106297765A (en) Phoneme synthesizing method and system
CN113257221B (en) Voice model training method based on front-end design and voice synthesis method
Dai [Retracted] An Automatic Pronunciation Error Detection and Correction Mechanism in English Teaching Based on an Improved Random Forest Model
US20240257802A1 (en) Acoustic-based linguistically-driven automated text formatting
Hanzlíček et al. LSTM-based speech segmentation trained on different foreign languages
CN111508522A (en) Statement analysis processing method and system
CN112242132A (en) Data labeling method, device and system in speech synthesis
JP2007133052A (en) Learning equipment and its program
Fan et al. A multifaceted approach to oral assessment based on the conformer architecture
CN109871528A (en) The method for recognizing semantics and device of voice data, storage medium, computer equipment
Singh et al. An integrated model for text to text, image to text and audio to text linguistic conversion using machine learning approach

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22871494

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE