WO2023045433A1

WO2023045433A1 - Prosodic information labeling method and related device

Info

Publication number: WO2023045433A1
Application number: PCT/CN2022/099389
Authority: WO
Inventors: 陈飞扬; 李太松; 陈珊珊; 王喆锋; 李明磊; 怀宝兴; 袁晶
Original assignee: 华为云计算技术有限公司
Priority date: 2021-09-24
Filing date: 2022-06-17
Publication date: 2023-03-30
Also published as: CN115862584A

Abstract

Provided is a prosodic information labeling method, comprising: a computer device acquiring audio information and first text information (201); the computer device labeling prosodic words and prosodic phrases in the first text information, so as to obtain a first labeled text (202), wherein the prosodic phrases in the first labeled text need to be performed on the basis of the audio information; and the computer device labeling intonation phrases in the first labeled text on the basis of the labeled prosodic words in the first labeled text, the labeled prosodic phrases in the first labeled text, and the audio information, so as to obtain a second labeled text (203). Further provided are a computer device, a computer program product, and a computer-readable storage medium.

Description

A prosodic information labeling method and related equipment

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China on September 24, 2021, with application number 202111124499.3 and titled "A Method for Prosodic Information Labeling and Related Equipment", the entire contents of which are hereby incorporated by reference In this application.

technical field

The embodiments of the present application relate to the field of speech synthesis, and in particular, to a prosodic information labeling method and related equipment.

Background technique

Speech synthesis (text to speech, TTS) technology is a technology that obtains natural and smooth speech information based on text and annotation information in the text, where the annotation information includes annotations indicating prosodic phrases, prosodic words and intonation phrases. Annotation, annotation information can represent pauses and tone changes in the text. Speech synthesis technology inputs text containing annotation information into the model, so that the model outputs corresponding speech information. In order to make the speech information output by the model more natural, it is necessary to use a large amount of text containing annotation information to train the model.

In the current method of tagging prosodic phrases, prosodic words, and intonation phrases on text, it is mainly done by professional personnel, which takes a long time. Therefore, it is difficult to obtain text containing tagged information, which is not conducive to quickly training the model. to the desired effect.

Contents of the invention

Embodiments of the present application provide a prosodic information labeling method and related equipment, which are used to improve labeling efficiency.

The first aspect of the embodiment of the present application provides a prosodic information labeling method:

The computer equipment acquires the corresponding audio information and the first text information, and marks prosodic words and prosodic phrases in the first text information to obtain the first marked text, wherein the prosodic phrases need to be marked based on the audio information. After marking the prosodic words and prosodic phrases in the first text information, the computer device marks the intonation phrases in the first text information based on the prosodic words and prosodic phrases marked in the first marked text, so as to obtain the second marked text.

In the embodiment of the present application, the manner in which the computer device collects the first text information and the corresponding audio information may be collected on the network, so a large amount of data may be collected. And mark the prosodic words, prosodic phrases and intonation phrases in the first text information by computer equipment, no longer need to pass through the mode of artificial labeling, thereby improve the labeling efficiency of prosodic words, prosodic phrases and intonation phrases, and because combine prosodic words and Prosodic phrases are used to mark intonation phrases, which improves the accuracy of intonation phrase labeling.

In a possible implementation manner, the computer device may also correct the prosodic words, prosodic phrases and intonation phrases marked in the second marked text in response to the user's operation instruction.

In the embodiment of the present application, after the computer equipment tags the prosodic words, prosodic phrases, and intonation phrases, the user can also correct the related tags, so as to further improve the accuracy of tagging.

In a possible implementation manner, the computer device may also obtain the third marked text, where the text information of the third marked text is consistent with the text information of the second marked text, and the marked information in the third marked text Different from the annotation information in the second annotated text, the annotation information is indicated by an annotation indicating a prosodic word, an annotation indicating a prosodic phrase, and an annotation indicating an intonation phrase. The computer device may also determine target annotations in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, where the target annotations include annotations for indicating prosodic words, for At least one of a label indicating a prosodic phrase and a label indicating an intonation phrase.

In the embodiment of the present application, the third annotated text may be based on the first text information after manually annotating prosodic words, prosodic phrases, and intonation phrases, and the third annotated text may be verified based on the second annotated text to determine the first There may be mislabeled target annotations in the text after the third annotation, so that the text after the third annotation can be corrected to improve the accuracy of annotation.

In a possible implementation manner, the computer device may also correct the target annotation in response to the user's operation instruction.

In a possible implementation manner, the computer device or the audio information and the first text information may specifically acquire the audio information first, and then acquire the first text information corresponding to the audio information based on a speech recognition technology. Or, acquire the video information, then acquire the audio information in the video information and the second text information corresponding to the subtitle information in the video information, acquire the third text information corresponding to the audio information based on speech recognition technology, and based on the second text information and The third text information determines the first text information.

The second aspect of the embodiments of the present application provides a computer device, where the computer device includes a plurality of functional modules, and the plurality of functional modules interact to implement the method in the above first aspect and various implementation manners thereof. Multiple functional modules can be implemented based on software, hardware, or a combination of software and hardware, and the multiple functional modules can be combined or divided arbitrarily based on specific implementations.

The third aspect of the embodiment of the present application provides a computer device, including a processor, the processor is coupled with a memory, and the memory is used to store instructions. When the instructions are executed by the processor, the display device performs the above-mentioned first aspect. Methods.

The fourth aspect of the embodiments of the present application provides a computer program product, including codes, and when the codes are run on a computer, the computer is made to execute the method described in the aforementioned first aspect.

The fifth aspect of the embodiment of the present application provides a computer-readable storage medium on which computer programs or instructions are stored, and is characterized in that when the computer programs or instructions are executed, the computer programs or instructions are stored thereon, When the instructions are executed, the computer is made to execute the method as described in the aforementioned first aspect.

Description of drawings

Figure 1a and Figure 1b are schematic diagrams of the system architecture in the embodiment of the present application;

Fig. 2 is a schematic flow chart of the prosodic information labeling method in the embodiment of the present application;

Fig. 3 is another schematic flow chart of the prosodic information labeling method in the embodiment of the present application;

Fig. 4 is a schematic diagram of labeling prosodic words, prosodic phrases and intonation phrases in the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a computer device in an embodiment of the present application;

FIG. 6 is another schematic structural diagram of a computer device in an embodiment of the present application.

Detailed ways

Embodiments of the present application are described below in conjunction with the accompanying drawings. Apparently, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Those of ordinary skill in the art know that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

The terms "first", "second" and the like in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, for example, a process, method, system, product or device comprising a sequence of steps or elements is not necessarily limited to the expressly listed instead, may include other steps or elements not explicitly listed or inherent to the process, method, product or apparatus.

Embodiments of the present application provide a prosodic information prosodic information labeling method and related equipment, which are used to improve the prosodic information labeling efficiency.

For ease of understanding, the following first introduces the related concepts involved in the embodiment of the present application:

Syllable: It is the most easily distinguishable phonetic unit for people to hear, and it is also the most natural phonetic unit in the speech flow.

Prosodic words: a group of syllables that are closely related in the actual language flow and are often pronounced together.

Rhythmic Phrases: Moderate rhythmic chunks between prosodic words and intonation phrases.

Intonation phrase: A phrase that connects several prosodic phrases according to a certain intonation pattern.

The embodiments of the present application can be applied to speech synthesis and text emotion recognition. In speech synthesis, the text that has been marked with prosodic phrases, prosodic words, and intonation phrases is input into the model, and the model outputs smooth and natural speech, which is widely used in technologies such as audio novels, digital humans, voice assistants, and smart speakers. Speech synthesis needs to use a large number of texts that have been marked with prosodic phrases, prosodic words and intonation phrases to train the model, so that the model can output smooth and natural speech. If you can quickly obtain the text that has been marked with prosodic phrases, prosodic words and intonation phrases, then It can improve the training efficiency of the model, thereby improving the naturalness and fluency of the output speech. Emotion recognition, also known as tendency analysis, is the process of analyzing, processing, inducing and inferring emotionally subjective texts. Using the emotion recognition function, it has a very wide range of applications in comment analysis and decision-making, e-commerce comment classification and public opinion monitoring. Emotion recognition usually analyzes the text directly. If prosodic phrases, prosodic words, and intonation phrases in the text can be marked, the accuracy of emotion recognition will be improved.

The embodiment of the present application can be applied to the system architecture shown in FIG. 1a or FIG. 1b , which will be introduced respectively below. As shown in Figure 1a, the system architecture includes a data collection module, a labeling module, and a verification module. The data acquisition module is used to collect video information or audio information on the network. For the video information, extract the audio information in the video information, and determine the corresponding second text information according to the subtitles in the video information, and perform speech recognition on the audio information to determine The third text information, if there is no difference between the second text information and the third text information, then determine that the second text information or the third text information is the first text information; if there is a difference between the second text information and the third text information , then the data acquisition module can correct the second text information or the third text information based on the user's operation instruction, and the corrected text information is the first text; The first text message. It should be understood that the embodiments of the above-mentioned second text information, third text information and first text information may specifically be document files, for example, the second text information may be embodied as a document 1 including the second text information, and the third text information may be embodied in As document 2 including the third text information, the first text information may be embodied as document 3 including the first text information.

The data acquisition module sends the audio information and the first text information to the labeling module, for example, sends the audio information and the document 3 to the labeling module. The tagging module tags prosodic phrases and prosodic words for the document 3 based on corresponding models, algorithms or rules combined with audio information to obtain the first tagged text, and then tags the first tagged text based on the prosodic phrases and prosodic words tagged in the first tagged text combined with audio information. An intonation phrase in a tagged text is obtained to obtain a second tagged text, and the second tagged text is sent to a verification module. The verification module corrects the second annotated text based on the user's operation instruction. It should be noted that, in one way, the tagging module can first send the document 3 after the prosodic words are annotated to the verification module, and the verification module is based on After correcting the prosodic words marked in the document 3 by the user's operation instruction, the corrected document 3 is sent to the marking module for subsequent marking, and finally the second marked text is obtained.

As shown in Figure 1b, the data acquisition module can also create a document 4 including the first text information, and send the document 4 and the audio information to the manual labeling module, and the manual labeling module obtains the user's operation instructions to mark the prosodic phrases, Prosodic words and intonation phrases are used to obtain the third annotated text. The labeling module and the manual labeling module will respectively send the second labeled text and the third labeled text to the screening module, and the screening module will compare the labeling information in the second labeled text and the third labeled text, and determine the The target label in the text after the third label, it should be noted that the target label can be understood as a wrong label. After the target annotation is determined, the screening module sends the third annotated text to the verification module, and the user corrects the target annotation through the verification module.

It should be noted that the above-mentioned modules shown in Fig. 1a and Fig. 1b can be located on different computer equipments, or all can be located on the same computer equipment, or some of them can be located on the same computer equipment, and the other can be located on the same computer equipment. Other computer equipment is not specifically limited here.

Please refer to FIG. 2 , the following is an introduction to a flow of the method for labeling prosodic information in the embodiment of the present application:

201. The computer device acquires audio information and first text information;

The computer equipment collects video information and audio information by downloading on the network. For the collected video information, the computer equipment extracts the audio information in the video information, and recognizes the subtitles in the video information through optical character recognition technology, so as to determine the corresponding For the second text information, since the subtitles in the video information have a corresponding relationship with the audio information, the second text information determined according to the subtitles also has a corresponding relationship with the audio information. Carry out voice recognition to the audio information afterwards so as to determine the third text information, if there is no difference between the second text information and the third text information, then determine that the second text information or the third text information is the first text information; if the second text information If there is a difference with the third text information, the user can correct the second text information or the third text information on the computer device, and the corrected text information is the first text information. For the collected audio information, the text information corresponding to the audio information is determined directly through the speech recognition technology, and the text information is the first text information.

202. The computer equipment marks the prosodic words and prosodic phrases in the first text information to obtain the first marked text;

After acquiring the first text information, the computer device marks the prosodic words in the first text information based on corresponding models, algorithms or rules, for example, it can mark the prosodic words in the first text information according to the coarse-grained word segmentation model. Or, after the computer device marks the prosodic words in the first text information based on corresponding models, algorithms or rules, the computer device can also correct the prosodic words marked in the first text information based on the user's operation instruction.

Afterwards, the computer equipment marks the prosodic phrases in the first text information according to the audio information. Specifically, the pronunciation duration of each word and the tone information of each word in the audio information can be extracted through a neural network or a machine algorithm, and the first text is marked according to these information Rhythmic phrases in messages.

It should be noted that the order of annotating prosodic words and prosodic phrases is not limited, prosodic words can be annotated first, or prosodic phrases can also be annotated first.

203. The computer device annotates intonation phrases in the first annotated text based on the annotated prosodic words, prosodic phrases and audio information in the first annotated text, to obtain a second annotated text.

Since intonation phrases have a strong association with prosodic words and prosodic phrases, the computer device can combine the prosodic words, prosodic phrases marked in the first text and the pronunciation duration of each word in the audio information and the tone information of each word , mark the intonation phrase in the first text, so as to obtain the second marked text.

204. The computer device corrects at least one of prosodic words, prosodic phrases, and intonation phrases marked in the second marked text in response to the user's operation instruction.

After obtaining the second marked text, the computer device corrects at least one of the prosodic words, prosodic phrases and intonation phrases marked in the second marked text based on the user's operation instruction.

Optionally, on the basis of the embodiment shown in FIG. 2 above, after marking the prosodic words, prosodic phrases and intonation phrases in the first text information, the user can also use the computer device to mark the prosodic words in the first text information. Prosodic words, prosodic phrases and intonation phrases are corrected, wherein the user may be a relevant professional.

Please refer to FIG. 3 , the following describes another flow of the prosodic information labeling method in the embodiment of the present application:

Steps 301 to 303 in this embodiment are similar to steps 201 to 203 in the embodiment shown in FIG. 2 , and will not be repeated here.

304. The computer device acquires the third annotated text;

The computer device obtains the third marked text, which has marked prosodic words, prosodic phrases and intonation phrases, and the text information of the third marked text is consistent with the text information of the second marked text, and the third The annotation information of the annotated text is different from the annotation information of the second annotated text. It can be understood that the text information only includes text information in the text, and the annotation information is embodied in the position of each annotation in the text and the corresponding type of each annotation, wherein the type of annotation includes annotations for indicating prosodic word boundaries, for Annotations to indicate prosodic phrases and annotations to indicate intonation phrases.

It should be understood that, in this embodiment, the second marked text may be obtained by the computer device marking the prosodic words, prosodic phrases and intonation phrases in the document 3 including the first text information, and the third marked text may be obtained by the computer device based on The user's operation instruction is obtained by annotating prosodic words, prosodic phrases and intonation phrases in the document 4 including the first text information.

It should be noted that the execution order of step 304 is not limited here, it only needs to be executed after step 301 and before step 305 .

305. The computer device determines the target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text;

The computer device combines the annotation information of the second annotated text and the annotation information of the third annotated text, and analyzes the difference in the annotation information between the two texts. Please refer to Figure 4. Figure 4 is a schematic diagram of marking prosodic words, prosodic phrases and intonation phrases on text. Annotation of word boundaries, annotations to indicate prosodic phrases, or annotations to indicate intonation phrases, for example, by "#1" will be before "#1" and in the previous "#1", "#2" or The word or word after "#3" is marked as a prosodic word; the word or word before "#2" and after the last "#1" is also marked as a prosodic word by "#2", and will be All prosodic words before "#2" and after the previous "#2" or "#3" are marked as prosodic phrases; those before "#3" and after the previous "#1" are marked as prosodic phrases by "#3" The following words or words are also marked as prosodic words, and all prosodic words before "#3" and after the previous "#2" are marked as prosodic phrases," and those before "#3" and after the previous All prosodic phrases or all prosodic words after "#3" are marked as intonation phrases. For example, if the sentence "there are constant rumors that technical experts appear in Beijing" is marked as "constantly #1 there are #1 rumors that #3 technology #1 expert #1 appears in Beijing #3", it means "constantly", "yes", "Rumors", "technology", "experts", "appearing" and "Beijing" are different prosodic words respectively, "constantly rumored", "technical experts", and "appearing in Beijing" are respectively different prosodic words. Rhythmic phrases, "there are constant rumors" and "technical experts have appeared in Beijing" are different intonation phrases.

In one case, the text at position 1 after the second mark is marked as "#a", "a" can be 0, 1, 2 or 3, and the text after the third mark at position 1 is marked as "# b", "b" can be 0, 1, 2 or 3. If b minus a is greater than or equal to 2, it indicates that there may be an error in the annotation at position 1 of the second annotation text, and the computer device determines this annotation as the target annotation. Using the above example to illustrate, if the tag information of a certain sentence in the text after the second tag is "Continuous #1 has #1 rumor that #3 technology #1 expert #1 appeared in Beijing #3", the text after the third tag The annotation information of the same sentence in is "Continuous #3 There is #1 rumor that #3 technology #1 expert #1 appeared in Beijing #1", obviously, in the text after the third annotation, "Continuous #3" and "Now Body Beijing #1" are marked as target marks.

It should be noted that if there is no label after a certain word or word, then the no label has a corresponding relationship with "#0". For example, "appearing in Beijing #3" in the text after the second annotation can be further understood as "now #0身#0北#0京#3", if the annotation information of the text after the third annotation is "now #2 body Beijing #3", then "now #2" is also marked as the target.

306. The computer device corrects the target label in response to the user's operation instruction.

After the computer equipment determines the target label, it displays the target label, and the user can correct the target label through the computer device.

In the embodiment of the present application, there may be improper marking in the third marked text manually marked. If the third marked text is completely verified manually, it will consume a lot of time. Therefore, the computer equipment is based on the second marked text. The text determines the target label in the text after the third label, and then manually checks the target label, which improves the efficiency of the check.

The prosody information labeling method in the embodiment of the present application has been introduced above, please refer to FIG. 5, and the computer equipment in the embodiment of the present application will be introduced below:

As shown in FIG. 5 , a computer device 500 in this embodiment of the present application includes a processing unit 501 .

The processing unit 501 is configured to acquire audio information and first text information, where the audio information has a corresponding relationship with the first text information.

The processing unit 501 is further configured to mark prosodic words and prosodic phrases in the first text information to obtain the first marked text, and the prosodic phrases in the first text information need to be marked based on audio information.

The processing unit 501 is further configured to annotate intonation phrases in the first annotated text based on prosodic words annotated in the first annotated text, prosodic phrases annotated in the first annotated text, and audio information to obtain a second annotated text.

In one implementation,

The processing unit 501 is further configured to correct at least one of the prosodic words, prosodic phrases and intonation phrases marked in the second marked text in response to the user's operation instruction.

In one implementation,

The processing unit 501 is further configured to acquire a third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marking information in the third marked text is the same as that in the second marked text The annotation information is different, and the annotation information is indicated by annotations indicating prosodic words, annotations indicating prosodic phrases, and annotations indicating intonation phrases.

The processing unit 501 is further configured to determine target annotations in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, the target annotations include annotations for indicating prosodic words, for At least one of a label indicating a prosodic phrase and a label indicating an intonation phrase.

In one implementation,

The processing unit 501 is further configured to correct the target label in response to the user's operation instruction.

In one implementation,

The processing unit 501 is specifically configured to acquire audio information.

The processing unit 501 is further configured to determine first text information corresponding to the audio information based on speech recognition.

or,

The processing unit 501 is specifically configured to acquire video information.

The processing unit 501 is further configured to acquire audio information in the video information and second text information corresponding to subtitle information in the video information.

The processing unit 501 is further configured to determine third text information corresponding to the audio information based on speech recognition.

The processing unit 501 is further configured to determine the first text information based on the second text information and the third text information.

6 is a schematic structural diagram of a computer device provided by an embodiment of the present application. The computer device 600 may include one or more central processing units (central processing units, CPU) 601 and a memory 605, and one or one above applications or data.

Wherein, the storage 605 may be a volatile storage or a persistent storage. The program stored in the memory 605 may include one or more modules, and each module may include a series of instructions to operate on the server. Furthermore, the central processing unit 601 may be configured to communicate with the memory 605 , and execute a series of instruction operations in the memory 605 on the computer device 600 .

The computer device 600 can also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input and output interfaces 604, and/or, one or more operating systems, such as Windows Server™, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The central processing unit 601 may perform the operations performed by the computer device in the foregoing embodiments shown in FIG. 2 and FIG. 3 , and details are not repeated here.

Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, and will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, device and method can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disc, etc., which can store program codes. .

Claims

A prosodic information tagging method, characterized in that it includes:

The computer device acquires audio information and first text information, where the audio information has a corresponding relationship with the first text;

The computer device marks prosodic words and prosodic phrases in the first text information to obtain a first marked text, and the prosodic phrases in the first marked text need to be marked based on the audio information;

The computer device annotates intonation phrases in the first annotated text based on the prosodic words annotated in the first annotated text, the prosodic phrases annotated in the first annotated text, and the audio information, and obtains the first 2. Text after annotation.
The method according to claim 1, further comprising:

The computer device responds to the user's operation instruction, and the prosodic words marked in the second marked text, the prosodic phrase marked in the second marked text, and the intonation marked in the second marked text Correct at least one of the phrases.
The method according to claim 1, further comprising:

The computer device acquires the third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marked information in the third marked text is the same as that of the second marked text. The tagging information in the tagged text is different, and the tagging information is indicated by tags used to indicate prosodic words, tags used to indicate prosodic phrases, and tags used to indicate intonation phrases;

The computer device determines a target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, and the target annotation includes the At least one of prosodic word labeling, the labeling for indicating prosodic phrases, and the labeling for indicating intonation phrases.
The method according to claim 3, further comprising:

The computer device corrects the target annotation in response to a user's operation instruction.
The method according to any one of claims 1 to 4, wherein the acquisition of the audio information and the first text information by the computer device comprises:

the computer device acquires the audio information;

determining, by the computer device, the first text information corresponding to the audio information based on speech recognition;

or,

The computer device acquires video information;

The computer device acquires the audio information in the video information, and second text information corresponding to the subtitle information in the video information;

said computer device determining third textual information based on speech recognition and said audio information;

The computer device determines the first text information based on the second text information and the third text information.
A computer device, characterized in that it includes:

a processing unit, configured to acquire audio information and first text information, where the audio information has a corresponding relationship with the first text information;

The processing unit is further configured to mark prosodic words and prosodic phrases in the first text information to obtain a first marked text, and the prosodic phrases in the first text information need to be marked based on the audio information;

The processing unit is further configured to mark the intonation in the first marked text based on the prosodic words marked in the first marked text, the prosodic phrase marked in the first marked text and the audio information Phrase, get the text after the second markup.
The apparatus according to claim 6, characterized in that,

The processing unit is further configured to, in response to a user's operation instruction, annotate prosodic words in the second annotated text, prosodic phrases annotated in the second annotated text, and the second annotated text Correct at least one of the intonation phrases marked in .
The apparatus according to claim 6, characterized in that,

The processing unit is further configured to acquire a third marked text, the text information of the third marked text is consistent with the text information of the second marked text, and the marking information in the third marked text is consistent with The annotation information in the second annotated text is different, and the annotation information is indicated by annotations for indicating prosodic words, annotations for indicating prosodic phrases, and annotations for indicating intonation phrases;

The processing unit is further configured to determine a target annotation in the third annotated text based on the annotation information of the second annotated text and the annotation information of the third annotated text, and the target annotation includes the At least one of the above annotations for indicating prosodic words, the annotations for indicating prosodic phrases, and the annotations for indicating intonation phrases.
The apparatus according to claim 8, characterized in that,

The processing unit is further configured to correct the target annotation in response to a user's operation instruction.
Apparatus according to any one of claims 6 to 9, characterized in that

The processing unit is specifically configured to acquire the audio information;

The processing unit is further configured to determine the first text information corresponding to the audio information based on speech recognition;

or,

The processing unit is specifically configured to acquire video information;

The processing unit is further configured to acquire the audio information in the video information, and second text information corresponding to the subtitle information in the video information;

The processing unit is further configured to determine third text information based on speech recognition and the audio information;

The processing unit is further configured to determine the first text information based on the second text information and the third text information.
A computer device, characterized in that it includes a processor, the processor is coupled with a memory, and the memory is used to store instructions, and when the instructions are executed by the processor, the display device performs the following steps: The method described in any one of 1 to 5.
A computer program product comprising codes which, when run on a computer, cause the computer to execute the method according to any one of claims 1 to 5.
A computer-readable storage medium on which computer instructions or programs are stored, wherein when the computer instructions or programs are executed, the computer executes the method according to any one of claims 1 to 5.