CN116451658A - Text labeling method, text labeling device, computer equipment and storage medium - Google Patents

Text labeling method, text labeling device, computer equipment and storage medium Download PDF

Info

Publication number
CN116451658A
CN116451658A CN202310331495.5A CN202310331495A CN116451658A CN 116451658 A CN116451658 A CN 116451658A CN 202310331495 A CN202310331495 A CN 202310331495A CN 116451658 A CN116451658 A CN 116451658A
Authority
CN
China
Prior art keywords
audio
target
text
determining
dialogue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310331495.5A
Other languages
Chinese (zh)
Inventor
闫影
文博龙
李娜
陈海涛
李海
刘俊晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN202310331495.5A priority Critical patent/CN116451658A/en
Publication of CN116451658A publication Critical patent/CN116451658A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a text labeling method, a text labeling device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring audio to be processed corresponding to a target video resource, wherein the audio to be processed is dialogue audio of each role in the target video resource; determining target objects in the roles, and determining corresponding first audio of the target objects in the audio to be processed; obtaining a dialogue text corresponding to the target video resource, and determining a target text corresponding to the first audio in the dialogue text; and labeling the target text in the dialogue text based on the object name of the target object.

Description

Text labeling method, text labeling device, computer equipment and storage medium
Technical Field
The disclosure relates to the technical field of computers, and in particular relates to a text labeling method, a text labeling device, computer equipment and a storage medium.
Background
With the vigorous development of the movie and television drama market, the demand for translation is also increasing, wherein the translation refers to the process of translating the dialect or the explanation of an original film into another language, and then dubbing, mixing or overlapping the film with the language. Based on the above, it is necessary to determine a speech book of the movie to be translated and instruct a dubbing person to dub the movie to be translated based on the speech book.
However, in the related scheme of determining the textbook of the movie to be translated, the roles are manually split according to the subtitles of the movie to be translated, and all the dialogs are in one-to-one correspondence to the associated roles, so as to generate the textbook of the movie to be translated, which results in lower efficiency of determining the textbook and higher time cost and labor cost.
Disclosure of Invention
The embodiment of the disclosure at least provides a text labeling method, a text labeling device, computer equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a text labeling method, including:
acquiring audio to be processed corresponding to a target video resource, wherein the audio to be processed is dialogue audio of each role in the target video resource;
determining target objects in the roles, and determining corresponding first audio of the target objects in the audio to be processed;
obtaining a dialogue text corresponding to the target video resource, and determining a target text corresponding to the first audio in the dialogue text;
and labeling the text based on the object name of the target object.
In an optional implementation manner, the determining the corresponding first audio of the target object in the audio to be processed includes:
Preprocessing the audio to be processed to obtain a plurality of sub-audios, wherein each sub-audio corresponds to at least one sentence of dialogue audio;
determining a target voiceprint feature of the target object;
calculating the matching degree of the audio features of each sub audio and the target voiceprint feature, and determining the audio to be confirmed, the matching degree of which meets the matching degree condition;
and determining a video frame corresponding to the audio to be confirmed in the target video resource, and determining the audio to be confirmed, which is matched with the target object, of the corresponding video frame as the first audio.
In an alternative embodiment, the determining the audio to be confirmed that matches the video frame with the target object is the first audio, including:
when the number of the video frames is a plurality, performing image recognition processing based on each frame of video frame, and determining whether a target image frame including the target object is recognized;
and in the case that the target image frame is identified in any video frame, determining the audio to be confirmed as the first audio.
In an optional implementation manner, the determining the corresponding first audio of the target object in the audio to be processed includes:
Determining a second audio in the plurality of sub-audios, wherein the second audio is an audio corresponding to a plurality of dialogue audios;
sentence processing is carried out based on the second audio to obtain multi-sentence dialogue audio;
and determining dialogue audio with the matching degree of the voiceprint feature and the target voiceprint feature meeting the matching degree condition in the multiple sentences of dialogue audio, and determining the dialogue audio as the first audio.
In an optional implementation manner, the sentence processing based on the second audio to obtain multi-sentence dialogue audio includes:
identifying a mute segment in the second audio, wherein the mute segment is an audio segment that does not include dialogue content;
and carrying out sentence processing on the second audio based on the silence segments to obtain dialogue audio between adjacent silence segments.
In an alternative embodiment, the determining the target voiceprint feature of the target object includes:
identifying the marking point positions in the audio to be processed to obtain third audio, wherein the third audio is audio which is marked in advance through the marking point positions and is matched with the target object;
and extracting based on the voiceprint feature vector of the third audio to obtain the target voiceprint feature.
In an alternative embodiment, the determining the target text corresponding to the first audio in the text comprises:
determining corresponding time information of the first audio in the target film and television resource;
and determining the dialogue matched with the time information in the dialogue text as the target text.
In a second aspect, an embodiment of the present disclosure further provides a text labeling device, including:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio to be processed corresponding to a target video resource, and the audio to be processed is used for indicating dialogue audio of each role in the target video resource;
the first determining unit is used for determining target objects in the roles and determining first audios corresponding to the target objects in the audios to be processed;
the second determining unit is used for obtaining the corresponding text of the target video resource and determining the corresponding target text of the first audio in the text of the dialogue;
and the labeling unit is used for labeling the text based on the object name of the target object.
In a third aspect, embodiments of the present disclosure further provide a computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any of the possible implementations of the first aspect.
In a fourth aspect, the presently disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect, or any of the possible implementations of the first aspect.
The disclosure provides a text labeling method, a text labeling device, computer equipment and a storage medium. In the embodiment of the present disclosure, first, a to-be-processed audio corresponding to a target video resource may be obtained, where the to-be-processed audio may be a dialogue audio between each character in the target video resource. Then, a target object can be determined in each role, and a first audio corresponding to the target object is determined in the audio to be processed. And then, determining a corresponding target text of the first audio in the text corresponding to the target video resource, and marking the target text in the text based on the object name of the target object, so as to generate a text corresponding to the target video resource, wherein all the corresponding roles of the corresponding dialects in the audio to be processed are not needed to be manually associated one by one, and the automatic determination of the text is realized only by determining the target role, so that the operation amount is reduced, the determination efficiency of the text is improved, and the time cost and the labor cost are greatly reduced.
The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.
FIG. 1 illustrates a flow chart of a text labeling method provided by an embodiment of the present disclosure;
FIG. 2 illustrates a flow chart of determining a corresponding first audio of a target object in audio to be processed provided by an embodiment of the present disclosure;
FIG. 3 illustrates an architecture diagram of a text labeling system provided by embodiments of the present disclosure;
FIG. 4 shows a schematic diagram of a text labeling device provided by an embodiment of the present disclosure;
Fig. 5 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The term "and/or" is used herein to describe only one relationship, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
It has been found that, with the vigorous development of the movie and television drama market, the demand for translation is also increasing, wherein, translation refers to a film after the dialogue or the explanation of an original film is translated into another language, and the film is dubbed and mixed or overlapped with the language. Based on the above, it is necessary to determine a speech book of the movie to be translated and instruct a dubbing person to dub the movie to be translated based on the speech book.
However, in the related scheme of determining the textbook of the movie to be translated, the roles are manually split according to the subtitles of the movie to be translated, and all the dialogs are in one-to-one correspondence to the associated roles, so as to generate the textbook of the movie to be translated, which results in lower efficiency of determining the textbook and higher time cost and labor cost.
Based on the above study, the present disclosure provides a text labeling method, apparatus, computer device, and storage medium. In the embodiment of the present disclosure, first, a to-be-processed audio corresponding to a target video resource may be obtained, where the to-be-processed audio may be a dialogue audio between each character in the target video resource. Then, a target object can be determined in each role, and a first audio corresponding to the target object is determined in the audio to be processed. And then, determining a corresponding target text of the first audio in the text corresponding to the target video resource, and marking the target text in the text based on the object name of the target object, so as to generate a text corresponding to the target video resource, wherein all the corresponding roles of the corresponding dialects in the audio to be processed are not needed to be manually associated one by one, and the automatic determination of the text is realized only by determining the target role, so that the operation amount is reduced, the determination efficiency of the text is improved, and the time cost and the labor cost are greatly reduced.
For the sake of understanding the present embodiment, first, a detailed description will be given of a text labeling method disclosed in an embodiment of the present disclosure, where an execution body of the text labeling method provided in the embodiment of the present disclosure is generally a computer device with a certain computing capability. In some possible implementations, the text labeling method may be implemented by way of a processor invoking computer readable instructions stored in a memory.
Referring to fig. 1, a flowchart of a text labeling method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S107, where:
s101: and obtaining the audio to be processed corresponding to the target video resource, wherein the audio to be processed is the dialogue audio of each role in the target video resource.
In the embodiment of the disclosure, the target video resource may be a video work to be dubbed, and the audio to be processed corresponding to the target video resource may be an entire track of audio of the target video work, that is, a dialogue audio including a dialogue between each character. When the whole track audio is determined, the dubbing such as scene sound in the target film and television work can be filtered to obtain the whole track audio only comprising the dialogue between the roles.
S103: and determining target objects in the roles, and determining corresponding first audio of the target objects in the audio to be processed.
In the embodiment of the present disclosure, in consideration of dubbing cost and target text generation efficiency, a target object to be dubbed may be determined in each character in a target movie work, where the target object is a character with a larger dialogue amount in a target movie resource, for example, a main angle in the target movie resource.
Next, a first audio corresponding to the target object may be determined from the audio to be processed, and specifically, the first audio corresponding to the target object may be determined by a voiceprint similarity calculation module and a face recognition module.
In an optional implementation manner, the voiceprint similarity calculation module may calculate the audio to be confirmed that is matched with the voiceprint feature of the target object, and identify, by the face recognition module, a video frame corresponding to the audio to be confirmed in the target movie resource, and if a face of the target object is identified in the video frame, determine the audio to be confirmed as the first audio, and a manner of determining the first audio is described below, which is not repeated herein.
In another alternative embodiment, after the voiceprint similarity calculation module calculates the audio to be confirmed matched with the voiceprint feature of the target object, the face recognition module may identify the video frame corresponding to the audio to be confirmed in the target video resource.
And then, calculating the voiceprint similarity X between the audio to be confirmed and the standard audio of the target object through a voiceprint similarity calculation module, and scoring the video frame through a face recognition module to obtain the face similarity Y, wherein the higher the score is, the higher the similarity between the face recognized in the video frame and the target object is.
Then, a first weight X and a second weight Y may be acquired, and an audio score Z of the audio to be confirmed may be calculated based on the voiceprint similarity X, the face similarity Y, the first weight X, and the second weight Y, where z=xx+yy. It should be understood that the higher the audio score Z, the higher the matching degree of the audio to be confirmed with the target object is, and therefore, the audio to be confirmed whose audio score Z exceeds the score threshold value can be determined as the first audio.
S105: and acquiring a text corresponding to the target video resource, and determining a target text corresponding to the first audio in the text.
In the embodiment of the present disclosure, a text corresponding to a target video resource may be obtained first, where the text includes text between roles in the target video resource, and it should be understood that the text may be translated, so that a dubbing person dubs the target video resource based on the translated text, for example, when the text is english text, the text may be translated into chinese text through a translation operation.
After the dialogue text is determined, a time node corresponding to the first audio in the target video resource can be obtained, and the dialogue matched with the time node is determined in the dialogue text, so that the dialogue is determined to be the target text corresponding to the first audio in the dialogue text.
S107: and labeling the target text in the dialogue text based on the object name of the target object.
In the embodiment of the disclosure, firstly, the object name of the target object can be obtained, the object name can be the role name of the target object playing in the target movie resource, and then, the object name can be marked to the target text in the dialect through marking operation so as to distinguish the dialect belonging to the target object from the dialect of other roles in the dialect.
In addition, as is clear from the above, the target object is a character with a large amount of conversation in the target movie resource, for example, a principal angle in the target movie resource. Therefore, for other roles of the non-target object in the target movie work, such as a matching angle, the matching angle can be marked in the above-mentioned white text, for example, when the matching angle is less, the matching angles can be uniformly marked as a matching angle, so as to save the workload of data marking, and instruct a fixed or unfixed dubbing staff to dub the non-target object based on the marked matching angle.
After the target object and the non-target object are marked in the text, the marked text can be used as a speech book, so that a dubbing person dubs the target film and television resource based on the speech book.
As can be seen from the foregoing description, in the embodiment of the present disclosure, first, the audio to be processed corresponding to the target video resource may be obtained, where the audio to be processed may be dialogue audio between each character in the target video resource. Then, a target object can be determined in each role, and a first audio corresponding to the target object is determined in the audio to be processed. And then, determining a corresponding target text of the first audio in the text corresponding to the target video resource, and marking the target text in the text based on the object name of the target object, so as to generate a text corresponding to the target video resource, wherein all the corresponding roles of the corresponding dialects in the audio to be processed are not needed to be manually associated one by one, and the automatic determination of the text is realized only by determining the target role, so that the operation amount is reduced, the determination efficiency of the text is improved, and the time cost and the labor cost are greatly reduced.
In an alternative embodiment, referring to fig. 2, which is a flowchart illustrating the step S103, determining a first audio corresponding to the target object in the audio to be processed, the step S103 specifically includes the following processes:
S1031: and preprocessing the audio to be processed to obtain a plurality of sub-audios, wherein each sub-audio corresponds to at least one sentence of dialogue audio.
In the embodiment of the disclosure, the sentence-through module may perform preprocessing on the audio to be processed to obtain sub-audio containing a single sentence or multiple sentences, where the preprocessing may be sentence processing. Specifically, the sentence dividing module may identify the sentence head and the sentence tail of the sub-audio, for example, a time node in which the dialogue content is not detected in the preset period may be determined as the sentence head, a time node in which the dialogue content is not detected in the preset period may be determined as the sentence tail, and then, the audio between the sentence head and the sentence tail may be determined as the sub-audio.
In addition, the audio to be processed may be preprocessed based on the subtitle file, and specifically, the audio to be processed may be preprocessed based on a single sentence marked in the subtitle file, to obtain sub-audio including the single sentence dialogue audio.
S1032: and determining target voiceprint features of the target object.
S1033: and calculating the matching degree of the audio features of each sub audio and the target voiceprint feature, and determining the audio to be confirmed, the matching degree of which meets the matching degree condition.
In the embodiment of the disclosure, the matching degree between the audio feature of each sub-audio and the target voiceprint feature may be calculated by the voiceprint similarity calculation module, where the target voiceprint feature may be used to indicate a voiceprint feature vector of the target object. When calculating the matching degree of the audio feature of the sub-audio and the target voiceprint feature, the cosine similarity between the voiceprint feature vector of the target object and the voiceprint feature vector of the sub-audio may be calculated, and the pre-similarity is determined as the matching degree, and the specific way of calculating the cosine similarity is not described in this disclosure.
Next, a matching degree condition may be acquired, a matching degree threshold may be determined based on the matching degree condition, a sub-audio having a matching degree with the target voiceprint feature higher than the matching degree threshold may be determined, and the sub-audio may be determined as the audio to be confirmed.
S1034: and determining a video frame corresponding to the audio to be confirmed in the target video resource, and determining the audio to be confirmed, which is matched with the target object, of the corresponding video frame as the first audio.
In the embodiment of the present disclosure, a time node corresponding to the audio to be confirmed in the target movie resource may be obtained, where the time node may include a sentence head and a sentence tail. Then, the video frames between the head and the tail of the sentence can be determined as the video frames corresponding to the audio to be confirmed in the target video resource.
After the video frame is determined, the face recognition module can be used for carrying out face recognition processing on the video frame, when the face of the target object is recognized, the video frame is determined to be matched with the target object, and the audio to be confirmed is determined to be the first audio.
In the embodiment of the disclosure, the audio to be confirmed with the audio characteristics matched with the target voiceprint characteristics can be determined through the voiceprint similarity calculation module, and the audio to be confirmed is subjected to auxiliary verification through the face recognition module, so that the first audio is determined, and the accuracy of the determined first audio is improved.
In an optional embodiment, the step S1034 determines the audio to be confirmed, which matches the corresponding video frame with the target object, as the first audio, and specifically includes the following steps:
s11: when the number of video frames is plural, image recognition processing is performed on a per-frame video frame basis, and it is determined whether or not a target image frame including the target object is recognized.
S12: and in the case that the target image frame is identified in any video frame, determining the audio to be confirmed as the first audio.
In the embodiment of the present disclosure, considering the duration of the audio to be confirmed, the number of video frames corresponding to the audio to be confirmed in the target video resource may be multiple, for example, if the duration of the audio to be confirmed is 2s, the number of video frames corresponding to the audio to be confirmed in the target video resource may be 48 frames. However, within the 2s, the target video resource may be shifted, that is, the picture is transferred from the target object corresponding to the audio to be confirmed to other characters, and the 48 frames of video frames do not necessarily all include the target object.
Based on the above, the above-mentioned face recognition module can be used for carrying out image recognition processing on each frame of video frame, and after the face of the target object is recognized, the face of the target object can be framed by the target image frame. Thus, after the target object is identified in a video frame, the target object may be considered to be included in the frame presented by the video frame.
When the target image frame is identified in any video frame corresponding to the audio to be confirmed, it can be determined that the target video resource displays the target object within the duration of the audio to be confirmed, and the possibility that the audio to be confirmed is the dialogue of the target object is extremely high. Therefore, in the case where the target image frame is recognized in any video frame corresponding to the audio to be confirmed, the audio to be confirmed may be determined as the first audio.
In the embodiment of the present disclosure, considering the duration of the audio to be confirmed, the audio to be confirmed may have a plurality of corresponding video frames in the target video resource, so that the audio to be confirmed may be determined as the first audio when the target image frame is identified in any video frame, thereby improving the application range of the present disclosure and perfecting the algorithm logic for determining the audio to be confirmed in the present disclosure.
In an optional embodiment, in step S1034, the audio to be confirmed that matches the corresponding video frame with the target object is determined as the first audio, and the method further includes the following steps:
s21: and determining a second audio in the plurality of sub-audio, wherein the second audio is audio corresponding to the multi-sentence dialogue audio.
In the embodiment of the present disclosure, it is considered that in the target movie resource, there may be a dialogue scene where a plurality of characters dialogue at the same time, for example, a scene where a quarry is made among a plurality of characters. For the dialogue scene, the clause module may identify sub-audio containing multiple sentences of dialogue audio, for example, the sub-audio includes the audio of the target object a and the audio of the target object B.
Based on this, the second audio including the dialogue audio may be identified in the plurality of sub-audios, and specifically, if more than one voiceprint feature is identified in the sub-audio, the sub-audio may be regarded as the second audio.
S22: and performing sentence dividing processing based on the second audio to obtain multi-sentence dialogue audio.
S23: and determining dialogue audio with the matching degree of the voiceprint feature and the target voiceprint feature meeting the matching degree condition in the multiple sentences of dialogue audio, and determining the dialogue audio as the first audio.
After the second audio is determined, the sentence processing may be further performed on the second audio by using the sentence module to obtain a plurality of dialogue audios including a single sentence, where a specific sentence processing manner is described below, and is not described herein.
Next, a dialogue audio in which the matching degree between the voiceprint feature and the target voiceprint feature satisfies the matching degree condition may be determined by the voiceprint similarity calculation module in the multiple-sentence dialogue audio, so as to determine the dialogue audio as the first audio, where the manner of performing the voiceprint feature matching by the voiceprint similarity calculation module is described in the embodiment corresponding to step S103, and is not repeated herein.
In the embodiment of the disclosure, considering that in the target movie resource, there may be a dialogue scene where multiple characters are simultaneously talking, for example, a scene where multiple characters are quarreling, for the dialogue scene, the sentence module may identify sub-audio including multiple sentences of dialogue audio. Based on the method, the second audio including the multi-sentence dialogue audio can be identified in the plurality of sub-audios, and sentence dividing processing is carried out on the second audio, so that the first audio is determined in the plurality of dialogue audios, the content in the finally generated target text is more perfect, and the possibility of generating the speech text of the target object in a missed manner is reduced.
In an optional embodiment, step S22, performing sentence processing based on the second audio to obtain a multi-sentence dialogue audio, specifically includes the following steps:
(1) Identifying a mute segment in the second audio, wherein the mute segment is an audio segment that does not include dialogue content;
(2) And carrying out sentence processing on the second audio based on the silence segments to obtain dialogue audio between adjacent silence segments.
In the embodiment of the disclosure, the second audio is segmented in a dynamic clause manner in which the clause length is dynamically variable. Specifically, firstly, a clause can be performed in the second audio according to the clause time indicated by the preset clause length, and after the clause length is finished, whether the ending node is in a silence section or not can be identified.
If the ending node is not in the mute segment, opening the clause operation corresponding to the length of the next clause until the clause operation is ended under the condition that the mute segment is identified, and obtaining the dialogue audio. If the ending node is in the mute segment, the sentence operation can be directly ended, and the dialogue audio is obtained. Based on this, dialogue audio between adjacent silence segments can be obtained from the above-described second audio.
In the embodiment of the disclosure, the second audio may be divided based on a silence segment that does not include dialogue content in the second audio, so as to obtain dialogue audio including a single dialogue, thereby providing a technical basis for the implementation of calculating the voiceprint feature and the target voiceprint feature of each dialogue audio by the voiceprint similarity calculation module.
In an optional embodiment, step S1032, the determining the target voiceprint feature of the target object specifically includes the following steps:
s31: and identifying the marking point positions in the audio to be processed to obtain third audio, wherein the third audio is audio which is marked in advance through the marking point positions and is matched with the target object.
S32: and extracting based on the voiceprint feature vector of the third audio to obtain the target voiceprint feature.
In the embodiment of the present disclosure, after determining a target object, the audio to be processed may be labeled based on the target object, where the dialog of the target object may be labeled by the labeling point, and specifically, the dialog audio of the target object may be labeled by the starting point and the ending point, so as to obtain the third audio. Here, a preset number of dialogue audios may be marked, and the present disclosure is not limited to a specific number of dialogue audios.
Based on this, the labeling point in the audio to be processed may be identified to obtain the dialogue audio of the target object, that is, the third audio, and then the voiceprint similarity calculation module may be trained by the target voiceprint feature, and specifically, the voiceprint similarity calculation module may extract the voiceprint feature vector of the third audio to obtain the target voiceprint feature of the target object.
In the embodiment of the disclosure, the identification can be performed based on the labeling point positions in the audio to be processed to obtain the third audio of the dialogue containing the target object, so that the voiceprint similarity calculation module is trained based on the third audio to improve the accuracy of the voiceprint similarity calculation module in calculating the matching degree of each sub-audio and the third audio.
In an optional embodiment, the step S105, determining the target text corresponding to the first audio in the text, specifically includes the following steps:
s1051: and determining corresponding time information of the first audio in the target film and television resource.
S1052: and determining the dialogue matched with the time information in the dialogue text as the target text.
In the embodiment of the present disclosure, the time information may include a start time and an end time of the first audio, where the time information is determined with respect to a playing progress of the target movie resource, for example, in the time information, the playing progress corresponding to the start time may be 17:28, and the playing progress corresponding to the end time may be 17:30.
After the time information is determined, the dialogue matched with the time information can be determined in the dialogue text, specifically, the dialogue text comprises the dialogues among the roles and the dialogues of each sentence of dialogues, so that the dialogues with the time matched with the time information can be determined, and the dialogues are determined as target texts.
In the embodiment of the disclosure, based on the time information corresponding to the first audio in the target video resource, the dialogue matched with the time information can be determined in the dialogue text, and the dialogue is determined as the target text, so that the object name corresponding to the target text is marked in the dialogue text, and the dubbing by a dubbing person is facilitated.
In summary, in the embodiment of the present disclosure, first, a to-be-processed audio corresponding to a target video resource may be obtained, where the to-be-processed audio may be a dialogue audio between each character in the target video resource. Then, a target object can be determined in each role, and a first audio corresponding to the target object is determined in the audio to be processed. And then, determining a corresponding target text of the first audio in the text corresponding to the target video resource, and marking the target text in the text based on the object name of the target object, so as to generate a text corresponding to the target video resource, wherein all the corresponding roles of the corresponding dialects in the audio to be processed are not needed to be manually associated one by one, and the automatic determination of the text is realized only by determining the target role, so that the operation amount is reduced, the determination efficiency of the text is improved, and the time cost and the labor cost are greatly reduced.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
Referring to fig. 3, an architecture diagram of a text labeling system 30 according to an embodiment of the disclosure is provided, where the text labeling system 30 includes: a clause module 31, a voiceprint similarity calculation module 32 and a face recognition module 33.
The sentence module 31 performs preprocessing based on the audio to be processed to obtain a plurality of sub-audios, where each sub-audio corresponds to at least one sentence of dialogue audio.
In the embodiment of the present disclosure, the process of preprocessing the audio to be processed by the clause module is described in the embodiment corresponding to step S103, which is not described herein. In addition, the sentence module may further determine a second audio in the plurality of sub-audios, where the second audio is an audio corresponding to the plurality of sentence dialogue audios, and perform sentence processing based on the second audio to obtain the plurality of sentence dialogue audios, and a process of performing sentence processing on the second audio is described in the embodiment corresponding to step S1034.
The voiceprint similarity calculation module 32 determines a target voiceprint feature of the target object, calculates a matching degree between the audio feature of each sub-audio and the target voiceprint feature, and determines an audio to be confirmed whose matching degree satisfies a matching degree condition.
In the embodiment of the present disclosure, the process of determining the audio to be confirmed is described in the embodiment corresponding to step S103, which is not described herein.
The face recognition module 33 determines a video frame corresponding to the audio to be confirmed in the target video resource, and determines the audio to be confirmed, which is matched with the target object, as the first audio.
In the embodiment of the present disclosure, the process of performing face recognition on the video frame corresponding to the audio to be confirmed in the target movie resource is described in the embodiment corresponding to step S103, which is not described herein again.
In the embodiment of the present disclosure, first, a to-be-processed audio corresponding to a target video resource may be obtained, where the to-be-processed audio may be a dialogue audio between each character in the target video resource. Then, a target object can be determined in each role, and a first audio corresponding to the target object is determined in the audio to be processed. And then, determining a corresponding target text of the first audio in the text corresponding to the target video resource, and marking the target text in the text based on the object name of the target object, so as to generate a text corresponding to the target video resource, wherein all the corresponding roles of the corresponding dialects in the audio to be processed are not needed to be manually associated one by one, and the automatic determination of the text is realized only by determining the target role, so that the operation amount is reduced, the determination efficiency of the text is improved, and the time cost and the labor cost are greatly reduced.
Based on the same inventive concept, the embodiments of the present disclosure further provide a text labeling device corresponding to the text labeling method, and since the principle of solving the problem by the device in the embodiments of the present disclosure is similar to that of the text labeling method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
The process flow of each module in the system and the interaction flow between each module may be described with reference to the related description in the above method embodiment, which is not described in detail herein.
Referring to fig. 4, a schematic diagram of a text labeling device according to an embodiment of the disclosure is shown, where the device includes: an acquisition unit 41, a first determination unit 42, a second determination unit 43, a labeling unit 44; wherein, the liquid crystal display device comprises a liquid crystal display device,
an obtaining unit 41, configured to obtain audio to be processed corresponding to a target video resource, where the audio to be processed is used to indicate dialogue audio of each character in the target video resource;
a first determining unit 42, configured to determine a target object in each character, and determine a first audio corresponding to the target object in the audio to be processed;
a second determining unit 43, configured to obtain a text corresponding to the target movie resource, and determine a target text corresponding to the first audio in the text;
And the labeling unit 44 is used for labeling the target text in the dialog text based on the object name of the target object.
In the embodiment of the present disclosure, first, a to-be-processed audio corresponding to a target video resource may be obtained, where the to-be-processed audio may be a dialogue audio between each character in the target video resource. Then, a target object can be determined in each role, and a first audio corresponding to the target object is determined in the audio to be processed. And then, determining a corresponding target text of the first audio in the text corresponding to the target video resource, and marking the target text in the text based on the object name of the target object, so as to generate a text corresponding to the target video resource, wherein all the corresponding roles of the corresponding dialects in the audio to be processed are not needed to be manually associated one by one, and the automatic determination of the text is realized only by determining the target role, so that the operation amount is reduced, the determination efficiency of the text is improved, and the time cost and the labor cost are greatly reduced.
In a possible implementation manner, the first determining unit 42 is further configured to:
preprocessing the audio to be processed to obtain a plurality of sub-audios, wherein each sub-audio corresponds to at least one sentence of dialogue audio;
Determining a target voiceprint feature of the target object;
calculating the matching degree of the audio features of each sub audio and the target voiceprint feature, and determining the audio to be confirmed, the matching degree of which meets the matching degree condition;
and determining a video frame corresponding to the audio to be confirmed in the target video resource, and determining the audio to be confirmed, which is matched with the target object, of the corresponding video frame as the first audio.
In a possible implementation manner, the first determining unit 42 is further configured to:
when the number of the video frames is a plurality, performing image recognition processing based on each frame of video frame, and determining whether a target image frame including the target object is recognized;
and in the case that the target image frame is identified in any video frame, determining the audio to be confirmed as the first audio.
In a possible implementation manner, the first determining unit 42 is further configured to:
determining a second audio in the plurality of sub-audios, wherein the second audio is an audio corresponding to a plurality of dialogue audios;
sentence processing is carried out based on the second audio to obtain multi-sentence dialogue audio;
and determining dialogue audio with the matching degree of the voiceprint feature and the target voiceprint feature meeting the matching degree condition in the multiple sentences of dialogue audio, and determining the dialogue audio as the first audio.
In a possible implementation manner, the first determining unit 42 is further configured to:
identifying a mute segment in the second audio, wherein the mute segment is an audio segment that does not include dialogue content;
and carrying out sentence processing on the second audio based on the silence segments to obtain dialogue audio between adjacent silence segments.
In a possible implementation manner, the first determining unit 42 is further configured to:
identifying the marking point positions in the audio to be processed to obtain third audio, wherein the third audio is audio which is marked in advance through the marking point positions and is matched with the target object;
and extracting based on the voiceprint feature vector of the third audio to obtain the target voiceprint feature.
In a possible embodiment, the second determining unit 43 is further configured to:
determining corresponding time information of the first audio in the target film and television resource;
and identifying the content of the first audio frequency, and marking the identification result based on the time information to obtain the text information.
The process flow of each unit in the apparatus and the interaction flow between units may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.
Corresponding to the text labeling method in fig. 1, the embodiment of the present disclosure further provides a computer device 500, as shown in fig. 5, which is a schematic structural diagram of the computer device 500 provided in the embodiment of the present disclosure, including:
a processor 51, a memory 52, and a bus 53; memory 52 is used to store execution instructions, including memory 521 and external storage 522; the memory 521 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 51 and data exchanged with the external memory 522 such as a hard disk, and the processor 51 exchanges data with the external memory 522 through the memory 521, and when the computer device 500 is operated, the processor 51 and the memory 52 communicate with each other through the bus 53, so that the processor 51 executes the following instructions:
acquiring audio to be processed corresponding to a target video resource, wherein the audio to be processed is dialogue audio of each role in the target video resource;
determining target objects in the roles, and determining corresponding first audio of the target objects in the audio to be processed;
obtaining a dialogue text corresponding to the target video resource, and determining a target text corresponding to the first audio in the dialogue text;
And labeling the target text in the dialogue text based on the object name of the target object.
The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the text labeling method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.
Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program code, where instructions included in the program code may be used to perform steps of the text labeling method described in the foregoing method embodiments, and specific reference may be made to the foregoing method embodiments, which are not described herein.
Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1. A method for labeling text, comprising:
acquiring audio to be processed corresponding to a target video resource, wherein the audio to be processed is dialogue audio of each role in the target video resource;
determining target objects in the roles, and determining corresponding first audio of the target objects in the audio to be processed;
obtaining a dialogue text corresponding to the target video resource, and determining a target text corresponding to the first audio in the dialogue text;
and labeling the target text in the dialogue text based on the object name of the target object.
2. The method of claim 1, wherein the determining a corresponding first audio of the target object in the audio to be processed comprises:
preprocessing the audio to be processed to obtain a plurality of sub-audios, wherein each sub-audio corresponds to at least one sentence of dialogue audio;
determining a target voiceprint feature of the target object;
calculating the matching degree of the audio features of each sub audio and the target voiceprint feature, and determining the audio to be confirmed, the matching degree of which meets the matching degree condition;
And determining a video frame corresponding to the audio to be confirmed in the target video resource, and determining the audio to be confirmed, which is matched with the target object, of the corresponding video frame as the first audio.
3. The method of claim 2, wherein the determining the audio to be confirmed that matches the corresponding video frame with the target object as the first audio comprises:
when the number of the video frames is a plurality, performing image recognition processing based on each frame of video frame, and determining whether a target image frame including the target object is recognized;
and in the case that the target image frame is identified in any video frame, determining the audio to be confirmed as the first audio.
4. The method of claim 2, wherein the determining a corresponding first audio of the target object in the audio to be processed comprises:
determining a second audio in the plurality of sub-audios, wherein the second audio is an audio corresponding to a plurality of dialogue audios;
sentence processing is carried out based on the second audio to obtain multi-sentence dialogue audio;
and determining dialogue audio with the matching degree of the voiceprint feature and the target voiceprint feature meeting the matching degree condition in the multiple sentences of dialogue audio, and determining the dialogue audio as the first audio.
5. The method of claim 4, wherein the sentence processing based on the second audio results in a multi-sentence dialogue audio, comprising:
identifying a mute segment in the second audio, wherein the mute segment is an audio segment that does not include dialogue content;
and carrying out sentence processing on the second audio based on the silence segments to obtain dialogue audio between adjacent silence segments.
6. The method of claim 2, wherein the determining the target voiceprint feature of the target object comprises:
identifying the marking point positions in the audio to be processed to obtain third audio, wherein the third audio is audio which is marked in advance through the marking point positions and is matched with the target object;
and extracting based on the voiceprint feature vector of the third audio to obtain the target voiceprint feature.
7. The method of claim 1, wherein the determining a target text corresponding to the first audio in the text-to-text comprises:
determining corresponding time information of the first audio in the target film and television resource;
and determining the dialogue matched with the time information in the dialogue text as the target text.
8. A text labeling device, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio to be processed corresponding to a target video resource, and the audio to be processed is used for indicating dialogue audio of each role in the target video resource;
the first determining unit is used for determining target objects in the roles and determining first audios corresponding to the target objects in the audios to be processed;
the second determining unit is used for obtaining the corresponding text of the target video resource and determining the corresponding target text of the first audio in the text of the dialogue;
and the labeling unit is used for labeling the target text in the dialogue text based on the object name of the target object.
9. A computer device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating over the bus when the computer device is running, said machine readable instructions when executed by said processor performing the steps of the text labeling method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the text labeling method of any of claims 1 to 7.
CN202310331495.5A 2023-03-30 2023-03-30 Text labeling method, text labeling device, computer equipment and storage medium Pending CN116451658A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310331495.5A CN116451658A (en) 2023-03-30 2023-03-30 Text labeling method, text labeling device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310331495.5A CN116451658A (en) 2023-03-30 2023-03-30 Text labeling method, text labeling device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116451658A true CN116451658A (en) 2023-07-18

Family

ID=87124880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310331495.5A Pending CN116451658A (en) 2023-03-30 2023-03-30 Text labeling method, text labeling device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116451658A (en)

Similar Documents

Publication Publication Date Title
US11322154B2 (en) Diarization using linguistic labeling
CN108536654B (en) Method and device for displaying identification text
CN109887497B (en) Modeling method, device and equipment for speech recognition
US7676373B2 (en) Displaying text of speech in synchronization with the speech
CN107562760B (en) Voice data processing method and device
CN109686383B (en) Voice analysis method, device and storage medium
US9588967B2 (en) Interpretation apparatus and method
US9818450B2 (en) System and method of subtitling by dividing script text into two languages
EP3779971A1 (en) Method for recording and outputting conversation between multiple parties using voice recognition technology, and device therefor
CN110750996B (en) Method and device for generating multimedia information and readable storage medium
US9251808B2 (en) Apparatus and method for clustering speakers, and a non-transitory computer readable medium thereof
CN111785275A (en) Voice recognition method and device
CN109102824B (en) Voice error correction method and device based on man-machine interaction
Kopparapu Non-linguistic analysis of call center conversations
CN111883137A (en) Text processing method and device based on voice recognition
US20240020489A1 (en) Providing subtitle for video content in spoken language
CN111881297A (en) Method and device for correcting voice recognition text
CN108831503B (en) Spoken language evaluation method and device
CN116451658A (en) Text labeling method, text labeling device, computer equipment and storage medium
CN110428668B (en) Data extraction method and device, computer system and readable storage medium
CN114203180A (en) Conference summary generation method and device, electronic equipment and storage medium
CN108959163B (en) Subtitle display method for audio electronic book, electronic device and computer storage medium
CN113763921B (en) Method and device for correcting text
CN113051985B (en) Information prompting method, device, electronic equipment and storage medium
US20240185844A1 (en) Context-aware end-to-end asr fusion of context, acoustic and text presentations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination