CN115914742A

CN115914742A - Method, device and equipment for identifying characters of video subtitles and storage medium

Info

Publication number: CN115914742A
Application number: CN202211582799.0A
Authority: CN
Inventors: 保俊杉
Original assignee: Beijing IQIYI Science and Technology Co Ltd
Current assignee: Beijing IQIYI Science and Technology Co Ltd
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-04-04
Anticipated expiration: 2042-12-08
Also published as: CN115914742B

Abstract

The embodiment of the invention provides a method, a device and equipment for identifying characters of video subtitles and a storage medium, and is applied to the technical field of video processing. The method comprises the following steps: performing character recognition on a video to be recognized to obtain a plurality of subtitle texts; for each subtitle text, determining a corresponding character contained in a video frame set to which the subtitle text belongs, and determining a first probability that the subtitle text belongs to the corresponding character; carrying out voiceprint segmentation clustering on the audio in the video to be identified to obtain at least one voiceprint cluster; determining the corresponding relation of each subtitle text and each voiceprint fragment in time sequence; determining the voice print fragments in the same voice print cluster as the voice print fragments of the same character, and determining the character corresponding to each voice print cluster based on the first probability and the corresponding relation; and respectively determining the characters corresponding to the subtitle texts according to the characters corresponding to the voiceprint clusters and the corresponding relation. By applying the embodiment of the invention, the characters corresponding to the subtitle texts can be accurately identified.

Description

Method, device and equipment for identifying characters of video subtitles and storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying a person in a video subtitle.

Background

Subtitles are non-video contents such as dialogs displayed in text form in television, movies, and stage works. In general, subtitles in video do not include names of speakers, and when video data including subtitles is processed, for example, when intelligent title generation is performed on a clipped movie or television play, the clipped video data includes subtitle information, and only an event occurring in the video data can be known based on the subtitle information, and a person who has occurred the event cannot be accurately located, so that the lack of character information in the video data hinders the development of an algorithm such as Natural Language Processing (NLP) in the video processing scene. Therefore, it is important how to identify the speaker information corresponding to the subtitles in the video.

In the existing method for identifying speaker information corresponding to subtitles in a video, audio data in video data is extracted, the audio data is cut into a plurality of audio segments, voiceprint identification is performed on each audio segment by using a voiceprint identification model to obtain character information of voiceprint characteristics corresponding to each audio segment, and the voiceprint identification model is obtained by training according to sample audio data and character information corresponding to the sample audio data. The training of the voiceprint recognition model needs to use a large amount of audio data with labels, so that the workload of the voiceprint recognition model training is large, in addition, the figure information in different movie and television shows is different, the specific figure information is recognized by using the voiceprint recognition model, different voiceprint recognition models need to be trained aiming at different movie and television shows, the operation is difficult, and the realization is difficult.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for identifying a person of a video subtitle, which are capable of identifying a person of a video subtitle without marking audio data in a large amount and without training different voiceprint recognition models for different movies and television shows. The specific technical scheme is as follows:

in a first aspect of the present invention, a method for identifying a person in a video subtitle is provided, including:

performing character recognition on a video to be recognized to obtain a plurality of subtitle texts, wherein the subtitle texts correspond to subtitles in the video to be recognized one by one;

for each caption text, determining a corresponding character contained in a video frame set to which the caption text belongs, and determining a first probability that the caption text belongs to the corresponding character;

carrying out voiceprint segmentation clustering on the audio frequency in the video to be identified to obtain at least one voiceprint cluster; each voiceprint cluster comprises at least one section of voiceprint fragment, and the voiceprint fragment is obtained by carrying out voiceprint segmentation on the audio;

determining the corresponding relation of each subtitle text and each voiceprint fragment in time sequence;

regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same person, and determining the person corresponding to each voiceprint cluster based on the first probability and the corresponding relation;

and respectively determining the characters corresponding to the subtitle texts according to the characters corresponding to the voiceprint clusters and the corresponding relation.

In a possible implementation manner, for each subtitle text, determining a corresponding character included in a video frame set to which the subtitle text belongs, and determining a first probability that the subtitle text belongs to the corresponding character includes:

determining a video frame set corresponding to the subtitle text display time period aiming at each subtitle text;

determining the number of the characters contained in the video frame set and the occurrence frequency of each character;

and determining a first probability that the subtitle text belongs to the corresponding character based on the number of the characters and the occurrence frequency of each character.

In a possible implementation manner, the performing voiceprint segmentation clustering on the audio in the video to be identified to obtain at least one voiceprint cluster includes:

acquiring audio data in the video to be identified;

denoising the audio data to obtain denoised audio data;

carrying out voiceprint segmentation on the denoised audio data to obtain a plurality of voiceprint segments;

for each voiceprint fragment, converting the voiceprint fragment into an embedded voiceprint feature;

and clustering the embedded voiceprint features to obtain at least one voiceprint cluster.

In a possible implementation manner, regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same person, and determining the person corresponding to each voiceprint cluster based on the first probability and the correspondence relationship includes:

for each voiceprint fragment, based on the corresponding relation, taking a first probability that the subtitle text corresponding to the voiceprint fragment belongs to the corresponding person as a second probability that the voiceprint fragment belongs to the corresponding person;

and matching the characters corresponding to the voiceprint clusters according to the second probability by taking the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character as a constraint condition, and determining the characters corresponding to the voiceprint clusters.

In a possible implementation manner, the determining, according to the person corresponding to each voiceprint cluster and the corresponding relationship, the person corresponding to each subtitle text respectively includes:

determining the figure corresponding to each voiceprint fragment according to the figure corresponding to each voiceprint cluster and the voiceprint fragments contained in each voiceprint cluster;

and determining the characters corresponding to the subtitle texts according to the characters corresponding to the voiceprint fragments and the corresponding relation.

In a second aspect of the present invention, there is also provided an apparatus for recognizing a person in a video subtitle, including:

the system comprises a character recognition module, a text recognition module and a recognition module, wherein the character recognition module is used for performing character recognition on a video to be recognized to obtain a plurality of subtitle texts, and the subtitle texts correspond to subtitles in the video to be recognized one by one;

the first determining module is used for determining a corresponding character contained in a video frame set to which the subtitle text belongs and determining a first probability of the subtitle text belonging to the corresponding character for each subtitle text;

the segmentation clustering module is used for carrying out voiceprint segmentation clustering on the audio frequency in the video to be identified to obtain at least one voiceprint cluster; each voiceprint cluster comprises at least one section of voiceprint fragment, and the voiceprint fragment is obtained by carrying out voiceprint segmentation on the audio;

a second determining module, configured to determine a time-sequential correspondence between each subtitle text and each voiceprint fragment;

a third determining module, configured to regard the voiceprint fragments in the same voiceprint cluster as voiceprint fragments of the same person, and determine, based on the first probability and the correspondence, the person corresponding to each voiceprint cluster;

and the fourth determining module is used for respectively determining the characters corresponding to the subtitle texts according to the characters corresponding to the voiceprint clusters and the corresponding relations.

In one possible implementation, the first determining module includes:

the first determining unit is used for determining a video frame set corresponding to the display time period of each subtitle text;

the second determining unit is used for determining the number of the people contained in the video frame set and the number of times of the occurrence of each person;

and a third determining unit, configured to determine a first probability that the subtitle text belongs to the corresponding person based on the number of persons and the number of times each of the persons appears.

In a possible implementation, the segmentation clustering module includes:

the acquisition unit is used for acquiring audio data in the video to be identified;

the denoising unit is used for denoising the audio data to obtain denoised audio data;

the segmentation unit is used for carrying out voiceprint segmentation on the denoised audio data to obtain a plurality of voiceprint fragments;

a conversion unit, configured to convert each voiceprint fragment into an embedded voiceprint feature;

and the clustering unit is used for clustering the embedded voiceprint characteristics to obtain at least one voiceprint cluster.

In a possible implementation manner, the third determining module is specifically configured to:

In one possible implementation, the fourth determining module includes:

a fourth determining unit, configured to determine a person corresponding to each voiceprint fragment according to the person corresponding to each voiceprint cluster and the voiceprint fragment contained in each voiceprint cluster;

and a fifth determining unit, configured to determine, according to the person corresponding to each voiceprint fragment and the corresponding relationship, a person corresponding to each subtitle text.

In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and a processor for implementing any of the above-described methods for recognizing a character of a video caption when executing a program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method for recognizing a character of a video subtitle as described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for recognizing a person of a video title as described in any one of the above.

The character recognition method for the video subtitles provided by the embodiment of the invention can accurately obtain a plurality of subtitle texts contained in the video to be recognized by performing character recognition on the video to be recognized. The method comprises the steps of determining corresponding characters contained in a video frame set to which each subtitle text belongs according to each subtitle text, determining first probability that the subtitle text belongs to the corresponding characters, carrying out voiceprint segmentation clustering on audio in a video to be recognized to obtain at least one voiceprint cluster, wherein each voiceprint cluster comprises at least one section of voiceprint fragment, further determining corresponding relation of each subtitle text and each voiceprint fragment in time sequence, regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character, determining characters corresponding to each voiceprint cluster based on the first probability and the corresponding relation, and determining the characters corresponding to each subtitle text according to the characters corresponding to each voiceprint cluster and the corresponding relation. In the embodiment of the invention, the subtitle text in the video to be recognized is recognized by a character recognition mode, the voice print segmentation clustering mode is used for carrying out voice print segmentation clustering on the audio in the video to be recognized, and the character corresponding to each subtitle text is accurately recognized by further combining the subtitle text recognition result, the voice print segmentation clustering result and the corresponding relation of each subtitle text and each voice print fragment in time sequence.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below. It is to be understood that the drawings in the following description are merely exemplary of the application and that other embodiments may be devised by those skilled in the art.

Fig. 1 is a flowchart illustrating a method for recognizing a character of a video subtitle according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a matching process of a voiceprint cluster and a person according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating determining a first probability that a subtitle belongs to a corresponding person according to an embodiment of the present invention;

FIG. 4 is a schematic flow chart of voiceprint segmentation and clustering according to an embodiment of the present invention;

FIG. 5 is a schematic flowchart illustrating a process of determining a character corresponding to a subtitle according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating a first structure of a device for recognizing a person in a video title according to an embodiment of the present application;

fig. 7 is a schematic diagram of a second structure of a person identification apparatus for video subtitles in an embodiment of the present application;

fig. 8 is a schematic diagram illustrating a third configuration of a device for recognizing a person in a video title according to an embodiment of the present application;

fig. 9 is a diagram illustrating a fourth configuration of a device for recognizing a person in a video title according to an embodiment of the present application;

fig. 10 is a schematic diagram of an electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In order to realize the identification of the character information corresponding to each subtitle in a video, the embodiment of the invention provides a method, a device, equipment and a storage medium for identifying characters of video subtitles. The embodiment of the invention provides a character recognition method for video subtitles, which comprises the following steps:

The character recognition method for the video subtitles provided by the embodiment of the invention can accurately obtain a plurality of subtitle texts contained in the video to be recognized by performing character recognition on the video to be recognized. The method comprises the steps of determining corresponding characters contained in a video frame set to which each subtitle text belongs according to each subtitle text, determining first probability that the subtitle text belongs to the corresponding characters, carrying out voiceprint segmentation clustering on audio in a video to be identified to obtain at least one voiceprint cluster, wherein each voiceprint cluster comprises at least one voiceprint fragment, further determining corresponding relation of each subtitle text and each voiceprint fragment in time sequence, regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character, determining characters corresponding to each voiceprint cluster based on the first probability and the corresponding relation, and determining the characters corresponding to each subtitle text according to the characters corresponding to each voiceprint cluster and the corresponding relation. In the embodiment of the invention, the subtitle text in the video to be recognized is recognized by a character recognition mode, the voice print segmentation clustering mode is used for carrying out voice print segmentation clustering on the audio in the video to be recognized, and the character corresponding to each subtitle text is accurately recognized by further combining the subtitle text recognition result, the voice print segmentation clustering result and the corresponding relation of each subtitle text and each voice print fragment in time sequence.

The following describes a method for identifying a character of a video subtitle according to an embodiment of the present invention in detail:

as shown in fig. 1, fig. 1 is a schematic flowchart of a method for identifying a person in a video subtitle according to an embodiment of the present invention, where the method includes the following steps:

and step S110, performing character recognition on the video to be recognized to obtain a plurality of subtitle texts.

In the embodiment of the invention, the video to be identified can be movie and television drama episodes, popular video segments and the like. The character language and the caption display language in the video to be recognized are not specifically limited, and the storage format of the video to be recognized is not specifically limited. The video to be identified can be acquired through the network platform, and the video to be identified can also be acquired through the electronic equipment with the video recording function.

In one embodiment, each video frame of the video to be recognized may be scanned by using an Optical Character Recognition (OCR) technique to obtain text information included in each video frame of the video to be recognized, so as to obtain a plurality of subtitle texts.

Step S120, for each subtitle text, determining a corresponding character included in the video frame set to which the subtitle text belongs, and determining a first probability that the subtitle text belongs to the corresponding character.

The video frame set to which the caption text belongs is a set of video frames played in a time period in which the caption text appears. The characters appearing in the video frame set corresponding to the appearing time period of each subtitle text can be detected through a character recognition or target detection method, so that the characters contained in the video frame set to which each subtitle text belongs are determined; furthermore, the number of times of each character in the video frame set corresponding to each caption text occurrence time period is counted to determine the number of times of each character in the video frame set to which each caption text belongs. After determining, for each subtitle text, the characters included in the video frame set to which the subtitle text belongs and the number of times of occurrence of each character, for example, the probability of occurrence of each character in the video frame set to which the subtitle text belongs may be determined as a first probability of the subtitle text belonging to the corresponding character.

Exemplary methods employed for person recognition or object detection may include, but are not limited to: the method comprises the steps of Face image acquisition, face positioning, face feature, local Face Analysis (LFA), a neural network model and the like.

Step S130, voice print segmentation clustering is carried out on the audio frequency in the video to be identified, and at least one voice print cluster is obtained.

In one example, the voice print segmentation and voice print clustering are performed on the audio in the video to be identified by a voice print segmentation clustering method to obtain at least one voice print cluster. Wherein, each obtained voiceprint cluster comprises at least one section of voiceprint fragment, and each voiceprint fragment is obtained by carrying out voiceprint segmentation on the audio.

Illustratively, the voiceprint segmentation clusters can be Aggregated Hierarchical Clustering (AHC) based on voiceprint features, hidden Markov model hierarchical clustering (VB-HMM on x-vectors, VBx), and the like.

Step S140, determining a time-sequential correspondence between each subtitle text and each voiceprint fragment.

The method includes the steps that a certain association relationship exists between subtitle texts and voiceprint fragments in a video to be recognized in terms of time, for example, a timestamp of a subtitle text beginning to appear corresponds to a timestamp of a voiceprint fragment beginning, and then the corresponding timestamp of each subtitle text and the corresponding timestamp of each voiceprint fragment can be matched by obtaining the corresponding timestamp of each subtitle text and the corresponding timestamp of each voiceprint fragment, so that the corresponding relationship of each subtitle text and each voiceprint fragment in time sequence is determined.

Step S150, the voiceprint fragments in the same voiceprint cluster are regarded as the voiceprint fragments of the same person, and the person corresponding to each voiceprint cluster is determined based on the first probability and the corresponding relation.

By means of the voiceprint segmentation and clustering method, voiceprint segmentation and voiceprint clustering are performed on audio in a video to be recognized, each obtained voiceprint cluster should contain voiceprint fragments under the voiceprint characteristics of the same speaker, and in practical application, clustering results may have slight differences and are not hundreds of accurate. In the embodiment of the invention, the possible slight difference of the clustering results is ignored, and the voiceprint fragments in the same voiceprint cluster are regarded as the voiceprint fragments of the same person, namely, one voiceprint cluster corresponds to one person.

One voiceprint cluster corresponds to one character, one voiceprint cluster comprises at least one section of voiceprint fragment, each subtitle text and each voiceprint fragment have corresponding relation in time sequence, the first probability that each subtitle text belongs to the corresponding character is determined, and then the optimal matching relation between each voiceprint cluster and the character can be determined.

In an embodiment, the hungarian algorithm may be used to optimally match the result of the voiceprint segmentation clustering, the first probability that each subtitle text belongs to a corresponding person, and the corresponding relationship between each subtitle text and each voiceprint fragment in time sequence, so as to determine the optimal combination of the persons corresponding to each voiceprint cluster.

The corresponding relation between each subtitle text and each voiceprint fragment can be determined through the corresponding relation between each subtitle text and each voiceprint fragment in time sequence and the voiceprint cluster to which each voiceprint fragment belongs. Illustratively, as shown in fig. 2, caption text 1 and caption text 2 correspond to voiceprint class cluster 1, caption text 3, caption text 4 correspond to voiceprint class cluster 2, and it is determined that caption text 1 belongs to a first probability of character a and to a first probability of character B, caption text 2 belongs to a first probability of character a, caption text 3 belongs to a first probability of character a and to a first probability of character B, caption text 4 belongs to a first probability of character B and to a first probability of character C, and caption text 5 belongs to a first probability of character B and to a first probability of character C. And taking the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same person, optimally matching the first probability of each subtitle text belonging to the corresponding person with each voiceprint cluster by using a Hungary algorithm to obtain the optimal matching combination, namely determining the result of the person corresponding to each voiceprint cluster, namely the voiceprint cluster 1 corresponds to the person A, and the voiceprint cluster 2 corresponds to the person B.

And step S160, respectively determining the characters corresponding to the subtitle texts according to the characters and the corresponding relations corresponding to the voiceprint clusters.

The character information corresponding to each subtitle text can be determined through the character corresponding to each voiceprint cluster, the voiceprint fragments contained in each voiceprint cluster, and the corresponding relation of each voiceprint fragment and each subtitle text in time sequence.

The character recognition method for the video subtitles provided by the embodiment of the invention can accurately obtain a plurality of subtitle texts contained in the video to be recognized by performing character recognition on the video to be recognized. The method comprises the steps of determining corresponding characters contained in a video frame set to which each subtitle text belongs according to each subtitle text, determining first probability that the subtitle text belongs to the corresponding characters, carrying out voiceprint segmentation clustering on audio in a video to be recognized to obtain at least one voiceprint cluster, wherein each voiceprint cluster comprises at least one section of voiceprint fragment, further determining corresponding relation of each subtitle text and each voiceprint fragment in time sequence, regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character, determining characters corresponding to each voiceprint cluster based on the first probability and the corresponding relation, and determining the characters corresponding to each subtitle text according to the characters corresponding to each voiceprint cluster and the corresponding relation. In the embodiment of the invention, the subtitle text in the video to be recognized is recognized by using a character recognition mode, the voice print segmentation clustering mode is used for carrying out voice print segmentation clustering on the audio in the video to be recognized, and the character corresponding to each subtitle text is accurately recognized by further combining the subtitle text recognition result, the voice print segmentation clustering result and the corresponding relation of each subtitle text and each voice print fragment in time sequence.

In a possible implementation manner, based on the embodiment shown in fig. 1, as shown in fig. 3, fig. 3 is a schematic diagram of a first probability implementation manner for determining that subtitle text belongs to a corresponding character in the embodiment of the present invention. The step S120 may include the steps of:

step S121, determining, for each subtitle text, a video frame set corresponding to the subtitle text display time period.

And performing target detection or capturing on the video corresponding to each subtitle text occurrence time period by using a target detection or video capturing method so as to determine a video frame set corresponding to each subtitle text display time period. For example, a set of video frames corresponding to a subtitle text may include one or more video frames.

In one example, the video frame set corresponding to the subtitle text display period may be detected or identified by the clips software.

In step S122, the number of people included in the video frame set and the number of times of appearance of each person are determined.

The number of times of appearance of the character in the video frame set, that is, the number of frames of the video frame in which the character appears in the video frame set, is equivalent to the time length of appearance of the character in the video frame set. In a possible implementation manner, a face included in a video frame set corresponding to a subtitle text display time period may be identified by a face identification model to obtain characters included in the video frame set corresponding to the subtitle text display time period, and further, the number of times each character appears in the video frame set corresponding to the subtitle text display time period is counted, so that the number of characters included in the video frame set corresponding to the subtitle text display time period is known when the characters included in the video frame set corresponding to the subtitle text display time period are obtained.

Illustratively, for example, 10 video frames are included in a video frame set corresponding to a subtitle text display time period, a face included in the video frame set is identified by a face identification model, and the identification result includes a character a and a character B, that is, the number of characters in the video frame set corresponding to the subtitle text display time period is 2. Wherein, the person a appears in 7 video frames, and the person B appears in 4 video frames, which can be statistically: in the video frame set corresponding to the caption text display period, the number of times of appearance of the character a is 7, and the number of times of appearance of the character B is 4.

Step S123, determining a first probability that the subtitle text belongs to the corresponding person based on the number of the persons and the number of times of occurrence of each person.

In one possible implementation, the first probability that the subtitle text belongs to the corresponding character may be determined by the following expression:

wherein p represents a first probability that the caption text belongs to the ith person, n represents the number of characters contained in a video frame set corresponding to the caption text display time period, and x _i Indicating the number of times that the ith person appears in the video frame set corresponding to the subtitle text appearance period.

The number of characters contained in a video frame set corresponding to the subtitle text occurrence time period and the occurrence frequency of each character are determined, and the first probability that the subtitle text belongs to the corresponding character is accurately calculated in a probability calculation mode.

In a possible implementation manner, based on the embodiment shown in fig. 1, as shown in fig. 4, fig. 4 is a schematic diagram of an implementation manner of voiceprint segmentation and clustering in the embodiment of the present invention. The step S130 may include the steps of:

step S131, audio data in the video to be identified are obtained.

In a possible implementation manner, the video format conversion function of the ffmpeg software may be utilized to convert the video to be recognized into the audio, so as to obtain the audio data corresponding to the video to be recognized. ffmpeg is a set of open source computer programs that can be used to record, convert digital audio, video, and convert them into streams.

In a possible implementation, the audio data in the video to be identified may also be extracted by video processing software such as Premiere.

And S132, denoising the audio data to obtain denoised audio data.

A large amount of noise such as background music, wind and rain sounds, etc. may be mixed in the audio data, and various sound information unrelated to human voice such as car whistling, signal noise, etc. may be included in the video data photographed by the electronic device. Before the acquired audio data is subjected to voiceprint segmentation, the audio data is subjected to denoising processing, audio information containing human voice is reserved, and the accuracy of acquiring the voiceprint characteristics of the audio data can be improved. Illustratively, the denoising process may be implemented by a deep learning network, a spectral subtraction method, or the like.

Step S133, performing voiceprint segmentation on the denoised audio data to obtain a plurality of voiceprint segments.

And carrying out voiceprint segmentation on the audio data, segmenting one or more words in the audio information which contains the human voice, and segmenting into a plurality of voiceprint segments, wherein each voiceprint segment is one part of the audio data.

In one example, a fixed-length division method may be adopted, for example, the audio data is divided according to a set duration to obtain a plurality of equal-duration voiceprint fragments, and the set duration may be set according to actual needs, such as 0.5 second, 1 second, or 2 seconds. Or segmenting the denoised audio data by adopting a Speaker Change Detection (SCD) model compiled based on a voiceprint coder, namely judging a conversion point, and carrying out voiceprint segmentation from the conversion point to obtain a plurality of voiceprint segments.

Step S134, for each voiceprint fragment, converting the voiceprint fragment into an embedded voiceprint feature.

In one example, each voiceprint fragment can be given an embedded feature embedding, or each voiceprint fragment can be encoded to obtain a corresponding embedded voiceprint feature. For example, the embedded code of the voice print information of each speaker can be fitted using a Gaussian Mixture Model-Universal Background Model (GMM-UBM); a Convolutional Neural Network (CNN) or cyclic Neural network (RNN) model may also be used to obtain a fixed-length embedded vector d-vector for each audio, and for example, the embedded code for each speaker may be an average of all audio embedded vectors of the speaker.

And step S135, clustering the embedded voiceprint features to obtain at least one voiceprint cluster.

And performing feature clustering on the voiceprint embedding features of all the voiceprint fragments, and clustering the voiceprint fragments embedded with the voiceprint features under the same speaker into the same cluster as much as possible to obtain at least one voiceprint cluster. In one example, clustering of each embedded voiceprint feature can be achieved through clustering methods such as K-means clustering, hierarchical clustering of gaussian mixture models, spectral clustering and the like.

The method has the advantages that the audio data are subjected to denoising processing, the influence of noise in the audio data on the target voiceprint is avoided, the obtained voiceprint characteristics are more accurate, and further, the embedded voiceprint characteristics corresponding to all voiceprint fragments are clustered, so that voiceprint clusters corresponding to different characters are accurately obtained.

In a possible embodiment, the step S150 regards the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same person, and determines the person corresponding to each voiceprint cluster based on the first probability and the corresponding relationship, including:

step one, aiming at each voiceprint fragment, based on the corresponding relation, taking a first probability that a subtitle text corresponding to the voiceprint fragment belongs to a corresponding person as a second probability that the voiceprint fragment belongs to the corresponding person;

and step two, matching the characters corresponding to the voiceprint clusters according to a second probability by taking the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character as constraint conditions, and determining the characters corresponding to the voiceprint clusters.

And the first probability that the subtitle text corresponding to the voiceprint fragment belongs to the corresponding character is used as the second probability that the voiceprint fragment belongs to the corresponding character.

In one example, a brute force matching algorithm may be used, and the voiceprint fragments in the same voiceprint cluster are the voiceprint fragments of the same person as a constraint condition, so as to determine all possible matching results of the correspondence relationship between the voiceprint cluster and the person, and then according to the second probability that each voiceprint fragment belongs to the corresponding person, calculate the second probability and the highest one of the matching results, thereby obtaining the person corresponding to each voiceprint cluster.

In one example, a Hungarian matching algorithm can be used, the voiceprint fragments in the same voiceprint cluster are the voiceprint fragments of the same person as a constraint condition, the correspondence between the voiceprint cluster and the person is matched according to the second probability that each voiceprint fragment belongs to the corresponding person, and the optimal matching result between the voiceprint cluster and the person can be obtained through the Hungarian matching algorithm, so that the persons corresponding to each voiceprint cluster are obtained.

In a possible implementation manner, based on the embodiment shown in fig. 1, as shown in fig. 5, fig. 5 is a schematic diagram of an implementation manner of determining a character corresponding to a subtitle text in an embodiment of the present invention. The step S160 may include the steps of:

step S161, determining the persons corresponding to the voiceprint fragments according to the persons corresponding to the voiceprint clusters and the voiceprint fragments contained in the voiceprint clusters.

Illustratively, a voiceprint cluster 1 corresponds to a person a, a voiceprint cluster 2 corresponds to a person B, the voiceprint cluster 1 comprises a voiceprint fragment 1 and a voiceprint fragment 2, the voiceprint cluster 2 comprises a voiceprint fragment 3, a voiceprint fragment 4 and a voiceprint fragment 5, and correspondingly, the voiceprint fragment 1 and the voiceprint fragment 2 correspond to a person a and the voiceprint fragment 3, the voiceprint fragment 4 and the voiceprint fragment 5 correspond to a person B.

And step S162, determining the character corresponding to each subtitle text according to the character corresponding to each voiceprint fragment and the corresponding relation.

Illustratively, a person a corresponds to a voiceprint fragment 1 and a voiceprint fragment 2, a person B corresponds to a voiceprint fragment 3, a voiceprint fragment 4 and a voiceprint fragment 5, and the voiceprint fragments and the subtitle text are in a one-to-one correspondence in time sequence, and accordingly, it can be determined that the correspondence between the person and the subtitle text is: character a corresponds to caption text 1 and caption text 2, and character B corresponds to caption text 3, caption text 4, and caption text 5.

According to the technical scheme provided by the embodiment of the invention, the persons corresponding to the voiceprint fragments are determined through the persons corresponding to the voiceprint clusters and the voiceprint fragments contained in the voiceprint clusters, and the persons corresponding to the subtitle texts can be accurately determined according to the persons corresponding to the voiceprint fragments and the corresponding relation.

In the existing application, because most of the video frames corresponding to the subtitles of the movie and television show have a plurality of people, the speaker in the movie and television show is identified only by the optical character identification technology, the content of the subtitles in the video frames can be identified, and the speaker can not be determined to speak. The specific talking character in the movie and television play is identified through the voiceprint identification model, a large number of labels are needed, the voiceprint identification model needs to be retrained for each play, and the whole operation process is difficult to realize.

According to the technical scheme provided by the embodiment of the invention, a plurality of subtitle texts contained in a video to be identified are accurately obtained by carrying out character identification on the video to be identified, for each subtitle text, corresponding characters contained in a video frame set to which the subtitle text belongs are determined, the first probability of the subtitle text belonging to the corresponding character is determined, voiceprint segmentation and clustering are carried out on audio in the video to be identified to obtain at least one voiceprint cluster, each voiceprint cluster contains at least one section of voiceprint fragment, the corresponding relation of each subtitle text and each voiceprint fragment in time sequence is further determined, the voiceprint fragments in the same voiceprint cluster are regarded as the voiceprint fragments of the same character, the characters corresponding to each voiceprint cluster are determined based on the first probability and the corresponding relation, and then the characters corresponding to each text are respectively determined according to the characters and the corresponding relation corresponding to each voiceprint cluster. The method comprises the steps of identifying subtitle texts in a video to be identified by using a character identification mode, carrying out voiceprint segmentation clustering on audio in the video to be identified by using a voiceprint segmentation clustering mode, further combining a subtitle text identification result, a voiceprint segmentation clustering result and the corresponding relation of each subtitle text and each voiceprint fragment in a time sequence, accurately identifying characters corresponding to each subtitle text, and compared with the mode of only character identification, the method can improve the confidence level of the characters corresponding to the subtitle texts, and compared with the mode of carrying out mass marking on audio data, different voiceprint identification models are trained for different movie dramas, the method is convenient to implement, simple to operate, reduces the workload, and saves manpower at the same time.

Based on the same inventive concept, a person identification device for video subtitles is provided corresponding to the above method for identifying persons of video subtitles, as shown in fig. 6, fig. 6 is a schematic view of a first structure of the person identification device for video subtitles in the embodiment of the present application, and the device includes:

the text recognition module 610 is configured to perform text recognition on a video to be recognized to obtain a plurality of subtitle texts, where the subtitle texts correspond to subtitles in the video to be recognized one by one;

a first determining module 620, configured to determine, for each subtitle text, a corresponding person included in a video frame set to which the subtitle text belongs, and determine a first probability that the subtitle text belongs to the corresponding person;

the segmentation clustering module 630 is configured to perform voiceprint segmentation clustering on the audio in the video to be identified to obtain at least one voiceprint cluster; each voiceprint cluster comprises at least one section of voiceprint fragment, and the voiceprint fragment is obtained by carrying out voiceprint segmentation on audio;

a second determining module 640, configured to determine a time-sequence correspondence between each subtitle text and each voiceprint fragment;

a third determining module 650, configured to determine a person corresponding to each voiceprint cluster based on the first probability and the correspondence, by regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same person;

the fourth determining module 660 is configured to determine the person corresponding to each subtitle text according to the person and the corresponding relationship corresponding to each voiceprint cluster.

The character recognition device for the video subtitles provided by the embodiment of the invention can accurately obtain a plurality of subtitle texts contained in the video to be recognized by performing character recognition on the video to be recognized. The method comprises the steps of determining corresponding characters contained in a video frame set to which each subtitle text belongs according to each subtitle text, determining first probability that the subtitle text belongs to the corresponding characters, carrying out voiceprint segmentation clustering on audio in a video to be recognized to obtain at least one voiceprint cluster, wherein each voiceprint cluster comprises at least one section of voiceprint fragment, further determining corresponding relation of each subtitle text and each voiceprint fragment in time sequence, regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character, determining characters corresponding to each voiceprint cluster based on the first probability and the corresponding relation, and determining the characters corresponding to each subtitle text according to the characters corresponding to each voiceprint cluster and the corresponding relation. In the embodiment of the invention, the subtitle text in the video to be recognized is recognized by using a character recognition mode, the voice print segmentation clustering mode is used for carrying out voice print segmentation clustering on the audio in the video to be recognized, and the character corresponding to each subtitle text is accurately recognized by further combining the subtitle text recognition result, the voice print segmentation clustering result and the corresponding relation of each subtitle text and each voice print fragment in time sequence.

In one possible implementation, referring to fig. 7, the first determining module 620 may include:

a first determining unit 621, configured to determine, for each subtitle text, a video frame set corresponding to the subtitle text display time period;

a second determining unit 622, configured to determine the number of people included in the video frame set and the number of times of appearance of each person;

a third determining unit 623 configured to determine a first probability that the subtitle text belongs to the corresponding person based on the number of persons and the number of times each person appears.

In one possible implementation, as shown in fig. 8, the segmentation clustering module 630 may include:

an obtaining unit 631, configured to obtain audio data in a video to be identified;

the denoising unit 632 is configured to perform denoising processing on the audio data to obtain denoised audio data;

a segmentation unit 633, configured to perform voiceprint segmentation on the denoised audio data to obtain a plurality of voiceprint segments;

a conversion unit 634, configured to convert each voiceprint fragment into an embedded voiceprint feature;

the clustering unit 635 is configured to cluster the embedded voiceprint features to obtain at least one voiceprint cluster.

In a possible implementation manner, the third determining module 650 is specifically configured to:

for each voiceprint fragment, based on the corresponding relation, taking a first probability that the subtitle text corresponding to the voiceprint fragment belongs to the corresponding character as a second probability that the voiceprint fragment belongs to the corresponding character;

and matching the characters corresponding to the voiceprint clusters according to the second probability by taking the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same character as the constraint condition, and determining the characters corresponding to the voiceprint clusters.

In one possible implementation, as shown in fig. 9, the fourth determining module 660 may include:

a fourth determining unit 661, configured to determine a person corresponding to each voiceprint fragment according to the person corresponding to each voiceprint cluster and the voiceprint fragment included in each voiceprint cluster;

a fifth determining unit 662, configured to determine a person corresponding to each subtitle text according to the person corresponding to each voiceprint fragment and the corresponding relationship.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, which includes a processor 101, a communication interface 102, a memory 103, and a communication bus 104, where the processor 101, the communication interface 102, and the memory 103 complete mutual communication through the communication bus 104,

a memory 103 for storing a computer program;

the processor 101 is configured to implement the following steps when executing the program stored in the memory 103:

for each subtitle text, determining a corresponding character contained in a video frame set to which the subtitle text belongs, and determining a first probability that the subtitle text belongs to the corresponding character;

In the embodiment of the invention, the subtitle text in the video to be recognized is recognized by a character recognition mode, the voice print segmentation clustering mode is used for carrying out voice print segmentation clustering on the audio in the video to be recognized, and the character corresponding to each subtitle text is accurately recognized by further combining the subtitle text recognition result, the voice print segmentation clustering result and the corresponding relation of each subtitle text and each voice print fragment in time sequence.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other devices.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In still another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the method for recognizing a character of a video subtitle as described in any of the above embodiments.

In another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to execute the method for recognizing a person of a video title as described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device/electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to some descriptions of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for identifying a person in a video subtitle, the method comprising:

2. The method of claim 1, wherein determining, for each caption text, a corresponding character included in the set of video frames to which the caption text belongs and determining a first probability that the caption text belongs to the corresponding character comprises:

3. The method according to claim 1, wherein the performing voiceprint segmentation clustering on the audio in the video to be identified to obtain at least one voiceprint cluster comprises:

acquiring audio data in the video to be identified;

denoising the audio data to obtain denoised audio data;

4. The method of claim 1, wherein the determining the persons corresponding to the voiceprint clusters based on the first probability and the correspondence relationship by regarding the voiceprint fragments in the same voiceprint cluster as the voiceprint fragments of the same person comprises:

5. The method of claim 1, wherein the determining the corresponding person of each subtitle text according to the corresponding person and the corresponding relationship of each voiceprint cluster comprises:

6. An apparatus for recognizing a character of a video caption, the apparatus comprising:

and a fourth determining module, configured to determine, according to the corresponding character of each voiceprint cluster and the corresponding relationship, a character corresponding to each subtitle text.

7. The apparatus of claim 6, wherein the first determining module comprises:

8. The apparatus of claim 6, wherein the segmentation clustering module comprises:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-5.