CN113033190A

CN113033190A - Subtitle generating method, device, medium and electronic equipment

Info

Publication number: CN113033190A
Application number: CN202110420704.4A
Authority: CN
Inventors: 王晶冰; 郝卓琳
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-06-25

Abstract

The disclosure relates to a subtitle generating method, apparatus, medium, and electronic device. The method comprises the following steps: acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object; extracting the content of the original object description information to obtain brief description information of the target object; generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information; and generating subtitle information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase. Therefore, the subtitle information related to the target object to be introduced in the video can be automatically generated, the readability and the accuracy of the subtitle information can be guaranteed, and a user can conveniently and quickly know the target object through the video.

Description

Subtitle generating method, device, medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a medium, and an electronic device for generating subtitles.

Background

With the rapid development of the internet and the explosive increase of the information amount in the internet, users are more and more used to obtain various required information from the internet. Particularly in daily life, for any unknown object, a user can quickly look up the content which the user wants to know by searching the internet for the related content of the object.

At present, information in the internet generally exists in the forms of text, images, audios and videos and the like. The video information has the advantages of rich, visual and intuitive content, and is often used as a main way for users to know things. However, in the video production process, the related introduction text of the object is usually directly used as the subtitle, or the related introduction text is used as the subtitle after being slightly moistened, and the related introduction text generally has more content and poor readability, so that a user is not convenient to quickly know the object through the video.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method for generating subtitles, including:

acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object;

extracting the content of the original object description information to obtain brief description information of the target object;

generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information;

and generating subtitle information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase.

In a second aspect, the present disclosure provides a subtitle generating apparatus comprising:

an acquisition module, configured to acquire original object description information of a target object to be introduced in a video, where the original object description information is used to describe original inherent features of the target object;

the refining module is used for refining the content of the original object description information acquired by the acquisition module to obtain brief description information of the target object;

a phrase generating module, configured to generate an object characteristic description phrase for describing characteristics of the target object according to the original object description information acquired by the acquiring module;

and the subtitle generating module is used for generating subtitle information corresponding to the target object in the video according to the brief description information obtained by the refining module and the object characteristic description phrase generated by the phrase generating module.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method provided by the first aspect of the present disclosure.

In the technical scheme, after original object description information of a target object to be introduced in a video and used for describing original inherent characteristics of the target object is acquired, brief description information of the target object and an object characteristic description phrase used for describing characteristics of the target object are determined according to the original object description information; and then, generating subtitle information corresponding to the target object in the video according to the two information. Therefore, the subtitle information related to the target object to be introduced in the video can be automatically generated, the readability and the accuracy of the subtitle information can be guaranteed, and a user can conveniently and quickly know the target object through the video.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

fig. 1 is a flowchart illustrating a subtitle generating method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating a method of generating subtitle information corresponding to a target object in a video according to brief description information and an object characterization phrase, according to an example embodiment.

Fig. 3 is a flowchart illustrating a method of generating subtitle information corresponding to a target object in a video according to brief description information and an object characterization phrase, according to another exemplary embodiment.

Fig. 4 is a block diagram illustrating a subtitle generating apparatus according to an example embodiment.

FIG. 5 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart illustrating a subtitle generating method according to an exemplary embodiment. As shown in fig. 1, the method may include S101 to S104.

In S101, original object description information of a target object to be introduced in a video is acquired.

In the present disclosure, the target object may be, for example, a flower, a bird, a commodity, furniture, or the like. The original object description information is used to describe original inherent characteristics of the target object, such as style, specification, name, material, and the like. Illustratively, the target object is a wardrobe, and its original object description information is: the high-end bedroom combined wardrobe and cabinet is simple, modern and European-style, is formed by assembling solid wood particle boards, and can be assembled in a sliding manner by pushing and pulling; and the storage cabinet has large capacity, can be used as a storage cabinet, is matched with colors of white and oak, is 1.8 meters high and is provided with a top cabinet.

In S102, content refinement is performed on the original object description information to obtain brief description information of the target object.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, the generated brief description information is: push-pull and slide assembly combined wardrobe and cabinet.

In S103, an object characteristic description phrase for describing the characteristics of the target object is generated from the original object description information.

In the present disclosure, the object characteristic description phrases can highlight the unique features of the target object, and the number of the generated object characteristic description phrases may be one or more, which is not particularly limited in the present disclosure.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, 6 object characteristic description phrases for describing the characteristics of the wardrobe are generated as: the method includes the following steps that (1) an European style is adopted, and atmosphere is simplified; secondly, manufacturing solid wood, and keeping the atmosphere at a high end; the multifunctional solid wood is foldable in large capacity; fourthly, simplifying the atmosphere, and preventing water, insects and dust; europe style wood grain, simple and not simple; sixthly, the modern and European style large capacity is reduced.

In S104, subtitle information corresponding to the target object in the video is generated based on the brief description information and the object characteristic description phrase.

The following describes in detail a specific embodiment of refining the content of the original object description information to obtain the brief description information of the target object in S102.

Specifically, according to the original object description information of the target object, brief description information of the target object can be obtained in various ways, and in one embodiment, a keyword can be extracted from the original object description information in a statistical manner, for example, a word whose occurrence frequency exceeds a preset frequency threshold in the original object description information is determined as the keyword; and then, splicing the keywords to obtain brief description information.

In another embodiment, the original object description information may be input into a pre-trained content refinement model to perform content refinement on the original object description information to obtain brief description information of the target object. Therefore, the brief description information of the target object can be automatically extracted from the original object description information through the content extraction model, and the method is convenient and quick.

In the present disclosure, the content refinement model may be, for example, a transform model, a Long Short-Term Memory (LSTM) model, or the like.

The content extraction model can be obtained by training through the following steps (1) to (7):

(1) first reference object description information of a first reference object is acquired, wherein the first reference object description information is used for describing original inherent characteristics of the first reference object.

(2) And performing word segmentation on the first reference object description information to obtain a first word sequence, wherein all words in the first word sequence are sequentially arranged according to the sequence of the words in the first reference object description information.

(3) And deleting each word in the first word sequence except the first reference object from the first word sequence to obtain a new first word sequence, and sequentially connecting all words in the new first word sequence to obtain candidate brief description information of the first reference object.

(4) And judging whether other words except the first reference object exist in any new first word sequence.

If any new first word sequence has other words except the first reference object, executing the following step (5); if there are no other words in any new first word sequence except the first reference object, the following step (6) is performed.

(5) And (4) regarding each new first word sequence, taking the new first word sequence as the first word sequence, and then returning to the step (3).

(6) The candidate brief description information with the highest fluency is taken as the reference brief description information.

In the present disclosure, the fluency of candidate profile information may be determined by a pre-trained language model (e.g., the GPT2 model).

(7) And performing model training by taking the first reference object description information as the input of the content refinement model and taking the reference brief description information as the target output of the content refinement model to obtain the content refinement model.

In the training process of the content extraction model, the reference brief description information can be directly determined based on the first reference object description information, so that the problem that the model training performance is influenced due to the fact that training samples are limited can be avoided, the reference brief description information can be enabled to be as smooth as possible, and the content extraction performance of the content extraction model can be improved.

A detailed description will be given below with respect to a specific embodiment of generating an object characteristic description phrase for describing the characteristic of the target object from the original object description information in S103 described above.

Specifically, an object characteristic description phrase for describing the characteristics of the target object may be obtained in various ways according to the original object description information of the target object, and in one embodiment, descriptors (usually adjectives) related to the characteristics of the target object may be extracted from the original object description information, and then, the descriptors are combined and spliced to obtain the object characteristic description phrase for describing the characteristics of the target object.

In another embodiment, descriptors related to the characteristics of the target object can be extracted from the original object description information; meanwhile, determining a target characteristic descriptor corresponding to the target object according to a pre-stored corresponding relationship between the object and the characteristic descriptor, wherein the corresponding relationship can be formed according to a corpus; and then combining and splicing the extracted descriptors related to the target object characteristics and the target characteristic descriptors determined according to the corresponding relation to obtain an object characteristic description phrase for describing the characteristics of the target object.

In yet another embodiment, the original object description information may be input into a pre-trained phrase generation model to obtain an object characteristic description phrase for describing characteristics of the target object. In this way, the object characteristic description phrase for describing the characteristic of the target object can be automatically generated through the phrase generation model, and the method is convenient and quick. The phrase generation model may be, for example, a transform model, an LSTM model, or the like.

The phrase generation model can be obtained by training through the following steps 1) to 5):

1) and acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are both used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is greater than a preset similarity threshold.

2) And segmenting the second reference object description information to obtain a second word sequence, and segmenting the third reference object description information to obtain a third word sequence.

3) For each modifier used for the second reference object in the second word sequence, extracting a phrase containing the modifier from the second reference object description information as a first candidate object characteristic description phrase; and for each modifier used for describing the second reference object in the third word sequence, extracting a phrase containing the modifier from the third reference object description information as a second candidate object characteristic description phrase.

4) Each first candidate object property description phrase and each second candidate object property description phrase is determined as a reference object property description phrase.

5) And performing model training by taking the second reference object description information as the input of the phrase generation model and taking the reference object characteristic description phrase as the target output of the phrase generation model to obtain the phrase generation model.

In the training process of the phrase generation model, the reference object characteristic description phrase corresponding to the second reference object description information is determined based on the second reference object description information and third reference object description information similar to the second reference object description information, so that the reference object characteristic description phrase can embody the characteristics of the second reference object as much as possible, the phrase generation performance of the phrase generation model is improved, the generated object characteristic description phrase can fully embody the characteristics of the target object, and the accuracy of the caption information is improved. In addition, the reference object characteristic description phrases corresponding to the second reference object description information are automatically generated, so that the problem that the model training performance is influenced due to limited training samples can be avoided, and the phrase generation performance of the phrase generation model is further improved.

The following is a detailed description of a specific embodiment of generating subtitle information corresponding to a target object in a video according to the brief description information and the object characteristic description phrase in S104.

In one embodiment, the object characteristic description phrase generated in S103 for describing the characteristic of the target object is one, and in this case, the brief description information generated in S102 and the object characteristic description phrase generated in S103 for describing the characteristic of the target object may be directly combined to obtain the subtitle information corresponding to the target object in the video.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, the generated brief description information is: the combined wardrobe cabinet is assembled by pushing, pulling and sliding, and the generated object characteristic description phrase for describing the characteristics of the wardrobe is as follows: the method comprises the steps of solid wood forging and high-end atmosphere, and after the solid wood forging and high-end atmosphere are combined, caption information corresponding to a target object wardrobe in a video can be obtained.

In another embodiment, the object characteristic description phrases generated in S103 for describing the characteristics of the target object are plural, and in this case, the subtitle information corresponding to the target object in the video may be generated through S1041 and S1042 shown in fig. 2.

In S1041, the brief description information is combined with each object characteristic description phrase, respectively, to obtain a plurality of candidate subtitle information.

In S1042, target subtitle information is determined from the plurality of candidate subtitle information as subtitle information corresponding to the target object in the video.

Specifically, for each candidate subtitle information, the similarity between the candidate subtitle information and the original object description information may be determined; then, the candidate subtitle information having the highest similarity with the original object description information is determined as the target subtitle information. The similarity may be, for example, cosine similarity, euclidean distance similarity, or the like.

In addition, the similarity between the candidate subtitle information and the original object description information may be determined by: firstly, determining semantic information corresponding to candidate subtitle information and semantic information corresponding to original object description information; then, the similarity between the semantic information corresponding to the candidate subtitle information and the semantic information corresponding to the original object description information is calculated, and the calculated similarity is used as the similarity between the candidate subtitle information and the original object description information. Specifically, for each candidate subtitle, a semantic vector (i.e., semantic information) corresponding to the candidate subtitle information and a semantic vector corresponding to the original object description information are generated through a subtitle vector representation service, and then a cosine similarity or a euclidean distance similarity between the two semantic vectors is calculated and is used as a similarity between the candidate subtitle information and the original object description information.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, the generated brief description information is: the sliding and pushing assembly is combined with the wardrobe cabinet, and the generated object characteristic description phrase for describing the characteristics of the wardrobe cabinet comprises: the method includes the following steps that (1) an European style is adopted, and atmosphere is simplified; secondly, manufacturing solid wood, and keeping the atmosphere at a high end; the multifunctional solid wood is foldable in large capacity; fourthly, simplifying the atmosphere, and preventing water, insects and dust; europe style wood grain, simple and not simple; sixthly, the modern and European style large capacity; the above-mentioned brief description information "push-pull sliding assembled wardrobe" is combined with each object feature description phrase in the above-mentioned 6 object feature description phrases, respectively, to obtain the following 6 candidate subtitle information:

(1) the combined wardrobe and cabinet is assembled in a sliding and sliding manner, has European style and is simple and atmospheric;

(2) the combined wardrobe and cabinet is assembled by sliding, pushing and moving, and the solid wood is manufactured and is high-end atmosphere;

(3) the combined wardrobe and cabinet is assembled by pushing, pulling, moving and sliding, and is multifunctional solid wood, high-capacity and foldable;

(4) the combined wardrobe and cabinet is assembled by pushing, pulling, moving and sliding, so that the environment is simplified, and the combined wardrobe and cabinet is waterproof, insect-proof and dustproof;

(5) the push-pull sliding assembly and combination wardrobe cabinet has European wood grains, and is simple and not simple;

(6) the push-pull sliding assembly combined wardrobe and cabinet is simple, modern and European-style and has large capacity.

The candidate subtitle information with the highest similarity to the original object description information is the candidate subtitle information (2), so that the candidate subtitle information (2) can be determined as target subtitle information, namely the subtitle information corresponding to a target object wardrobe in a video is 'push-pull sliding assembly wardrobe, solid wood manufacturing and high-end atmosphere'.

In addition, the object characteristic description phrases generated in S103 for describing the characteristics of the target object are not all derived from the original object description information, so that the generated video may contain information that the original object description information does not appear in the subtitle information corresponding to the target object based on the object characteristic description phrases obtained in S103 for describing the characteristics of the target object, so that the accuracy of the generated subtitle information cannot be guaranteed. Therefore, before the S1042 determines the target subtitle information from the candidate subtitle information, the candidate subtitle information obtained by combining the candidate subtitle information in S1041 needs to be filtered to ensure the accuracy and efficiency of subsequently generating the subtitle information. Specifically, as shown in fig. 3, S104 further includes S1043.

In S1043, for each candidate subtitle information, at least one object characteristic description word for describing a characteristic of the target object is extracted from the candidate subtitle information, and if there is an object characteristic description word that does not appear in the original object description information, the candidate subtitle information is filtered out.

In the present disclosure, the object characteristic description word may be a modifier that describes the material, class, style, and the like of the target object. After filtering the candidate subtitles in S1043, in S1042, the target subtitle information may be determined from the candidate subtitle information remaining after the filtering. Specifically, the similarity between the candidate subtitle information and the original object description information may be determined for each candidate subtitle information remaining after the filtering operation; then, the candidate subtitle information having the highest similarity with the original object description information is determined as the target subtitle information.

A detailed description is given below of a specific embodiment of extracting at least one object characteristic description word for describing a characteristic of the target object from the candidate subtitle information in S1043 described above. Specifically, Named Entity Recognition (NER) may be performed on the candidate subtitle information to obtain at least one Named Entity, and the at least one Named Entity may be used as at least one object property description word for describing properties of the target object.

The NER is used to identify an object property description entity, such as material, category, style, etc., in the candidate subtitle information for describing the property of the target object.

Illustratively, based on the target object "wardrobe" in the above example, 6 pieces of candidate subtitle information are obtained through S1041 combination, which is specifically as follows:

Wherein, for the candidate subtitle information (1), the object characteristic description words extracted therefrom for describing the characteristics of the target object "wardrobe" include: "euro", "brief", "atmosphere", it can be seen that these three words all appear in the original object description information, thus, the candidate subtitle information (1) is retained;

for the candidate subtitle information (2), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "solid wood", "high-end", "atmosphere", it can be seen that the three words all appear in the original object description information, thus, the candidate subtitle information (2) is retained;

for the candidate subtitle information (3), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "multi-function", "solid wood", "large capacity", "foldable", wherein "foldable" does not appear in the original object description information, i.e. there are object property description words in the candidate subtitle information (3) that do not appear in the original object description information, and therefore, the candidate subtitle information (3) is filtered out from the above 6 candidate subtitles;

for candidate subtitle information (4), object characteristic description words extracted therefrom for describing characteristics of a target object "wardrobe" include: "brief", "atmosphere", "waterproof", "insect-proof", "dust-proof", wherein "waterproof", "insect-proof", "dust-proof" are not present in the original object description information, that is, there are object feature description words that are not present in the original object description information in the candidate subtitle information (4), so, filter out the candidate subtitle information (4) from the above-mentioned 6 candidate subtitles;

for candidate subtitle information (5), object characteristic description words extracted therefrom for describing characteristics of a target object "wardrobe" include: "european style", "wood grain", "brevity" and "not simple", wherein "wood grain" and "not simple" are not present in the original object description information, that is, there are object feature description words which are not present in the original object description information in the candidate subtitle information (5), and therefore, the candidate subtitle information (5) is filtered from the above 6 candidate subtitles;

for candidate subtitle information (6), object characteristic description words extracted therefrom for describing characteristics of a target object "wardrobe" include: "brief", "modern", "european style", and "large volume", it can be seen that the four words all appear in the original object description information, and therefore, the candidate subtitle information (6) is retained.

Therefore, the candidate subtitle information left after the filtering operation comprises candidate subtitle information (1), candidate subtitle information (2) and candidate subtitle information (6), namely: the combined wardrobe and cabinet is assembled in a sliding and sliding manner, has European style and is simple and atmospheric; the combined wardrobe and cabinet is assembled by sliding, pushing and moving, and the solid wood is manufactured and is high-end atmosphere; the push-pull sliding assembly combined wardrobe and cabinet is simple, modern and European-style and has large capacity. Thereafter, S1043 may determine a target subtitle from the three candidate subtitles.

Based on the same inventive concept, the disclosure also provides a subtitle generating device. As shown in fig. 4, the apparatus 400 includes: an obtaining module 401, configured to obtain original object description information of a target object to be introduced in a video, where the original object description information is used to describe original inherent features of the target object; a refining module 402, configured to refine the content of the original object description information obtained by the obtaining module 401 to obtain brief description information of the target object; a phrase generating module 403, configured to generate an object characteristic description phrase for describing characteristics of the target object according to the original object description information acquired by the acquiring module 402; a caption generating module 404, configured to generate caption information corresponding to the target object in the video according to the brief description information obtained by the refining module 402 and the object characteristic description phrase generated by the phrase generating module 403.

In one embodiment, the refining module 402 is configured to extract keywords from the original object description information in a statistical manner; and then, splicing the keywords to obtain brief description information.

In another embodiment, the refining module 402 is configured to input the raw object description information into a pre-trained content refining model to perform content refining on the raw object description information to obtain brief description information of the target object. Therefore, the brief description information of the target object can be automatically extracted from the original object description information through the content extraction model, and the method is convenient and quick.

In an embodiment, the phrase generating module 403 is configured to extract descriptors related to characteristics of the target object from the original object description information, and then combine and concatenate the descriptors to obtain an object characteristic description phrase for describing characteristics of the target object.

In another embodiment, the phrase generation module 403 is configured to: extracting descriptors related to the characteristics of the target object from the original object description information; meanwhile, determining a target characteristic descriptor corresponding to the target object according to a pre-stored corresponding relationship between the object and the characteristic descriptor, wherein the corresponding relationship can be formed according to a corpus; and then combining and splicing the extracted descriptors related to the target object characteristics and the target characteristic descriptors determined according to the corresponding relation to obtain an object characteristic description phrase for describing the characteristics of the target object.

In yet another embodiment, the phrase generating module 403 is configured to input the original object description information into a phrase generating model trained in advance, so as to obtain an object characteristic description phrase for describing characteristics of the target object. In this way, the object characteristic description phrase for describing the characteristic of the target object can be automatically generated through the phrase generation model, and the method is convenient and quick.

Optionally, the object property description phrase is one; the caption generating module 404 is configured to directly combine the brief description information and the object feature description phrase to obtain caption information corresponding to a target object in a video.

Optionally, the object property description phrase is plural; the subtitle generating module 404 includes: the combination sub-module is used for respectively combining the brief description information with each object characteristic description phrase to obtain a plurality of candidate subtitle information; and the determining sub-module is used for determining target subtitle information from the candidate subtitle information as the subtitle information corresponding to the target object in the video.

Optionally, the subtitle generating module 404 further includes: a filtering submodule for: for each candidate subtitle information, extracting at least one object characteristic description word for describing the characteristic of the target object from the candidate subtitle information; if object characteristic description words which do not appear in the original object description information exist, filtering the candidate subtitle information; the determining sub-module is used for determining the target subtitle information from the candidate subtitle information left after the filtering operation.

Optionally, the determining sub-module includes: the similarity determining submodule is used for determining the similarity between the candidate subtitle information and the original object description information aiming at each candidate subtitle information left after the filtering operation; and the subtitle determining submodule is used for determining the candidate subtitle information with the highest similarity with the original object description information as the target subtitle information.

Optionally, the filtering sub-module is configured to perform named entity identification on the candidate subtitle information to obtain at least one named entity, and use the at least one named entity as at least one object characteristic description word for describing a characteristic of the target object.

Optionally, the determining sub-module includes: the similarity determining submodule is used for determining the similarity between the candidate subtitle information and the original object description information aiming at each candidate subtitle information; and the subtitle determining submodule is used for determining the candidate subtitle information with the highest similarity with the original object description information as the target subtitle information.

Optionally, the similarity determination sub-module includes: the semantic information determining submodule is used for determining semantic information corresponding to the candidate subtitle information and semantic information corresponding to the original object description information; and the calculating submodule is used for calculating the similarity between the semantic information corresponding to the candidate subtitle information and the semantic information corresponding to the original object description information, and taking the similarity as the similarity between the candidate subtitle information and the original object description information.

Optionally, the refining module 402 is configured to input the original object description information into a pre-trained content refining model, so as to perform content refining on the original object description information to obtain brief description information of the target object; wherein, the content extraction model is obtained by training through a first model training device, wherein the first model training device comprises: the device comprises a first description information acquisition module, a first reference object description information acquisition module and a second description information acquisition module, wherein the first reference object description information is used for acquiring first reference object description information of a first reference object, and the first reference object description information is used for describing original inherent characteristics of the first reference object; the first word segmentation module is used for segmenting the first reference object description information to obtain a first word sequence; a first determining module, configured to delete each word in the first word sequence except the first reference object from the first word sequence to obtain a new first word sequence, and sequentially connect the words in the new first word sequence to obtain candidate brief description information of the first reference object; the judging module is used for judging whether other words except the first reference object exist in any new first word sequence; a triggering module, configured to, if there are other words in any of the new first word sequences except the first reference object, take the new first word sequence as the first word sequence for each new first word sequence, trigger the first determining module to delete the word from the first word sequence for each word in the first word sequence except the first reference object, obtain a new first word sequence, and sequentially connect words in the new first word sequence to obtain candidate brief description information; a second determining module, configured to, if no word other than the first reference object exists in any of the new first word sequences, use the candidate profile information with the highest fluency as reference profile information; and the first model training module is used for performing model training by taking the first reference object description information as the input of the content extraction model and taking the reference brief description information as the target output of the content extraction model to obtain the content extraction model.

Optionally, the phrase generating module 403 is configured to input the original object description information into a phrase generating model trained in advance, so as to obtain an object characteristic description phrase for describing characteristics of the target object; wherein, the phrase generating model is obtained by training through a second model training device, wherein, the second model training device comprises: a second description information obtaining module, configured to obtain second reference object description information and third reference object description information of a second reference object, where the second reference object description information and the third reference object description information are both used to describe an original inherent feature of the second reference object, and a similarity between the second reference object description information and the third reference object description information is greater than a preset similarity threshold; the second word segmentation module is used for segmenting the second reference object description information to obtain a second word sequence, and segmenting the third reference object description information to obtain a third word sequence; a third determination module to: for each modifier used for describing the second reference object in the second word sequence, extracting a phrase containing the modifier from the second reference object description information as a first candidate object characteristic description phrase; for each modifier used for describing the second reference object in the third word sequence, extracting a phrase containing the modifier from the third reference object description information as a second candidate object characteristic description phrase; a fourth determining module for determining each of the first candidate object property description phrases and each of the second candidate object property description phrases as reference object property description phrases; and the second model training module is used for performing model training by taking the second reference object description information as the input of the phrase generation model and taking the reference object characteristic description phrase as the target output of the phrase generation model to obtain the phrase generation model.

It should be noted that the first model training device may be independent of the caption generating device 400 or may be integrated into the caption generating device 400, and the second model training device may be independent of the caption generating device 400 or may be integrated into the caption generating device 400, but the disclosure is not limited thereto.

The present disclosure also provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, implements the steps of the above-mentioned subtitle generating method provided by the present disclosure.

Referring now to fig. 5, a schematic diagram of an electronic device (e.g., a terminal device or a server) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 505 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object; extracting the content of the original object description information to obtain brief description information of the target object; generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information; and generating subtitle information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not in some cases constitute a limitation to the module itself, and for example, the acquisition module may also be described as a "module that acquires original object description information of a target object to be introduced in a video".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a subtitle generating method according to one or more embodiments of the present disclosure, including: acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object; extracting the content of the original object description information to obtain brief description information of the target object; generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information; and generating subtitle information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase.

Example 2 provides the method of example 1, the object property description phrase being plural; generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase, wherein the generating comprises: respectively combining the brief description information with each object characteristic description phrase to obtain a plurality of candidate subtitle information; and determining target subtitle information from the candidate subtitle information as the subtitle information corresponding to the target object in the video.

Example 3 provides the method of example 2, wherein generating caption information corresponding to the target object in the video according to the brief description information and the object characterization phrase further includes: for each candidate subtitle information, extracting at least one object characteristic description word for describing the characteristic of the target object from the candidate subtitle information; if object characteristic description words which do not appear in the original object description information exist, filtering the candidate subtitle information; the determining target subtitle information from the plurality of candidate subtitle information includes: and determining the target subtitle information from the candidate subtitle information left after the filtering operation.

Example 4 provides the method of example 3, wherein determining the target caption information from the candidate caption information remaining after the filtering operation, according to one or more embodiments of the present disclosure includes: determining similarity between the candidate subtitle information and the original object description information for each of the candidate subtitle information remaining after the filtering operation; and determining the candidate subtitle information with the highest similarity to the original object description information as the target subtitle information.

Example 5 provides the method of example 3, the extracting at least one object characteristic description word for describing a characteristic of the target object from the candidate subtitle information, including: and carrying out named entity identification on the candidate subtitle information to obtain at least one named entity, and using the at least one named entity as at least one object characteristic description word for describing the characteristics of the target object.

Example 6 provides the method of example 2, wherein determining the target subtitle information from the plurality of candidate subtitle information, according to one or more embodiments of the present disclosure, includes: for each candidate subtitle information, determining the similarity between the candidate subtitle information and the original object description information; and determining the candidate subtitle information with the highest similarity to the original object description information as the target subtitle information.

Example 7 provides the method of example 4 or 6, wherein determining the similarity between the candidate subtitle information and the original object description information includes: determining semantic information corresponding to the candidate subtitle information and semantic information corresponding to the original object description information; and calculating the similarity between the semantic information corresponding to the candidate subtitle information and the semantic information corresponding to the original object description information, and taking the similarity as the similarity between the candidate subtitle information and the original object description information.

Example 8 provides the method of example 1, wherein content refining the original object description information to obtain brief description information of the target object, according to one or more embodiments of the present disclosure includes: inputting the original object description information into a pre-trained content extraction model to extract the content of the original object description information to obtain brief description information of the target object; wherein the content refinement model is trained by: acquiring first reference object description information of a first reference object, wherein the first reference object description information is used for describing original inherent characteristics of the first reference object; performing word segmentation on the first reference object description information to obtain a first word sequence; deleting each word in the first word sequence except the first reference object from the first word sequence to obtain a new first word sequence, and sequentially connecting the words in the new first word sequence to obtain candidate brief description information of the first reference object; judging whether any new first word sequence has other words except the first reference object; if any new first word sequence has other words except the first reference object, taking the new first word sequence as the first word sequence for each new first word sequence, and repeatedly executing the step of deleting the word from the first word sequence for each word except the first reference object in the first word sequence to the step of judging whether any new first word sequence has other words except the first reference object; if no other words except the first reference object exist in any new first word sequence, the candidate brief description information with the highest fluency is used as the reference brief description information; and performing model training by taking the first reference object description information as an input of the content refinement model and taking the reference brief description information as a target output of the content refinement model to obtain the content refinement model.

Example 9 provides the method of example 1 or 8, wherein generating an object property description phrase to describe a property of the target object from the original object description information comprises: inputting the original object description information into a pre-trained phrase generation model to obtain an object characteristic description phrase for describing the characteristics of the target object; wherein the phrase generation model is obtained by training in the following way: acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are both used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is greater than a preset similarity threshold; performing word segmentation on the second reference object description information to obtain a second word sequence, and performing word segmentation on the third reference object description information to obtain a third word sequence; for each modifier used for describing the second reference object in the second word sequence, extracting a phrase containing the modifier from the second reference object description information as a first candidate object characteristic description phrase; for each modifier used for describing the second reference object in the third word sequence, extracting a phrase containing the modifier from the third reference object description information as a second candidate object characteristic description phrase; determining each of the first candidate object property description phrases and each of the second candidate object property description phrases as reference object property description phrases; and performing model training by taking the second reference object description information as the input of the phrase generation model and taking the reference object characteristic description phrase as the target output of the phrase generation model to obtain the phrase generation model.

Example 10 provides, in accordance with one or more embodiments of the present disclosure, a subtitle generating apparatus comprising: an acquisition module, configured to acquire original object description information of a target object to be introduced in a video, where the original object description information is used to describe original inherent features of the target object; the refining module is used for refining the content of the original object description information acquired by the acquisition module to obtain brief description information of the target object; a phrase generating module, configured to generate an object characteristic description phrase for describing characteristics of the target object according to the original object description information acquired by the acquiring module; and the subtitle generating module is used for generating subtitle information corresponding to the target object in the video according to the brief description information obtained by the refining module and the object characteristic description phrase generated by the phrase generating module.

Example 11 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-9, in accordance with one or more embodiments of the present disclosure.

Example 12 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to carry out the steps of the method of any of examples 1-9.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A subtitle generating method, comprising:

2. The method of claim 1, wherein the object property description phrase is plural;

generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase, wherein the generating comprises:

respectively combining the brief description information with each object characteristic description phrase to obtain a plurality of candidate subtitle information;

and determining target subtitle information from the candidate subtitle information as the subtitle information corresponding to the target object in the video.

3. The method of claim 2, wherein generating caption information corresponding to the target object in the video according to the brief description information and the object characterization phrase further comprises:

for each candidate subtitle information, extracting at least one object characteristic description word for describing the characteristic of the target object from the candidate subtitle information; if object characteristic description words which do not appear in the original object description information exist, filtering the candidate subtitle information;

the determining target subtitle information from the plurality of candidate subtitle information includes:

and determining the target subtitle information from the candidate subtitle information left after the filtering operation.

4. The method of claim 3, wherein the determining the target caption information from the candidate caption information remaining after the filtering operation comprises:

determining similarity between the candidate subtitle information and the original object description information for each of the candidate subtitle information remaining after the filtering operation;

and determining the candidate subtitle information with the highest similarity to the original object description information as the target subtitle information.

5. The method of claim 3, wherein the extracting at least one object property description word for describing the property of the target object from the candidate subtitle information comprises:

and carrying out named entity identification on the candidate subtitle information to obtain at least one named entity, and using the at least one named entity as at least one object characteristic description word for describing the characteristics of the target object.

6. The method of claim 2, wherein determining the target caption information from the plurality of candidate caption information comprises:

for each candidate subtitle information, determining the similarity between the candidate subtitle information and the original object description information;

7. The method according to claim 4 or 6, wherein the determining the similarity between the candidate subtitle information and the original object description information comprises:

determining semantic information corresponding to the candidate subtitle information and semantic information corresponding to the original object description information;

and calculating the similarity between the semantic information corresponding to the candidate subtitle information and the semantic information corresponding to the original object description information, and taking the similarity as the similarity between the candidate subtitle information and the original object description information.

8. The method of claim 1, wherein the content refining the original object description information to obtain the brief description information of the target object comprises:

inputting the original object description information into a pre-trained content extraction model to extract the content of the original object description information to obtain brief description information of the target object;

wherein the content refinement model is trained by:

acquiring first reference object description information of a first reference object, wherein the first reference object description information is used for describing original inherent characteristics of the first reference object;

performing word segmentation on the first reference object description information to obtain a first word sequence;

deleting each word in the first word sequence except the first reference object from the first word sequence to obtain a new first word sequence, and sequentially connecting the words in the new first word sequence to obtain candidate brief description information of the first reference object;

judging whether any new first word sequence has other words except the first reference object;

if any new first word sequence has other words except the first reference object, taking the new first word sequence as the first word sequence for each new first word sequence, and repeatedly executing the step of deleting the word from the first word sequence for each word except the first reference object in the first word sequence to the step of judging whether any new first word sequence has other words except the first reference object;

if no other words except the first reference object exist in any new first word sequence, the candidate brief description information with the highest fluency is used as the reference brief description information;

and performing model training by taking the first reference object description information as an input of the content refinement model and taking the reference brief description information as a target output of the content refinement model to obtain the content refinement model.

9. The method according to claim 1 or 8, wherein generating an object property description phrase for describing properties of the target object according to the original object description information comprises:

inputting the original object description information into a pre-trained phrase generation model to obtain an object characteristic description phrase for describing the characteristics of the target object;

wherein the phrase generation model is obtained by training in the following way:

acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are both used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is greater than a preset similarity threshold;

performing word segmentation on the second reference object description information to obtain a second word sequence, and performing word segmentation on the third reference object description information to obtain a third word sequence;

for each modifier used for describing the second reference object in the second word sequence, extracting a phrase containing the modifier from the second reference object description information as a first candidate object characteristic description phrase; for each modifier used for describing the second reference object in the third word sequence, extracting a phrase containing the modifier from the third reference object description information as a second candidate object characteristic description phrase;

determining each of the first candidate object property description phrases and each of the second candidate object property description phrases as reference object property description phrases;

and performing model training by taking the second reference object description information as the input of the phrase generation model and taking the reference object characteristic description phrase as the target output of the phrase generation model to obtain the phrase generation model.

10. A subtitle generating apparatus, comprising:

11. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1-9.

12. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 9.