CN113033190B

CN113033190B - Subtitle generation method, subtitle generation device, medium and electronic equipment

Info

Publication number: CN113033190B
Application number: CN202110420704.4A
Authority: CN
Inventors: 王晶冰; 郝卓琳
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2024-05-17
Anticipated expiration: 2041-04-19
Also published as: CN113033190A

Abstract

The disclosure relates to a subtitle generating method, a subtitle generating device, a subtitle generating medium and electronic equipment. The method comprises the following steps: acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object; content refining is carried out on the original object description information, and brief description information of the target object is obtained; generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information; and generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase. Therefore, the subtitle information related to the target object to be introduced in the video can be automatically generated, the readability and the accuracy of the subtitle information can be ensured, and a user can quickly know the target object through the video.

Description

Subtitle generation method, subtitle generation device, medium and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a subtitle generating method, a subtitle generating device, a subtitle generating medium and electronic equipment.

Background

With the rapid development of the internet, the amount of information in the internet has exploded, and users are increasingly accustomed to acquiring various required information from the internet. In particular, in daily life, for anything that is not known, a user can quickly refer to what he wants to know by searching the internet for the relevant content of the thing.

At present, information in the internet generally exists in the form of text, images, audios and videos and the like. Among them, information in the form of video is often used as a main way for users to understand things with the advantages of rich content, visualization, intuitiveness, etc. However, in the video production process, the related introduction text of the object is usually directly used as a subtitle, or the related introduction text is slightly colored and then used as a subtitle, but the related introduction text usually has more content and poor readability, so that a user can not conveniently and quickly know the object through the video.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a subtitle generating method, including:

acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object;

content refining is carried out on the original object description information, and brief description information of the target object is obtained;

Generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information;

And generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase.

In a second aspect, the present disclosure provides a subtitle generating apparatus, including:

The acquisition module is used for acquiring original object description information of a target object to be introduced in the video, wherein the original object description information is used for describing original inherent characteristics of the target object;

The extraction module is used for extracting the content of the original object description information acquired by the acquisition module to acquire brief description information of the target object;

The phrase generating module is used for generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information acquired by the acquiring module;

And the caption generating module is used for generating caption information corresponding to the target object in the video according to the brief description information obtained by the extracting module and the object characteristic description phrase generated by the phrase generating module.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device implements the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

A storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method provided by the first aspect of the present disclosure.

In the above technical solution, after original object description information for describing original intrinsic characteristics of a target object to be introduced in a video is obtained, brief description information of the target object and an object characteristic description phrase for describing characteristics of the target object are determined according to the original object description information; and then, generating subtitle information corresponding to the target object in the video according to the subtitle information and the target object. Therefore, the subtitle information related to the target object to be introduced in the video can be automatically generated, the readability and the accuracy of the subtitle information can be ensured, and a user can quickly know the target object through the video.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

fig. 1 is a flowchart illustrating a subtitle generating method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating a method of generating subtitle information corresponding to a target object in a video according to brief description information and an object property description phrase according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating a method of generating subtitle information corresponding to a target object in a video according to brief description information and an object property description phrase according to another exemplary embodiment.

Fig. 4 is a block diagram illustrating a subtitle generating apparatus according to an exemplary embodiment.

Fig. 5 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flowchart illustrating a subtitle generating method according to an exemplary embodiment. As shown in fig. 1, the method may include S101 to S104.

In S101, original object description information of a target object to be introduced in a video is acquired.

In the present disclosure, the target object may be, for example, flowers, birds, daily necessities, furniture, or the like. The original object description information is used to describe original inherent characteristics of the target object, such as style, specification, name, material, etc. Illustratively, the target object is a wardrobe, and its original object description information is: the high-end bedroom combined clothes cabinet is simple, modern and European style, is assembled by solid wood particle boards, and can be assembled in a sliding manner; and the capacity is large, and the cabinet can be used as a storage cabinet, the color matching is white in the atmosphere and oak, and the height is 1.8 meters, and the cabinet is provided with a top cabinet.

In S102, the original object description information is content-refined to obtain brief description information of the target object.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, the generated brief description information is: sliding and sliding assembled combined clothes cabinet.

In S103, an object property description phrase for describing the property of the target object is generated from the original object description information.

In the present disclosure, the object property description phrase can highlight a feature unique to the target object, and the number of generated object property description phrases may be one or more, which is not specifically limited in the present disclosure.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, 6 object property description phrases for describing properties of the wardrobe are generated as follows: ① European style, simple atmosphere; ② Constructing solid wood and high-end atmosphere; ③ Multifunctional solid wood, high capacity and foldability; ④ Simple atmosphere, water resistance, insect resistance and dust resistance; ⑤ European wood grain, simple but not simple; ⑥ Simple and modern, european style large capacity.

In S104, caption information corresponding to the target object in the video is generated from the brief description information and the object property description phrase.

The following describes in detail the embodiment of the brief description information of the target object obtained by content-refining the description information of the original object in S102.

Specifically, according to the original object description information of the target object, the brief description information of the target object can be obtained in various modes, in one embodiment, keywords can be extracted from the original object description information in a statistical mode, for example, words, which occur frequently exceeding a preset frequency threshold, in the original object description information are determined to be keywords; and then, splicing the keywords to obtain the brief description information.

In another embodiment, the original object description information may be input into a pre-trained content extraction model to perform content extraction on the original object description information to obtain the brief description information of the target object. Thus, the brief description information of the target object can be automatically extracted from the original object description information through the content extraction model, and the method is convenient and quick.

In the present disclosure, the content refinement model may be, for example, a transducer model, a Long Short-Term Memory (LSTM) model, or the like.

The content extraction model can be obtained through training in the following steps (1) to (7):

(1) First reference object description information of a first reference object is acquired, wherein the first reference object description information is used for describing original inherent characteristics of the first reference object.

(2) And segmenting the first reference object description information to obtain a first word sequence, wherein the words in the first word sequence are sequentially arranged according to the sequence of the words in the first reference object description information.

(3) And deleting each word except the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting each word in the new first word sequence to obtain candidate brief description information of the first reference object.

(4) It is determined whether there are other words in any new first word sequence than the first reference object.

If there are other words in any new first word sequence except the first reference object, executing the following step (5); if no other words than the first reference object are present in any new first word sequence, the following step (6) is performed.

(5) And (3) taking the new first word sequence as the first word sequence for each new first word sequence, and returning to the step (3).

(6) The candidate brief description information with the highest fluency is taken as reference brief description information.

In the present disclosure, the fluency of candidate profile information may be determined by a pre-trained language model (e.g., a GPT2 model).

(7) Model training is performed by taking the first reference object description information as an input of the content refinement model in such a manner that the reference brief description information is taken as a target output of the content refinement model to obtain the content refinement model.

In the training process of the content extraction model, the reference brief description information can be directly determined based on the first reference object description information, so that the problem that the model training performance is affected due to the fact that training samples are limited can be avoided, the reference brief description information can be smooth as much as possible, and the content extraction performance of the content extraction model can be improved.

The following describes in detail a specific embodiment of generating an object property description phrase for describing the property of the target object from the original object description information in S103 described above.

Specifically, according to the original object description information of the target object, the object characteristic description phrase for describing the characteristics of the target object may be obtained in various manners, in one embodiment, the description words (typically adjectives) related to the characteristics of the target object may be extracted from the original object description information, and then, the description words are combined and spliced to obtain the object characteristic description phrase for describing the characteristics of the target object.

In another embodiment, descriptors related to the characteristics of the target object may be extracted from the original object description information; meanwhile, according to the corresponding relation between the pre-stored object and the characteristic descriptor, determining the target characteristic descriptor corresponding to the target object, wherein the corresponding relation can be formed according to a corpus; and combining and splicing the extracted description words related to the characteristics of the target object and the target characteristic description words determined according to the corresponding relation to obtain the object characteristic description phrase for describing the characteristics of the target object.

In yet another embodiment, the original object description information may be input into a pre-trained phrase generation model to obtain an object property description phrase for describing properties of the target object. Thus, object characteristic description phrases for describing characteristics of the target object can be automatically generated through the phrase generation model, and convenience and rapidness are realized. The phrase generation model may be, for example, a transducer model, an LSTM model, or the like.

The phrase generating model can be obtained through training in the following steps 1) to 5):

1) And acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is larger than a preset similarity threshold.

2) And performing word segmentation on the second reference object description information to obtain a second word sequence, and performing word segmentation on the third reference object description information to obtain a third word sequence.

3) Extracting a phrase containing each modifier from the second reference object description information as a first candidate object characteristic description phrase aiming at each modifier for the second reference object in the second word sequence; for each modifier in the third word sequence for describing the second reference object, extracting a phrase containing the modifier from the third reference object description information as a second candidate object characteristic description phrase.

4) Each first candidate property description phrase and each second candidate property description phrase is determined to be a reference object property description phrase.

5) Model training is performed by taking the second reference object description information as input of a phrase generating model and taking the reference object characteristic description phrase as target output of the phrase generating model to obtain the phrase generating model.

In the training process of the phrase generation model, the reference object characteristic description phrase corresponding to the second reference object description information is determined based on the second reference object description information and third reference object description information similar to the second reference object description information, so that the reference object characteristic description phrase can embody the characteristic of the second reference object as far as possible, the phrase generation performance of the phrase generation model is improved, the generated object characteristic description phrase can fully embody the characteristic of the target object, and the accuracy of the subtitle information is improved. In addition, the phrase of the characteristic description of the reference object corresponding to the second reference object description information is automatically generated, so that the problem that the training performance of the model is influenced due to limited training samples can be avoided, and the phrase generating performance of the phrase generating model is further improved.

The following describes in detail the specific embodiment of generating the subtitle information corresponding to the target object in the video according to the brief description information and the object property description phrase in S104.

In one embodiment, the object property description phrase generated in S103 and used for describing the property of the target object is one, at this time, the brief description information generated in S102 and the object property description phrase generated in S103 and used for describing the property of the target object may be directly combined, so as to obtain the subtitle information corresponding to the target object in the video.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, the generated brief description information is: the sliding and sliding assembled combined clothes cabinet generates object characteristic description phrase for describing characteristics of the clothes cabinet, wherein the object characteristic description phrase is as follows: the method comprises the steps of constructing solid wood, constructing the high-end atmosphere, and combining the solid wood and the high-end atmosphere to obtain subtitle information 'push-pull sliding assembly combined clothes cabinet' corresponding to a target object 'wardrobe' in a video.

In another embodiment, the object property description phrase generated in S103 is plural and used for describing the property of the target object, and at this time, subtitle information corresponding to the target object in the video may be generated through S1041 and S1042 shown in fig. 2.

In S1041, the brief description information is respectively combined with each object property description phrase to obtain a plurality of candidate subtitle information.

In S1042, target subtitle information is determined from among the plurality of candidate subtitle information as subtitle information corresponding to the target object in the video.

Specifically, for each candidate subtitle information, a similarity between the candidate subtitle information and the original object description information may be determined; then, the candidate subtitle information having the highest similarity with the original object description information is determined as target subtitle information. The similarity may be, for example, cosine similarity, euclidean distance similarity, or the like.

In addition, the similarity between the candidate subtitle information and the original object description information may be determined by: firstly, determining semantic information corresponding to candidate subtitle information and semantic information corresponding to original object description information; then, the similarity between the semantic information corresponding to the candidate caption information and the semantic information corresponding to the original object description information is calculated, and the calculated similarity is used as the similarity between the candidate caption information and the original object description information. Specifically, for each candidate subtitle, the semantic vector (i.e., semantic information) corresponding to the generated candidate subtitle information and the semantic vector corresponding to the original object description information are respectively generated through the title vector characterization service, and then cosine similarity or euclidean distance similarity between the two semantic vectors is calculated and is used as the similarity between the candidate subtitle information and the original object description information.

Illustratively, based on the original object description information of the target object "wardrobe" in the above example, the generated brief description information is: the sliding and sliding assembled combined clothes cabinet generates object characteristic description phrase for describing characteristics of the clothes cabinet, which comprises the following steps: ① European style, simple atmosphere; ② Constructing solid wood and high-end atmosphere; ③ Multifunctional solid wood, high capacity and foldability; ④ Simple atmosphere, water resistance, insect resistance and dust resistance; ⑤ European wood grain, simple but not simple; ⑥ Simple and modern, european style large capacity; the brief description information 'push-pull sliding assembled combined clothes cabinet' is respectively combined with each object characteristic description phrase in the 6 object characteristic description phrases to obtain the following 6 candidate caption information:

(1) Push-pull sliding assembled combined clothes cabinet, european style and simple atmosphere;

(2) Sliding and sliding assembled combined clothes cabinet, solid wood is manufactured, and the high end is air;

(3) Push-pull sliding assembled combined clothes cabinet, multifunctional solid wood and high-capacity foldable;

(4) The sliding assembly combined clothes cabinet is simple in atmosphere, waterproof, insect-proof and dust-proof;

(5) The sliding assembly combined clothes cabinet is simple and not easy, and European wood grain is reduced;

(6) The sliding assembly combined clothes cabinet is simple and modern, and has European style and large capacity.

The candidate caption information with the highest similarity with the original object description information is the candidate caption information (2), so that the candidate caption information (2) can be determined as target caption information, namely, the caption information corresponding to a target object wardrobe in the video is a sliding-sliding assembled combined wardrobe, and the combined wardrobe is manufactured by solid wood and is high-end in atmosphere.

In addition, the object property description phrase generated in S103 for describing the property of the target object is not all derived from the original object description information, so that the subtitle information corresponding to the target object in the generated video may contain information in which the original object description information does not appear, based on the object property description phrase obtained in S103 for describing the property of the target object, so that the accuracy of the generated subtitle information cannot be guaranteed. For this reason, before determining the target subtitle information from the plurality of candidate subtitle information in S1042, the plurality of candidate subtitle information obtained by combining S1041 needs to be filtered to ensure accuracy and efficiency of the subsequent subtitle information generation. Specifically, as shown in fig. 3, S104 further includes S1043.

In S1043, for each candidate subtitle information, at least one object property description word for describing the property of the target object is extracted from the candidate subtitle information, and if there is an object property description word that does not appear in the original object description information, the candidate subtitle information is filtered out.

In the present disclosure, the object property description word may be a modifier describing a material, a class, a style, or the like of the target object. After the filtering operation is performed on the plurality of candidate subtitles in S1043, the above S1042 may determine the target subtitle information from the candidate subtitle information remaining after the filtering operation. Specifically, for each piece of candidate caption information remaining after the filtering operation, a similarity between the candidate caption information and the original object description information may be determined; then, the candidate subtitle information having the highest similarity with the original object description information is determined as target subtitle information.

A detailed description will be given below of a specific embodiment of extracting at least one object property description word for describing a property of a target object from the candidate subtitle information in S1043 described above. Specifically, named entity Recognition (NAMED ENTITY Recognizion, NER) may be performed on the candidate caption information to obtain at least one named entity, and the at least one named entity is used as at least one object property description term for describing properties of the target object.

Wherein, NER is used to identify an object property description entity, such as a material, a category, a style, etc., for describing a property of a target object in the candidate subtitle information.

For example, based on the target object "wardrobe" in the above example, 6 pieces of candidate subtitle information are obtained by combining in S1041, specifically as follows:

Wherein, for the candidate subtitle information (1), the object characteristic description word extracted therefrom for describing the characteristic of the target object "wardrobe" includes: "European", "brief", "atmospheric", it can be seen that these three words all appear in the original object description information, and therefore, candidate subtitle information (1) is retained;

For the candidate subtitle information (2), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "solid wood", "high-end", "atmosphere", it can be seen that these three words all appear in the original object description information, thus, the candidate subtitle information (2) is retained;

For the candidate subtitle information (3), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "multifunction", "solid wood", "large capacity", "foldable", wherein "foldable" does not appear in the original object description information, i.e., there are object property description words in the candidate subtitle information (3) that do not appear in the original object description information, and therefore, the candidate subtitle information (3) is filtered out from the above 6 candidate subtitles;

for the candidate subtitle information (4), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "brief", "atmosphere", "waterproof", "insect-proof", "dust-proof", wherein "waterproof", "insect-proof", "dust-proof" do not appear in the original object description information, namely there are object characteristic description words that do not appear in the original object description information in the candidate subtitle information (4), therefore, filter the candidate subtitle information (4) from above-mentioned 6 candidate subtitles;

For the candidate subtitle information (5), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "European", "wood grain", "simple", "not simple", wherein "wood grain", "not simple" do not appear in the original object description information, namely there are object characteristic description words that do not appear in the original object description information in the candidate subtitle information (5), therefore, filter the candidate subtitle information (5) from above-mentioned 6 candidate subtitles;

For the candidate subtitle information (6), object characteristic description words extracted therefrom for describing characteristics of the target object "wardrobe" include: "brief", "modern", "European", "large capacity", it can be seen that the four words all appear in the original object description information, and thus, candidate subtitle information (6) is retained.

Thus, the candidate caption information remaining after the filtering operation includes candidate caption information (1), candidate caption information (2), and candidate caption information (6), that is: push-pull sliding assembled combined clothes cabinet, european style and simple atmosphere; sliding and sliding assembled combined clothes cabinet, solid wood is manufactured, and the high end is air; the sliding assembly combined clothes cabinet is simple and modern, and has European style and large capacity. After that, S1043 may determine a target subtitle from among the three candidate subtitles.

Based on the same inventive concept, the present disclosure also provides a subtitle generating apparatus. As shown in fig. 4, the apparatus 400 includes: an obtaining module 401, configured to obtain original object description information of a target object to be introduced in a video, where the original object description information is used to describe original intrinsic characteristics of the target object; a refinement module 402, configured to perform content refinement on the original object description information acquired by the acquisition module 401, so as to obtain brief description information of the target object; a phrase generating module 403, configured to generate an object characteristic description phrase for describing a characteristic of the target object according to the original object description information acquired by the acquiring module 402; and a caption generating module 404, configured to generate caption information corresponding to the target object in the video according to the brief description information obtained by the extracting module 402 and the object characteristic description phrase generated by the phrase generating module 403.

In one embodiment, the refinement module 402 is configured to statistically extract keywords from the original object description information; and then, splicing the keywords to obtain the brief description information.

In another embodiment, the refinement module 402 is configured to input the original object description information into a pre-trained content refinement model to perform content refinement on the original object description information to obtain the brief description information of the target object. Thus, the brief description information of the target object can be automatically extracted from the original object description information through the content extraction model, and the method is convenient and quick.

In one embodiment, the phrase generating module 403 is configured to extract, from the original object description information, a description word related to the characteristics of the target object, and then combine and splice the description words to obtain an object characteristic description phrase for describing the characteristics of the target object.

In another embodiment, the phrase generation module 403 is configured to: extracting descriptive words related to target object characteristics from the original object descriptive information; meanwhile, according to the corresponding relation between the pre-stored object and the characteristic descriptor, determining the target characteristic descriptor corresponding to the target object, wherein the corresponding relation can be formed according to a corpus; and combining and splicing the extracted description words related to the characteristics of the target object and the target characteristic description words determined according to the corresponding relation to obtain the object characteristic description phrase for describing the characteristics of the target object.

In yet another embodiment, the phrase generating module 403 is configured to input the original object description information into a pre-trained phrase generating model to obtain an object property description phrase for describing a property of the target object. Thus, object characteristic description phrases for describing characteristics of the target object can be automatically generated through the phrase generation model, and convenience and rapidness are realized.

Optionally, the object property description phrase is one; the caption generating module 404 is configured to directly combine the brief description information and the object characteristic description phrase to obtain caption information corresponding to a target object in the video.

Optionally, the object property description phrase is a plurality of; the subtitle generating module 404 includes: a combination sub-module, configured to combine the brief description information with each object characteristic description phrase to obtain multiple candidate subtitle information; and the determining submodule is used for determining target subtitle information from the plurality of candidate subtitle information as subtitle information corresponding to the target object in the video.

Optionally, the subtitle generating module 404 further includes: and a filtering sub-module for: extracting, for each of the candidate subtitle information, at least one object property description word for describing a property of the target object from the candidate subtitle information; if the object characteristic description words which do not appear in the original object description information exist, filtering the candidate subtitle information; the determining submodule is used for determining the target subtitle information from candidate subtitle information remained after filtering operation.

Optionally, the determining submodule includes: a similarity determining submodule, configured to determine, for each piece of candidate subtitle information remaining after the filtering operation, a similarity between the candidate subtitle information and the original object description information; and the subtitle determining sub-module is used for determining the candidate subtitle information with the highest similarity with the original object description information as the target subtitle information.

Optionally, the filtering sub-module is configured to identify the named entity of the candidate subtitle information, obtain at least one named entity, and use the at least one named entity as at least one object characteristic description term for describing characteristics of the target object.

Optionally, the determining submodule includes: a similarity determining sub-module, configured to determine, for each piece of candidate subtitle information, a similarity between the candidate subtitle information and the original object description information; and the subtitle determining sub-module is used for determining the candidate subtitle information with the highest similarity with the original object description information as the target subtitle information.

Optionally, the similarity determining submodule includes: the semantic information determining submodule is used for determining semantic information corresponding to the candidate subtitle information and semantic information corresponding to the original object description information; and the calculating sub-module is used for calculating the similarity between the semantic information corresponding to the candidate subtitle information and the semantic information corresponding to the original object description information, and taking the similarity as the similarity between the candidate subtitle information and the original object description information.

Optionally, the refinement module 402 is configured to input the original object description information into a pre-trained content refinement model, so as to refine the content of the original object description information, and obtain the brief description information of the target object; wherein the content refinement model is trained by a first model training device, wherein the first model training device comprises: the first descriptive information acquisition module is used for acquiring first reference object descriptive information of a first reference object, wherein the first reference object descriptive information is used for describing original inherent characteristics of the first reference object; the first word segmentation module is used for segmenting the first reference object description information to obtain a first word sequence; the first determining module is used for deleting each word except the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting the words in the new first word sequence to obtain candidate brief description information of the first reference object; the judging module is used for judging whether other words except the first reference object exist in any new first word sequence; the triggering module is used for triggering the first determining module to delete each word except for the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting the words in the new first word sequence to obtain candidate brief description information for each word except for the first reference object in the first word sequence; the second determining module is configured to use the candidate brief description information with the highest fluency as reference brief description information if no other word except the first reference object exists in any new first word sequence; and the first model training module is used for carrying out model training in a mode that the first reference object description information is used as the input of the content extraction model and the reference brief description information is used as the target output of the content extraction model so as to obtain the content extraction model.

Optionally, the phrase generating module 403 is configured to input the original object description information into a pre-trained phrase generating model, to obtain an object feature description phrase for describing a feature of the target object; the phrase generating model is trained by a second model training device, wherein the second model training device comprises: the second description information acquisition module is used for acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is larger than a preset similarity threshold; the second word segmentation module is used for segmenting the second reference object description information to obtain a second word sequence, and segmenting the third reference object description information to obtain a third word sequence; a third determining module, configured to: extracting a phrase containing each modifier from the second reference object description information as a first candidate object characteristic description phrase aiming at each modifier in the second word sequence for describing the second reference object; extracting a phrase containing each modifier from the third reference object description information as a second candidate object characteristic description phrase aiming at each modifier in the third word sequence for describing the second reference object; a fourth determining module configured to determine each of the first candidate object property description phrases and each of the second candidate object property description phrases as reference object property description phrases; and the second model training module is used for carrying out model training in a mode that the second reference object description information is used as the input of the phrase generation model and the reference object characteristic description phrase is used as the target output of the phrase generation model so as to obtain the phrase generation model.

Note that the first model training device may be independent of the subtitle generating device 400, may be integrated into the subtitle generating device 400, and the second model training device may be independent of the subtitle generating device 400, may be integrated into the subtitle generating device 400, and is not particularly limited in this respect.

The present disclosure also provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the above-described subtitle generating method provided by the present disclosure.

Referring now to fig. 5, a schematic diagram of an electronic device (e.g., a terminal device or server) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 505 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object; content refining is carried out on the original object description information, and brief description information of the target object is obtained; generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information; and generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and for example, the acquisition module may also be described as "a module that acquires original object description information of a target object to be introduced in a video".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a subtitle generating method, including: acquiring original object description information of a target object to be introduced in a video, wherein the original object description information is used for describing original inherent characteristics of the target object; content refining is carried out on the original object description information, and brief description information of the target object is obtained; generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information; and generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase.

Example 2 provides the method of example 1, the object property description phrase being a plurality of according to one or more embodiments of the present disclosure; and generating subtitle information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase, wherein the subtitle information comprises the following components: combining the brief description information with each object characteristic description phrase respectively to obtain a plurality of candidate subtitle information; and determining target subtitle information from the plurality of candidate subtitle information as subtitle information corresponding to the target object in the video.

According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, wherein generating subtitle information corresponding to the target object in the video according to the brief description information and the object property description phrase further includes: extracting, for each of the candidate subtitle information, at least one object property description word for describing a property of the target object from the candidate subtitle information; if the object characteristic description words which do not appear in the original object description information exist, filtering the candidate subtitle information; the determining target subtitle information from the plurality of candidate subtitle information includes: the target subtitle information is determined from the candidate subtitle information remaining after the filtering operation.

In accordance with one or more embodiments of the present disclosure, example 4 provides the method of example 3, the determining the target subtitle information from the candidate subtitle information remaining after the filtering operation, comprising: for each piece of candidate caption information left after filtering operation, determining the similarity between the candidate caption information and the original object description information; and determining the candidate caption information with the highest similarity with the original object description information as the target caption information.

According to one or more embodiments of the present disclosure, example 5 provides the method of example 3, the extracting at least one object property description word for describing a property of the target object from the candidate subtitle information, including: and carrying out named entity recognition on the candidate subtitle information to obtain at least one named entity, and taking the at least one named entity as at least one object characteristic description word for describing the characteristics of the target object.

In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 2, the determining target subtitle information from the plurality of candidate subtitle information, comprising: for each piece of candidate subtitle information, determining the similarity between the candidate subtitle information and the original object description information; and determining the candidate caption information with the highest similarity with the original object description information as the target caption information.

According to one or more embodiments of the present disclosure, example 7 provides the method of example 4 or 6, the determining a similarity between the candidate subtitle information and the original object description information, comprising: determining semantic information corresponding to the candidate subtitle information and semantic information corresponding to the original object description information; and calculating the similarity between the semantic information corresponding to the candidate caption information and the semantic information corresponding to the original object description information, and taking the similarity as the similarity between the candidate caption information and the original object description information.

In accordance with one or more embodiments of the present disclosure, example 8 provides the method of example 1, the content refining of the original object description information to obtain the brief description information of the target object, including: inputting the original object description information into a pre-trained content extraction model to extract the content of the original object description information so as to obtain the brief description information of the target object; wherein the content refinement model is trained by: acquiring first reference object description information of a first reference object, wherein the first reference object description information is used for describing original inherent characteristics of the first reference object; word segmentation is carried out on the first reference object description information to obtain a first word sequence; deleting each word except the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting each word in the new first word sequence to obtain candidate brief description information of the first reference object; judging whether other words except the first reference object exist in any new first word sequence; if any word except the first reference object exists in any new first word sequence, repeating the step of deleting the word from the first word sequence to the step of judging whether any word except the first reference object exists in any new first word sequence by taking the new first word sequence as the first word sequence; if no other words except the first reference object exist in any new first word sequence, the candidate brief description information with the highest fluency is used as reference brief description information; model training is performed by taking the first reference object description information as an input of the content extraction model and taking the reference brief description information as a target output of the content extraction model so as to obtain the content extraction model.

According to one or more embodiments of the present disclosure, example 9 provides the method of example 1 or 8, the generating an object property description phrase for describing a property of the target object from the original object description information, comprising: inputting the original object description information into a pre-trained phrase generation model to obtain an object characteristic description phrase for describing the characteristics of the target object; the phrase generation model is trained by the following modes: acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is larger than a preset similarity threshold; performing word segmentation on the second reference object description information to obtain a second word sequence, and performing word segmentation on the third reference object description information to obtain a third word sequence; extracting a phrase containing each modifier from the second reference object description information as a first candidate object characteristic description phrase aiming at each modifier in the second word sequence for describing the second reference object; extracting a phrase containing each modifier from the third reference object description information as a second candidate object characteristic description phrase aiming at each modifier in the third word sequence for describing the second reference object; determining each of the first candidate object property description phrases and each of the second candidate object property description phrases as reference object property description phrases; and performing model training in a mode that the second reference object description information is used as the input of the phrase generation model and the reference object characteristic description phrase is used as the target output of the phrase generation model so as to obtain the phrase generation model.

According to one or more embodiments of the present disclosure, example 10 provides a subtitle generating apparatus, including: the acquisition module is used for acquiring original object description information of a target object to be introduced in the video, wherein the original object description information is used for describing original inherent characteristics of the target object; the extraction module is used for extracting the content of the original object description information acquired by the acquisition module to acquire brief description information of the target object; the phrase generating module is used for generating an object characteristic description phrase for describing the characteristics of the target object according to the original object description information acquired by the acquiring module; and the caption generating module is used for generating caption information corresponding to the target object in the video according to the brief description information obtained by the extracting module and the object characteristic description phrase generated by the phrase generating module.

According to one or more embodiments of the present disclosure, example 11 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-9.

Example 12 provides an electronic device according to one or more embodiments of the present disclosure, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-9.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A subtitle generating method, comprising:

inputting the original object description information into a pre-trained content extraction model to extract the content of the original object description information so as to obtain the brief description information of the target object;

generating caption information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase;

Wherein the content refinement model is trained by:

Acquiring first reference object description information of a first reference object, wherein the first reference object description information is used for describing original inherent characteristics of the first reference object;

Word segmentation is carried out on the first reference object description information to obtain a first word sequence;

Deleting each word except the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting each word in the new first word sequence to obtain candidate brief description information of the first reference object;

judging whether other words except the first reference object exist in any new first word sequence;

If any word except the first reference object exists in any new first word sequence, repeating the step of deleting the word from the first word sequence to the step of judging whether any word except the first reference object exists in any new first word sequence by taking the new first word sequence as the first word sequence;

If no other words except the first reference object exist in any new first word sequence, the candidate brief description information with the highest fluency is used as reference brief description information;

Model training is performed by taking the first reference object description information as an input of the content extraction model and taking the reference brief description information as a target output of the content extraction model so as to obtain the content extraction model.

2. The method of claim 1, wherein the object property description phrase is a plurality of;

and generating subtitle information corresponding to the target object in the video according to the brief description information and the object characteristic description phrase, wherein the subtitle information comprises the following components:

combining the brief description information with each object characteristic description phrase respectively to obtain a plurality of candidate subtitle information;

And determining target subtitle information from the plurality of candidate subtitle information as subtitle information corresponding to the target object in the video.

3. The method of claim 2, wherein generating subtitle information corresponding to the target object in the video from the brief description information and the object property description phrase further comprises:

Extracting, for each of the candidate subtitle information, at least one object property description word for describing a property of the target object from the candidate subtitle information; if the object characteristic description words which do not appear in the original object description information exist, filtering the candidate subtitle information;

the determining target subtitle information from the plurality of candidate subtitle information includes:

the target subtitle information is determined from the candidate subtitle information remaining after the filtering operation.

4. The method of claim 3, wherein the determining the target subtitle information from the candidate subtitle information remaining after the filtering operation comprises:

for each piece of candidate caption information left after filtering operation, determining the similarity between the candidate caption information and the original object description information;

and determining the candidate caption information with the highest similarity with the original object description information as the target caption information.

5. The method of claim 3, wherein the extracting at least one object property description word for describing a property of the target object from the candidate subtitle information comprises:

And carrying out named entity recognition on the candidate subtitle information to obtain at least one named entity, and taking the at least one named entity as at least one object characteristic description word for describing the characteristics of the target object.

6. The method of claim 2, wherein the determining target caption information from the plurality of candidate caption information comprises:

for each piece of candidate subtitle information, determining the similarity between the candidate subtitle information and the original object description information;

7. The method according to claim 4 or 6, wherein the determining of the similarity between the candidate subtitle information and the original object description information includes:

determining semantic information corresponding to the candidate subtitle information and semantic information corresponding to the original object description information;

And calculating the similarity between the semantic information corresponding to the candidate caption information and the semantic information corresponding to the original object description information, and taking the similarity as the similarity between the candidate caption information and the original object description information.

8. The method of claim 1, wherein generating an object property description phrase for describing properties of the target object from the original object description information comprises:

inputting the original object description information into a pre-trained phrase generation model to obtain an object characteristic description phrase for describing the characteristics of the target object;

the phrase generation model is trained by the following modes:

Acquiring second reference object description information and third reference object description information of a second reference object, wherein the second reference object description information and the third reference object description information are used for describing original inherent characteristics of the second reference object, and the similarity between the second reference object description information and the third reference object description information is larger than a preset similarity threshold;

Performing word segmentation on the second reference object description information to obtain a second word sequence, and performing word segmentation on the third reference object description information to obtain a third word sequence;

Extracting a phrase containing each modifier from the second reference object description information as a first candidate object characteristic description phrase aiming at each modifier in the second word sequence for describing the second reference object; extracting a phrase containing each modifier from the third reference object description information as a second candidate object characteristic description phrase aiming at each modifier in the third word sequence for describing the second reference object;

Determining each of the first candidate object property description phrases and each of the second candidate object property description phrases as reference object property description phrases;

And performing model training in a mode that the second reference object description information is used as the input of the phrase generation model and the reference object characteristic description phrase is used as the target output of the phrase generation model so as to obtain the phrase generation model.

9. A subtitle generating apparatus, comprising:

The extraction module is used for inputting the original object description information into a pre-trained content extraction model so as to extract the content of the original object description information acquired by the acquisition module and obtain the brief description information of the target object;

The subtitle generating module is used for generating subtitle information corresponding to the target object in the video according to the brief description information obtained by the extracting module and the object characteristic description phrase generated by the phrase generating module;

Wherein the content refinement model is trained by a first model training device comprising:

the first descriptive information acquisition module is used for acquiring first reference object descriptive information of a first reference object, wherein the first reference object descriptive information is used for describing original inherent characteristics of the first reference object;

the first word segmentation module is used for segmenting the first reference object description information to obtain a first word sequence;

The first determining module is used for deleting each word except the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting the words in the new first word sequence to obtain candidate brief description information of the first reference object;

The judging module is used for judging whether other words except the first reference object exist in any new first word sequence;

The triggering module is used for triggering the first determining module to delete each word except for the first reference object in the first word sequence to obtain a new first word sequence, and sequentially connecting the words in the new first word sequence to obtain candidate brief description information for each word except for the first reference object in the first word sequence;

The second determining module is configured to use the candidate brief description information with the highest fluency as reference brief description information if no other word except the first reference object exists in any new first word sequence;

And the first model training module is used for carrying out model training in a mode that the first reference object description information is used as the input of the content extraction model and the reference brief description information is used as the target output of the content extraction model so as to obtain the content extraction model.

10. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-8.

11. An electronic device, comprising:

A storage device having a computer program stored thereon;

Processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-8.