CN113132781A

CN113132781A - Video generation method and apparatus, electronic device, and computer-readable storage medium

Info

Publication number: CN113132781A
Application number: CN201911412228.0A
Authority: CN
Inventors: 金莉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-16
Anticipated expiration: 2039-12-31
Also published as: CN113132781B; WO2021136334A1

Abstract

The application discloses a video generation method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring at least one first image data; performing character analysis processing on the first image data to acquire character information contained in the first image data; acquiring at least one first text segment; determining matching data of the character information of the first image data and the first text segment; processing the first image data according to the matching data to generate a plurality of second image data, wherein the character information matched with the first text segment in the second image data has a first display effect; and generating the first video data from the plurality of second image data. In this way, a teletext video in which image data and text segments correspond to each other can be automatically generated based on various teletext materials, thereby greatly improving the efficiency of teletext video generation.

Description

Video generation method and apparatus, electronic device, and computer-readable storage medium

Technical Field

The present application relates to the field of video generation, and in particular, to a video generation method and apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of internet technology, more and more users learn by watching videos, and thus a large amount of pictures and texts of videos are required. Such a teletext video is distinguished from various videos which are common in nature and are aggregates of material such as pictures and text related to specific content, and therefore, in the prior art, the teletext video is often generated only by means of manual editing. For example, it is necessary to collect and sort various graphics and text materials, such as pictures, articles, etc., and then perform manual sorting to finally form a graphics and text video. However, the existing mode for manually making the image-text video needs a large amount of manual work, so that the cost is high and the making efficiency is low.

Therefore, a scheme for automatically producing a text video based on various text materials is needed to reduce production cost and improve efficiency.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The embodiment of the application provides a video generation method and device, electronic equipment and a computer readable storage medium, which can automatically analyze characters in an image and establish a matching relation with audio, so as to generate a video with a character and audio synchronous display effect.

In order to achieve the above object, an embodiment of the present application provides a video generation method, including:

acquiring at least one first image data;

performing character analysis processing on first image data to acquire character information contained in the first image data;

acquiring at least one first text segment;

determining matching data of the character information of the first image data and the first text segment;

processing the first image data according to the matching data to generate a plurality of second image data, wherein the text information matched with the first text segment in the second image data has a first display effect; and

first video data is generated from the plurality of second image data.

The embodiment of the present application further provides a video generation method, including:

acquiring at least one first image data;

inputting the text information into a search engine to obtain at least one first text segment;

and generating first video data according to the second image data.

receiving at least one first image data and at least one first text segment input by a user;

presenting the matching data to the user;

receiving indication data input by a user for the matching data, wherein the indication data at least comprises instruction information of a first presentation effect for the matching data;

processing the first image data according to the indication data to generate a plurality of second image data, wherein the text information matched with the first text segment in the second image data has the first display effect; and

and generating first video data according to the second image data.

An embodiment of the present application further provides a video generating apparatus, including:

an acquisition component for acquiring at least one first image data and at least one first text segment;

the analysis component is used for carrying out character analysis processing on the first image data to acquire character information contained in the first image data;

a matching component for determining matching data of the text information of the first image data and the first text segment;

the image data generation component is used for processing the first image data according to the matching data to generate a plurality of second image data, wherein the character information matched with the first text segment in the second image data has a first display effect; and

a video data generation component to generate first video data from the plurality of second image data.

an image data acquisition component for acquiring at least one first image data;

the text segment acquisition component is used for inputting the text information into a search engine to acquire at least one first text segment;

a receiving component for receiving at least one first image data and at least one first text segment input by a user;

a presentation component for presenting the matching data to the user, an

The receiving component is further configured to receive indication data for the matching data input by a user, wherein the indication data comprises at least instruction information for a first presentation effect of the matching data;

the video generation apparatus further includes: the image data generation component is used for processing the first image data according to the indication data to generate a plurality of second image data, wherein the character information matched with the first text segment in the second image data has the first display effect; and

a video data generation component for generating first video data from the second image data

An embodiment of the present application further provides an electronic device, including:

a processing unit; and

a memory coupled to the processing unit and containing instructions stored thereon that, when executed by the processing unit, cause the apparatus to perform acts comprising:

acquiring at least one first image data;

acquiring at least one first text segment;

first video data is generated from the plurality of second image data.

Embodiments of the present application further provide a computer-readable storage medium, on which instructions are stored, where the instructions include:

acquiring at least one first image data;

acquiring at least one first text segment;

first video data is generated from the plurality of second image data.

According to the video generation method and device, the electronic device and the computer readable storage medium, the character information contained in the image data is obtained from the first image data, the matching data between the character information and the at least one text segment is determined, so that the matching relation between the image data and the at least one text segment can be established, then the image data can be processed based on the matching relation to generate a plurality of second image data corresponding to the text segments, the video data in which the character information matched with the text segment has the first display effect is finally generated based on the second image data, the image-text videos in which the image data and the text segment correspond to each other can be automatically generated based on various image-text materials, and the generation efficiency of the image-text videos is greatly improved.

The foregoing description is only an overview of the technical solutions of the present disclosure, and the embodiments of the present disclosure are described below in order to make the technical means of the present disclosure more clearly understood and to make the above and other objects, features, and advantages of the present disclosure more clearly understandable.

Drawings

Fig. 1A to 1C are schematic diagrams illustrating an application scenario of a video generation scheme according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a video generation method according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating an application scenario of a video generation method according to an embodiment of the present application;

fig. 4 is a further schematic flow diagram of a video generation method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a video generation apparatus according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a video generation apparatus according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a video generation apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

With the development of internet technology, more and more users learn by watching videos, and thus a large amount of pictures and texts of videos are required. Such teletext videos are distinguished from various videos that are common, which are actually aggregates of various materials, such as pictures and text, that are related to specific content. For example, when a certain content needs to be explained, the content needs to be explained based on a picture and combined with a piece of explanatory text. That is, it is necessary to record video based on one or more pictures and text or audio corresponding to the contents of the pictures. In this case, it is necessary to synchronize the playing of the picture with the text being interpreted, and further, the text may be converted into audio, thereby implementing audio interpretation of the picture content.

However, in the prior art, the production of such a teletext video is often only performed manually. For example, it is necessary to collect and sort various materials, such as pictures, articles, video clips, and the like, and then manually edit and sort the materials, such as collecting corresponding texts from the picture contents or manually making texts, and generating matching teletext data, so as to finally form a teletext video. However, such an existing method for manually creating a video requires a lot of manual work, which is not only high in cost but also low in creation efficiency.

Therefore, a scheme for automatically producing a graphic video based on various materials is needed to reduce production cost and improve efficiency.

Fig. 1A to 1C are schematic diagrams illustrating application scenarios of a video generation scheme according to an embodiment of the present application. As shown in fig. 1A, in one application scenario according to embodiments of the present application, various materials may be obtained as input data from various data sources. These materials may include, for example, pictures, text, mixed data of pictures and text, video, audio, and so forth. The pictures and text segments in these materials may be directly used as the pictures 11 and text segments 12 for subsequent video generation processing, or if material data including mixed data of pictures and text, video, audio, and the like is acquired, these data may be processed in advance, and the pictures and text after processing are used as the pictures 11 and text segments 12 shown in fig. 1 for subsequent video generation processing. After the picture 11 is acquired, the picture 11 may be subjected to an extraction process to extract various text information in the picture 11. For example, as shown in fig. 1, text information 111 to 11n may be extracted from the picture 11. These extracted text information 111 to 11n may be, for example, subtitles of some sub-content to which the content described in the picture relates. Therefore, matching data of these text information 111 to 11n and the text passage 12 can be further determined. The matching data referred to herein may be a correspondence between each of the character information 111 to 11n and the corresponding content in the text passage 12. For example, as shown in fig. 1, the text segment 12 may include sub-text segments 121 to 12n, which respectively correspond to the text information 111 to 11n in the picture 11, and may be further descriptions of subtitles of the content of the picture 11 represented by the text information 111 to 11n in the picture 11. Accordingly, after determining matching data of the text segment and the text information of the picture 11, the picture 11 may be processed according to such matching data and a plurality of pictures 131 to 13n for video, each of which may be one of the sub-text segments including the content of the picture 11 and the text segment 12, may be generated, and the text information of the picture 11 corresponding to the included sub-text segment may be applied with a specific presentation effect. For example, the text information 111 of the picture 11 corresponds to the sub-text segment 121 of the text segment 12, that is, the sub-text segment 121 describes the description and explanation of the text information 111, the picture 11 may be processed based on such matching data, so as to generate a picture 131 for a video including the entire content of the picture 11 and the sub-text segment 121, and the picture 131 may include a presentation effect applied to the text information 111 corresponding to the sub-text segment 121, for example, as shown in fig. 1A, the text information 111 may be highlighted so that a user can clearly understand when viewing the portion of the video, and the sub-text segment 121 describes and explains the text information 111. By analogy, when a plurality of pictures 131 to 13n are generated, a video can be further generated from these pictures 131 to 13 n.

Referring next to fig. 1A and 2, a video generation method according to an embodiment of the present application will be described in detail. Fig. 2 is a schematic flow chart of a video generation method according to an embodiment of the present application. As shown in fig. 2, the video generation method according to the embodiment of the present application includes the steps of:

s201, acquiring at least one first image data;

in the embodiment of the present application, material used for generating a video can be acquired as first image data from various data sources. For example, pictures containing characters may be acquired as the first image data from various web pages, or pictures containing characters may be directly input as the first image data by a user on a user interface. Of course, in the embodiment of the present application, various pictures may be obtained from various data sources, and then the obtained pictures may be preprocessed to identify the picture containing the text as the first image data. Or any material can be input by the user through the user interface, and the picture input by the user is processed to identify the picture containing the characters as the first image data.

S202, performing a text analysis process on the first image data to obtain text information included in the first image data.

In the embodiment of the present application, after the first image data is acquired, the acquired first image data may be analyzed by using various character analysis processes, for example, a character extraction technique, so that a text sentence segment included in the image can be acquired as text information. For example, when the picture 11 shown in fig. 1 is acquired as the first image data, all of the sentence fragments contained in the image may be acquired as the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 in the embodiment of the present application by image processing.

S203, at least one first text segment is obtained.

In the present embodiment, after the first image data is acquired, at least one first text segment 12 may be further acquired. In the embodiment of the present application, at least one first text segment may be obtained from the first image data, and various text segments may also be obtained from a text data source to be used as candidate first text segments for subsequent processing. The first segment of text 12 may contain a plurality of first sub-segments of text. For example, the first text segment 12 may be a text segment containing a plurality of text fields or a plurality of text sentences. In this embodiment, the at least one first text segment acquired in step S203 may be a text segment related to the content of the first image data, a text segment containing characters related to the content of the first image data, or any text segment.

In particular, in the embodiment of the present application, the order of the two steps of the acquisition of the first image data and the acquisition of the first text segment is not limited at all. For example, step S201 may be performed first to acquire the first image data, and then step S203 may be performed to acquire the first text passage, step S203 may be performed first to acquire the first text passage, and then step S201 may be performed to acquire the first image data, although step S201 and step S203 may be performed at the same time.

S204, determining the matching data of the character information of the first image data and the first text segment.

In the embodiment of the application, after the text information 111 and 113 and the first text segment are obtained, the matching data of the text information 111 and 113 and the first text segment in the first image data may be further determined. For example, the plurality of sub-text segments included in the first text segment may be respectively compared with the text information 111-. In this embodiment of the application, when a plurality of first text segments are obtained, each first text segment may be compared with the text information one by one to determine matching data. For example, as shown in fig. 1, it may be determined that the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the sub-text sections "first position distinguished strength" 121, "intelligent preferred placement hosting" 122, and "infinite subject difference marketing" 123 in the text section 12, respectively, and thus constitute matching data, through step S204. In the embodiment of the present application, there may be a plurality of first text segments, for example, there may be a first text segment 12 including the above sub-text segments "mark first place for display strength" 121, "intelligent preferred placement for hosting" 122, and "infinite subject difference marketing" 123, and there may additionally be another first text segment 12 including the sub-text segments "scene of search result first 30 days", "preferred conversion for improvement" first-order "S-level promotion, treasure of town, and the like". Therefore, in this case, it may be determined in this step S204 that the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the sub-text section "first-position distinguished strength" 121, "intelligent preferred delivery hosting" 122, and "infinite subject difference marketing" 123 in the text section 12, respectively, and the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the sub-text section "first 30 days of search result", "preferred conversion is improved by" first-position "S-level promotion, treasure of town, and the like" in the other text section 12, respectively. In this case, therefore, such a two-layer correspondence may constitute matching data of the text information of the first image data and the plurality of first text segments.

S205, the first image data is processed according to the matching data to generate a plurality of second image data.

In the embodiment of the present application, after the matching data of the text information of the first image data and the first text segment is determined in step S204, the first image data may be processed based on the matching data thus determined. For example, the processing may include embedding each corresponding sub-text segment in the first image data, respectively, and corresponding to the embedded sub-text segment, so that text information corresponding to the embedded sub-text segment has a specific presentation effect. For example, the text information may be highlighted, the text information font may be bolded, or a graphic mark may be added to the text information. Thus, when the user views the plurality of second image data thus formed, the object described by the currently displayed sub-field can be known from the character information having the display effect.

For example, referring to fig. 1, after it is determined in step S204 that the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 respectively have a corresponding relationship with the sub-text segments "calibration first revealing strength" 121, "intelligent preferred delivery hosting" 122, and "infinite subject difference marketing" 123 in the text segment 12, and matching data is formed, the picture 11 may be processed based on such matching data, so that the picture 131 and 133 in which three sub-text segments of the sub-text segments "calibration first revealing strength" 121, "intelligent preferred delivery hosting" 122, and "infinite subject difference marketing" 123 are respectively embedded on the picture 11 can be formed, and the text information corresponding to the embedded sub-text segments in each of the picture 131 and 133 may have a specific revealing effect. For example, in the picture 131 preceded by the sub-text segment "show first best effort" 121, the text information "TOP 1" 111 corresponding to the sub-text segment "show first best effort" 121 may have a highlighted presentation effect to prompt the viewing user that the content currently being explained or explained is the text information "TOP 1" 111.

S206, the first video data is generated from the plurality of second image data.

When the plurality of pictures 131 and 133 of the plurality of sub-text segments in which the text segment 12 is embedded are generated as described above, video data can be generated from these pictures. Therefore, according to the video generation method of the embodiment of the application, the text information contained in the image data is acquired from the first image data, the matching data between the text information and the at least one text segment is determined, so that the matching relationship between the image data and the at least one text segment can be established, then the image data can be processed based on the matching relationship to generate a plurality of second image data corresponding to the text segment, so that the video data in which the text information matched with the text segment has the first display effect is finally generated based on the second image data, and the teletext video in which the image data and the text segment correspond to each other can be automatically generated based on various teletext materials in this way, so that the generation efficiency of the teletext video is greatly improved.

According to another embodiment of the present application, as shown in fig. 3, fig. 3 is a schematic diagram illustrating an application scenario of a video generation method according to an embodiment of the present application. As shown in fig. 3, in another application scenario according to an embodiment of the present application, after the picture 11 is acquired and the text information in the picture 11 is extracted. For example, as shown in fig. 3, text information 111 to 11n is extracted from the picture 11. Thereafter, the text segment 12 may be retrieved and audio data 181-18n corresponding to each of the sub-text segments 121-12n in the text segment 12 may be generated or retrieved based on the text segment 12. Thereafter, as shown in fig. 1, after determining matching data of these text information 111 to 11n and the text segment 12, the picture 11 may be processed according to such matching data and a plurality of pictures 131 to 13n for video, each of which may be one of sub-text segments including the content of the picture 11 and the text segment 12, may be generated, and the text information of the picture 11 corresponding to the included sub-text segment may be applied with a specific presentation effect. For example, the text information 111 of the picture 11 corresponds to the sub-text segment 121 of the text segment 12, that is, the sub-text segment 121 describes the description and explanation of the text information 111, the picture 11 may be processed based on such matching data, so as to generate a picture 131 for a video including the entire content of the picture 11 and the sub-text segment 121, and the picture 131 may include a presentation effect applied to the text information 111 corresponding to the sub-text segment 121, for example, as shown in fig. 1, the text information 111 may be highlighted so that a user can clearly understand when viewing the portion of the video, and the sub-text segment 121 describes and explains the text information 111. By analogy, in the application scenario shown in fig. 3, when a plurality of pictures 131 to 13n are generated, video may be further generated from these pictures 131 to 13n and the audio 181-18n corresponding to the sub-text segment 121-12n embedded in these pictures 131 to 13 n. Therefore, in the application scenario shown in fig. 3, the generated video includes not only the sub-text segment corresponding to the text information that is the content object of the current explanation or explanation, but also the audio data corresponding to the sub-text segment, so that when the user views the portion, the user can listen to the audio of the explanation or explanation while viewing the text explained or explanation for the text information.

A video generation method according to an embodiment of the present application is described below with reference to fig. 4, and fig. 4 is another flowchart illustrating the video generation method according to an embodiment of the present application. As shown in fig. 4, a video generation method according to another embodiment of the present application includes the steps of:

s401, first data containing images and characters are obtained.

S402, carrying out image extraction processing on the first data to extract at least one image from the first data as first image data.

S403, performing word extraction processing on the first data to extract at least one text segment from the first data as a first text segment.

In practical applications, there are also various kinds of mixed data containing both images and corresponding characters. E.g. news stories for a certain event, etc. Therefore, in such mixed data, a picture and a text passage explaining the picture already exist. Therefore, as shown in the embodiment of the present application, after such mixed data is acquired, the first image data and the first text passage may be acquired by performing the image extraction process and the word extraction process on the mixed data, respectively.

In particular, in the embodiment of the present application, the order of the two steps of the acquisition of the first image data and the acquisition of the first text segment is not limited at all. For example, the first image data may be acquired in step S402 and then the first text passage may be acquired in step S403, the first text passage may be acquired in step S303 and then the first image data may be acquired in step S402, or the step S402 and the step S403 may be performed at the same time.

S404, performing a text analysis process on the first image data to obtain text information included in the first image data.

In the embodiment of the present application, after the first image data is acquired, the acquired first image data may be analyzed by using various character analysis processes, for example, a character extraction technique, so that a sentence fragment included in an image can be acquired as character information. For example, when the picture 11 shown in fig. 1 is acquired as the first image data, all of the sentence fragments contained in the image may be acquired as the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 in the embodiment of the present application by image processing. S203, at least one first text segment is obtained.

In the embodiment of the present application, the text analysis process may use any technical solution in the art to obtain the text information contained in the first image data. For example, a blank region in the first image data may be identified, a text region surrounded by the blank region may be determined based on the identified blank region, and the text region may be subjected to a text extraction process using the text extraction process to generate text information.

In the present embodiment, after the first image data is acquired, at least one first text segment 12 may be further acquired. In the embodiment of the present application, at least one first text segment may be obtained from the first image data, and various text segments may also be obtained from a text data source to be used as candidate first text segments for subsequent processing. The first segment of text 12 may contain a plurality of first sub-segments of text. For example, the first text segment 12 may be a text segment containing a plurality of text fields or a plurality of text sentences. In this embodiment of the application, the obtained at least one first text segment may be a text segment related to the content of the first image data, may also be a text segment containing characters related to the content of the first image data, or may also be any text segment.

Further, as shown in the scenario of fig. 1B, in the embodiment of the present application, at least one keyword may also be generated according to the text information included in the first image data, and the first text segment may be acquired based on the generated keyword. For example, a search process may be performed on the internet using a search algorithm based on the generated keyword to acquire a text segment related to the text information contained in the first image data. For another example, a text generation algorithm may be utilized based on the generated keyword to generate a text segment that matches the textual information corresponding to the keyword directly based on the keyword.

S405, determining matching data of the character information of the first image data and the first text segment.

In the embodiment of the application, after the text information 111 and 113 and the first text segment are obtained, the matching data of the text information 111 and 113 and the first text segment in the first image data may be further determined. For example, the plurality of sub-text segments included in the first text segment may be respectively compared with the text information 111-. In this embodiment of the application, when a plurality of first text segments are obtained, each first text segment may be compared with the text information one by one to determine matching data. For example, as shown in fig. 3, it may be determined that the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the sub-text sections "first position distinguished strength" 121, "intelligent preferred placement hosting" 122, and "infinite subject difference marketing" 123 in the text section 12, respectively, and thus constitute matching data, through step S204. In the embodiment of the present application, there may be a plurality of first text segments, for example, there may be a first text segment 12 including the above sub-text segments "mark first place for display strength" 121, "intelligent preferred placement for hosting" 122, and "infinite subject difference marketing" 123, and there may additionally be another first text segment 12 including the sub-text segments "scene of search result first 30 days", "preferred conversion for improvement" first-order "S-level promotion, treasure of town, and the like". Therefore, in this case, it is possible that the determined text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the subfile "first position distinguished strength" 121, "intelligent preferred placement hosting" 122, and "infinite subject difference marketing" 123 in the text passage 12, respectively, and the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the subfile "first 30 days of search result", "preferred conversion improves" first level "S-level promotion, treasure of shop, etc." scene of town, respectively, in the other text passage 12. In this case, therefore, such a two-layer correspondence may constitute matching data of the text information of the first image data and the plurality of first text segments.

S406, the first image data is processed according to the matching data to generate a plurality of second image data, wherein the character information matched with the first text segment in the second image data has a first display effect.

In the embodiment of the present application, after the matching data of the text information of the first image data and the first text segment is determined in step S405, the first image data may be processed based on the matching data thus determined. For example, the processing may include embedding each corresponding sub-text segment in the first image data, respectively, and corresponding to the embedded sub-text segment, so that text information corresponding to the embedded sub-text segment has a specific presentation effect. For example, the text information may be highlighted, the text information font may be bolded, or a graphic mark may be added to the text information. Thus, when the user views the plurality of second image data thus formed, the object described by the currently displayed sub-field can be known from the character information having the display effect.

For example, referring to fig. 3, after determining in step S405 that the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 respectively have a corresponding relationship with the sub-text segments "calibration first revealing strength" 121, "intelligent preferred delivery hosting" 122, and "infinite subject difference marketing" 123 in the text segment 12, and form matching data, the picture 11 may be processed based on such matching data, so that the picture 131 and 133 in which three sub-text segments of the sub-text segments "calibration first revealing strength" 121, "intelligent preferred delivery hosting" 122, and "infinite subject difference marketing" 123 are respectively embedded on the picture 11 can be formed, and the text information corresponding to the embedded sub-text segments in each of the picture 131 and 133 can have a specific revealing effect. For example, in the picture 131 preceded by the sub-text segment "show first best effort" 121, the text information "TOP 1" 111 corresponding to the sub-text segment "show first best effort" 121 may have a highlighted presentation effect to prompt the viewing user that the content currently being explained or explained is the text information "TOP 1" 111.

S407, generating a first audio segment according to the first text segment, wherein the first audio segment includes audio data corresponding to the first text segment.

According to the embodiment of the application, the audio data 181-18n corresponding to each sub-text segment 121-12n in the text segment 12 can be further generated or obtained based on the text segment 12. In the present embodiment, audio data is generated based on the text segment 12 in various ways. For example, TTS (Text To Speech) technology may be employed To convert the Text segment 12 into audio data 181-18n, or other algorithms may be employed To generate content-related audio data based on the content of the Text segment 12.

S408, generating the first video data according to the second image data and the first audio clip.

As shown in fig. 3, when a plurality of pictures 131 to 13n are generated, a video may be further generated from these pictures 131 to 13n and the audio 181-18n corresponding to the sub-text segment 121-12n embedded in these pictures 131 to 13 n. Therefore, in the application scenario shown in fig. 3, the generated video includes not only the sub-text segment corresponding to the text information that is the content object of the current explanation or explanation, but also the audio data corresponding to the sub-text segment, so that when the user views the portion, the user can listen to the audio of the explanation or explanation while viewing the text explained or explanation for the text information.

In addition, in the embodiment of the present application, the audio data generated from the text segment may further include time information corresponding to the text. For example, the audio information generated based on the sub-text segment "calibrated first-out strength" 121 may further include the playing time required for the audio corresponding to the eight words contained in the sub-text segment "calibrated first-out strength" 121. The playing time can be determined according to the preset speed of speech and the length of the characters. Therefore, in the video generation method according to the present application, the audio data may be further synchronized with the second image data generated based on the first image data according to the time information. For example, the second image data 131 embedded in the sub-text segment generated based on the sub-text segment "calibrated first showing capability" 121 may be synchronized with the audio data corresponding to the sub-text segment "calibrated first showing capability" 121, and further generate video data based on the synchronized second image data and audio data, so as to ensure that the second image data is kept played for a time sufficient for the audio data corresponding to the second image data to be played completely when the user views the second image data.

Further, in the embodiment of the present application, the user may further specify the playing time of the audio clip corresponding to the second image data before generating the video data, synchronize the second image data with the audio clip after specifying the playing time after the user confirms the playing time, and generate the video data based on the second image data and the audio clip thus synchronized.

Therefore, according to the video generation method of the embodiment of the present application, by acquiring text information contained in image data from first image data and determining matching data between such text information and at least one text passage, thereby enabling establishing a matching relationship between the image data and the at least one text passage, after which the image data may be processed based on such matching relationship to generate a plurality of second image data corresponding to the text passage, thereby finally generating video data in which the text information matched with the text segment has the first presentation effect and is synchronously provided with audio explanation or explanation based on the second image data and the audio data corresponding to the text segment matched with the image data, in this way, the teletext video in which the image data, the text segment and the audio correspond to each other can be automatically generated based on various teletext materials, thereby greatly improving the efficiency of generating the teletext video.

Fig. 5 is a schematic structural diagram of a video generation apparatus according to an embodiment of the present application. Referring to fig. 5, the video generation apparatus of the present application may include: an acquisition component 501, an analysis component 502, a matching component 503, an image data generation component 504, and a video data generation component 505.

In particular, the acquisition component 501 is configured to acquire at least one first image data and at least one first text segment.

In the embodiment of the present application, the acquisition component 501 may acquire material used for generating a video as first image data from various data sources. For example, pictures containing characters may be acquired as the first image data from various web pages, or pictures containing characters may be directly input as the first image data by a user on a user interface. Of course, in the embodiment of the present application, various pictures may be obtained from various data sources, and then the obtained pictures may be preprocessed to identify the picture containing the text as the first image data. Or any material can be input by the user through the user interface, and the picture input by the user is processed to identify the picture containing the characters as the first image data. Furthermore, the obtaining component 501 may obtain at least one first text segment from the first image data, and may also obtain various text segments from a text data source to be used as candidate first text segments for subsequent processing. The first segment of text 12 may contain a plurality of first sub-segments of text. For example, the first text segment 12 may be a text segment containing a plurality of text fields or a plurality of text sentences. In this embodiment, the at least one first text segment acquired by the acquiring component 501 may be a text segment related to the content of the first image data, a text segment containing characters related to the content of the first image data, or any text segment.

Furthermore, the acquisition component 501 may also acquire first data containing an image and a text, and further acquire the first image data and the first text segment by performing image extraction and text extraction on the first data. For example, the obtaining component 501 may obtain pictures containing characters from various web pages as the first image data, or may directly input pictures containing characters on the user interface by the user as the first image data. Of course, in this embodiment of the application, the obtaining component 501 may also obtain various pictures from various data sources, and then pre-process the obtained pictures to identify the pictures containing the text as the first image data. Or the obtaining component 501 may also be used by the user to input any material through the user interface, and process the picture input by the user to recognize the picture containing the text as the first image data.

The analysis component 502 is configured to perform text analysis processing on the first image data to obtain text information included in the first image data. In the embodiment of the present application, after the first image data is acquired, the analysis component 502 may analyze the first image data acquired by the acquisition component 501 by using various word analysis processing techniques, such as a word extraction technique, so as to acquire a word sentence segment contained in the image as word information. For example, when the acquisition component 501 acquires the picture 11 shown in fig. 1 as the first image data, the analysis component 502 may acquire all text paragraphs included in the image as the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 in the embodiment of the present application by image processing. S203, at least one first text segment is obtained.

In the embodiment of the present application, after the acquisition component 501 acquires the first image data, at least one first text segment 12 may be further acquired. In the embodiment of the present application, the obtaining component 501 may obtain at least one first text segment according to the first image data, and may also obtain various text segments from a text data source to be used as candidate first text segments for subsequent processing. The first segment of text 12 may contain a plurality of first sub-segments of text. For example, the first text segment 12 may be a text segment containing a plurality of text fields or a plurality of text sentences. In this embodiment, the at least one first text segment acquired by the acquiring component 501 may be a text segment related to the content of the first image data, a text segment containing characters related to the content of the first image data, or any text segment.

The matching component 503 is configured to determine matching data of the text information of the first image data and the first text segment. In the embodiment of the present application, after the analysis component 502 obtains the text information 111 and 113 and obtains the first text segment of the component 501, the matching component 503 may determine matching data of the text information 111 and 113 in the first image data and the first text segment.

For example, the matching component 503 may compare the plurality of sub-text segments included in the first text segment with the text information 111-. In this embodiment of the application, when the obtaining component 501 obtains a plurality of first text segments, the matching component 503 may compare each of the first text segments with the text information one by one to determine matching data. For example, as shown in fig. 1, the matching component 503 may determine that the textual information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the sub-text segments "first place manifest strength" 121, "intelligent preferred placement host" 122, and "infinite topic difference marketing" 123 in the text segment 12, respectively, and thus constitute matching data. In the embodiment of the present application, there may be a plurality of first text segments, for example, there may be a first text segment 12 including the above sub-text segments "mark first place for display strength" 121, "intelligent preferred placement for hosting" 122, and "infinite subject difference marketing" 123, and there may additionally be another first text segment 12 including the sub-text segments "scene of search result first 30 days", "preferred conversion for improvement" first-order "S-level promotion, treasure of town, and the like". Thus, in this case, the matching component 503 may determine that the textual information "TOP 1" 111, "high conversion" 112, and "S-level bustle" 113 have a correspondence relationship with the sub-text segment "first place manifest strength" 121, "intelligent preferred placement host" 122, and "infinite topic difference marketing" 123 in the text segment 12, respectively, and that the textual information "TOP 1" 111, "high conversion" 112, and "S-level bustle" 113 have a correspondence relationship with the sub-text segment "first 30 days of search results", "preferred conversion improves" first level "S-level bustle, treasure of shop, etc." in another text segment 12, respectively. In this case, therefore, such a two-layer correspondence may constitute matching data of the text information of the first image data and the plurality of first text segments.

The image data generating component 504 is configured to process the first image data according to the matching data to generate a plurality of second image data, where text information in the second image data matching the first text segment has a first display effect. After the matching component 503 determines matching data of the text information of the first image data and the first text segment, the image data generating component 504 can process the first image data based on the matching data determined by the matching component 503. For example, the processing may include embedding each corresponding sub-text segment in the first image data, respectively, and corresponding to the embedded sub-text segment, so that text information corresponding to the embedded sub-text segment has a specific presentation effect. For example, the text information may be highlighted, the text information font may be bolded, or a graphic mark may be added to the text information. Thus, when the user views the plurality of second image data thus formed, the object described by the currently displayed sub-field can be known from the character information having the display effect.

The video data generating component 505 is configured to generate first video data from the plurality of second image data. When the image data generation component 504 generates the plurality of pictures 131 and 133 of the plurality of sub-text segments in which the text segment 12 is embedded as described above, the video data generation component 505 may generate video data from the pictures.

Therefore, according to the video generating apparatus of the embodiment of the present application, by obtaining the text information included in the image data from the first image data and determining the matching data between such text information and at least one text segment, the matching relationship between the image data and the at least one text segment can be established, and then the image data can be processed based on such matching relationship to generate a plurality of second image data corresponding to the text segment, so as to finally generate the video data in which the text information matched with the text segment has the first presentation effect based on these second image data, and in this way, the teletext video in which the image data and the text segment correspond to each other can be automatically generated based on various teletext materials, thereby greatly improving the generation efficiency of the teletext video.

Fig. 6 is a schematic structural diagram of a video generation apparatus according to an embodiment of the present application. Referring to fig. 6, the video generation apparatus of the present application may include: an image data acquisition component 601, an analysis component 602, a text segment acquisition component 603, a matching component 604, an image data generation component 605, and a video data generation component 606. The image data acquisition component 601 may acquire material used to generate a video as first image data from various data sources. For example, pictures containing characters may be acquired as the first image data from various web pages, or pictures containing characters may be directly input as the first image data by a user on a user interface. Of course, in the embodiment of the present application, various pictures may be obtained from various data sources, and then the obtained pictures may be preprocessed to identify the picture containing the text as the first image data. Or any material can be input by the user through the user interface, and the picture input by the user is processed to identify the picture containing the characters as the first image data.

The analysis component 602 is configured to perform text analysis processing on the first image data to obtain text information included in the first image data. In the embodiment of the present application, after the image data acquiring component 601 acquires the first image data, the analyzing component 602 may analyze the first image data acquired by the image data acquiring component 601 by using various text analysis processing techniques, such as a text extraction technique, so as to acquire a text sentence segment contained in the image as text information.

The text segment retrieving component 603 is configured to input text information to a search engine to retrieve at least one first text segment. In the embodiment of the present application, the text passage acquiring component 603 may generate at least one keyword from the text information contained in the first image data acquired by the image data acquiring component 601, and acquire the first text passage based on the generated keyword. For example, a search process may be performed on the internet using a search algorithm based on the generated keyword to acquire a text segment related to the text information contained in the first image data. For another example, a text generation algorithm may be utilized based on the generated keyword to generate a text segment that matches the textual information corresponding to the keyword directly based on the keyword.

The matching component 604 is configured to determine matching data of the text information of the first image data and the first text segment. In the embodiment of the present application, after the image data obtaining component 601 obtains the first image data and the text segment obtaining component 603 obtains the first text segment, the matching component 604 may determine matching data of the text information in the first image data and the first text segment.

The image data generating component 605 is configured to process the first image data according to the matching data to generate a plurality of second image data, where text information in the second image data matching the first text segment has a first display effect.

After the matching component 604 determines matching data of the text information of the first image data to the first text segment, the image data generation component 605 may process the first image data based on the matching data determined by the matching component 604. For example, the processing may include embedding each corresponding sub-text segment in the first image data, respectively, and corresponding to the embedded sub-text segment, so that text information corresponding to the embedded sub-text segment has a specific presentation effect. For example, the text information may be highlighted, the text information font may be bolded, or a graphic mark may be added to the text information. Thus, when the user views the plurality of second image data thus formed, the object described by the currently displayed sub-field can be known from the character information having the display effect.

The video data generating component 606 is configured to generate first video data from the plurality of second image data.

Therefore, according to the video generating apparatus of the embodiment of the present application, the first text segment can be acquired from the internet by using the search algorithm based on the text information in the first image data, by acquiring text information contained in the image data from the first image data and determining matching data between such text information and at least one text passage, thereby enabling establishing a matching relationship between the image data and the at least one text passage, after which the image data may be processed based on such matching relationship to generate a plurality of second image data corresponding to the text passage, thereby finally generating video data in which the text information matched with the text segment has the first presentation effect based on these second image data, in this way, a teletext video in which image data and text segments correspond to each other can be automatically generated based on various teletext materials, thereby greatly improving the efficiency of teletext video generation.

Fig. 7 is a schematic structural diagram of a video generation apparatus according to an embodiment of the present application. Referring to fig. 7, the video generation apparatus of the present application may include: a receiving component 701, an analyzing component 702, a matching component 703, a presentation component 704, an image data generating component 705, and a video data generating component 706. The receiving component 701 is configured to receive at least one first image data and at least one first text segment input by a user. In the embodiment of the present application, various image data and text pieces may be input by a user as the first image data and the first text piece at, for example, a user interface. Of course, the user can also input the first image data and the text for the first image data as the first text segment by the receiving component.

The analysis component 702 is configured to perform text analysis processing on the first image data to obtain text information included in the first image data. After the user inputs the first image data through the receiving component 701, the analyzing component 702 may analyze and process the first image data input by the user through the receiving component 701 by using various word analysis processing techniques, such as a word extraction technique, so as to be able to acquire a word period included in the image as word information.

The matching component 703 is configured to determine matching data between the text information of the first image data and the first text segment. In this embodiment, after the receiving component 701 receives the first image data and the first text segment input by the user, the matching component 703 may determine the matching data of these text information in the first image data and the first text segment.

Presentation component 704 is used to present matching data to a user.

In an embodiment of the present application, as shown in fig. 1C, after the receiving component receives the first image data and the first text segment input by the user, especially after the user inputs a plurality of image data and various text segments, the user often wants to be able to see the matching result between the input image data and the text segments in order to determine the desired presentation effect. Therefore, in the embodiment of the present application, the result of the matching process performed by the matching component 703 on the image data and the text segment input by the user can be presented to the user through the presentation component 704. The receiving component 701 may thus be further configured to receive indication data for the matching data input by a user, wherein the indication data comprises at least instruction information for a first presentation effect of the matching data. For example, as shown in fig. 1 and 3, when the matching component 703 determines that the text information "TOP 1" 111, "high conversion" 112, and "S-level promotion" 113 have a correspondence relationship with the sub-text section "first position distinguished strength" 121, "intelligent preferred placement hosting" 122, and "infinite subject difference marketing" 123 in the text section 12, respectively, the presentation component 704 may present the result to the user, so that the user may determine to apply a bolding effect to the text information 111, to apply an underlining effect to the text information 112, and to apply a bubble graphic to the text information 113, thereby enabling customization of the presentation effect by the user.

The image data generating component 705 is configured to process the first image data according to the indication data to generate a plurality of second image data, where the text information in the second image data matching the first text segment has the first presentation effect. In the embodiment of the present application, after the user inputs an instruction for a presentation effect of the text information through the receiving component 701, the image data generating component 705 may process the image data according to such an instruction to generate a plurality of second image data meeting the user requirement. When the user designates a presentation effect only for a part of the text information, the presentation effect may be automatically designated for the other text information by the image data generating component 705, or the automatically designated presentation effect may be output to the user, and further an instruction of the user may be received through the receiving component 701.

The video data generation component 706 is configured to generate first video data from the plurality of second image data.

The electronic device of the present disclosure may be a mobile terminal device having mobility, and may also be a less mobile or non-mobile computing device. The electronic device of the present disclosure has at least a processing unit and a memory, the memory having instructions stored thereon, the processing unit obtaining the instructions from the memory and executing processing to cause the electronic device to perform actions.

The above describes a video generation method and a video generation apparatus, which may be implemented as an electronic device. As shown in fig. 8, the electronic device includes a memory 801 and a processor 802.

The memory 801 stores programs. In addition to the programs described above, the memory 801 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 801 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The processor 802 is not limited to a Central Processing Unit (CPU), but may be a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), an embedded neural Network Processor (NPU), an Artificial Intelligence (AI) chip, or the like. A processor 802, coupled to the memory 801, executes programs stored by the memory 501 to:

acquiring at least one first image data;

acquiring at least one first text segment;

first video data is generated from the plurality of second image data.

Further, as shown in fig. 8, the electronic device may further include: communication component 803, power component 804, audio component 805, display 806, and other components. Only some of the components are schematically shown in the figure and it is not meant that the electronic device comprises only the components shown in the figure.

The communication component 803 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 4G or 5G, or a combination thereof. In an exemplary embodiment, the communication component 803 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 803 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

A power supply component 804 provides power to the various components of the electronic device. The power components 804 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for an electronic device.

The audio component 805 is configured to output and/or input audio signals. For example, the audio component 805 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 801 or transmitted via the communication component 803. In some embodiments, the audio component 805 also includes a speaker for outputting audio signals.

The display 806 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of video generation, comprising:

acquiring at least one first image data;

acquiring at least one first text segment;

first video data is generated from the plurality of second image data.

2. The video generation method according to claim 1, characterized in that the video generation method further comprises:

generating a first audio clip from the first text segment, wherein the first audio clip includes audio data corresponding to the first text segment, and

the generating first video data from the second image data further comprises:

generating the first video data from the second image data and the first audio clip.

3. The video generation method of claim 2, wherein the audio clip further comprises time information corresponding to words contained in the first text segment, the method further comprising: synchronizing the audio data with the second image data according to the time information.

4. The video generation method of claim 1, wherein prior to said obtaining at least one first image data, the video generation method further comprises:

first data comprising an image and text is acquired,

the acquiring at least one first image data comprises: performing image extraction processing on the first data to extract at least one image from the first data as the first image data, and

the obtaining at least one first text segment comprises: and performing word extraction processing on the first data to extract at least one text segment from the first data as the first text segment.

5. The video generation method of claim 1, wherein said obtaining at least one first text segment comprises:

generating at least one first keyword according to the text information;

the first text segment is obtained from at least one data source using the first keyword.

6. The video generation method of claim 1, wherein said obtaining at least one first text segment comprises:

generating at least one first keyword according to the text information;

generating the first text segment using at least one preset text generation algorithm based on the first keyword.

7. The video generation method according to claim 1, wherein the performing text analysis processing on the first image data to obtain text information included in the first image data includes:

identifying at least one blank region in the first image data;

determining at least one text region in the first image data based on the identified at least one blank region; and

and performing character extraction processing on the at least one character area to obtain the character information.

8. The video generation method of claim 1, wherein the method further comprises: determining a video template from the results of the word processing analysis, an

The generating first video data from the second image data comprises:

generating the first video data from the second image data based on the determined video template.

9. A video generation method as defined in claim 1, wherein the first presentation effect comprises a color of a word in the textual information.

10. The video generation method of claim 1, wherein the processing the first image data to generate a plurality of second image data according to the match data comprises:

acquiring at least one graphic data according to the text information;

determining the associated position of the text information in the second image data according to the matching data;

generating at least one of the plurality of second image data based on the match data, the at least one graphical data, and the associated location, wherein each of the at least one second image data contains at least one of the at least one graphical data located at the associated location.

11. A method of video generation, comprising:

acquiring at least one first image data;

and generating first video data according to the second image data.

12. A method of video generation, comprising:

presenting the matching data to the user;

and generating first video data according to the second image data.

13. A video generation apparatus, comprising:

14. A video generation apparatus, comprising:

15. A video generation apparatus, comprising:

a presentation component for presenting the matching data to the user, an

a video data generation component to generate first video data from the second image data.

16. An electronic device, comprising:

a processing unit; and

acquiring at least one first image data;

acquiring at least one first text segment;

first video data is generated from the plurality of second image data.

17. A computer-readable storage medium having instructions stored thereon, the instructions comprising:

acquiring at least one first image data;

acquiring at least one first text segment;

first video data is generated from the plurality of second image data.