WO2024091080A1 - Procédé de génération de vidéo automatique, et serveur de génération de vidéo automatique - Google Patents

Procédé de génération de vidéo automatique, et serveur de génération de vidéo automatique Download PDF

Info

Publication number
WO2024091080A1
WO2024091080A1 PCT/KR2023/016935 KR2023016935W WO2024091080A1 WO 2024091080 A1 WO2024091080 A1 WO 2024091080A1 KR 2023016935 W KR2023016935 W KR 2023016935W WO 2024091080 A1 WO2024091080 A1 WO 2024091080A1
Authority
WO
WIPO (PCT)
Prior art keywords
token
scene
assigned
script
extracting
Prior art date
Application number
PCT/KR2023/016935
Other languages
English (en)
Korean (ko)
Inventor
권석면
김유석
Original Assignee
주식회사 일만백만
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 일만백만 filed Critical 주식회사 일만백만
Publication of WO2024091080A1 publication Critical patent/WO2024091080A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8541Content authoring involving branching, e.g. to different story endings
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8545Content authoring for generating interactive applications

Definitions

  • This disclosure relates to a method for automatically generating a video and an automatic video generation server. More specifically, it relates to an automatic video generation method and an automatic video generation server that can generate a script according to the user's request and then automatically generate the video by synthesizing reference scenes and environmental data according to the script.
  • the problem that this disclosure seeks to solve is to provide an automatic video generation method and an automatic video generation server that can generate a script according to the user's request and then automatically generate the video by synthesizing reference scenes and environmental data according to the script. will be.
  • the automatic video generation method divides the collected images into scene units to generate a plurality of reference scenes, and assigns different types of tags to the plurality of reference scenes according to characteristic information of the plurality of reference scenes. Assigning each scene to a scene and storing it in a reference scene database; Generating a script using image generation reference information received from a customer terminal; generating a scenario consisting of a standard scene based on the script and then extracting keywords from the script; extracting a reference scene assigned a tag matching the keyword from among the plurality of reference scenes; and generating an image by combining the extracted reference scene and pre-generated environmental data according to the scenario.
  • Extracting the keyword from the script includes extracting words based on spaces from the text of the script; Generating tokens by performing morphological analysis on the words - a token includes a pair of a word and a morpheme value, and is assigned a label indicating a frequency value of the word -; assigning different weights to the words of each token according to the word of each token and the label of each token; and extracting keywords composed of the tokens to which different weights are assigned.
  • the step of extracting a reference scene to which a tag matching the keyword is assigned includes, when the morpheme value of a token constituting the keyword is a noun, selecting an object attribute tag from a plurality of tags assigned to each of the plurality of reference scenes.
  • a screen attribute tag and a situation attribute tag are selected from a plurality of tags respectively assigned to the plurality of reference scenes, a similarity score between the screen attribute tag and the token, and calculating similarity scores between the situation attribute tag and the token; and extracting, from the reference scene database, a reference scene to which a tag having the similarity score is greater than or equal to a specific score is assigned.
  • An automatic video generation server includes one or more processors; and a memory including instructions configured to cause the one or more processors to execute operations, wherein the operations include dividing the collected images into scene units to generate a plurality of reference scenes. , assigning different types of tags to each of the plurality of reference scenes according to characteristic information of the plurality of reference scenes and storing them in a reference scene database; Generating a script using image generation reference information received from the customer terminal; generating a scenario consisting of a standard scene based on the script and then extracting keywords from the script; extracting a reference scene assigned a tag matching the keyword from among the plurality of reference scenes; and generating an image by combining the reference scene data and pre-generated environment data.
  • Extracting the keyword from the script includes extracting words based on spaces from the text of the script; Generating tokens by performing morphological analysis on the words - a token includes a pair of a word and a morpheme value, and is assigned a label indicating the frequency value of the word -; assigning different weights to the words in each token, according to the words in each token and the label of each token; and extracting keywords composed of the tokens to which different weights are assigned.
  • Extracting a reference scene to which a tag matching the keyword is assigned includes, when the morpheme value of the token constituting the keyword is a noun, selecting an object attribute tag from a plurality of tags each assigned to the plurality of reference scenes; , calculating a similarity score between the object attribute tag and the token;
  • the morpheme value of the token constituting the keyword is an adjective
  • a screen attribute tag and a situation attribute tag are selected from a plurality of tags respectively assigned to the plurality of reference scenes, a similarity score between the screen attribute tag and the token, and calculating similarity scores between the situation attribute tag and the token, respectively; and extracting from the reference scene database a reference scene to which a tag having the similarity score is greater than or equal to a specific score is assigned.
  • the automatic video generation method and the automatic video generation server it is possible to automatically generate a video by generating a script according to a user's request and then synthesizing reference scenes and environmental data according to the script.
  • FIG. 1 is a diagram illustrating an automatic video generation system according to an embodiment of the present disclosure.
  • Figure 2 is a diagram illustrating an automatic video generation server according to an embodiment of the present disclosure.
  • Figure 3 is a flow chart illustrating a method for automatically generating a video according to an embodiment of the present disclosure.
  • 4 to 7 are diagrams for explaining the operation of an automatic video generation device according to an embodiment of the present disclosure.
  • FIG. 1 is a diagram illustrating an automatic video generation system according to an embodiment of the present disclosure.
  • the automatic video generation system may include an automatic video generation server 100, one or more customer terminals 400, and one or more user terminals 500.
  • Customer terminal 400 may refer to an electronic device used by customers such as advertisers.
  • the user terminal 500 may refer to an electronic device used by general users other than advertisers.
  • the customer can input the video generation reference information needed to automatically generate the video into the customer terminal 400, and the customer terminal 400 can transmit the video generation reference information entered by the customer to the automatic video creation server 100.
  • the image generation reference information may be a keyword in word units.
  • the automatic video generation server 100 can automatically generate videos such as advertising videos according to customer requests.
  • the automatic video generation server 100 may collect images and build a reference scene database based on the collected images.
  • the automatic video generation server 100 may collect images (eg, videos) and divide the collected images into scenes to generate a plurality of reference scenes.
  • tags can be extracted from the reference scene by applying the reference scene to a machine learning-based video analysis model.
  • a reference scene database can be built by assigning the extracted tags to reference scenes and then storing them.
  • the automatic video generation server 100 can generate a script using the received video generation reference information and a pre-generated script database. there is.
  • the script database may store one or more attributes related to a keyword and text matching each attribute.
  • one or more properties related to a keyword include object properties of the object corresponding to the keyword, screen properties of the scene matching the object, situation properties of the scene matching the object, and highlight properties of the scene matching the object.
  • the automatic video generation server 100 may generate a script using text that matches an attribute determined based on user behavior information using customer-related content among one or more attributes related to a keyword.
  • the automatic video generation server 100 may generate a scenario consisting of a standard scene based on the script.
  • the automatic video generation server 100 can extract keywords from the script. More specifically, the automatic video generation server 100 may extract words from the text of the script based on spaces. And, based on a database of frequency values for each word created in advance, the frequency values of the extracted words can be measured.
  • a token may include a pair of a word and a morpheme value, and may be assigned a label indicating a frequency value.
  • the automatic video generation server 100 has (frequency value: 1000, (word, morpheme value)), (frequency value: 234, (word, morpheme value)), (frequency value: 2541, (word, morpheme value) Tokens such as (value)), and (frequency value: 2516, (word, morpheme value)) can be created.
  • the automatic video generation server 100 may assign different weights to each token according to the word and/or label of each token.
  • the automatic video generation server 100 determines the type of language that implements the words in the token (e.g., English, Chinese, Korean, etc.), the position of the words within the text of the script, and/or the location of the words assigned to the token.
  • different weights can be assigned to each token.
  • the automatic video generation server 100 may calculate the first weight using the total number of tokens generated from the text of the script and the order of each token.
  • the automatic video generation server 100 quantifies the order of the current token based on the total number of tokens generated from the text of the script and an important value predetermined according to the type of language,
  • the first weight can be calculated. For example, if the total number of tokens is 12 and the current token order is 4th, 12 can be assumed to be '1' and 1 can be divided by 4 to calculate '0.25'. And the first weight can be calculated by reflecting the important value predetermined according to the type of language in the value calculated in this way.
  • the significant value may change depending on the order of the current token. Specifically, if an important word is a language that appears at the end of a sentence, the important value reflected may also increase as the order of the current token increases. If the important word is a language that appears at the beginning of the sentence, the important value reflected will decrease as the order of the current token increases.
  • the automatic video generation server 200 creates a second weight for the current token using the frequency value indicated by the label of the current token, the frequency value indicated by the label of the previous token, and the frequency value indicated by the label of the next token. can be calculated.
  • the automatic video generation server 100 may assign a final weight to the current token using the first weight and the second weight. Then, keywords consisting of tokens with final weights can be extracted.
  • the automatic video generation server 100 may calculate a similarity score between the extracted keyword and the tag assigned to the reference scene.
  • the similarity score is a score expressing the degree to which the extracted keyword matches the tag assigned to the reference scene.
  • the automatic video generation server 100 selects a tag that matches the morpheme value of the token constituting the keyword from among the plurality of tags assigned to the reference scene, and compares the selected tag with the word of the token. A similarity score can be calculated.
  • the automatic video generation server 100 may select an object attribute tag from a plurality of tags assigned to the reference scene. Additionally, the similarity score between the object attribute tag and the words in the token can be calculated.
  • the automatic video generation server 100 may select a screen attribute tag and a situation attribute tag from a plurality of tags assigned to the reference scene. Additionally, the similarity score between the screen attribute tag and the words in the token can be calculated, and the similarity score between the situation attribute tag and the words in the token can be calculated. Similarity score calculation can be performed on all reference scenes stored in the reference scene database.
  • the automatic video generation server 100 may extract a reference scene assigned a tag with a similarity score of a certain score or more from the reference scene database.
  • the automatic video generation server 100 may generate an image by combining the extracted reference scene and pre-generated environment data. To this end, the automatic video generation server 100 may select sound data according to a scenario and convert text data corresponding to the scenario into voice data. And, the automatic video generation server 100 can generate an AI actor according to the above scenario.
  • the automatic video generation server 100 can collect images (eg, videos). Then, the collected video can be decoded to obtain the frames that make up the video, and then the frames can be sampled at playback time intervals.
  • images eg, videos
  • the collected video can be decoded to obtain the frames that make up the video, and then the frames can be sampled at playback time intervals.
  • the automatic video generation server 100 may list the sampled frames in the order of playback time and calculate the degree of similarity between adjacent frames. When the similarity is calculated for all the listed frames, the automatic video generation server 100 groups the frames based on the similarity, thereby generating a plurality of reference scenes divided by scene.
  • the automatic video generation server 100 may perform feature matching on adjacent frames and calculate the degree of similarity between adjacent frames. Specifically, the automatic video generation server 100 compares the keypoints between adjacent frames, and if the similarity is greater than the standard value, the automatic video generation server 100 groups the frames into one scene, thereby generating one reference scene. . If, as a result of comparing feature points between adjacent frames, the similarity is less than the standard value, it can be determined that the scene has been switched, and different reference images can be generated by grouping the corresponding frames into different scenes.
  • the automatic video generation server 100 may extract objects for each listed frame and then determine whether to change the scene based on a change in the number of extracted objects. Additionally, a reference scene can be created based on the point in time when the number of extracted objects changes or the point in time when the number of extracted objects changes beyond the standard value.
  • the automatic video generation server 100 determines whether the background changes based on the change in pixel value between pixels of the same position among pixels of adjacent frames, and determines whether or not there is a scene change based on the determination result. You can judge. Next, a reference scene can be created based on the point in time when the background changes.
  • the automatic video generation server 100 may determine whether to change the scene based on a change in the content of audio data and/or subtitle data constituting the video. Additionally, a reference scene can be created based on the point in time when new content appears in the audio data and/or subtitle data.
  • the automatic video generation server 100 may extract objects for each listed frame and then determine whether a scene change occurs based on a change in the type of the extracted object. Additionally, a reference scene can be created based on the point in time when a previously extracted object disappears and/or when a new object appears.
  • the automatic video generation server 100 can analyze the plurality of reference scenes and extract characteristic information of the reference scene. And, depending on the extracted feature information, different types of tags can be assigned to each reference scene. For example, depending on the extracted feature information, one of an object attribute tag, a screen attribute tag, a situation attribute tag, and a highlight attribute tag can be assigned.
  • an object attribute tag a screen attribute tag, a situation attribute tag, and a highlight attribute tag.
  • the automatic video creation server 100 may detect the characteristic area of the object in the reference scene (Interest Point Detection).
  • the feature area refers to the main area from which a feature descriptor that describes the characteristics of an object is extracted.
  • Feature descriptors may also be referred to as descriptors, feature vectors, or vector values, and may be used to determine whether objects are identical or similar.
  • feature areas include the contours of the object, corners such as corners among the contours, blobs that are distinct from the surrounding area, areas that are invariant or covariant depending on the deformation of the reference scene, and/or are darker than the surrounding brightness. Or it may contain poles with bright features.
  • the feature area may target a patch (piece) of the reference scene or the entire reference scene.
  • the automatic video generation server 100 may extract feature information of the object from the detected feature area. Additionally, a feature descriptor expressing the extracted feature information as a vector value can be extracted. And object attribute tags can be assigned to the reference scene according to the feature descriptor.
  • the automatic video generation server 100 may detect a feature area of a reference scene. And the feature information of the reference scene can be extracted from the feature area of the detected reference scene. Additionally, a feature descriptor expressing the extracted feature information as a vector value can be extracted. And screen attribute tags can be assigned to the reference scene according to the feature descriptor.
  • the above-mentioned feature descriptor may be calculated using the location of the feature area, brightness, color, sharpness, gradient, scale and/or pattern information of the feature area in the reference scene. For example, the feature descriptor may calculate the brightness value, brightness change value, and/or distribution value of the feature area by converting them into vectors.
  • the feature descriptor is not only a local descriptor based on the feature area as described above, but also a global descriptor, frequency descriptor, binary descriptor, or neural network descriptor. It can also be expressed as
  • the global descriptor can convert the brightness, color, sharpness, gradient, scale, and/or pattern information of the entire reference scene, each area where the reference scene is divided by an arbitrary standard, or each feature area into vector values. there is.
  • the frequency descriptor can convert the number of times pre-classified feature descriptors are included in a reference scene and/or the number of times they include global features such as a conventionally defined color table into a vector value.
  • a binary descriptor can be used by extracting in bits whether each descriptor is included and/or whether the size of each element value constituting the descriptor is larger or smaller than a specific value, and then converting it to an integer type.
  • a neural network descriptor can extract image information used for learning or classification from the layers of a neural network.
  • the automatic video generation server 100 may apply a reference scene to a scene type analysis model.
  • a scene type analysis model may refer to a model learned to receive a scene as input and output the scene type. Additionally, the scene type may refer to the type of situation being expressed in the scene.
  • the automatic video generation server 100 may assign a situation attribute tag to the reference scene according to the type of the extracted situation.
  • the automatic video generation server 100 may build a scene type analysis model as a CNN (Convolution Neural Network) model, one of the deep learning models, and learn the above-described data set.
  • the CNN model can be designed to include two convolutional layers, a relu layer, a max pooling layer, and one fully connected layer.
  • the automatic video generation server 100 uses the RCNN technique to construct a feature sequence in the map order of the convolution feature maps calculated from the CNN model, and then converts each feature sequence into a long and short term. It can be learned by substituting for memory networks (LSTM; Long Short Term Memory networks).
  • LSTM Long Short Term Memory networks
  • the automatic video creation server 100 may extract a highlight portion from the video.
  • the highlight portion may refer to the section containing the most important information in the video. For example, if the content of the video consists of four sections of 'Before', 'Before', and 'Before', the section corresponding to 'Before' may be considered the highlight section. Highlights can be extracted manually or automatically.
  • the automatic video creation server 100 may assign a highlight attribute tag to the reference scene corresponding to the highlight portion.
  • the automatic video generation server 100 as described above may be implemented as included in, for example, a web service providing server.
  • the web service providing server can provide various contents to the user terminal 500.
  • the type of content provided to the user terminal 500 may vary depending on the type of application used by the user terminal 500 to access the web service providing server.
  • This web service providing server may be implemented as an online shopping mall server or a search engine server.
  • the customer terminal 400 may include an application for accessing a web service providing server. Accordingly, when the application is selected and executed by the customer, the customer terminal 400 can access the automatic video creation server 100 through the application. Thereafter, when the customer inputs video generation reference information into the customer terminal 400, the customer terminal 400 may request automatic video generation by providing the input video generation reference information to the automatic video generation server 100.
  • the user terminal 500 may include an application for accessing a web service providing server. Accordingly, when the application is selected and executed by the user, the user terminal 500 can access the web service providing server through the application.
  • the user terminal 500 can display a web page provided from a web service providing server through an application.
  • a web page may include a screen loaded on an electronic device and/or content within the screen so that it can be immediately displayed on the screen according to a user's scroll input.
  • the entire application execution screen that extends horizontally or vertically and is displayed as the user scrolls may be included in the concept of a web page.
  • the camera roll screen can also be included in the concept of a web page.
  • the user terminal 500 may include an application (eg, software, neural network model, etc.) for analyzing user interests. Accordingly, the user terminal 500 may collect and store log records and/or engagement records and determine the user's interests by analyzing the log records and/or engagement records through an application for user interest analysis.
  • an application eg, software, neural network model, etc.
  • the user terminal 500 may extract content by analyzing log records and/or engagement records stored in the user terminal 500, and create a label indicating the type of extracted content. It can be extracted.
  • Log records may be created by recording events that occur while the operating system or software of the user terminal 500 is running.
  • Engagement records can be created by recording a set of committed actions that result in a user becoming interested, participating, and engaging.
  • User behavior information includes not only actions such as the user viewing content through a web browser, the user creating a 'like' tag on content through social networks, and the user viewing images or text on the homepage. , it can also include the object of these actions, the time when these actions occurred, and the time these actions were maintained.
  • a label indicating the type of extracted content may indicate, for example, whether the extracted content corresponds to the user's interests or not.
  • a label indicating the type of extracted content may be extracted by analyzing log records and/or engagement records, or may be extracted from labels stored in advance.
  • the user terminal 500 may be equipped with a crawler, a parser, and an indexer, through which web pages viewed by the user may be collected.
  • item information e.g., image, item name, and item price
  • the crawler can collect data related to item information by collecting a list of web addresses that users browse, checking websites, and tracking links.
  • the parser can interpret web pages collected during the crawling process and extract item information such as images, item prices, and item names included in the page.
  • the indexer can index the location and meaning of the extracted item information.
  • Figure 2 is a diagram illustrating an automatic video creation server according to an embodiment of the present disclosure.
  • the automatic video generation server 100 may include an image generating device 200 and a reference scene providing device 300.
  • the video generating device 200 can automatically generate videos, such as advertising videos, according to customer requests.
  • the image generation device 200 includes a script generation unit 210, a scenario generation unit 220, a keyword extraction unit 230, a reference scene extraction unit 240, an environment data generation unit 250, and an image synthesis unit ( 260).
  • the script generator 210 may generate a script using the received image generation reference information and a pre-generated script database. Specifically, the script generator 210 searches the script database for keywords included in the image generation reference information, and then generates object properties of the object corresponding to the searched keyword, screen properties of the scene matching the object, and scenes matching the object.
  • a script can be created using text that matches the attributes determined based on the user's behavior information using content related to the customer among the highlight attributes of the scene matching the situation attributes and objects.
  • the scenario generator 220 may generate a scenario composed of a standard scene based on the script generated by the script generator 210. According to embodiments, the scenario may further include sound effects and/or atmosphere in addition to the reference scene.
  • the keyword extraction unit 230 may extract keywords from the script generated by the script creation unit 210. More specifically, the keyword extractor 230 may extract words from the text of the script based on spaces. And, based on a database of frequency values for each word created in advance, the frequency values of the extracted words can be measured.
  • the keyword extraction unit 230 may generate a token by performing morphological analysis on each of the extracted words.
  • a token may include a pair of words and morpheme values, and may be assigned a label indicating a frequency value.
  • the keyword extraction unit 230 has (frequency value: 1000, (word, morpheme value)), (frequency value: 234, (word, morpheme)), (frequency value: 2541, (word, morpheme)) Tokens such as , and (frequency value: 2516, (word, morpheme)) can be generated.
  • the keyword extractor 230 may assign different weights to each token according to the word and/or label of each token.
  • the keyword extraction unit 230 determines the type of language (e.g., English, Chinese, Korean, etc.) that implements the word in the token, the position of the word within the text of the script, and/or the label assigned to the token.
  • different weights can be assigned to each token.
  • the keyword extractor 230 may calculate the first weight using the total number of tokens generated from the text of the script and the order of each token.
  • the keyword extraction unit 230 quantifies the order of the current token based on the total number of tokens generated from the text of the script and an important value predetermined according to the type of language, and provides information on the current token.
  • 1 Weight can be calculated. For example, if the total number of tokens is 12 and the current token order is 4th, the keyword extraction unit 230 may assume 12 as '1' and divide 1 by 4 to calculate '0.25'. .
  • the first weight can be calculated by reflecting the important value predetermined according to the type of language in the value calculated in this way. According to an embodiment, the significant value may change depending on the order of the current token.
  • the important value reflected may also increase as the order of the current token increases. If the important word is a language that appears at the beginning of the sentence, the important value reflected will decrease as the order of the current token increases.
  • the keyword extractor 230 may calculate the second weight using the frequency value indicated by the label of the current token, the frequency value indicated by the label of the previous token, and the frequency value indicated by the label of the next token. .
  • the keyword extractor 240 may assign a final weight to the current token using the first weight and the second weight. Then, keywords consisting of tokens with final weights can be extracted.
  • the reference scene extractor 240 may calculate a similarity score between the extracted keyword and the tag assigned to the reference scene.
  • the similarity score is a score expressing the degree to which the extracted keyword matches the tag assigned to the reference scene.
  • the reference scene extractor 240 selects a tag that matches the morpheme value of the token constituting the keyword from among the plurality of tags assigned to the reference scene, and compares the selected tag with the word of the token. A similarity score can be calculated.
  • the reference scene extractor 240 may select an object attribute from a plurality of tags assigned to the reference scene. Additionally, the similarity score between the object attribute tag and the words in the token can be calculated.
  • the reference scene extractor 240 may select a screen attribute tag and a situation attribute tag from a plurality of tags assigned to the reference scene. Additionally, the similarity score between the screen attribute tag and the words in the token can be calculated, and the similarity score between the situation attribute tag and the words in the token can be calculated. Similarity score calculation can be performed on all reference scenes stored in the reference scene database.
  • the reference scene extractor 240 may extract a reference scene assigned a tag with a similarity score of a certain score or more from the reference scene database.
  • the environmental data generator 250 may select sound data according to the scenario. And text data corresponding to the above scenario can be converted into voice data. Furthermore, an AI actor can be created according to the above scenario.
  • the image synthesis unit 260 may generate an image by combining the extracted reference scene and the environment data previously generated by the environment data generation unit 250.
  • the reference scene providing device 300 may build a reference scene database and provide the reference scene extracted from the reference scene database to the image generating device 200.
  • the reference scene providing device 300 may include an image segmentation unit 310, a tag allocation unit 320, a reference scene database 300, and a reference scene recommendation unit 340.
  • the image segmentation unit 310 may collect images (eg, videos). Then, the collected video can be decoded to obtain the frames that make up the video, and then the frames can be sampled at playback time intervals.
  • images eg, videos
  • the collected video can be decoded to obtain the frames that make up the video, and then the frames can be sampled at playback time intervals.
  • the image segmentation unit 310 may arrange the sampled frames in the order of playback time and calculate the degree of similarity between adjacent frames.
  • the automatic video generation server 100 groups the frames based on the similarity, thereby generating a plurality of reference scenes divided by scene.
  • the image segmentation unit 310 may perform feature matching on adjacent frames to calculate the degree of similarity between adjacent frames. Specifically, the automatic video generation server 100 compares the keypoints between adjacent frames, and if the similarity is greater than the standard value, the automatic video generation server 100 groups the frames into one scene, thereby generating one reference scene. . If, as a result of comparing feature points between adjacent frames, the similarity is less than the standard value, it can be determined that the scene has been switched, and different reference images can be generated by grouping the corresponding frames into different scenes.
  • the image segmentation unit 310 may extract objects for each listed frame and then determine whether to change the scene based on a change in the number of extracted objects. Additionally, a reference scene can be created based on the point in time when the number of extracted objects changes or the point in time when the number of extracted objects changes beyond the standard value.
  • the image segmentation unit 310 determines whether the background changes based on the change in pixel value between pixels of the same position among pixels of adjacent frames, and determines whether there is a scene change based on the determination result. can do. Next, a reference scene can be created based on the point in time when the background changes.
  • the video segmentation unit 310 may determine whether to change the scene based on a change in the content of the audio data and/or subtitle data constituting the video. Additionally, a reference scene can be created based on the point in time when new content appears in the audio data and/or subtitle data.
  • the image segmentation unit 310 may extract objects for each listed frame and then determine the beginning of a scene change based on a change in the type of the extracted object. Additionally, a reference scene can be created based on the point in time when a previously extracted object disappears and/or when a new object appears.
  • the tag allocator 320 may analyze a plurality of reference scenes and extract characteristic information of the reference scenes. And, depending on the extracted feature information, different types of tags can be assigned to each reference scene. For example, depending on the extracted feature information, one of an object attribute tag, a screen attribute tag, a situation attribute tag, and a highlight attribute tag can be assigned.
  • the automatic video creation server 100 may detect the characteristic area of the object in the reference scene (Interest Point Detection).
  • the feature area refers to the main area from which a feature descriptor that describes the characteristics of an object is extracted.
  • Feature descriptors may also be referred to as descriptors, feature vectors, or vector values, and may be used to determine whether objects are identical or similar.
  • feature areas include the contours of the object, corners such as corners among the contours, blobs that are distinct from the surrounding area, areas that are invariant or covariant depending on the deformation of the reference scene, and/or are darker than the surrounding brightness. Or it may contain poles with bright features.
  • the feature area may target a patch (piece) of the reference scene or the entire reference scene.
  • the automatic video generation server 100 may extract feature information of the object from the detected feature area. Additionally, a feature descriptor expressing the extracted feature information as a vector value can be extracted. And object attribute tags can be assigned to the reference scene according to the feature descriptor.
  • the automatic video generation server 100 may detect a feature area of a reference scene. And the feature information of the reference scene can be extracted from the feature area of the detected reference scene. Additionally, a feature descriptor expressing the extracted feature information as a vector value can be extracted. And screen attribute tags can be assigned to the reference scene according to the feature descriptor.
  • the above-mentioned feature descriptor may be calculated using the location of the feature area, brightness, color, sharpness, gradient, scale and/or pattern information of the feature area in the reference scene.
  • the feature descriptor may calculate the brightness value, brightness change value, and/or distribution value of the feature area by converting them into vectors.
  • the tag allocation unit 320 may apply the reference scene to the scene type analysis model.
  • a scene type analysis model may refer to a model learned to receive a scene as input and output the scene type. Additionally, the scene type may refer to the type of situation being expressed in the scene.
  • the automatic video generation server 100 may assign a situation attribute tag to the reference scene according to the type of the extracted situation.
  • the tag allocator 320 may build a scene type analysis model as a CNN (Convolution Neural Network) model, which is one of the deep learning models, and learn the above-described data set.
  • the CNN model can be designed to include two convolutional layers, a relu layer, a max pooling layer, and one fully connected layer.
  • the tag allocation unit 320 uses the RCNN technique to construct a feature sequence in the map order of the convolution feature maps calculated from the CNN, and then stores each feature sequence in a long-short-term memory network. It can be learned by substituting for (LSTM; Long Short Term Memory networks).
  • the tag allocation unit 320 may extract a highlight portion from the video.
  • the highlight portion may refer to the section containing the most important information in the video. For example, if the content of the video consists of four sections of 'Before', 'Before', and 'Before', the section corresponding to 'Before' may be considered the highlight section. Highlights can be extracted manually or automatically.
  • the tag allocation unit 320 may assign a highlight attribute tag to the reference scene corresponding to the highlight portion.
  • the reference image to which a tag is assigned by the tag allocation unit 320 may be stored in the reference scene database 330.
  • the reference scene database 330 may store the start time of the reference scene, the end time of the reference scene, and one or more tags assigned to the reference scene in a table format.
  • Figure 3 is a flow chart illustrating a method for automatically generating a video according to an embodiment of the present disclosure.
  • the automatic video generation server 100 may generate a script using the video generation reference information received from the customer terminal 400 (S310).
  • the automatic video generation server 100 may generate a scenario consisting of a standard scene based on the script and then extract keywords from the script (S320).
  • the step S320 is a step in which the automatic video generation server 100 extracts words from the text of the script based on spaces, and based on a database of frequency values for each word created in advance, the extracted words Measuring the frequency value, performing morphological analysis on each of the extracted words to generate tokens, assigning different weights to each token according to the word of each token and/or the label of each token, And it may include extracting words from tokens whose weight is greater than or equal to a standard value as keywords.
  • the automatic video generation server 100 may extract from the reference scene database 330 a reference scene to which a tag matching the keyword is assigned among a plurality of reference scenes stored in the reference scene database (S330).
  • the automatic video generation server 100 may generate an image by combining the pre-generated environmental data and the extracted reference scene according to the scenario created in step S320 (S340).
  • 4 to 7 are diagrams for explaining the operation of an automatic video generation device according to an embodiment of the present disclosure.
  • the automatic video generation device may collect video 410 to automatically generate a video according to a customer's request.
  • the collected image 410 may be provided to the image segmentation unit.
  • the image division unit may divide the input image into scenes to create a plurality of reference scenes (420_1 to 420_4).
  • a plurality of reference scenes may be input to the tag allocation unit.
  • the tag allocation unit may assign tags to each reference scene (420_1 to 420_4).
  • Reference scenes 420_1 to 420_4 to which tags are assigned may be stored in the reference scene database 430.
  • the image segmentation unit may decode the input image 410 to obtain frames constituting the image, and then sample the frames at playback time intervals.
  • the image segmentation unit may calculate the similarity between adjacent frames among the sampled frames and group the frames based on the similarity, thereby generating a plurality of reference scenes divided on a scene basis.
  • the tag allocation unit can analyze a plurality of reference scenes (420_1 to 420_4) to extract characteristic information of each reference scene and assign different types of tags to each reference scene (420_1 to 420_1) according to the extracted specific information.
  • the tag allocation unit can analyze a plurality of reference scenes (420_1 to 420_4) to extract characteristic information of each reference scene and assign different types of tags to each reference scene (420_1 to 420_1) according to the extracted specific information.
  • the tag allocation unit can analyze a plurality of reference scenes (420_1 to 420_4) to extract characteristic information of each reference scene and assign different types of tags to each reference scene (420_1 to 420_1) according to the extracted specific information.
  • the tag allocation unit can analyze a plurality of reference scenes (420_1 to 420_4) to extract characteristic information of each reference scene and assign different types of tags to each reference scene (420_1 to 420_1) according to the extracted specific information.
  • the extracted specific information there is. For example, depending on the extracted feature information, one of an object attribute
  • the tag allocation unit may detect the feature area of the object in the reference scene and extract feature information of the object from the detected feature area. Additionally, a feature descriptor expressing the extracted feature information as a vector value can be extracted. And object attribute tags can be assigned to the reference scene according to the feature descriptor.
  • the tag allocator may analyze the reference scene 420_3 and detect the feature area of the object (Interest Point Detection). And as shown in FIG. 6(b), the object and its characteristic information can be extracted from the detected feature area. Afterwards, the feature information of the object can be extracted by expressing it as a vector value. Next, as shown in FIG. 6(c), an object attribute tag can be assigned to the reference scene 420_3 according to the characteristic information of the object.
  • programs for various operations of the automatic video generation server 100 may be stored in the memory of the automatic video generation server 100.
  • the processor of the automatic video generation server 100 can load and execute a program stored in memory.
  • the processor may be implemented as an application processor (AP), central processing unit (CPU), microcontroller unit (MCU), or similar devices, depending on hardware, software, or a combination thereof.
  • AP application processor
  • CPU central processing unit
  • MCU microcontroller unit
  • hardware may be provided in the form of an electronic circuit that processes electrical signals to perform a control function
  • software may be provided in the form of a program or code that drives the hardware circuit.
  • the disclosed embodiments may be implemented in the form of a recording medium that stores instructions executable by a computer. Instructions may be stored in the form of program code, and when executed by a processor, may create program modules to perform operations of the disclosed embodiments.
  • the recording medium may be implemented as a computer-readable recording medium.
  • Computer-readable recording media include all types of recording media storing instructions that can be decoded by a computer. For example, there may be read only memory (ROM), random access memory (RAM), magnetic tape, magnetic disk, flash memory, optical data storage, etc.
  • ROM read only memory
  • RAM random access memory
  • magnetic tape magnetic tape
  • magnetic disk magnetic disk
  • flash memory optical data storage
  • computer-readable recording media may be provided in the form of non-transitory storage media.
  • 'non-transitory storage medium' only means that it is a tangible device and does not contain signals (e.g. electromagnetic waves). This term refers to cases where data is semi-permanently stored in a storage medium and temporary storage media. It does not distinguish between cases where it is stored as .
  • a 'non-transitory storage medium' may include a buffer where data is temporarily stored.
  • methods according to various embodiments disclosed in this document may be included and provided in a computer program product.
  • Computer program products are commodities and can be traded between sellers and buyers.
  • the computer program product may be distributed in the form of a machine-readable recording medium (e.g. compact disc read only memory (CD-ROM)) or via an application store (e.g. Play StoreTM) or on two user devices (e.g. It may be distributed directly between smartphones (e.g. smartphones) or distributed online (e.g. downloaded or uploaded).
  • a machine-readable recording medium e.g. compact disc read only memory (CD-ROM)
  • an application store e.g. Play StoreTM
  • two user devices e.g. It may be distributed directly between smartphones (e.g. smartphones) or distributed online (e.g. downloaded or uploaded).
  • a computer program product e.g., a downloadable app
  • a machine-readable recording medium such as the memory of a manufacturer's server, an application store's server, or a relay server. It can be stored or created temporarily.
  • the automatic video generation method and automatic video generation server as described above can be applied to the video production field.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Un procédé de génération de vidéo automatique, selon un mode de réalisation de la présente divulgation, peut comprendre les étapes consistant à : générer un script à l'aide d'informations de référence de génération d'image reçues en provenance d'un terminal client ; générer un scénario comprenant des scènes standard sur la base du script, puis extraire un mot-clé à partir du script ; extraire, parmi une pluralité de scènes de référence, une scène de référence à laquelle une étiquette correspondant au mot-clé est attribuée ; et combiner la scène de référence extraite et des données environnementales pré-générées en fonction du scénario, de façon à générer une image.
PCT/KR2023/016935 2022-10-27 2023-10-27 Procédé de génération de vidéo automatique, et serveur de génération de vidéo automatique WO2024091080A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020220140176A KR102560609B1 (ko) 2022-10-27 2022-10-27 동영상 자동 생성 방법 및 이를 실행하는 서버
KR10-2022-0140176 2022-10-27

Publications (1)

Publication Number Publication Date
WO2024091080A1 true WO2024091080A1 (fr) 2024-05-02

Family

ID=87433039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/016935 WO2024091080A1 (fr) 2022-10-27 2023-10-27 Procédé de génération de vidéo automatique, et serveur de génération de vidéo automatique

Country Status (2)

Country Link
KR (1) KR102560609B1 (fr)
WO (1) WO2024091080A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102560609B1 (ko) * 2022-10-27 2023-07-27 주식회사 일만백만 동영상 자동 생성 방법 및 이를 실행하는 서버
CN117749960B (zh) * 2024-02-07 2024-06-14 成都每经新视界科技有限公司 一种视频合成方法、装置和电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110070386A (ko) * 2009-12-18 2011-06-24 주식회사 케이티 영상 ars 자동 제작 시스템 및 그 방법
KR20130032653A (ko) * 2011-09-23 2013-04-02 브로드밴드미디어주식회사 동영상 자막을 키워드로 이용한 영상 검색 시스템 및 방법
KR20170099492A (ko) * 2016-02-24 2017-09-01 (주)핑거플러스 영상 컨텐츠의 정지영상을 이용하여 커머스 광고 정보를 제공하는 방법
KR102217262B1 (ko) * 2020-07-20 2021-02-18 주식회사 파파플랜트 라이브커머스 서비스 제공 시스템 및 방법
KR20220083091A (ko) * 2020-12-11 2022-06-20 주식회사 케이티 커머스 영상을 제공하는 서버, 판매자 단말 및 영상을 시청하는 시청자 단말
KR102560609B1 (ko) * 2022-10-27 2023-07-27 주식회사 일만백만 동영상 자동 생성 방법 및 이를 실행하는 서버

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20110070386A (ko) * 2009-12-18 2011-06-24 주식회사 케이티 영상 ars 자동 제작 시스템 및 그 방법
KR20130032653A (ko) * 2011-09-23 2013-04-02 브로드밴드미디어주식회사 동영상 자막을 키워드로 이용한 영상 검색 시스템 및 방법
KR20170099492A (ko) * 2016-02-24 2017-09-01 (주)핑거플러스 영상 컨텐츠의 정지영상을 이용하여 커머스 광고 정보를 제공하는 방법
KR102217262B1 (ko) * 2020-07-20 2021-02-18 주식회사 파파플랜트 라이브커머스 서비스 제공 시스템 및 방법
KR20220083091A (ko) * 2020-12-11 2022-06-20 주식회사 케이티 커머스 영상을 제공하는 서버, 판매자 단말 및 영상을 시청하는 시청자 단말
KR102560609B1 (ko) * 2022-10-27 2023-07-27 주식회사 일만백만 동영상 자동 생성 방법 및 이를 실행하는 서버

Also Published As

Publication number Publication date
KR102560609B1 (ko) 2023-07-27

Similar Documents

Publication Publication Date Title
WO2024091080A1 (fr) Procédé de génération de vidéo automatique, et serveur de génération de vidéo automatique
WO2016013914A1 (fr) Procédé, appareil, système et programme d'ordinateur permettant de fournir et d'afficher des informations de produit
WO2020080606A1 (fr) Procédé et système de génération automatique de métadonnées intégrées à un contenu vidéo à l'aide de métadonnées de vidéo et de données de script
US20210406549A1 (en) Method and apparatus for detecting information insertion region, electronic device, and storage medium
WO2018135881A1 (fr) Gestion de l'intelligence de vision destinée à des dispositifs électroniques
WO2024091084A1 (fr) Procédé de recommandation de scène de référence et dispositif de recommandation de scène de référence destinés à la génération de vidéo automatique
WO2010119996A1 (fr) Procédé et dispositif de fourniture d'annonces publicitaires à images mobiles
WO2016013915A1 (fr) Procédé, appareil et programme d'ordinateur d'affichage d'informations de recherche
CN109474847A (zh) 基于视频弹幕内容的搜索方法、装置、设备及存储介质
WO2020251174A1 (fr) Procédé permettant de faire la publicité d'un article de mode personnalisé pour l'utilisateur et serveur exécutant celle-ci
CN111314732A (zh) 确定视频标签的方法、服务器及存储介质
US10257563B2 (en) Automatic generation of network pages from extracted media content
CN113761253A (zh) 视频标签确定方法、装置、设备及存储介质
CN104102683A (zh) 用于增强视频显示的上下文查询
EP3942510A1 (fr) Procédé et système de fourniture d'objets multimodaux personnalisés en temps réel
CN116415017A (zh) 基于人工智能的广告敏感内容审核方法及系统
WO2022105507A1 (fr) Procédé et appareil de mesure de définition vidéo d'enregistrement de texte, dispositif informatique et support de stockage
Jin et al. Network video summarization based on key frame extraction via superpixel segmentation
WO2024091085A1 (fr) Procédé de génération de scène de référence et dispositif de génération de scène de référence, qui sont basés sur une image
WO2014178498A1 (fr) Procédé pour produire une image publicitaire et système de production correspondant, et système pour produire un fichier de film comprenant une image publicitaire et procédé permettant d'obtenir un fichier de film
WO2024107000A1 (fr) Procédé et serveur pour générer une image d'examen personnalisée à l'aide de données de rétroaction
WO2024106993A1 (fr) Procédé de génération de vidéo de commerce et serveur utilisant des données d'examen
WO2024091086A1 (fr) Procédé de fourniture de fonction de saut d'image et appareil de fourniture de fonction de saut d'image
WO2017222226A1 (fr) Procédé d'enregistrement d'un produit publicitaire sur un contenu d'image, et serveur pour l'exécution du procédé
WO2021107556A1 (fr) Procédé pour fournir un article recommandé sur la base d'informations d'événement d'utilisateur et dispositif pour l'exécuter

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23883184

Country of ref document: EP

Kind code of ref document: A1