CN117714771A

CN117714771A - Video interaction method, device, equipment and readable storage medium

Info

Publication number: CN117714771A
Application number: CN202311719185.7A
Authority: CN
Inventors: 程星星
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-03-15

Abstract

The application discloses a video interaction method, a device, equipment and a readable storage medium, which relate to the technical field of video processing and comprise the following steps: acquiring a trigger signal, wherein the trigger signal is generated when a target video is played, and the trigger signal is used for triggering the display of the interactive video; acquiring content generation intention information associated with the trigger signal in response to the trigger signal; the content generation intention information is related to a target video clip, and the target video clip comprises video content played when the trigger signal is acquired; generating intention information according to the content, and acquiring an interactive video, wherein the content of the interactive video is different from the content of the target video; and displaying the interactive video. According to the scheme, the interactive video is displayed, an interactive mode between a user and video content is provided, and the problem that an existing video interactive effect is single is solved.

Description

Video interaction method, device, equipment and readable storage medium

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video interaction method, apparatus, device, and readable storage medium.

Background

Video interaction is a general term for all interactive operations of a user in the process of watching video, and an immersive participation experience can be brought to a viewer through interaction. Currently, the interaction mode of a user in the process of watching a video can be summarized as the following three modes: 1) Text bullet screen (this bullet screen interaction mode is more common): the user inputs text contents through an input box below the screen, publishes the text and displays the text above the video picture; 2) Chat room: the chat room is also a common interaction mode and is mainly applied to live broadcasting scenes, and a user can send characters, expressions, images and the like in the chat room to perform various interactions; 3) Graphic bullet screen (this interactive mode uses relatively less): the user publishes images, expressions and the like in the form of a barrage and displays the images, expressions and the like above the video picture.

The interaction modes can be used as communication channels among users, and the users can freely publish the evaluation to the video, however, the existing interaction modes are interaction among users, and interaction between the users and the video content is lacking.

Disclosure of Invention

The invention aims to provide a video interaction method, a device, equipment and a readable storage medium, so as to solve the problems that in the prior art, the video interaction effect is single and a mode of interaction between a user and video content is lacking.

In order to achieve the above object, an embodiment of the present application provides a video interaction method, including:

acquiring a trigger signal, wherein the trigger signal is generated when a target video is played, and the trigger signal is used for triggering the display of the interactive video;

acquiring content generation intention information associated with the trigger signal in response to the trigger signal; the content generation intention information is related to a target video clip, and the target video clip comprises video content played when the trigger signal is acquired;

generating intention information according to the content, and acquiring an interactive video, wherein the content of the interactive video is different from the content of the target video;

and displaying the interactive video.

Optionally, the trigger signal is generated in at least one of the following cases:

the target video is played to a preset position or preset time;

the user inputs instruction information.

Optionally, before the acquiring the content generation intention information associated with the trigger signal, the method further includes:

and generating the content generation intention information at a first moment or a second moment, wherein the first moment is the moment when the target video is monitored to be played to the preset position or the preset time, and the second moment is the moment before the target video is played.

Optionally, the generating the content generation intention information includes:

determining character information corresponding to the audio data in the target video clip;

determining text content in the target video segment according to the audio data and/or the image data in the target video segment;

determining character information corresponding to the text content according to the time or the position of the text content in the target video segment and the character information corresponding to the audio data;

and determining content generation intention information based on the text content, character information corresponding to the text content and the target video segment, wherein the content generation intention is represented by triple information, and the triple information comprises a character list, a generation scene and generation content.

Optionally, the determining the character information corresponding to the audio data in the target video clip includes:

dividing the audio data in the target video segment into a plurality of audio segments, wherein the coincidence degree between two adjacent audio segments is a first numerical value;

extracting original characteristic information of the audio in each audio fragment;

identifying a first target audio fragment with dialogue content in each audio fragment according to the audio original characteristic information;

Extracting audio fingerprint features from the first target audio fragment, and classifying the audio fingerprint features according to the similarity of the audio fingerprint features;

and determining character information corresponding to various audio fingerprint features according to the corresponding relation between the pre-stored characters and the audio fingerprint features.

Optionally, determining the character information corresponding to the text content according to the time or the position of the text content in the target video segment and the character information corresponding to the audio data includes:

acquiring a second target audio segment according to the time or the position of the text content in the target video segment;

acquiring target audio fingerprint characteristics corresponding to the text content from the audio fingerprint characteristics of the second target audio fragment;

and determining the character information corresponding to the text content according to the character information corresponding to the target audio fingerprint characteristics.

Optionally, the determining text content corresponding to the target video segment according to the audio data and/or the image data in the target video segment includes any one of the following:

under the condition that the target video segment comprises the audio data, recognizing the audio data by adopting a voice recognition algorithm to acquire text content corresponding to the target video segment;

And under the condition that the target video segment comprises the audio data and the image data, identifying the audio data to obtain first text content corresponding to the audio data, identifying the image data to obtain second text content corresponding to the image data, checking the first text content by using the second text content, and obtaining text content corresponding to the video segment according to a checking result.

Optionally, the determining the content generating intention information based on the text content, the character information corresponding to the text content and the target video clip includes:

based on the text content and character information corresponding to the text content, utilizing a natural language processing algorithm to acquire four-tuple information, wherein the four-tuple information comprises the number of characters, the name of the characters, the relation of the characters and the emotion of the characters;

determining a generated scene according to the image data corresponding to the text content in the target video segment;

determining a character list and generating contents according to the four-tuple information;

and determining the content generation intention information according to the generation scene, the character list and the generation content.

Optionally, acquiring the content generation intention information associated with the trigger signal includes:

acquiring character information, a generation scene and generation content in instruction information input by a user;

generating the content generation intention information according to a character list, the generation scene and the generation content, wherein the character list comprises the character information acquired from instruction information input by the user.

Optionally, before the generating intention information according to the content and acquiring the interactive video, the method further includes:

acquiring character materials corresponding to a character list in the content generation intention in the target video segment;

and synthesizing the interactive video according to the generated scene and the generated content in the character material and the content generation intention information.

Optionally, the method further comprises:

determining position information of each person in the target video in each image data of the target video;

identifying the person identity of each person in the target video by using a person identity identification algorithm;

determining the role names corresponding to the identities of the characters according to the role table of the target video;

The step of acquiring, in the target video clip, the character materials corresponding to the character list in the content generation intention, including:

and acquiring the character materials corresponding to the character list according to the position information, the character identity and the character name corresponding to the character identity.

In order to achieve the above object, an embodiment of the present application provides a video interaction device, including:

the first acquisition module is used for acquiring a trigger signal, wherein the trigger signal is generated when a target video is played, and the trigger signal is used for triggering the display of the interactive video;

the second acquisition module is used for responding to the trigger signal and acquiring content generation intention information associated with the trigger signal; the content generation intention information is related to a target video clip, and the target video clip comprises video content played when the trigger signal is acquired;

the third acquisition module is used for generating intention information according to the content to acquire an interactive video, wherein the content of the interactive video is different from the content of the target video;

and the display module is used for displaying the interactive video.

In order to achieve the above object, an embodiment of the present application provides a video interaction device, including: a transceiver, a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the video interaction method according to the first aspect.

In order to achieve the above object, an embodiment of the present application provides a readable storage medium, where a program is stored, the program implementing the video interaction method according to the first aspect when executed by a processor.

The technical scheme of the application has at least the following beneficial effects:

in the video interaction method, firstly, a trigger signal is obtained, wherein the trigger signal is generated when a target video is played, and the trigger signal is used for triggering the display of the interaction video; secondly, responding to the trigger signal, and acquiring content generation intention information associated with the trigger signal; the content generation intention information is related to a target video clip, and the target video clip comprises video content played when the trigger signal is acquired; thirdly, generating intention information according to the content, and acquiring an interactive video, wherein the content of the interactive video is different from the content of the target video; and finally, displaying the interactive video. Therefore, when a user watches the target video, the interactive video different from the target video can be displayed, an interactive mode between the user and the video content is provided, and the problem that the existing video interactive effect is single is solved.

Drawings

FIG. 1 is a flow chart of a video interaction method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating the identification and classification of audio fingerprint features according to an embodiment of the present application;

FIG. 3 is a schematic diagram of recognizing text content in an embodiment of the present application;

FIG. 4 is a second flowchart of a video interaction method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram showing an interactive video in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a video interaction device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a video interaction device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

An embodiment of the present application provides a video interaction method, as shown in fig. 1, including:

step 101, acquiring a trigger signal, wherein the trigger signal is generated when a target video is played, and the trigger signal is used for triggering the display of an interactive video; here, the trigger signal may be specifically a trigger event, so this step may also be described as monitoring the trigger event during the playing process of the target video, where the trigger event is used to trigger the display of the interactive video;

step 102, responding to the trigger signal, and acquiring content generation intention information associated with the trigger signal; the content generation intention information is related to a target video clip, and the target video clip comprises video content played when the trigger signal is acquired; correspondingly, step 102 may also be described as obtaining content generation intent information associated with a monitored trigger event in response to the monitored trigger event; here, the content generation intention information is used to generate or select an interactive video, that is: the content generation intention information is used for guiding generation or selection of the interactive video, and alternatively, the content generation intention may include key elements constituting the interactive video; the acquiring of the content generation intention information associated with the trigger signal specifically includes the following two cases, one of which is: acquiring content generation intention information corresponding to the trigger signal, wherein the content generation intention information generated in advance can be content generation intention information generated by related video workers before playing the target video or content generation intention information generated when other users watch the target video; the other is: acquiring content generation intention information corresponding to a trigger event generated in real time;

Step 103, generating intention information according to the content, and acquiring an interactive video, wherein the content of the interactive video is different from the content of the target video; specifically, taking a television play or a short video as an example, the content may be different in trend of content/scenario, for example: the plot trend of the target video is tragedy (such as the role is quarry), and the plot trend of the interactive video is comedy (such as the role is the sum of the hands), so that the interactive video which is different from the content of the target video is obtained, the content of the target video can be expanded, and the viewing regrets of users on the target video can be made up;

and 104, displaying the interactive video.

Here, it should be noted that, when the target video is played in full screen, the interactive video may be displayed in a popup window manner; when the interactive video is not played in full screen, the interactive video can be displayed outside the playing area of the target video, namely: the display area of the interactive video is located outside the playing area of the target video, specifically taking fig. 5 as an example, the interactive video related to the content of the target video played in the time period can be displayed in a rolling manner on the right side of the playing area (the playing window of the target video), the user can select the interactive video to be played in the displayed interactive video based on personal preference, and the target video and the interactive video are displayed in different areas, so that the interactive video which is different from the content of the target video is recommended to the user without influencing the playing of the target video, the user can select the interactive video which wants to be watched based on the emotion requirement of the user, the user's regrets about the content of the target video is made up, and the participation feeling and the interestingness of the user are improved.

As an alternative implementation, the trigger signal is generated in at least one of the following cases:

the target video is played to a preset position or preset time;

The user inputs instruction information.

The trigger signal is equivalent to a trigger event, and the description mode of the optional implementation mode is as follows:

under the condition that the target video is monitored to be played to a preset position or preset time, determining that the trigger event is detected;

and determining that the trigger event is detected under the condition that the user input instruction information is monitored.

Namely: the triggering signal is a signal/triggering event generated when the target video is played to a preset position or preset time, and takes the target video as a television play as an example, wherein the preset position or preset time is, for example, a position or time of relatively intense scenario; the triggering event in the embodiment of the application may also be user input instruction information/triggering signal is a signal generated when user input instruction information is received; namely: in the process of playing the target video, if the target video is monitored to be played to a preset position or preset time, or the user input instruction information is monitored, determining that an interaction trigger event is detected/a trigger signal is generated. When the target video is monitored to be played to a preset position or preset time, corresponding content generation intention information is acquired, and the realization mode of further acquiring and displaying the interactive video can be called a passive interaction mode for short; when the user input instruction information is monitored, corresponding generation intention information is acquired, and the realization mode of further acquiring and displaying the interactive video can be called an active interaction mode for short.

As described above, the interactive video in the embodiment of the present application may be displayed by a popup window, or may be displayed in an area other than the target video playing area. On the basis, if the interaction mode is a passive interaction mode, the interaction mode can be displayed in a popup window mode, namely: when the trigger signal is acquired, displaying the interactive video corresponding to the trigger signal in a popup window mode (at the moment, the target video can be played in a full screen or not); it should be noted that, in order to avoid that the sudden display of the pop-up window affects the viewing experience of the user on the target video, before the user views the target video, the user may start the passive interaction triggering mode by inputting a first operation, where the first operation may be clicking a button on the screen indicating to start the passive interaction triggering mode, or the first operation is selecting "yes" in a window of whether the displayed query starts the passive interaction triggering mode; if the interaction mode is an active interaction mode, the interaction video can be displayed in a non-target video playing area, and the user selects the video of interest from the displayed interaction video.

Still further, as an alternative implementation, before step 102, the method further includes:

And generating the content generation intention information at a first moment or a second moment, wherein the first moment is the moment when the target video is monitored to be played to the preset position or the preset time, and the second moment is the moment before the target video is played. The time before playing the target video may be the time after generating the target video and before playing the target video for the first time (for example, the time during post-production of the target video), or may be the time when other users view a preset position or preset time of the target video, where the time when other users view the preset position or preset time is earlier than the time when the current users view the preset position or preset time; here, the case of generating the content generation intention information at the first time is an implementation of generating the content generation intention information in real time, and the case of generating the content generation intention information at the second time is an implementation of generating the content generation intention information in advance.

As a specific implementation manner, the generating the content generation intention information includes:

determining character information corresponding to the audio data in the target video clip; the length of the target video clip can be determined according to a preset rule; in addition, the step can determine characters corresponding to the audio data with different characteristics based on the characteristics of the audio data;

Determining text content in the target video segment according to the audio data and/or the image data in the target video segment; here, text content in the target video clip may be identified using voice recognition techniques and/or image-text recognition techniques;

determining character information corresponding to the text content according to the time or the position of the text content in the target video segment and the character information corresponding to the audio data; the step may specifically be to match the time or position of occurrence of the text content with the time or position corresponding to the audio data, so as to determine the corresponding relationship between the text content and the audio data, so that the character information corresponding to the text content may be determined based on the character information corresponding to the audio data;

and determining content generation intention information based on the text content, character information corresponding to the text content and the target video segment, wherein the content generation intention is represented by triple information, and the triple information comprises a character list, a generation scene and generation content. For example, the generation scene in this step may be determined based on the content picture in the target video clip, the character list may be determined based on the character information, and the generation content may be determined based on the character information and the text content.

As a more specific implementation manner, the determining the character information corresponding to the audio data in the target video clip includes:

dividing the audio data in the target video segment into a plurality of audio segments, wherein the coincidence degree between two adjacent audio segments is a first numerical value; for example, the first value is 10%, where by setting the coincidence ratio on two adjacent audio pieces, the timing relationship between the different audio pieces can be accurately determined;

extracting original characteristic information of the audio in each audio fragment; for example, this step may extract audio raw feature information in each audio clip based on mel spectrum;

identifying a first target audio fragment with dialogue content in each audio fragment according to the audio original characteristic information; the step can be specifically based on a pre-trained dialogue event detection model, wherein the dialogue event detection model is a model constructed based on a neural network algorithm; here, the audio segments are divided into speaking audio segments and non-speaking audio segments, and the following can be only identified for the speaking audio segments, so that the accuracy and the identification efficiency of text content identification can be improved;

Extracting audio fingerprint characteristics from the audio original characteristic information of the first target audio fragment, and classifying the audio fingerprint characteristics according to the similarity of the audio fingerprint characteristics; the step may specifically use an audio fingerprint extraction algorithm to extract audio fingerprint features in the first target audio piece;

determining character information corresponding to various audio fingerprint features according to the corresponding relation between the pre-stored characters and the audio fingerprint features; here, the correspondence between the person and the audio fingerprint feature may be a correspondence between a role player and the audio fingerprint feature, or may be a correspondence between a role name and the audio fingerprint feature in the target video, based on which, in the embodiment of the present application, the correspondence between the person and the audio fingerprint feature needs to be obtained in advance, where the step of obtaining the correspondence may be: carrying out identity recognition on the person in the target video by adopting a person identity recognition algorithm, determining each role player, and determining the role name corresponding to each role player based on the cast of the target video; then, based on the audio features of the different persons stored in the speech recognition library, the correspondence between the audio features and the role player/role names is determined.

In this specific implementation manner, firstly, audio data in a target video clip is divided into a plurality of audio clips, and audio original feature information is extracted for each audio clip, so that the audio clips are classified into speaking audio clips (audio clips with dialogue content) and non-speaking audio clips (audio clips without dialogue content) based on the audio original feature information, and secondly, audio fingerprint feature recognition is performed for the clips with dialogue content, so that characters corresponding to various audio fingerprint features are determined.

As another more specific implementation manner, determining the character information corresponding to the text content according to the time or the position of the text content in the target video segment and the character information corresponding to the audio data includes:

acquiring a second target audio segment according to the time or the position of the text content in the target video segment; here, the second target audio piece is an audio piece corresponding to the text content;

acquiring target audio fingerprint characteristics corresponding to the text content from the audio fingerprint characteristics of the second target audio fragment, wherein the time or the position corresponding to the target audio fingerprint characteristics is the same as the time or the data corresponding to the text content;

According to the character information corresponding to the target audio fingerprint characteristics, determining character information corresponding to the text content; here, different text contents in the target video clip correspond to different characters.

As still another specific implementation manner, the determining text content corresponding to the target video segment according to the audio data and/or the image data in the target video segment includes any one of the following:

under the condition that the target video segment comprises the audio data, recognizing the audio data by adopting a voice recognition algorithm to acquire text content corresponding to the target video segment; that is, in the case where only audio data is included in the target video clip and no subtitle is included, text content may be determined according to the audio recognition result;

and under the condition that the target video segment comprises the audio data and the image data, identifying the audio data to obtain first text content corresponding to the audio data, identifying the image data to obtain second text content corresponding to the image data, checking the first text content by using the second text content, and obtaining text content corresponding to the video segment according to a checking result, wherein a text character string matching method can be used for checking to improve the accuracy of text content identification.

As still another specific implementation manner, the determining the content generation intention information based on the text content, the character information corresponding to the text content, and the target video clip includes:

based on the text content and character information corresponding to the text content, utilizing a natural language processing algorithm to acquire four-tuple information, wherein the four-tuple information comprises the number of characters, the name of the characters, the relation of the characters and the emotion of the characters; wherein the number of characters is the total number of characters corresponding to the text content, the character names can be character names or actor names playing roles, and when the number of characters is a plurality of characters, the characters have a character relationship (such as friends, family, colleagues, enemy, and the like);

determining a generated scene according to the image data corresponding to the text content in the target video segment; as previously described, the generated scene may be a content picture in the target video clip;

determining a character list and generating contents according to the four-tuple information; the character list can be determined based on character information corresponding to text content, and the generated content can be determined based on a preset scenario scene represented by the four-element group;

Determining content generation intention information according to the generation scene, the character list and the generation content, namely: the content generation intention information is represented by a triplet composed of a person list, a generation scene, and a generation content.

The implementation of the foregoing alternative implementations (implementation of passive interaction) is illustrated below.

(one) determining character information corresponding to audio data in a target video clip

(1) Audio data preprocessing: splitting the audio data into segments with the length of 3 seconds, wherein the adjacent segments have the overlap ratio of 10%;

(2) Audio clip classification: firstly, extracting audio original characteristic information in each audio fragment by adopting a Mel frequency spectrum; secondly, according to the original characteristic information of the audio, identifying whether the voice of the dialogue content exists in the voice clip (specifically, the original characteristic information of the audio is input into a pre-trained dialogue event detection model so as to identify whether the voice of the dialogue content exists in the audio clip according to the input original characteristic information of the audio by the dialogue event detection model); finally, dividing the audio frequency fragments into speaking audio frequency fragments and non-speaking audio frequency fragments according to the identification result; therefore, irrelevant audio scenes (scenes without dialogue) can be filtered, and the accuracy and the recognition efficiency of text content recognition are improved;

(3) Speaker identity matching and identification: specifically, an audio fingerprint extraction algorithm is used for extracting 512-dimensional audio fingerprint features from each audio segment, and the speaker identity grouping is realized by calculating the similarity of the audio fingerprint features in the target video segment; the specific implementation of this identification process is described below with reference to fig. 2:

a) Initializing a speaker identity set to be empty, and storing the recognized voice features of the first speaker into a feature set P1;

b) Aiming at the audio fragment of the subsequent speaker, if the similarity between the extracted audio fingerprint feature and the feature set P1 meets the similarity condition, the audio fragment and the P1 set are represented to belong to the same speaker identity, and the extracted audio fragment fingerprint feature is expanded to the P1 set;

c) If step B) does not match the established speaker identity set, namely the audio fingerprint feature of a certain sound fragment and the set P1 do not meet the similarity condition, indicating that the sound fragment belongs to another speaker identity, and storing the audio fingerprint feature into the set P2;

d) And C), sequentially executing the step B) and the step C), wherein the audio fragments corresponding to the audio fingerprint characteristics in each Pn set finally obtained belong to the same identity of the speaker, and different Pn sets correspond to different speaker identities.

(4) Video text content extraction and calibration: specifically, an image character recognition technology and a voice recognition technology are adopted to extract text content in a video from two dimensions of an image and a voice, and the voice recognition result is calibrated according to the image recognition result, so that the accuracy of text recognition is improved. Calibrating a text recognition result based on the image recognition and voice recognition results, wherein the method mainly comprises the following steps:

a) The speaker identity cannot be identified through the voice characteristics without recognizing the voice content, and the scene is not processed;

b) The subtitle content is not recognized, and the text content is mainly a voice recognition result;

c) Simultaneously recognizing the voice and caption contents, and calibrating the voice recognition result using the image recognition result based on a text string matching method (e.g., common sub-string matching) as shown in fig. 3. The speaker identity corresponding to each text of the scene is based on the voice recognition result.

Here, it should be noted that, in the text recognition process, video time corresponding to the text is recorded, and the video time is associated with the speaker identity recognition result, so as to obtain the speaker identity corresponding to each text. And correlating the speaker identity with a pre-executed character identity recognition result to obtain the character identity corresponding to each text.

(5) Extracting content generation intent based on scenario changes: specifically, based on the text content extracted in the previous step, character names, character relations and emotion changes are extracted by using a natural language processing algorithm, four-element groups of the number of characters, the character names, the character relations and the character emotion > are constructed, corresponding content generating intents are defined for different scenario scenes represented by different four-element groups, and then content generation is carried out according to the generating intents. The following scenario key event scenes can be predefined, and the scene recognition range can be expanded infinitely along with the algorithm recognition capability:

number of people: determining the number of people contained in the scene according to the speaker identification result in the up-down person, for example, the embodiment of the application can comprise single person and double person scenes;

character relationship: the character relationships of the embodiments of the present application may include, but are not limited to, single person, friend, family, lover, hostile, and the like relationships;

character emotion: the emotion of a person identified by embodiments of the present application may include, but is not limited to, surprise, anger, sadness, joy, fear, mind, disgust, and the like.

According to the number of persons, the relationship between persons, and the emotion of persons, the content generation intention extraction rules shown in the following table 1 are defined:

TABLE 1

(6) And extracting content generation intention information from the flow, generating a scene by using the < character list, generating a content > triplet, storing the triplet, and generating the content based on the triplet information. And generating a relevant content picture in the scene selection video in the passive mode as a generation background.

Here, it should be noted that, as described above, the interactive video of the passive interaction manner may be a pre-generated interactive video (different scenario directions generated by the video producer or other viewers), or may be an interactive manner generated in real time according to the playing position or time of the target video, so that the generated interactive video may have an interactive video meeting the emotion requirement of the current viewer, but the accuracy, flexibility and participation degree of the current viewer of the generated interactive video may have certain shortfalls, based on this, as another alternative implementation manner, the obtaining content generation intention information associated with the trigger signal includes:

acquiring character information, a generation scene and generation content in instruction information input by a user; specifically, the method comprises the following steps: based on the instruction information input by the user, extracting character information contained in the instruction information, so that the extracted character is used as a main object for generating subsequent content and used for guiding the subsequent content generating operation; the method comprises the steps of extracting requirements for generating content backgrounds in instruction information, such as rainy days, snowy days, dusk days, night days and the like; extracting requirements for generating contents in instruction information, such as smiling, dance, celebration, crying, slow-down, tremble, hug, kissing, hand pulling, head beating, gazing, frame-noisy and the like;

That is, when the interaction triggering event is that the user inputs instruction information, the user can control the video content based on the instruction information, such as controlling the expression and action of the character, and generating content generation intention information related to the interaction video expected by the user according to the instruction, so that the generated interaction video accords with the real preference of the user, and the interaction participation of the user is stronger.

Further, as an optional implementation, before step 103, the method further includes:

acquiring character materials corresponding to a character list in the content generation intention in the target video segment; the execution timing of the step may be after generating the content generation intention information associated with the interaction trigger time at the first time or the second time; specifically, the step may be to generate intention information based on the pre-identified identity of the person, the person's role and the position of the person in the picture, perform person matting in the target video segment according to the person list, and collect the person's material on which the content is generated;

And synthesizing the interactive video according to the generated scene and the generated content in the character material and the content generation intention information, wherein the interactive video is an image, a short video, a moving picture and the like.

Further, as an alternative implementation, the method further includes:

determining position information of each person in the target video in each image data of the target video; the step can be based on an image recognition algorithm, and the position information of each person in the image is recognized;

identifying the person identity of each person in the target video by using a person identity identification algorithm; here, the person identity recognition algorithm is, for example, face recognition, portrait recognition, etc., and the person identity recognized here is, for example, the name of the role player;

the above steps can determine the correspondence between the character names and the character players in the target video, thus providing basis for determining specific character information (character names/character players) corresponding to the text content or the audio content, and in addition, the content can determine the position information of each character in each frame image, thus providing support for the subsequent acquisition of character materials.

Based on the above steps, the obtaining, in the target video clip, the character stories corresponding to the character list in the content generation intention includes:

Here, it should be noted that the steps of person identity and person role recognition in this alternative implementation may be performed before generating the content generation intention information; or pre-generated, such as generated after the target video is produced, or finished before the target video is played; the method can also be generated in real time, such as when the interaction triggering event is detected, the identity and the character role are identified, then the content generation intention information is generated, and the interaction video is further generated and displayed according to the content generation intention information.

In the video interaction method of the embodiment of the present application, as shown in fig. 4, an active interaction mode and a passive interaction mode are provided, where in the passive interaction mode, video text content is extracted (e.g. by means of speech recognition, subtitle recognition, etc. in fig. 4), scenario changes are identified using a natural language processing algorithm, key information such as core characters, events, emotions, etc. is extracted, a scene of content generation is defined (corresponding to the steps of video scenario analysis and content generation intention in fig. 4), and content generation is performed according to the scenario content extraction generation intention (corresponding to the content generation in fig. 4); in the active interaction mode, a user sends a content generation instruction, extracts a person name, a role name, a generation scene, a content generation requirement and the like contained in the user instruction (corresponding to the steps of interaction instruction analysis and named entity extraction in fig. 4, wherein the step of named entity extraction is required to be based on the result of the pre-performed person identification), and extracts a generation intention to perform content generation (corresponding to the content generation in fig. 4); in addition, the content generation result is presented after the content (interactive video) is generated. Under the passive interaction mode, the content recommendation conforming to the scenario is generated to the user at the highlight part by analyzing the scenario content, so that the passive interaction experience is realized. In the active interaction mode, a user actively sends related instructions to control video content, such as controlling character expressions and actions, and generates related content according to the instructions, such as character hugs, celebrations and the like, wherein the generated content better accords with the real preference of the user, and the interaction participation is stronger.

In the video interaction method, the methods of audio and video processing, image recognition, natural language processing, deep learning and the like are used in combination, scenario content can be analyzed through an algorithm, content generation intention is extracted to generate content, a user can send an instruction, and content generation is performed according to the instruction content. The generated content can be expansion of the scenario content and expression of the emotion of the user, and can be used for making up the regrets of the user on the scenario and improving the participation feeling and the interestingness of the user.

The embodiment of the application further provides a video interaction device, as shown in fig. 6, including:

the first obtaining module 601 is configured to obtain a trigger signal, where the trigger signal is generated when a target video is played, and the trigger signal is used to trigger display of an interactive video;

a second obtaining module 602, configured to obtain content generation intention information associated with the trigger signal in response to the trigger signal; the content generation intention information is related to a target video clip, and the target video clip comprises video content played when the trigger signal is acquired;

a third obtaining module 603, configured to generate intent information according to the content, and obtain an interactive video, where the content of the interactive video is different from the content of the target video;

And the display module 604 is used for displaying the interactive video.

the target video is played to a preset position or preset time;

the user inputs instruction information.

Optionally, the apparatus further comprises:

the generation module is used for generating the content generation intention information at a first moment or a second moment, wherein the first moment is the moment when the target video is monitored to be played to a preset position or a preset time, and the second moment is the moment before the target video is played.

Optionally, the generating module includes:

a first determining submodule, configured to determine character information corresponding to audio data in the target video clip;

a second determining submodule, configured to determine text content in the target video segment according to the audio data and/or the image data in the target video segment;

a third determining submodule, configured to determine character information corresponding to the text content according to time or position of occurrence of the text content in the target video segment and character information corresponding to the audio data;

and a fourth determining sub-module, configured to determine, based on the text content, character information corresponding to the text content, and the target video clip, content generation intention information, where the content generation intention is represented by triple information, and the triple information includes a character list, a generation scene, and a generation content.

Optionally, the first determining submodule includes:

the dividing unit is used for dividing the audio data in the target video segment into a plurality of audio segments, wherein the coincidence ratio between two adjacent audio segments is a first numerical value;

the extraction unit is used for extracting the original characteristic information of the audio in each audio fragment;

an identifying unit, configured to identify a first target audio segment having dialogue content in each of the audio segments according to the audio original feature information;

the processing unit is used for extracting audio fingerprint characteristics from the first target audio fragment and classifying the audio fingerprint characteristics according to the similarity of the audio fingerprint characteristics;

and the first determining unit is used for determining character information corresponding to various audio fingerprint characteristics according to the corresponding relation between the pre-stored characters and the audio fingerprint characteristics.

Optionally, the third determining submodule includes:

a first obtaining unit, configured to obtain a second target audio segment according to a time or a position of the text content in the target video segment;

the second acquisition unit is used for acquiring target audio fingerprint characteristics corresponding to the text content from the audio fingerprint characteristics of the second target audio fragment;

And the second determining unit is used for determining the character information corresponding to the text content according to the character information corresponding to the target audio fingerprint characteristics.

Optionally, the second determining submodule includes:

the first recognition unit is used for recognizing the audio data by adopting a voice recognition algorithm under the condition that the target video segment comprises the audio data, and acquiring text content corresponding to the target video segment;

and the second recognition unit is used for recognizing the audio data to obtain first text content corresponding to the audio data, recognizing the image data to obtain second text content corresponding to the image data, checking the first text content by using the second text content and obtaining the text content corresponding to the video segment according to a checking result when the audio data and the image data are included in the target video segment.

Optionally, the third determining submodule includes:

a third obtaining unit, configured to obtain, based on the text content and character information corresponding to the text content, four-tuple information by using a natural language processing algorithm, where the four-tuple information includes a number of characters, a name of the characters, a relationship of the characters, and emotion of the characters;

A third determining unit, configured to determine a generated scene according to image data corresponding to the text content in the target video segment;

a fourth determining unit, configured to determine a character list and generate content according to the four-tuple information;

and a fifth determining unit configured to determine the content generation intention information according to the generation scene, the character list, and the generation content.

Optionally, the second obtaining module 602 includes:

the acquisition sub-module is used for acquiring character information, a generated scene and generated content in the instruction information input by the user;

and the generation submodule is used for generating the content generation intention information according to a character list, the generation scene and the generation content, wherein the character list comprises the character information acquired from instruction information input by the user.

Optionally, the apparatus further comprises:

a fourth acquisition module, configured to acquire, in the target video clip, character materials corresponding to a character list in the content generation intention;

and the synthesis module is used for synthesizing the interactive video according to the character materials, the generated scenes in the content generation intention information and the generated content.

Optionally, the apparatus further comprises:

a first determining module, configured to determine position information of each person in the target video in each image data of the target video;

the identification module is used for identifying the person identity of each person in the target video by utilizing a person identity identification algorithm;

the second determining module is used for determining the role names corresponding to the identities of the characters according to the role table of the target video;

the fourth obtaining module is specifically configured to: and acquiring the character materials corresponding to the character list according to the position information, the character identity and the character name corresponding to the character identity.

It should be noted that, the video interaction device provided in this embodiment of the present application can implement all the method steps implemented in the video interaction method embodiment, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those of the method embodiment in this embodiment are omitted.

The embodiment of the application further provides a video interaction device, as shown in fig. 7, including a transceiver 710, a processor 700, a memory 720, and a program or instructions stored on the memory 720 and executable on the processor 700; the processor 700 implements the video interaction method described above when executing the program or instructions.

The transceiver 710 is configured to receive and transmit data under the control of the processor 700.

Wherein in fig. 7, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 700 and various circuits of memory represented by memory 720, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 710 may be a number of elements, i.e. comprising a transmitter and a receiver, providing a means for communicating with various other apparatus over a transmission medium. The user interface 730 may also be an interface capable of interfacing with an inscribed desired device for a different device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 700 is responsible for managing the bus architecture and general processing, and the memory 720 may store data used by the processor 700 in performing operations.

The embodiment of the application further provides a readable storage medium, on which a program is stored, where the program, when executed by a processor, implements each process of the video interaction method embodiment described above, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here. Wherein the readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the foregoing is directed to the preferred embodiments of the present application, it should be noted that modifications and adaptations to those embodiments may occur to one skilled in the art and that such modifications and adaptations are intended to be comprehended within the scope of the present application without departing from the principles set forth herein.

Claims

1. A method of video interaction, comprising:

and displaying the interactive video.

2. The method of claim 1, wherein the trigger signal is generated in at least one of the following cases:

the target video is played to a preset position or preset time;

the user inputs instruction information.

3. The method of claim 2, wherein prior to the obtaining the content-generated intent information associated with the trigger signal, the method further comprises:

4. The method of claim 3, wherein the generating the content generation intent information comprises:

5. The method of claim 4, wherein the determining the content generation intent information based on the text content, the persona information corresponding to the text content, and the target video clip comprises:

6. The method of claim 2, wherein obtaining content generation intent information associated with the trigger signal comprises:

7. The method of claim 1, wherein the generating intent information from the content, prior to obtaining interactive video, further comprises:

8. A video interactive apparatus, comprising:

and the display module is used for displaying the interactive video.

9. A video interactive apparatus, comprising: a transceiver, a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the video interaction method of any of claims 1 to 7.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a program, which when executed by a processor, implements the video interaction method according to any of claims 1 to 7.