CN116962787A

CN116962787A - Interaction method, device, equipment and storage medium based on video information

Info

Publication number: CN116962787A
Application number: CN202310949052.2A
Authority: CN
Inventors: 高岩; 张铮; 郭冬升; 姜凯; 王光鑫
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-27

Abstract

The application discloses an interaction method, device, equipment and storage medium based on video information, relating to the field of natural language processing, comprising the following steps: separating the video file to be processed, and preprocessing the obtained audio data and video picture data to obtain a plurality of audio paragraph texts and a plurality of video picture texts; vectorizing a plurality of audio paragraph texts and a plurality of video picture texts, and storing the vectorized data into a preset vector database; vectorizing the received problem text, and performing similarity matching on vectorized data based on the vectorized problem text to determine target vectorized data; and inputting the target vectorization data and the vectorization question text into a preset language model to generate an answer text, and presenting the answer text to a video playing window of the video file to be processed. Therefore, question and answer interaction based on video information can be realized, and experience and efficiency of watching videos and acquiring information of users are improved.

Description

Interaction method, device, equipment and storage medium based on video information

Technical Field

The present invention relates to the field of natural language processing, and in particular, to an interaction method, apparatus, device, and storage medium based on video information.

Background

Along with the development of video websites and self-media, the number of various curriculum learning videos and conference lecture reporting videos is rapidly increasing, and a large number of users learn professional curriculum knowledge and know industry front technology through the videos. With the development of artificial intelligence technology, especially the recent breakthrough progress in speech recognition, natural language processing, large-scale language model technology and the like, the structuring of video information can be basically realized, thereby accelerating the transmission efficiency of video information. However, in the prior art, the processing of the video information may result in a low density of video knowledge utilized in the question-answer interaction based on the video information, and poor structure, thereby causing a bottleneck in information transfer.

Disclosure of Invention

Accordingly, the present invention is directed to an interaction method, apparatus, device and storage medium based on video information, which can match text data obtained based on video according to question text input by a user, so as to obtain answer text corresponding to the question text, and present the answer text to the user, so that question-answer interaction based on video information can be realized, and experience and efficiency of the user for watching video and obtaining information can be improved. The specific scheme is as follows:

In a first aspect, the application discloses an interaction method based on video information, which is applied to a video playing client, and comprises the following steps:

performing separation operation on a video file to be processed, and preprocessing audio data and video picture data obtained through the separation operation to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video picture texts corresponding to the video picture data;

vectorizing the audio paragraph texts and the video picture texts, and storing the vectorized data to a preset vector database;

vectorizing the received problem text, and performing similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text to determine target vectorized data corresponding to the vectorized problem text;

and inputting the target vectorization data and the vectorization question text into a preset language model to generate an answer text corresponding to the question text, and presenting the answer text to a video playing window of the video file to be processed.

Optionally, preprocessing the audio data includes:

Performing voice recognition on the audio data to obtain an audio text corresponding to the audio data;

performing paragraph division on the audio text by using a preset language model to obtain a plurality of audio paragraph texts, and determining a plurality of paragraph summaries and start and stop times corresponding to the plurality of audio paragraph texts; and the start-stop time is the start-stop time corresponding to each audio paragraph text in the plurality of audio paragraph texts.

Optionally, preprocessing the video picture data includes:

video segmentation is carried out on the video picture data based on the start-stop time so as to obtain a plurality of video picture data corresponding to the plurality of audio paragraph texts;

respectively extracting adjacent video frames in each video picture data in the plurality of video picture data based on a preset time interval, and calculating picture repetition rate of the adjacent video frames; the adjacent video frames are two frames of video pictures based on the preset time interval;

if the picture repetition rate is larger than a preset repetition rate threshold, removing a frame of video picture from the adjacent video frames to obtain a plurality of target video frames;

and extracting texts in the target video frames based on an optical character recognition technology to obtain a plurality of video picture texts.

Optionally, the interaction method based on video information further includes:

creating a first jump interface based on the paragraph summaries, and judging whether a first video jump instruction corresponding to a target paragraph summary in the paragraph summaries is received through the first jump interface;

if so, switching the video picture data currently played in the video playing window into first target video picture data corresponding to the first target start-stop time based on the first target start-stop time corresponding to the target paragraph abstract.

Optionally, the interaction method based on video information further includes:

judging whether a language transcription instruction is received, if so, respectively converting the plurality of audio paragraph texts and the plurality of video picture texts into a plurality of target audio paragraph texts and a plurality of target video picture texts corresponding to the language type based on the language type in the language transcription instruction;

and covering the target audio paragraph texts to a preset first video area of the video playing window, and covering the target video picture texts to a preset second video area of the video playing window.

Optionally, the vectorizing the received problem text, and performing similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text, so as to determine target vectorized data corresponding to the vectorized problem text, including:

judging whether an input problem text is received, if so, vectorizing the problem text to obtain a vectorized problem text;

calculating the similarity between the vectorization problem text and the vectorization data stored in the preset vector database, and determining the similarity with the highest calculated value as the target similarity;

and determining the vectorized audio paragraph text corresponding to the target similarity as target vectorized audio paragraph text, and determining the vectorized video picture text corresponding to the target similarity as target vectorized video picture text.

Optionally, after inputting the target vectorization data and the vectorization question text to a preset language model to generate an answer text corresponding to the question text and presenting the answer text to a video playing window of the video file to be processed, the method further includes:

Determining a second target start-stop time corresponding to the target vectorized audio paragraph text and the target vectorized video picture text;

creating a second jump interface corresponding to the answer text, and judging whether a second video jump instruction corresponding to the answer text is received based on the second jump interface;

if yes, the video picture data currently played in the video playing window are jumped to second target video picture data corresponding to the second target starting and ending time.

In a second aspect, the present application discloses an interaction device based on video information, which is applied to a video playing client, and includes:

the video separation module is used for separating the video file to be processed, preprocessing the audio data and the video picture data obtained through the separation operation to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video picture texts corresponding to the video picture data;

the data vectorization module is used for vectorizing the audio paragraph texts and the video picture texts, and storing the obtained vectorized data into a preset vector database;

The problem matching module is used for vectorizing the received problem text, and carrying out similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text so as to determine target vectorized data corresponding to the vectorized problem text;

and the answer presentation module is used for inputting the target vectorization data and the vectorization question text into a preset language model so as to generate an answer text corresponding to the question text, and presenting the answer text to a video playing window of the video file to be processed.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

and a processor for executing the computer program to implement the aforementioned video information-based interaction method.

In a fourth aspect, the present application discloses a computer readable storage medium storing a computer program which, when executed by a processor, implements the aforementioned video information based interaction method.

In the method, firstly, a video file to be processed is separated, audio data and video picture data obtained through the separation operation are preprocessed, so that a plurality of audio paragraph texts corresponding to the audio data and a plurality of video picture texts corresponding to the video picture data are obtained; vectorizing the audio paragraph texts and the video picture texts, and storing the vectorized data to a preset vector database; vectorizing the received problem text, and performing similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text to determine target vectorized data corresponding to the vectorized problem text; and finally, inputting the target vectorization data and the vectorization question text into a preset language model to generate an answer text corresponding to the question text, and presenting the answer text to a video playing window of the video file to be processed. Therefore, according to the interaction method based on the video information, the video file to be processed can be separated into the audio data and the video picture data, then the audio data is subjected to voice recognition to convert the audio data into a plurality of audio paragraph texts, texts in the video picture data are recognized to obtain a plurality of video picture texts, the audio paragraph texts and the video picture texts are vectorized, the received question text is vectorized to obtain answer texts corresponding to the question text, and the answer texts are displayed on a video playing window of the video file to be processed. Therefore, question and answer interaction based on video information can be realized, and experience and efficiency of watching videos and acquiring information of users are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an interaction method based on video information provided by the application;

FIG. 2 is a flowchart of a specific video information based interaction method according to the present application;

FIG. 3 is a schematic diagram of an interactive structure based on video information according to the present application;

fig. 4 is a block diagram of an electronic device according to the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

With the development of artificial intelligence technology, especially the recent breakthrough progress in speech recognition, natural language processing, large-scale language model technology and the like, the structuring of video information can be basically realized, thereby accelerating the transmission efficiency of video information. However, in the prior art, the processing of the video information may result in a low density of video knowledge utilized in the question-answer interaction based on the video information, and poor structure, thereby causing a bottleneck in information transfer.

In order to solve the technical problems, the application provides an interaction method, device, equipment and storage medium based on video information, which can match text data obtained based on video according to question text input by a user to obtain answer text corresponding to the question text and present the answer text to the user, so that question and answer interaction based on the video information can be realized, and experience and efficiency of the user for watching video and obtaining information are improved.

Referring to fig. 1, an embodiment of the present application discloses an interaction method based on video information, which is applied to a video playing client, and includes:

step S11, separating the video file to be processed, and preprocessing the audio data and the video picture data obtained through the separating operation to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video picture texts corresponding to the video picture data.

In this embodiment, a separation operation is performed on a video file to be processed, and audio data and video frame data obtained through the separation operation are preprocessed to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video frame texts corresponding to the video frame data. That is, in order to extract complete information in a section of video to be processed, the video to be processed may be separated into audio data and video frame data, for example, a section of educational video, in which audio data includes dictation audio of an interpreter, and in which video frame includes blackboard writing information of the interpreter, so that the educational video may be decomposed into audio data and video frame data, the audio data may be preprocessed to convert the audio data into text data, and the video frame data may be preprocessed to extract text information included in the video frame data.

The preprocessing of the audio data includes: performing voice recognition on the audio data to obtain an audio text corresponding to the audio data; performing paragraph division on the audio text by using a preset language model to obtain a plurality of audio paragraph texts, and determining a plurality of paragraph summaries and start and stop times corresponding to the plurality of audio paragraph texts; and the start-stop time is the start-stop time corresponding to each audio paragraph text in the plurality of audio paragraph texts. That is, after the video file to be processed is separated into the audio data and the video picture data, language recognition needs to be performed on the audio data to convert the audio data into audio text, that is, the spoken text of the video speaker in the video file to be processed. After the audio text is obtained, any one of semantic segmentation models such as FCN (Fully Convolutional Networks) model, segNet (real-time semantic segmentation) model and the like is required to be utilized to segment the audio text so as to obtain a plurality of segments of audio paragraph text. It should be noted that, when the audio text is segmented by the semantic segmentation model, each segment of audio text is obtained, a segment abstract corresponding to the audio segment text is generated, and the start and stop time of each segment of audio segment text in the video to be processed, that is, the start time and the end time are marked. After the audio text is segmented to obtain all the audio paragraph texts, outputting a list containing the audio paragraph texts, paragraph summaries corresponding to the audio paragraph texts and start and stop times corresponding to the audio paragraph texts according to a JSON (JavaScript Object Notation, JS object numbered musical notation) format.

It should be further noted that preprocessing the video picture data includes: video segmentation is carried out on the video picture data based on the start-stop time so as to obtain a plurality of video picture data corresponding to the plurality of audio paragraph texts; respectively extracting adjacent video frames in each video picture data in the plurality of video picture data based on a preset time interval, and calculating picture repetition rate of the adjacent video frames; the adjacent video frames are two frames of video pictures based on the preset time interval; if the picture repetition rate is larger than a preset repetition rate threshold, removing a frame of video picture from the adjacent video frames to obtain a plurality of target video frames; and extracting texts in the target video frames based on an optical character recognition technology to obtain a plurality of video picture texts. That is, after the video file to be processed is separated into the audio data and the video picture data, a picture containing a blackboard writing, which may be a PPT (PowerPoint slide) page containing text or handwritten text used by the interpreter in the video, may exist in the obtained video picture data. It should be noted that, because the audio and video frames are synchronously played when the video file is played, the audio data and the video frame data obtained by the video to be processed have consistency on the time axis, and the video frame data can be divided based on the time node determined when the audio data is divided, that is, the video frame data is divided by the start-stop time, so as to obtain a plurality of video frame data corresponding to the plurality of audio paragraph texts. After the video frame data are obtained, text displayed in the video frame data needs to be extracted, one frame of video frame in the video frame data needs to be extracted based on a preset time interval, for example, the preset time interval is set to be 1 second, one frame of video frame can be extracted for each divided video frame data every 1 second, and after the extraction is completed, the board writing in two adjacent frames of video frame pictures may not be changed due to the short time interval, so that the repetition rate between the two adjacent frames of video frames needs to be calculated. Before calculating the repetition rate between two adjacent frames of video frames, the area where the board writing is located can be determined first, the area where the board writing is located is selected as a repetition rate calculation area, after the repetition rate calculation area is determined, the repetition rate calculation can be performed on the two adjacent frames of video frames, whether the calculated repetition rate is larger than a preset repetition rate threshold value is determined, if so, the board writing between the two adjacent frames of video frames is characterized to be consistent, one frame of video frames in the two adjacent frames of video frames can be removed, after the calculation of the video frames of all video frame data is completed, after a plurality of target video frames are obtained, the board writing data in the plurality of target video frames, namely the video frame text, can be extracted through OCR (optical character recognition ) technology.

The method for interaction based on video information further includes: creating a first jump interface based on the paragraph summaries, and judging whether a first video jump instruction corresponding to a target paragraph summary in the paragraph summaries is received through the first jump interface; if so, switching the video picture data currently played in the video playing window into first target video picture data corresponding to the first target start-stop time based on the first target start-stop time corresponding to the target paragraph abstract. That is, after obtaining a plurality of paragraph summaries corresponding to the audio text, the paragraph summaries may be displayed in the video playing window, for example, a sidebar attached to the video playing window is newly created in the video playing window, and all the paragraph summaries are displayed in the sidebar, and when all the audio paragraph text is obtained by dividing the audio text, a start-stop time corresponding to the audio paragraph text, a paragraph summary corresponding to the audio paragraph text, and a start-stop time corresponding to the audio paragraph text are obtained at the same time, so that all the paragraph summaries and the start-stop time corresponding to the paragraph summaries may be displayed in the sidebar at the same time. And a first jump interface can be created for the plurality of paragraph summaries, and when a first video jump instruction corresponding to a target paragraph summary in the plurality of paragraph summaries is received through the first jump interface, video frames of a video playing window are directly switched to video frames of start and stop time corresponding to the target paragraph summary. For example, when receiving a user click to a paragraph summary, the screen may be jumped to a video clip corresponding to the summary clicked by the user.

And step S12, vectorizing the audio paragraph texts and the video picture texts, and storing the vectorized data into a preset vector database.

In this embodiment, vectorization is performed on the plurality of audio paragraph texts and the plurality of video picture texts, and the obtained vectorized data is stored in a preset vector database. That is, after all the audio paragraph text and the video picture text are obtained, vectorization needs to be performed on all the obtained audio paragraph text and video picture text, and all the audio paragraph text and the video picture text can be processed through the word embedding model to convert text data into digital vectors, and the obtained vectorized data is stored in a preset vector database. Therefore, the text data is converted into the vector data, so that the analysis and the processing of a computer can be facilitated, and the efficiency of the interaction method based on the video information can be improved.

It should be noted that, the interaction method based on video information further includes: judging whether a language transcription instruction is received, if so, respectively converting the plurality of audio paragraph texts and the plurality of video picture texts into a plurality of target audio paragraph texts and a plurality of target video picture texts corresponding to the language type based on the language type in the language transcription instruction; and covering the target audio paragraph texts to a preset first video area of the video playing window, and covering the target video picture texts to a preset second video area of the video playing window. That is, a language transcription interface may be provided in the video playing window, after receiving a language transcription instruction input by the user through the language transcription interface, the audio paragraph text and the video picture text may be transcribed according to the received language transcription instruction, so as to transcribe the audio paragraph text and the video picture text into texts in a language corresponding to the language transcription instruction, for example, the user selects to transcribe the text into english through the language transcription interface, after receiving the language transcription instruction for transcribing the text into english, the audio paragraph text and the video picture text may be transcribed into english text, and a display area of the audio paragraph text and a display area of the board book are determined, and the transcribed target audio paragraph text is overlaid on the display area of the audio paragraph text of the video playing window, and the transcribed target video picture text is overlaid on the board book area of the video playing window.

And S13, vectorizing the received problem text, and carrying out similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text so as to determine target vectorized data corresponding to the vectorized problem text.

In this embodiment, vectorization is performed on the received problem text, and similarity matching is performed on the vectorized data stored in the preset vector database based on the obtained vectorized problem text, so as to determine target vectorized data corresponding to the vectorized problem text. That is, the user may input the question text through the video playing window, after receiving the question text input by the user, may vectorize the entered question text, and perform similarity matching on vectorized data stored in a preset vector database based on the obtained vectorized question text, and determine vectorized data of a paragraph with the highest similarity as target vectorized data corresponding to the vectorized question text.

And S14, inputting the target vectorization data and the vectorization question text into a preset language model to generate an answer text corresponding to the question text, and presenting the answer text to a video playing window of the video file to be processed.

In this embodiment, the target vectorized data and the vectorized question text are input to a preset language model to generate an answer text corresponding to the question text, and the answer text is presented to a video playing window of the video file to be processed. The obtained target vectorization data and vectorization question text are input into a preset language model, namely a word embedding model, so that answer text corresponding to the question text is generated, and the answer text is presented in a video playing window. For example, after generating the answer text, the answer text may be scroll-played in the video play window in a scroll font according to a scroll speed selected by the user.

It should be noted that, after the target vectorization data and the vectorization question text are input to a preset language model to generate an answer text corresponding to the question text and the answer text is presented to the video playing window of the video file to be processed, the method further includes: determining a second target start-stop time corresponding to the target vectorized audio paragraph text and the target vectorized video picture text; creating a second jump interface corresponding to the answer text, and judging whether a second video jump instruction corresponding to the answer text is received based on the second jump interface; if yes, the video picture data currently played in the video playing window are jumped to second target video picture data corresponding to the second target starting and ending time. That is, the generated answer text corresponds to the target vectorized data, and the target vectorized data is the target vectorized audio paragraph text and the target vectorized video picture text, after the target vectorized audio paragraph text and the target vectorized video picture text are determined, the start-stop time corresponding to the target vectorized audio paragraph text and the target vectorized video picture text may be determined, and the determined start-stop time corresponding to the target vectorized audio paragraph text and the target vectorized video picture text may be determined as the start-stop time corresponding to the answer text. After determining the start-stop time corresponding to the answer text, a second jump interface corresponding to the answer text can be created, a second jump instruction is received through the second jump interface, and the video picture currently played by the video playing window is jumped to the video picture of the start-stop time corresponding to the answer text based on the second jump instruction. For example, when the answer text is scrolled and played in the video playing window, a skip interface may be created in the area where the answer text of the video playing page is scrolled and when the user clicks the answer text, the screen may be skipped to the video screen of the start-stop time corresponding to the answer text.

Therefore, in this embodiment, the separation operation is performed on the video file to be processed, and the audio data and the video frame data obtained through the separation operation are preprocessed, so as to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video frame texts corresponding to the video frame data; vectorizing the audio paragraph texts and the video picture texts, and storing the vectorized data to a preset vector database; vectorizing the received problem text, and performing similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text to determine target vectorized data corresponding to the vectorized problem text; and finally, inputting the target vectorization data and the vectorization question text into a preset language model to generate an answer text corresponding to the question text, and presenting the answer text to a video playing window of the video file to be processed. Therefore, according to the interaction method based on the video information, the video file to be processed can be separated into the audio data and the video picture data, then the audio data is subjected to voice recognition to convert the audio data into a plurality of audio paragraph texts, texts in the video picture data are recognized to obtain a plurality of video picture texts, the audio paragraph texts and the video picture texts are vectorized, the received question text is vectorized to obtain answer texts corresponding to the question text, and the answer texts are displayed on a video playing window of the video file to be processed. In this way, on one hand, text data are converted into vector data, so that analysis and processing of a computer can be facilitated, and the efficiency of the interaction method based on video information is improved; on the other hand, question-answer interaction based on video information can be realized, and experience of watching videos and acquiring information of users is improved.

Based on the foregoing embodiments, it can be seen that, after an input problem text is received, it is necessary to vectorize the problem text, and perform similarity matching on the vectorized problem text obtained in a preset vector database to determine target vectorized data. For this reason, the present embodiment describes in detail how the similarity matching is performed based on the inputted question text. Referring to fig. 2, an embodiment of the present invention discloses an interaction method based on video information, including:

step S21, separating the video file to be processed, and preprocessing the audio data and the video picture data obtained through the separating operation to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video picture texts corresponding to the video picture data.

Step S22, vectorizing the audio paragraph texts and the video picture texts, and storing the vectorized data to a preset vector database.

And S23, judging whether an input question text is received, if yes, vectorizing the question text to obtain a vectorized question text.

In this embodiment, whether an input question text is received is determined, if yes, vectorization is performed on the question text to obtain a vectorized question text. That is, it is necessary to determine whether a question text input by the user in the video playing window is received, and if the question text input by the user is received, the question text may be vectorized by a preset word embedding model to obtain a vectorized question text corresponding to the input question text.

And step S24, calculating the similarity between the vectorization problem text and the vectorization data stored in the preset vector database, and determining the similarity with the highest calculated value as the target similarity.

In this embodiment, the similarity between the vectorized question text and the vectorized data stored in the preset vector database is calculated, and the similarity with the highest value obtained by calculation is determined as the target similarity, that is, after the vectorized question text is obtained, similarity calculation needs to be performed based on the vectorized question text and the vectorized data stored in the preset vector database, and each piece of vectorized audio paragraph text and each piece of vectorized video picture text of the video file to be processed are stored in the preset vector database, so that the similarity between the vectorized question text and each piece of vectorized audio paragraph text and each piece of vectorized video picture text needs to be calculated, and the similarity with the highest value obtained by calculation is determined as the target similarity.

Step S25, determining the vectorized audio paragraph text corresponding to the target similarity as a target vectorized audio paragraph text, and determining the vectorized video picture text corresponding to the target similarity as a target vectorized video picture text.

In this embodiment, the vectorized audio paragraph text corresponding to the target similarity is determined as the target vectorized audio paragraph text, and the vectorized video picture text corresponding to the target similarity is determined as the target vectorized video picture text. That is, after the target similarity is obtained by calculation, the vectorized audio paragraph text in the preset vector database corresponding to the target similarity may be determined as a target vectorized audio paragraph text, and the vectorized video picture text in the preset vector database corresponding to the target similarity may be determined as a target vectorized video picture text, so as to obtain an answer text through the target vectorized audio paragraph text, the target vectorized video picture text and the vectorized question text.

Step S26, inputting the target vectorized audio paragraph text, the target vectorized video picture text and the vectorized question text into a preset language model to generate an answer text corresponding to the question text, and presenting the answer text to a video playing window of the video file to be processed.

It can be seen that, in this embodiment, after the vectorized data is stored in the preset vector database, whether an input question text is received is first determined, if yes, vectorization is performed on the question text to obtain a vectorized question text, then the similarity between the vectorized question text and the vectorized data stored in the preset vector database is calculated, the similarity with the highest calculated value is determined to be a target similarity, finally the vectorized audio paragraph text corresponding to the target similarity is determined to be a target vectorized audio paragraph text, the vectorized video picture text corresponding to the target similarity is determined to be a target vectorized video picture text, and the target vectorized audio paragraph text, the target vectorized video picture text and the vectorized question text are input into a preset language model, so that an answer text corresponding to the question text is generated, and the answer text is presented to a video playing window of the video file to be processed. Therefore, the target vectorization data with highest similarity with the question text stored in the preset vector database can be determined through the input question text, the determined target vectorization data and the vectorization question text are input into the preset language model, so that answer text corresponding to the question text input by the user can be obtained, and interaction with the user can be realized during video playing.

Referring to fig. 3, an embodiment of the present invention discloses an interaction device based on video information, including:

the video separation module 11 is configured to perform a separation operation on a video file to be processed, and perform preprocessing on audio data and video frame data obtained through the separation operation, so as to obtain a plurality of audio paragraph texts corresponding to the audio data and a plurality of video frame texts corresponding to the video frame data;

the data vectorization module 12 is configured to vectorize the plurality of audio paragraph texts and the plurality of video frame texts, and store the obtained vectorized data in a preset vector database;

the problem matching module 13 is configured to perform vectorization on the received problem text, and perform similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized problem text, so as to determine target vectorized data corresponding to the vectorized problem text;

the answer presenting module 14 is configured to input the target vectorized data and the vectorized question text to a preset language model, generate an answer text corresponding to the question text, and present the answer text to a video playing window of the video file to be processed.

In some embodiments, the video separation module 11 may specifically include:

the voice conversion unit is used for carrying out voice recognition on the audio data so as to obtain an audio text corresponding to the audio data;

the data determining unit is used for carrying out paragraph division on the audio text by utilizing a preset language model to obtain a plurality of audio paragraph texts, and determining a plurality of paragraph summaries and start-stop times corresponding to the plurality of audio paragraph texts; and the start-stop time is the start-stop time corresponding to each audio paragraph text in the plurality of audio paragraph texts.

In some embodiments, the video separation module 11 may specifically include:

the video segmentation unit is used for carrying out video segmentation on the video picture data based on the start-stop time so as to obtain a plurality of video picture data corresponding to the plurality of audio paragraph texts;

the repetition rate calculating unit is used for respectively extracting adjacent video frames in each video picture data in the plurality of video picture data based on a preset time interval and calculating the picture repetition rate of the adjacent video frames; the adjacent video frames are two frames of video pictures based on the preset time interval;

The video frame eliminating unit is used for eliminating a frame of video frame from the adjacent video frames to obtain a plurality of target video frames if the frame repetition rate is greater than a preset repetition rate threshold;

and the text extraction unit is used for extracting texts in the plurality of target video frames based on an optical character recognition technology so as to obtain a plurality of video picture texts.

In some embodiments, the interaction device based on video information may further include:

the first instruction receiving unit is used for creating a first jump interface based on the paragraph summaries and judging whether a first video jump instruction corresponding to a target paragraph summary in the paragraph summaries is received through the first jump interface;

and the first picture switching unit is used for switching the video picture data currently played in the video playing window into first target video picture data corresponding to the first target starting and ending time based on the first target starting and ending time corresponding to the target paragraph abstract if the video picture data is the first target starting and ending time.

the text conversion unit is used for judging whether a language transcription instruction is received, if so, respectively converting the plurality of audio paragraph texts and the plurality of video picture texts into a plurality of target audio paragraph texts and a plurality of target video picture texts corresponding to the language type based on the language type in the language transcription instruction;

And the text covering unit is used for covering the target audio paragraph texts to the preset first video area of the video playing window and covering the target video picture texts to the preset second video area of the video playing window.

In some embodiments, the problem matching module 13 may specifically include:

the text vectorization unit is used for judging whether an input problem text is received or not, if yes, vectorizing the problem text to obtain a vectorized problem text;

the similarity calculation unit is used for calculating the similarity between the vectorization problem text and the vectorization data stored in the preset vector database, and determining the similarity with the highest calculated value as the target similarity;

and the text determining unit is used for determining the vectorized audio paragraph text corresponding to the target similarity as a target vectorized audio paragraph text and determining the vectorized video picture text corresponding to the target similarity as a target vectorized video picture text.

a start-stop time determining unit, configured to determine a second target start-stop time corresponding to the target vectorized audio paragraph text and the target vectorized video picture text;

The second instruction receiving unit is used for creating a second jump interface corresponding to the answer text and judging whether a second video jump instruction corresponding to the answer text is received based on the second jump interface;

and the second instruction receiving unit is used for jumping the video picture data currently played in the video playing window to second target video picture data corresponding to the second target starting and ending time if the video picture data currently played in the video playing window are the second target video picture data.

Further, the embodiment of the present application further discloses an electronic device, and fig. 4 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.

Fig. 4 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein the memory 22 is configured to store a computer program that is loaded and executed by the processor 21 to implement relevant steps in the video information based interaction method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.

In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.

The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.

The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and computer programs 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the video information based interaction method performed by the electronic device 20 as disclosed in any of the previous embodiments.

Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program, when executed by the processor, implements the previously disclosed video information based interaction method. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An interaction method based on video information is characterized by being applied to a video playing client and comprising the following steps:

2. The video information based interaction method of claim 1, wherein preprocessing the audio data comprises:

3. The video information-based interaction method according to claim 2, wherein preprocessing the video picture data comprises:

4. The video information-based interaction method according to claim 2, further comprising:

5. The video information-based interaction method according to claim 1, further comprising:

6. The video information based interaction method according to any one of claims 1 to 5, wherein the vectorizing the received question text and performing similarity matching on the vectorized data stored in the preset vector database based on the obtained vectorized question text to determine target vectorized data corresponding to the vectorized question text includes:

7. The method according to claim 6, wherein after inputting the target vectorized data and the vectorized question text into a preset language model to generate answer text corresponding to the question text and presenting the answer text to a video playing window of the video file to be processed, further comprising:

8. An interactive apparatus based on video information, which is applied to a video playing client, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the video information based interaction method of any of claims 1 to 7.

10. A computer readable storage medium for storing a computer program which, when executed by a processor, implements the video information based interaction method of any of claims 1 to 7.