CN114979540A

CN114979540A - Multi-modal content switching method, device, equipment and storage medium

Info

Publication number: CN114979540A
Application number: CN202210403998.4A
Authority: CN
Inventors: 周东谕
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-08-30

Abstract

The application discloses a method, a device, equipment and a storage medium for multi-modal content switching, wherein the method comprises the following steps: detecting the playing state of a first file in the playing process of the first file; the first file is an audio file or a video file; when the playing state is determined to meet the switching condition, acquiring associated modal content of the first file, wherein the file type of the associated modal content is different from the file type of the first file; and playing the associated modal content. The method and the device solve the technical problem that the user experience effect is influenced due to the fact that the terminal equipment cannot load videos caused by extremely weak network signals.

Description

Multi-modal content switching method, device, equipment and storage medium

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for multi-modal content switching.

Background

When a user watches movie and TV works such as TV drama on a video website by using a mobile phone, the situation that the video playing is blocked and unsmooth due to the fluctuation of the communication network environment generally influences the user to watch the video, for example, when a high-speed rail passes through a tunnel, part of the video with weak signals is in an elevator and in a basement, at present, in order to improve the watching effect of the user, a plurality of videos with different definitions are generally preset, when the network of the user fluctuates, the user is prompted to switch to the lower definition video to continue playing, and therefore the problem that the user can continuously play the video from a strong network environment to a weak network environment is solved. However, when a user is in a scene where network signals are usually very weak or even no network, the terminal device cannot load videos, and the experience effect of video watching of the user is further affected.

Disclosure of Invention

The application mainly aims to provide a multi-mode content switching method, device, equipment and storage medium, and aims to solve the technical problem that in the prior art, due to the fact that network signals are extremely weak, terminal equipment cannot load videos, and user experience effects are affected.

In order to achieve the above object, the present application provides a multimodal content switching method, including:

detecting the playing state of a first file in the playing process of the first file; the first file is an audio file or a video file;

when the playing state is determined to meet the switching condition, acquiring associated modal content of the first file, wherein the file type of the associated modal content is different from the file type of the first file;

and playing the associated modal content.

The present application further provides a multi-modal content switching apparatus, where the multi-modal content switching apparatus is a virtual apparatus, and the multi-modal content switching apparatus includes:

the device comprises a detection module, a playing module and a playing module, wherein the detection module is used for detecting the playing state of a first file in the playing process of the first file; the first file is an audio file or a video file;

an obtaining module, configured to obtain associated modal content of the first file when it is determined that the playing state meets a switching condition, where a file type of the associated modal content is different from a file type of the first file;

and the playing module is used for playing the associated modal content.

The present application further provides a multimodal content switching apparatus, the multimodal content switching apparatus being an entity apparatus, the multimodal content switching apparatus including: the content switching system comprises a memory, a processor and a multi-modal content switching program stored on the memory, wherein the multi-modal content switching program is executed by the processor to realize the multi-modal content switching method.

The present application also provides a storage medium, which is a computer-readable storage medium, on which a multi-modal content switching program is stored, and the multi-modal content switching program is executed by a processor to implement the steps of the multi-modal content switching method as described above.

Compared with the technical scheme that only switching between different video definitions can be performed and a video can only be in a stuck state when a user cannot load the video in a weak network in the prior art, the method, the device, the equipment and the storage medium for switching the multi-mode content are characterized in that firstly, in the playing process of a first file, the playing state of the first file is detected; the first file is an audio file or a video file, and then when the playing state is determined to meet the switching condition, the associated modal content of the first file is obtained, wherein the file type of the associated modal content is different from the file type of the first file, so that the associated modal content is played, the playing state of the first file is switched to the corresponding associated modal content, and the currently played video or audio is switched to the rest of the associated modal content, so that when a user watches movie works (videos) and the user cannot watch the videos due to the fact that the user is stuck, the other associated modal content can be skipped, and the smoothness of the user in reading the works is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic flow chart diagram of a first embodiment of a multimodal content switching method of the present application;

fig. 2 is a schematic structural diagram of video sharpness switching according to the present application;

FIG. 3 is a schematic diagram illustrating a modal content switch according to the present application;

FIG. 4 is a flowchart illustrating a second embodiment of the multi-modal content switching method according to the present application;

FIG. 5 is a flowchart illustrating a third embodiment of the multi-modal content switching method according to the present application;

FIG. 6 is a schematic flow chart of the subject model training of the present application;

FIG. 7 is a flowchart illustrating a fourth embodiment of the multi-modal content switching method according to the present application;

fig. 8 is a schematic structural diagram of a topic sequence corresponding to a query play position in the present application;

FIG. 9 is a flowchart illustrating a fifth embodiment of the multi-modal content switching method according to the present application;

FIG. 10 is a diagram illustrating the structure of a matching candidate subsequence of the present application;

FIG. 11 is a diagram illustrating the structure of a target subsequence determined in the present application;

fig. 12 is a schematic structural diagram of a multi-modal content switching apparatus in a hardware operating environment according to an embodiment of the present application;

fig. 13 is a functional block diagram of the multi-modal content switching apparatus according to the present application.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In a first embodiment of the multimodal content switching method of the present application, specifically referring to fig. 1, the multimodal content switching method includes:

step S10, in the process of playing a first file, detecting the playing status of the first file; the first file is an audio file or a video file;

in this embodiment, specifically, during the playing process of the first file, the current network state is detected in real time, so as to determine whether to switch to other associated modality content according to the current network state.

Additionally, at step S10: in the process of playing a first file, detecting the playing state of the first file, and then, further comprising:

step a1, judging whether the first file has associated modal content;

if yes, the step a2 returns to the step S20: when the playing state is determined to meet the switching condition, acquiring the associated modal content of the first file;

step a3, if not, prompting the target user to carry out definition switching operation.

In this embodiment, it should be noted that, since not all works are configured with multiple modal contents, when it is detected that there is no associated modal content in the first file, the user is prompted that the current network is not good, so as to switch the currently played definition, so as to continue playing the first file by reducing the definition, specifically, referring to fig. 2, fig. 2 is a schematic structural diagram of video definition switching performed in the present application, for example, if a video is played with 720P ultra-definition, the video can be played with definition reduced to 280P high definition or 270P standard definition.

Step S20, when it is determined that the playing status satisfies the switching condition, acquiring associated modal content of the first file, where a file type of the associated modal content is different from a file type of the first file.

In this embodiment, it should be noted that, at present, a large number of literary works are filmed into video contents such as movies and dramas, and are produced into audio books (audio playback contents) and electronic books (text contents). The same literary work can have modal contents of three types, namely video, audio and text, at the same time, further, the file associated with the modal content and the first file belong to the same work, that is, in this embodiment, the first file is switched to other modal contents belonging to the same work, the switching condition may include that the playing state is in a stuck state, or a state that a current network is enough to stably play video or audio is detected, or may further include that a target user performs a click switching instruction on a terminal device, or a condition corresponding to a switching instruction generated by a voice command of the target user may be identified, and the like.

As an implementation manner, when the first file currently played by the user mobile terminal is in a stuck state due to network signal fluctuation during the playing process, obtaining other associated modal content of the same work corresponding to the first file, specifically, different types of contents in each work can be stored in a work database in an associated manner in advance, for example, the work is a Shuihu transmission, a video file, an audio file and a text file corresponding to the Shuihu transmission are stored in an associated manner, and further, in the playing process of a first file, when the first file is determined to need to be switched, based on the work to which the first file belongs, and inquiring other associated modal contents corresponding to the work in the work database, so that even if the network signal is poor, the work can be continuously appreciated by switching to other modal contents which can be played at a low network speed.

Additionally, in order to improve the experience effect of the user, if the current first file is played at a low network speed, the current playing state is detected in real time, and when the network state is detected to be good, the associated modal content of the current first file is automatically acquired to perform content switching, so that when the network state is excellent, the content at a higher level can be automatically switched, for example, the content is switched from a text file to a video file for playing.

As another implementation manner, in order to make the plot play have continuity, when the content is switched, based on the play position of the first file, the play position may be switched to a position aligned with the play position, specifically, when it is determined that the play state satisfies the switching condition, the play position of the first file is obtained, and then, based on a preset multi-modal topic library, a complete topic sequence corresponding to the first file is obtained through query, where it is to be noted that the preset multi-modal topic library stores a multi-modal topic sequence corresponding to each work, for example, a complete topic sequence corresponding to video, audio, and text, where the multi-modal topic sequence is a multi-modal topic sequence generated by a multi-modal topic model based on text information of different modal contents in each work, the multi-modal topic model is an LDA topic model, and LDA (late dichall) is a document topic generation model, the model is also called a three-layer Bayes probability model, and comprises three layers of structures of words, subjects and documents, and the generation process of a section of text has the following rules: in the embodiment, based on text information of different modal contents in the same work, topic sequences of different modalities of the same work are generated through a multi-modal topic model, wherein the topic sequences of different modalities comprise a video topic sequence, a text topic sequence and an audio topic sequence, the multi-modal topic sequences corresponding to different work information can be stored in the preset multi-modal topic library, and further based on the playing position, the topic sequence corresponding to the playing position is determined from a complete topic sequence corresponding to the first file, so that based on the topic sequence corresponding to the playing position, and determining the target content matched with the theme sequence as the associated modal content in a plurality of second files corresponding to the same work, so that the plot playing after content switching is also consistent.

Additionally, in the switching process, if it is detected that the currently played first file has a plurality of associated modal contents, the mobile terminal of the user may display the plurality of associated modal contents for the user to select, referring to fig. 3, where fig. 3 is a schematic structural diagram illustrating the switching of modal contents according to the present application, and if the user selects a modal of the electronic book, the modal contents corresponding to the electronic book are switched.

And step S30, playing the associated modal content.

Compared with the technical scheme that only switching between different video definitions can be performed and a video can only be in a pause state when a user cannot load the video in a weak network in the prior art, the method, the device and the equipment for switching the multi-mode content and the storage medium have the advantages that firstly, in the playing process of a first file, the playing state of the first file is detected; the first file is an audio file or a video file, and then when the playing state is determined to meet the switching condition, the associated modal content of the first file is obtained, wherein the file type of the associated modal content is different from the file type of the first file, so that the associated modal content is played, the playing state of the first file is switched to the corresponding associated modal content, and the currently played video or audio is switched to the rest of the associated modal content, so that when a user watches movie works (videos) and the user cannot watch the videos due to the fact that the user is stuck, the other associated modal content can be skipped, and the smoothness of the user in reading the works is improved.

Further, referring to fig. 4, according to the second embodiment of the present application, the step S20: acquiring the associated modal content of the first file, specifically including:

step S21, obtaining the theme sequence of the first file;

step S22, determining, from a plurality of second files, that the target content matched with the topic sequence is the associated modal content, where file types of the plurality of second files are different from a file type of the first file, and the first file and the second file belong to the same work.

In this embodiment, specifically, the playing position of the first file is obtained, and a first complete topic sequence corresponding to the current first file is queried based on a preset multi-modal topic library, where it should be noted that the preset multi-modal topic library includes a multi-modal topic sequence corresponding to each piece of work information, the multi-modal topic sequence is formed by converting text information of different modalities of the piece of work information through the multi-modal topic sequence, the text information of different modality contents includes information such as audio reading text information, e-book text information, and scenario text information corresponding to a video, and correspondingly, the multi-modal topic sequence includes an audio topic sequence, a text topic sequence, a video topic sequence, and the like, and further, based on that in the first complete topic sequence, a topic sequence corresponding to the playing position is found according to a preset topic finding method, for example, a theme corresponding to the current playing is determined, and then a preset number of themes are searched backwards and forwards from a first complete theme sequence to form the theme sequence, further, in the preset multi-modal theme library, second complete theme sequences corresponding to a plurality of second files belonging to the same work as the first file are searched, and then associated theme sequences matching the theme sequences are searched from the second complete theme sequences, and further, associated modal content corresponding to the associated theme sequences is determined from the second files.

According to the scheme, the multi-modal theme model obtained based on different modal content training is achieved, text information of different modal contents in each piece of work information is converted into the corresponding multi-modal theme sequence through the multi-modal theme model, the multi-modal theme sequence of each piece of work information is stored in the preset modal theme library, the theme sequence of the first file is rapidly searched based on the preset modal theme library, the corresponding modal theme sequences of other multiple second files belonging to the same piece of work are searched, and the efficiency of modal content switching is improved.

Further, referring to fig. 5, according to the second embodiment of the present application, before the step of determining the theme sequence corresponding to the playing position based on the playing position and the preset multi-modal theme library, the method further includes:

step A10, acquiring text information of different modal contents corresponding to each work;

in this embodiment, it should be noted that, for the same work, text information of different modalities of the work is obtained, for example, video script text information, audio text information, and electronic book text information of the work information are obtained.

Step A20, paragraph dividing is carried out on the text information of different modal contents of each work respectively to obtain a multi-modal text segment sequence of each work;

in this embodiment, it should be noted that the multi-modal text segment sequence is a text segment sequence including a plurality of different modalities, specifically, the text information of the different modalities is divided into a plurality of text segments according to paragraphs, and the text segment sequence corresponding to each modality is formed based on each text segment, for example, the text information of an electronic book is divided into a plurality of text segments according to paragraphs to obtain a text segment sequence of the electronic book, which is defined as D _w ＝{d _w1 ,d _w2 ,...,d _wn And similarly, obtaining an audio text segment sequence of the audio book reading manuscript: d _m ＝{d _m1 ,d _m2 ,...,d _mn D and text segment sequence of video script _v ＝{d _v1 ,d _v2 ,...,d _vn D represents a text segment sequence, and D represents a text segment.

Step A30, merging the multi-modal text segment sequences in the same work to obtain merged text segment sequences;

in this embodiment, it should be noted that, in order to enable the model to recognize the narration modes of text segments in different modalities, the multi-modal text segment sequences in the same work are merged to obtain merged text segment sequences, for example, in watery dynasty, an original book may be more inclined to a language, but a transcript of a film that is filmed is followed by a white language, and in order to enable the model to recognize the 2 narration modes, the original book and the transcript of the filmed film are merged and trained, so that the accuracy of model theme recognition is improved.

Step A40, performing iterative training on a topic model to be trained based on each combined text segment sequence to obtain the multi-modal topic model, and outputting a multi-modal topic sequence of each work, wherein the multi-modal topic sequence comprises a video topic sequence, a text topic sequence and an audio topic sequence;

in this embodiment, it should be noted that the multimodal topic model is an LDA topic model, and LDA (latent Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three-layer structures of words, topics, and documents, and a generation process of a text segment has the following rules: in the embodiment, based on text information of different modal contents in the same work, theme sequences of different modalities of the same work are generated through a multi-modal theme model.

Specifically, each merged text segment sequence is input into the topic model to be trained, so as to perform topic prediction on each text segment in the multi-modal text segment sequence through the topic model to be trained, and obtain a topic probability set of each text segment in the multi-modal text segment sequence, where the topic probability set of each text segment in the multi-modal text segment sequence is as follows:

P(z|d)＝{P(z ₁ |d),P(z ₂ |d),...,P(z _k | d) }, wherein,

wherein d represents a text segment, z represents a topic, P (z | d) represents the probability of the topic corresponding to the text segment,

all topic probabilities representing the topic probability set for each text segment add up to 1. Further, the air conditioner is provided with a fan,for each text segment in the multi-modal text segment sequence, selecting the topic with the highest topic probability in the topic probability set as the representative topic of the text segment: z is a radical of formula _i ＝max({P(z ₁ |d),P(z ₂ |d),...,P(z _k I d) }), wherein zi represents the topic with the maximum topic probability, max () represents that the topic with the maximum topic probability is selected in the topic probability set as the prediction topic, then based on the prediction topic and the preset topic labels of each text segment, the model loss between the prediction topic and the topic labels is calculated, then based on the model loss, the topic model to be trained is iteratively optimized, and whether the optimized topic model to be trained meets the training ending condition is judged, wherein the training ending condition comprises the conditions of iterative convergence, the iteration time reaches the preset training time threshold value, and if not, the execution step is returned: performing iterative training on the topic model to be trained based on each combined text segment sequence to obtain the multi-modal topic model, continuing to train the model, if yes, taking the optimized topic model to be trained as the multi-modal topic model, and further performing topic prediction on text information of different modal contents corresponding to the same work based on the multi-modal topic model to obtain a multi-modal topic sequence corresponding to the work, wherein the multi-modal topic sequence comprises a video topic sequence, a text topic sequence and an audio topic sequence, for example, following the example of the step A20, a text segment sequence D is performed _w 、D _m 、D _v Conversion to the subject sequence, denoted Z _w 、Z _m 、Z _v Wherein the subject sequence and the text segment sequence differ by: different text segments can be attributed to a topic to obtain an optimal multi-modal topic model.

Step A50, forming the preset multi-modal topic library based on the multi-modal topic sequence of each of the works.

In this embodiment, specifically, the multi-modal topic sequences of the same work are stored in the preset multi-modal topic library in an associated manner, so that the topic sequence corresponding to the currently played file can be queried through the preset multi-modal topic library, and whether the currently played file has topic sequences of other modalities or not is determined.

Further, referring to fig. 6, fig. 6 is a schematic flow chart of the training topic model of the present application, specifically, the e-book text information, the audio book reading manuscript text information (audio text information), and the video script text information of the same work information are obtained, and then the e-book text information, the audio book reading manuscript text information (audio text information), and the video script text information are divided into a plurality of text segments according to paragraphs, so as to obtain an e-book text segment sequence, an audio book text segment sequence, and a video text segment sequence, further, the e-book text segment sequence, the audio book text segment sequence, and the video text segment sequence are combined into one text segment sequence, and trained to obtain an optimal topic multi-modal model, and then according to the multi-modal topic model, the e-book text segment sequence is converted into an e-book topic sequence, the audio book text segment sequence is converted into an audio topic sequence, and converting the video text segment sequence into a video subject matter sequence.

According to the scheme, the text information of different modal contents corresponding to each work is obtained; respectively carrying out paragraph division on text information of different modal contents of each work to obtain a multi-modal text segment sequence of each work; combining the multi-mode text segment sequences in the same work to obtain combined text segment sequences; performing iterative training on a topic model to be trained based on each combined text segment sequence to obtain the multi-modal topic model, and outputting a multi-modal topic sequence of each work, wherein the multi-modal topic sequence comprises a video topic sequence, a text topic sequence and an audio topic sequence; the preset multi-mode theme library is formed based on the multi-mode theme sequences of the works, the multi-mode text segment sequences of the same work are used for training a theme model, the model can learn text information of different modes of the same work, the multi-mode text segment sequences are converted into the multi-mode theme sequences through the theme model, the multi-mode theme sequences of the same work information are stored in the preset multi-mode theme library, and therefore the video theme sequence corresponding to the currently played file can be quickly inquired through the preset multi-mode theme library, and whether the currently played file has theme sequences of other modes or not is judged.

Further, referring to fig. 7, based on the second embodiment in the present application, in another embodiment of the present application, wherein the step S21: obtaining a subject sequence of the first file, including:

step S211, acquiring the playing position of the first file;

step S212, determining a theme sequence corresponding to the playing position based on the playing position and a preset multi-mode theme library; the preset multi-modal theme library stores a multi-modal theme sequence generated by a multi-modal theme model based on text information of different modal contents in each work, and the multi-modal theme model is obtained by performing iterative training based on pre-collected text information of different modal contents in each work.

In step S212, determining the theme sequence corresponding to the playing position based on the playing position and a preset multi-modal theme library specifically includes:

step S2121, inquiring a first complete topic sequence corresponding to the first file in the preset multi-modal topic library;

step S2122, determining a theme position corresponding to the playing position based on the first complete theme sequence;

step S2123, reversely inquiring a preset number of target topics in the first complete topic sequence based on the topic positions;

step S2124, forming the theme sequence based on each target theme.

In this embodiment, specifically, a first complete topic sequence corresponding to the first file is queried in the preset multi-modal topic library, and then a progress position corresponding to the playing position is determined in the first complete topic sequence, and further, after a progress position corresponding to the target video topic sequence is determined based on the playing position, a topic sequence corresponding to the progress position is queried in the target video topic sequence based on a preset query algorithm, where the preset query algorithm includes querying a preset number of topics in the topic sequence forwards or backwards based on the progress position, for example, referring to fig. 8, fig. 8 is a schematic structural diagram of a topic sequence corresponding to the querying playing position of the present application, specifically, the playing position of a video is R1, and a progress position R2 of a target video topic sequence (first complete topic sequence) is determined according to the playing position R1, and further, from the R2, the last 3 subjects are obtained in the reverse direction (forward search) of the first complete subject sequence to form a subject sequence, wherein the subject sequence is { Z1, Z2, Z2 }.

According to the scheme, the first complete theme sequence corresponding to the first file is quickly inquired in the preset multi-mode theme base, the theme sequence corresponding to the playing position is found in the first complete theme sequence, the playing position of the second file can be aligned with the playing position of the first file according to the theme sequence, and therefore the playing progress of the work is consistent after content switching is carried out.

Further, referring to fig. 9, based on the second embodiment in the present application, in another embodiment of the present application, the step S22: determining the target content matched with the subject sequence from a plurality of second files as the associated modal content, including:

step S221, based on a preset multi-mode theme base, querying a second complete theme sequence corresponding to each of the plurality of second files;

step S222, in each second complete topic sequence, querying an associated topic sequence matched with the topic sequence;

in step S222, in each second complete topic sequence, the querying a related topic sequence matched with the topic sequence includes:

step S2221, querying each candidate subsequence that is the same as the topic sequence in each second complete topic sequence;

step S2222, determining the progress value of each candidate subsequence in each second complete subject sequence;

step S2223, for each second complete topic sequence, respectively comparing the progress value of each candidate subsequence with the progress value of the topic sequence, and determining, based on the comparison result, an associated topic sequence corresponding to each second complete topic sequence.

Step S223, determining associated modal content corresponding to the associated topic sequence from the plurality of second files.

In this embodiment, specifically, first, based on a preset multi-modal topic library, second complete topic sequences corresponding to the plurality of second files are queried, and then the following steps are performed for each second complete topic sequence:

by a character string searching method, a subsequence identical to the topic sequence is queried in the second complete topic sequence to obtain at least one candidate subsequence, and a progress value of each candidate subsequence is marked, for example, referring to fig. 10, fig. 10 is a schematic structural diagram of a matching candidate subsequence of the present application, where the topic sequence is { Z1, Z2, Z2}, and all subsequences identical to { Z1, Z2, Z2} are queried in an electronic book topic sequence (second complete topic sequence) to obtain each candidate subsequence.

It should be further noted that, because the complete topic sequence is formed by topics corresponding to each paragraph in the whole work, each topic has a corresponding progress value with respect to the whole topic sequence, for example, the complete topic sequence is set to 100%, and each topic in the complete topic sequence is provided with a corresponding value range [ 0%, 100% ] according to the sequence position where the topic is located.

After obtaining each candidate subsequence, determining a progress value corresponding to each candidate subsequence of the second complete topic sequence, and further calculating a difference value between the progress value of each candidate subsequence and the progress value of the topic sequence, further selecting the candidate subsequence with the smallest difference value as a target subsequence of the modal topic sequence, and taking the progress value of the target subsequence as the target progress position, wherein the distance calculation method comprises the following steps: dis ═ min (| R) ₁ -R _n L) wherein R ₁ -R _n Representing the distance between the progress value corresponding to the subject sequence and the progress value corresponding to each candidate sub-sequence in the second complete subject sequence, R ₁ Representing the progress value, R, corresponding to the topic sequence of the first complete topic sequence _n Representing the corresponding progress value of each candidate subsequence in the second complete subject sequence, and min representing the selected R ₁ -R _n For example, please refer to fig. 11, fig. 11 is a schematic structural diagram of determining an associated topic sequence according to the present application, where a progress value R1 at video pause is 20%, a progress value R2 in an electronic book matched to an electronic book topic sequence satisfying a condition is 25%, and R3 is 40%, and finally, the following is calculated: and R2-R1 is 5% smaller than R3-R1 is 20%, so that a candidate subsequence with a progress value of R2 is selected as the associated topic sequence, and the progress position of the second complete topic sequence is aligned with the playing position of video pause, so that when the multi-modal content jumps, the playing can be started at the closest progress position from the video playing position, and the playing progress switched to other modalities is aligned with the playing progress of the first file, so that the scenario has continuity and is continuously enjoyed by the user.

Through the scheme, the theme sequence corresponding to the progress position is realized, the same candidate subsequence is matched with the second complete theme sequence, the progress position of the second complete theme sequence is aligned to the playing position of video pause, and therefore when the multi-mode content jumps, the playing can be started at the nearest progress position from the playing position of the first file, the playing scenario is made to have continuity, the user can continue to enjoy the work, and the watching experience of the user is improved.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a multimodal content switching apparatus in a hardware operating environment according to an embodiment of the present application.

As shown in fig. 12, the multimodal content switching apparatus may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used for realizing connection communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Optionally, the multimodal content switching apparatus may further include a rectangular user interface, a network interface, a camera, RF (Radio Frequency) circuitry, sensors, audio circuitry, a WiFi module, and the like. The rectangular user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WIFI interface).

Those skilled in the art will appreciate that the multimodal content switching apparatus arrangement shown in FIG. 12 does not constitute a limitation of multimodal content switching apparatus and may include more or less components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 12, a memory 1005, which is a kind of computer storage medium, may include an operating device, a network communication module, and a multimodal content switching program therein. The operation device is a program for managing and controlling hardware and software resources of the multimodal content switching apparatus, and supports the operation of the multimodal content switching program and other software and/or programs. The network communication module is used to implement communication between the components inside the memory 1005 and with other hardware and software in the multimodal content switching apparatus.

In the multimodal content switching apparatus shown in fig. 12, the processor 1001 is configured to execute a multimodal content switching program stored in the memory 1005, and implement the steps of the multimodal content switching method described in any one of the above.

The specific implementation of the multimodal content switching apparatus of the present application is substantially the same as that of each embodiment of the multimodal content switching method, and is not described herein again.

In addition, referring to fig. 13, fig. 13 is a schematic functional module diagram of the multi-modal content switching apparatus according to the present application, and the present application further provides a multi-modal content switching apparatus, including:

and the playing module is used for playing the associated modal content.

Optionally, the obtaining module is further configured to:

acquiring a theme sequence of the first file;

and determining target content matched with the theme sequence as the associated modal content from a plurality of second files, wherein the file types of the plurality of second files are different from the file type of the first file, and the first file and the second file belong to the same work.

Optionally, the obtaining module is further configured to:

acquiring the playing position of the first file;

determining a theme sequence corresponding to the playing position based on the playing position and a preset multi-mode theme library;

the preset multi-modal theme library stores a multi-modal theme sequence generated by a multi-modal theme model based on text information of different modal contents in each work, and the multi-modal theme model is obtained by performing iterative training based on pre-collected text information of different modal contents in each work.

Optionally, the multimodal content switching apparatus is further configured to:

acquiring text information of different modal contents corresponding to each work;

respectively carrying out paragraph division on text information of different modal contents of each work to obtain a multi-modal text segment sequence of each work;

combining the multi-mode text segment sequences in the same work to obtain combined text segment sequences;

performing iterative training on a topic model to be trained based on each combined text segment sequence to obtain the multi-modal topic model, and outputting a multi-modal topic sequence of each work, wherein the multi-modal topic sequence comprises a video topic sequence, a text topic sequence and an audio topic sequence;

and forming the preset multi-modal theme library based on the multi-modal theme sequence of each work.

inquiring a first complete topic sequence corresponding to the first file in the preset multi-modal topic library;

determining a theme position corresponding to the playing position based on the first complete theme sequence;

reversely inquiring a preset number of target topics in the first complete topic sequence based on the topic positions;

forming the topic sequence based on each of the target topics.

querying second complete topic sequences respectively corresponding to the second files based on a preset multi-mode topic library;

respectively inquiring the associated topic sequences matched with the topic sequences in each second complete topic sequence;

and determining the associated modal content corresponding to the associated topic sequence from a plurality of second files.

respectively inquiring each candidate subsequence which is the same as the topic sequence in each second complete topic sequence;

determining a progress value of each candidate subsequence in each second complete subject sequence;

and respectively comparing the progress value of each candidate subsequence with the progress value of the topic sequence aiming at each second complete topic sequence, and determining an associated topic sequence corresponding to each second complete topic sequence based on the comparison result.

The specific implementation of the multi-modal content switching apparatus of the present application is substantially the same as the embodiments of the multi-modal content switching method, and is not described herein again.

The present application provides a storage medium, which is a computer-readable storage medium, and the computer-readable storage medium stores one or more programs, which can be further executed by one or more processors for implementing the steps of the multimodal content switching method as described in any one of the above.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the embodiments of the multi-modal content switching method, and is not described herein again.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all equivalent structures or equivalent processes, which are directly or indirectly applied to other related technical fields, and which are not limited by the present application, are also included in the scope of the present application.

Claims

1. A multimodal content switching method, characterized in that the multimodal content switching method comprises:

and playing the associated modal content.

2. The multimodal content switching method of claim 1, wherein said obtaining the associated modal content of the first document comprises:

acquiring a theme sequence of the first file;

3. The multimodal content switching method of claim 2, wherein said obtaining the subject sequence of the first document comprises:

acquiring the playing position of the first file;

the preset multi-mode theme bank stores a multi-mode theme sequence generated by a multi-mode theme model based on text information of different modal contents in each work, and the multi-mode theme model is obtained by performing iterative training based on the pre-collected text information of the different modal contents in each work.

4. The method of claim 3, wherein prior to the step of determining the theme sequence corresponding to the playback position based on the playback position and a preset multi-modal theme library, the method further comprises:

and forming the preset multi-modal subject library based on the multi-modal subject sequence of each work.

5. The method according to claim 3, wherein the determining the theme sequence corresponding to the playing position based on the playing position and a preset multi-modal theme library comprises:

forming the topic sequence based on each of the target topics.

6. The method of multimodal content switching according to claim 2, wherein said determining from a plurality of second documents that the target content matching the subject sequence is the associated modal content comprises:

7. The multimodal content switching method according to claim 6, wherein said querying, in each of said second complete topic sequences, an associated topic sequence that matches said topic sequence, respectively, comprises:

8. A multimodal content switching apparatus, wherein the multimodal content switching apparatus comprises:

and the playing module is used for playing the associated modal content.

9. A multimodal content switching apparatus, wherein the multimodal content switching apparatus comprises: a memory, a processor, and a multimodal content switching program stored on the memory,

the multi-modal content switching program is executed by the processor to implement the multi-modal content switching method as recited in any one of claims 1 to 7.

10. A storage medium which is a computer-readable storage medium, characterized in that a multi-modal content switching program is stored on the computer-readable storage medium, and the multi-modal content switching program is executed by a processor to implement the multi-modal content switching method according to any one of claims 1 to 7.