CN117035004A

CN117035004A - Text, picture and video generation method and system based on multi-modal learning technology

Info

Publication number: CN117035004A
Application number: CN202310912800.XA
Authority: CN
Inventors: 江何; 周鑫; 史普力; 周训游; 刘权震; 韩立群
Original assignee: Beijing Testor Technology Co ltd
Current assignee: Beijing Testor Technology Co ltd
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-11-10
Anticipated expiration: 2043-07-24
Also published as: CN117035004B

Abstract

The application discloses a text, picture and video generation method and system based on a multi-modal learning technology, wherein the method comprises the following steps: respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data; training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model; receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task; and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video. The learning module can model the subsequent vector things in a modeling way by respectively acquiring the multi-mode characteristics of the text, the picture and the video so as to display the vector things in multiple layers, thereby improving the cognition and learning efficiency of the user on different things and improving the experience.

Description

Text, picture and video generation method and system based on multi-modal learning technology

Technical Field

The application relates to the technical field of data processing, in particular to a method and a system for generating texts, pictures and videos based on a multi-modal learning technology.

Background

Multi-modal learning, i.e., machine learning using multi-modal information. The modality is the way things develop. Further explained, a modality refers to some type of information, or representation of information. The visual information, the auditory information and the tactile information belong to one mode, and the human brain can capture information of multiple modes simultaneously and process and integrate the information so as to complete cognition and execution tasks. The same thing can be presented in multiple modalities, for example: the manner of text, pictures or video, there is currently no method for generating text, pictures or video of things through multi-modal learning techniques. The cognition of people on things is limited, and the experience of users is reduced.

Disclosure of Invention

In view of the above-mentioned problems, the present application provides a method and a system for generating text, picture, and video based on a multi-modal learning technique, which are used for solving the problem that a method for generating text, picture, or video of things by using a multi-modal learning technique is not yet available in the background art. The cognition of people on things is limited, and the experience of users is reduced.

A text, picture and video generation method based on a multi-modal learning technology comprises the following steps:

respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data;

training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model;

receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task;

and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.

Preferably, the acquiring multi-modal data corresponding to the text, the picture and the video respectively, generating learning features according to the multi-modal data, includes:

determining respective expansion modes of a text, a picture and a video, and acquiring standard mode data vector expression of the expansion modes;

acquiring the modal coding vectors of the text, the picture and the video on different expansion modes based on the standard modal data vector expression;

integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;

and extracting morphological characteristics of the multi-modal data corresponding to the text, the picture and the video, and acquiring multi-modal learning characteristics of the text, the picture and the video according to the morphological characteristics.

Preferably, the training the preset neural network learning model according to the learning characteristics of each of the text, the picture and the video to generate the multi-modal learning model includes:

generating a training set according to respective learning characteristics of the text, the picture and the video;

acquiring a preset neural network learning model, setting initial model parameters of the model, and training the preset neural network learning model to generate a first learning model after the setting is finished;

acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, and selectively optimizing the first learning model based on the model precision to acquire a second learning model;

and confirming the second learning model as a multi-modal learning model.

Preferably, the receiving the decision task issued by the user, analyzing the task vector and the decision instruction of the decision task, includes:

receiving a decision task issued by a user through a task platform, acquiring a link data packet corresponding to the decision task, and decompressing the link data packet;

analyzing the decompressed link data packet to obtain a task vector;

acquiring a task instruction corresponding to the decision task, and determining a decision mode for a task vector according to the task instruction;

and calling a preset mode instruction sample, and filling the decision mode into the preset mode instruction sample to generate a decision instruction.

Preferably, the adaptive processing of task vectors based on decision instructions using a multi-modal learning model to generate real-time text/real-time pictures or real-time video includes:

determining a modal expression processing mode for the task vector based on the decision instruction, and acquiring modal parameters corresponding to the modal expression processing mode;

acquiring multimode corpus corresponding to the task vector;

calling a multi-modal learning model to model the multi-modal corpus based on the modal parameters, and obtaining a modeling result;

and generating real-time text/real-time pictures or real-time videos according to the modeling result.

A text, picture, video generation system based on multimodal learning technique, the system comprising:

the first generation module is used for respectively acquiring multi-mode data corresponding to the text, the picture and the video and generating learning characteristics according to the multi-mode data;

the second generation module is used for training a preset neural network learning model according to the respective learning characteristics of the text, the picture and the video so as to generate a multi-modal learning model;

the analysis module is used for receiving a decision task issued by a user and analyzing a task vector and a decision instruction of the decision task;

and the third generation module is used for carrying out self-adaptive processing on the task vector by utilizing the multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.

Preferably, the first generating module includes:

the first determining submodule is used for determining respective expansion modes of texts, pictures and videos and obtaining standard mode data vector expression of the expansion modes;

the first acquisition submodule is used for acquiring the modal coding vectors of the text, the picture and the video on different expansion modes respectively based on the standard modal data vector expression;

the first generation sub-module is used for integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;

the second obtaining submodule is used for extracting morphological characteristics of the multi-mode data corresponding to the text, the picture and the video respectively, and obtaining multi-mode learning characteristics of the text, the picture and the video according to the morphological characteristics.

Preferably, the second generating module includes:

the second generation submodule is used for generating a training set according to the respective learning characteristics of the text, the picture and the video;

the third generation sub-module is used for acquiring a preset neural network learning model and setting initial model parameters of the preset neural network learning model, and after the setting is finished, the preset neural network learning model is trained to generate a first learning model;

the optimization sub-module is used for acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, selectively optimizing the first learning model based on the model precision, and acquiring a second learning model;

and the confirmation sub-module is used for confirming the second learning model as a multi-mode learning model.

Preferably, the parsing module includes:

the third acquisition sub-module is used for receiving a decision task issued by a user through the task platform, acquiring a link data packet corresponding to the decision task and decompressing the link data packet;

the analysis sub-module is used for analyzing the decompressed link data packet to obtain a task vector;

the second determining submodule is used for acquiring a task instruction corresponding to the decision task and determining a decision mode for a task vector according to the task instruction;

and the fourth generation sub-module is used for retrieving a preset mode instruction sample and filling the decision mode into the preset mode instruction sample to generate a decision instruction.

Preferably, the third generating module includes:

the third determining submodule is used for determining a mode expression processing mode for the task vector based on the decision instruction and acquiring mode parameters corresponding to the mode expression processing mode;

the third acquisition sub-module is used for acquiring the multimode corpus corresponding to the task vector;

the modeling module is used for calling a multi-mode learning model to model the multi-mode corpus based on the mode parameters, and obtaining a modeling result;

and the fifth generation sub-module is used for generating real-time text/real-time pictures or real-time videos according to the modeling result.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical scheme of the application is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application.

FIG. 1 is a workflow diagram of a method for generating text, pictures and video based on a multi-modal learning technique;

FIG. 2 is another workflow diagram of a text, picture, video generation method based on a multi-modal learning technique provided by the present application;

fig. 3 is a schematic structural diagram of a text, picture and video generating system based on a multi-mode learning technology provided by the application;

fig. 4 is a schematic structural diagram of a first generating module in a text, picture and video generating system based on a multi-mode learning technology.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Multi-modal learning, i.e., machine learning using multi-modal information. The modality is the way things develop. Further explained, a modality refers to some type of information, or representation of information. People are also subconscious and connect them with our multi-channel senses (vision, sense, etc.). People recognize the world through various senses such as vision, zhong Jiao, hearing, beer sense and the like, and different senses can reflect the intrinsic properties of the same thing from different sides. The visual information, the auditory information and the tactile information belong to one mode, and the human brain can capture information of multiple modes simultaneously and process and integrate the information to complete cognition and execution tasks. The same thing can be presented in multiple modalities, for example: the manner of text, pictures or video, there is currently no method for generating text, pictures or video of things through multi-modal learning techniques. The cognition of people on things is limited, and the experience of users is reduced. In order to solve the above problems, the present embodiment discloses a text, picture, and video generation method based on a multi-modal learning technique.

A text, picture and video generation method based on a multi-modal learning technology, as shown in figure 1, comprises the following steps:

step S101, respectively acquiring multi-mode data corresponding to texts, pictures and videos, and generating learning features according to the multi-mode data;

step S102, training a preset neural network learning model according to respective learning characteristics of texts, pictures and videos to generate a multi-modal learning model;

step S103, receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task;

step S104, performing self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction to generate a real-time text/real-time picture or a real-time video.

In this embodiment, the multi-modal data is represented as modal data in which text, pictures, and video are each converted into other formats;

in this embodiment, the learning features are represented as steady-state learning features of respective modalities of text, picture, and video;

in this embodiment, the decision task is represented as a decision task for performing modal transformation on the real-time parameter vector;

in this embodiment, the task vector is expressed as an entity parameter vector corresponding to the task;

in this embodiment, the decision instruction is represented as a decision instruction issued by the user to perform modal transformation on the entity parameter vector.

The working principle of the technical scheme is as follows: respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data; training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model; receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task; and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.

The beneficial effects of the technical scheme are as follows: the learning module can model the subsequent vector things in a modeling mode by respectively acquiring the multi-mode characteristics of the text, the picture and the video so as to display the vector things in a multi-level mode, so that the cognition and the learning efficiency of a user on different things are improved, the experience is improved, and the problem that a method for generating the text, the picture or the video of the things through the multi-mode learning technology does not exist in the prior art at present is solved. The cognition of people on things is limited, and the experience of users is reduced.

In one embodiment, the obtaining multi-modal data corresponding to the text, the picture and the video respectively, generating the learning feature according to the multi-modal data, includes:

In this embodiment, the expansion mode is expressed as a representation of each of text, picture, and video in other expansion forms.

The beneficial effects of the technical scheme are as follows: the convertible modes of the text, the picture and the video can be determined by determining the respective expansion modes of the text, the picture and the video, so that the acquired multi-mode data has better referential property, the data precision and the accuracy are improved, furthermore, the learning characteristics can be acquired rapidly in a targeted manner based on the morphological parameters of different modes by acquiring the learning characteristics according to the morphological characteristics of the multi-mode data, and the working efficiency is improved.

In one embodiment, the training the preset neural network learning model according to the respective learning characteristics of the text, the picture and the video to generate the multi-modal learning model includes:

and confirming the second learning model as a multi-modal learning model.

The beneficial effects of the technical scheme are as follows: the accuracy test of the learning model can ensure the model accuracy of the learning model, improve the stability and lay the condition for the subsequent modeling treatment of the object modeling.

In one embodiment, as shown in fig. 2, the receiving the decision task issued by the user, analyzing the task vector and the decision instruction of the decision task, includes:

step S201, receiving a decision task issued by a user through a task platform, acquiring a link data packet corresponding to the decision task, and decompressing the link data packet;

step S202, analyzing the decompressed link data packet to obtain a task vector;

step 203, acquiring a task instruction corresponding to the decision task, and determining a decision mode for a task vector according to the task instruction;

step S204, a preset mode instruction sample is called, and the decision mode is filled into the preset mode instruction sample to generate a decision instruction.

The beneficial effects of the technical scheme are as follows: the modeling subject matter can be determined by acquiring the task vector according to the link data packet, so that the working efficiency is improved, furthermore, the instruction generation time can be saved by rapidly generating the decision instruction by calling the preset modal instruction sample, and the working efficiency and the practicability are further improved.

In this embodiment, before receiving the decision task issued by the user through the task platform, the method further includes:

receiving a task execution request sent by a user through a task platform, responding to the task execution request, and obtaining the equipment type of the task execution request sent by the user;

acquiring a plurality of protocol buffers and annotation files thereof according to the equipment type, and feeding back the plurality of protocol buffers and the annotation files thereof to user equipment;

receiving a target protocol buffer fed back by user equipment, and obtaining a buffer file corresponding to the target protocol buffer;

determining a protocol security event according to the buffer file, and determining a plurality of attack chain directions and attack stage parameters of each attack chain direction based on the protocol security event;

acquiring network attributes of user equipment, and determining a security connection strategy for the user equipment according to the network attributes, a plurality of attack chain directions and attack stage parameters of each attack chain direction;

configuring target protocol parameters of a target protocol buffer according to a secure connection policy for user equipment;

the user equipment is connected through the configured target protocol buffer, a plurality of data files transmitted by the user equipment are received, virus detection is carried out on the data files, and a detection result is obtained;

if the detection result determines that the data file is virus-free, receiving a decision task with an accessory issued by a user through a task platform;

if the data file is detected, determining that the data file has viruses, and setting an intermediate database based on a data sharing mechanism and a safety protection mechanism;

and constructing a link between the task platform and the intermediate database, receiving a decision task with an accessory issued by a user through the task platform, and transmitting the accessory to the intermediate database in priority for subsequent calling.

In the present embodiment, the device type represents a device medium for which a user issues a task execution request, for example: a mobile phone, a notebook computer, a tablet computer, etc.;

in this embodiment, the protocol buffer and the annotation file thereof are represented as annotation files of the connection conditions of the network protocol and the protocol that are in data connection with the user equipment;

in this embodiment, the protocol security event is represented as a security event affecting the communication protocol;

in this embodiment, the attack chain direction is expressed as an attack direction of an attack chain to a server where the task platform is located, for example: software and hardware;

in this embodiment, the attack stage parameter is expressed as a distribution stage of overall attack in different attack directions and an attack element set parameter of each stage;

in this embodiment, the network attribute is represented as a network type attribute of a currently used network of the user equipment, for example: public network, operating network.

The beneficial effects of the technical scheme are as follows: the safety and stability of the data transmission process can be ensured by configuring the protocol parameters of the data connection protocol with the user equipment, and further, the accessory carried by the task can be prevented from being damaged by viruses due to the existence of the accessory by constructing the intermediate database for storing the accessory carried by the task, and meanwhile, the safety detection can be carried out on the accessory carried by the task through the intermediate database, so that the practicability and stability are improved.

In one embodiment, adaptively processing task vectors to generate real-time text/real-time pictures or real-time video using a multi-modal learning model based on decision instructions includes:

acquiring multimode corpus corresponding to the task vector;

The beneficial effects of the technical scheme are as follows: modeling parameters can be rapidly obtained for modeling by modeling aiming at multimode expectation, so that modeling efficiency and generating efficiency are improved, modeling results are ensured to be more in line with expectations, and practicability and stability are further improved.

In one embodiment, the embodiment further discloses a text, picture and video generating system based on the multi-modal learning technology, as shown in fig. 3, the system includes:

the first generating module 301 is configured to respectively obtain multimodal data corresponding to a text, a picture and a video, and generate learning features according to the multimodal data;

a second generating module 302, configured to train a preset neural network learning model according to respective learning features of the text, the picture and the video to generate a multi-modal learning model;

the parsing module 303 is configured to receive a decision task issued by a user, parse a task vector and a decision instruction of the decision task;

the third generating module 304 is configured to adaptively process the task vector by using a multi-modal learning model based on the decision instruction to generate a real-time text/real-time picture or a real-time video.

The working principle of the technical scheme is as follows: firstly, respectively acquiring multi-modal data corresponding to texts, pictures and videos through a first generation module, and generating learning characteristics according to the multi-modal data; secondly, training a preset neural network learning model by using a second generation module according to respective learning characteristics of the text, the picture and the video so as to generate a multi-modal learning model; then, based on the analysis module, receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task; and finally, carrying out self-adaptive processing on the task vector by utilizing a third generation module based on the decision instruction and utilizing a multi-mode learning model so as to generate a real-time text/real-time picture or a real-time video.

The beneficial effects of the technical scheme are as follows: the learning module can model the subsequent vector things in a modeling way by respectively acquiring the multi-mode characteristics of the text, the picture and the video so as to display the vector things in multiple layers, thereby improving the cognition and learning efficiency of the user on different things and improving the experience.

In one embodiment, as shown in fig. 4, the first generating module 301 includes:

the first determining submodule 3011 is used for determining respective expansion modes of the text, the picture and the video and obtaining standard mode data vector expression of the expansion modes;

the first obtaining submodule 3012 is used for obtaining the modal coding vectors of the text, the picture and the video on different expansion modes respectively based on the standard modal data vector expression;

the first generation sub-module 3013 is configured to integrate the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;

the second obtaining sub-module 3014 is configured to extract morphological features of multimodal data corresponding to the text, the picture, and the video, and obtain multimodal learning features of the text, the picture, and the video according to the morphological features.

In one embodiment, the second generating module includes:

In one embodiment, the parsing module includes:

In one embodiment, the third generation module includes:

It will be appreciated by those skilled in the art that the first and second aspects of the present application refer to different phases of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A text, picture and video generation method based on a multi-modal learning technology is characterized by comprising the following steps:

2. The method for generating text, picture and video based on multi-modal learning technology according to claim 1, wherein the steps of respectively acquiring multi-modal data corresponding to the text, picture and video, generating learning features according to the multi-modal data include:

3. The method for generating text, pictures and videos based on the multi-modal learning technology according to claim 1, wherein training the preset neural network learning model according to the respective learning characteristics of the text, the pictures and the videos to generate the multi-modal learning model comprises:

and confirming the second learning model as a multi-modal learning model.

4. The method for generating text, pictures and videos based on the multi-modal learning technology according to claim 1, wherein the steps of receiving a decision task issued by a user, analyzing a task vector and a decision instruction of the decision task, and include:

analyzing the decompressed link data packet to obtain a task vector;

5. The method for generating text, pictures and videos based on the multi-modal learning technology according to claim 1, wherein the adaptive processing of task vectors based on decision instructions by using a multi-modal learning model to generate real-time text/real-time pictures or real-time videos comprises:

acquiring multimode corpus corresponding to the task vector;

6. A text, picture, video generation system based on a multimodal learning technique, the system comprising:

7. The multi-modal learning technology based text, picture, video generation system of claim 6, wherein the first generation module comprises:

8. The multi-modal learning technology based text, picture, video generation system of claim 6, wherein the second generation module comprises:

9. The multi-modal learning technology based text, picture, video generation system of claim 6, wherein the parsing module comprises:

10. The text, picture, video generation system based on the multi-modal learning technique of claim 6, wherein the third generation module comprises: