CN117035004A - Text, picture and video generation method and system based on multi-modal learning technology - Google Patents

Text, picture and video generation method and system based on multi-modal learning technology Download PDF

Info

Publication number
CN117035004A
CN117035004A CN202310912800.XA CN202310912800A CN117035004A CN 117035004 A CN117035004 A CN 117035004A CN 202310912800 A CN202310912800 A CN 202310912800A CN 117035004 A CN117035004 A CN 117035004A
Authority
CN
China
Prior art keywords
text
picture
modal
video
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310912800.XA
Other languages
Chinese (zh)
Other versions
CN117035004B (en
Inventor
江何
周鑫
史普力
周训游
刘权震
韩立群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Testor Technology Co ltd
Original Assignee
Beijing Testor Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Testor Technology Co ltd filed Critical Beijing Testor Technology Co ltd
Priority to CN202310912800.XA priority Critical patent/CN117035004B/en
Publication of CN117035004A publication Critical patent/CN117035004A/en
Application granted granted Critical
Publication of CN117035004B publication Critical patent/CN117035004B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a text, picture and video generation method and system based on a multi-modal learning technology, wherein the method comprises the following steps: respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data; training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model; receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task; and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video. The learning module can model the subsequent vector things in a modeling way by respectively acquiring the multi-mode characteristics of the text, the picture and the video so as to display the vector things in multiple layers, thereby improving the cognition and learning efficiency of the user on different things and improving the experience.

Description

Text, picture and video generation method and system based on multi-modal learning technology
Technical Field
The application relates to the technical field of data processing, in particular to a method and a system for generating texts, pictures and videos based on a multi-modal learning technology.
Background
Multi-modal learning, i.e., machine learning using multi-modal information. The modality is the way things develop. Further explained, a modality refers to some type of information, or representation of information. The visual information, the auditory information and the tactile information belong to one mode, and the human brain can capture information of multiple modes simultaneously and process and integrate the information so as to complete cognition and execution tasks. The same thing can be presented in multiple modalities, for example: the manner of text, pictures or video, there is currently no method for generating text, pictures or video of things through multi-modal learning techniques. The cognition of people on things is limited, and the experience of users is reduced.
Disclosure of Invention
In view of the above-mentioned problems, the present application provides a method and a system for generating text, picture, and video based on a multi-modal learning technique, which are used for solving the problem that a method for generating text, picture, or video of things by using a multi-modal learning technique is not yet available in the background art. The cognition of people on things is limited, and the experience of users is reduced.
A text, picture and video generation method based on a multi-modal learning technology comprises the following steps:
respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data;
training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model;
receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task;
and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.
Preferably, the acquiring multi-modal data corresponding to the text, the picture and the video respectively, generating learning features according to the multi-modal data, includes:
determining respective expansion modes of a text, a picture and a video, and acquiring standard mode data vector expression of the expansion modes;
acquiring the modal coding vectors of the text, the picture and the video on different expansion modes based on the standard modal data vector expression;
integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;
and extracting morphological characteristics of the multi-modal data corresponding to the text, the picture and the video, and acquiring multi-modal learning characteristics of the text, the picture and the video according to the morphological characteristics.
Preferably, the training the preset neural network learning model according to the learning characteristics of each of the text, the picture and the video to generate the multi-modal learning model includes:
generating a training set according to respective learning characteristics of the text, the picture and the video;
acquiring a preset neural network learning model, setting initial model parameters of the model, and training the preset neural network learning model to generate a first learning model after the setting is finished;
acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, and selectively optimizing the first learning model based on the model precision to acquire a second learning model;
and confirming the second learning model as a multi-modal learning model.
Preferably, the receiving the decision task issued by the user, analyzing the task vector and the decision instruction of the decision task, includes:
receiving a decision task issued by a user through a task platform, acquiring a link data packet corresponding to the decision task, and decompressing the link data packet;
analyzing the decompressed link data packet to obtain a task vector;
acquiring a task instruction corresponding to the decision task, and determining a decision mode for a task vector according to the task instruction;
and calling a preset mode instruction sample, and filling the decision mode into the preset mode instruction sample to generate a decision instruction.
Preferably, the adaptive processing of task vectors based on decision instructions using a multi-modal learning model to generate real-time text/real-time pictures or real-time video includes:
determining a modal expression processing mode for the task vector based on the decision instruction, and acquiring modal parameters corresponding to the modal expression processing mode;
acquiring multimode corpus corresponding to the task vector;
calling a multi-modal learning model to model the multi-modal corpus based on the modal parameters, and obtaining a modeling result;
and generating real-time text/real-time pictures or real-time videos according to the modeling result.
A text, picture, video generation system based on multimodal learning technique, the system comprising:
the first generation module is used for respectively acquiring multi-mode data corresponding to the text, the picture and the video and generating learning characteristics according to the multi-mode data;
the second generation module is used for training a preset neural network learning model according to the respective learning characteristics of the text, the picture and the video so as to generate a multi-modal learning model;
the analysis module is used for receiving a decision task issued by a user and analyzing a task vector and a decision instruction of the decision task;
and the third generation module is used for carrying out self-adaptive processing on the task vector by utilizing the multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.
Preferably, the first generating module includes:
the first determining submodule is used for determining respective expansion modes of texts, pictures and videos and obtaining standard mode data vector expression of the expansion modes;
the first acquisition submodule is used for acquiring the modal coding vectors of the text, the picture and the video on different expansion modes respectively based on the standard modal data vector expression;
the first generation sub-module is used for integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;
the second obtaining submodule is used for extracting morphological characteristics of the multi-mode data corresponding to the text, the picture and the video respectively, and obtaining multi-mode learning characteristics of the text, the picture and the video according to the morphological characteristics.
Preferably, the second generating module includes:
the second generation submodule is used for generating a training set according to the respective learning characteristics of the text, the picture and the video;
the third generation sub-module is used for acquiring a preset neural network learning model and setting initial model parameters of the preset neural network learning model, and after the setting is finished, the preset neural network learning model is trained to generate a first learning model;
the optimization sub-module is used for acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, selectively optimizing the first learning model based on the model precision, and acquiring a second learning model;
and the confirmation sub-module is used for confirming the second learning model as a multi-mode learning model.
Preferably, the parsing module includes:
the third acquisition sub-module is used for receiving a decision task issued by a user through the task platform, acquiring a link data packet corresponding to the decision task and decompressing the link data packet;
the analysis sub-module is used for analyzing the decompressed link data packet to obtain a task vector;
the second determining submodule is used for acquiring a task instruction corresponding to the decision task and determining a decision mode for a task vector according to the task instruction;
and the fourth generation sub-module is used for retrieving a preset mode instruction sample and filling the decision mode into the preset mode instruction sample to generate a decision instruction.
Preferably, the third generating module includes:
the third determining submodule is used for determining a mode expression processing mode for the task vector based on the decision instruction and acquiring mode parameters corresponding to the mode expression processing mode;
the third acquisition sub-module is used for acquiring the multimode corpus corresponding to the task vector;
the modeling module is used for calling a multi-mode learning model to model the multi-mode corpus based on the mode parameters, and obtaining a modeling result;
and the fifth generation sub-module is used for generating real-time text/real-time pictures or real-time videos according to the modeling result.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical scheme of the application is further described in detail through the drawings and the embodiments.
Drawings
The accompanying drawings are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, serve to explain the application.
FIG. 1 is a workflow diagram of a method for generating text, pictures and video based on a multi-modal learning technique;
FIG. 2 is another workflow diagram of a text, picture, video generation method based on a multi-modal learning technique provided by the present application;
fig. 3 is a schematic structural diagram of a text, picture and video generating system based on a multi-mode learning technology provided by the application;
fig. 4 is a schematic structural diagram of a first generating module in a text, picture and video generating system based on a multi-mode learning technology.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Multi-modal learning, i.e., machine learning using multi-modal information. The modality is the way things develop. Further explained, a modality refers to some type of information, or representation of information. People are also subconscious and connect them with our multi-channel senses (vision, sense, etc.). People recognize the world through various senses such as vision, zhong Jiao, hearing, beer sense and the like, and different senses can reflect the intrinsic properties of the same thing from different sides. The visual information, the auditory information and the tactile information belong to one mode, and the human brain can capture information of multiple modes simultaneously and process and integrate the information to complete cognition and execution tasks. The same thing can be presented in multiple modalities, for example: the manner of text, pictures or video, there is currently no method for generating text, pictures or video of things through multi-modal learning techniques. The cognition of people on things is limited, and the experience of users is reduced. In order to solve the above problems, the present embodiment discloses a text, picture, and video generation method based on a multi-modal learning technique.
A text, picture and video generation method based on a multi-modal learning technology, as shown in figure 1, comprises the following steps:
step S101, respectively acquiring multi-mode data corresponding to texts, pictures and videos, and generating learning features according to the multi-mode data;
step S102, training a preset neural network learning model according to respective learning characteristics of texts, pictures and videos to generate a multi-modal learning model;
step S103, receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task;
step S104, performing self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction to generate a real-time text/real-time picture or a real-time video.
In this embodiment, the multi-modal data is represented as modal data in which text, pictures, and video are each converted into other formats;
in this embodiment, the learning features are represented as steady-state learning features of respective modalities of text, picture, and video;
in this embodiment, the decision task is represented as a decision task for performing modal transformation on the real-time parameter vector;
in this embodiment, the task vector is expressed as an entity parameter vector corresponding to the task;
in this embodiment, the decision instruction is represented as a decision instruction issued by the user to perform modal transformation on the entity parameter vector.
The working principle of the technical scheme is as follows: respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data; training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model; receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task; and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.
The beneficial effects of the technical scheme are as follows: the learning module can model the subsequent vector things in a modeling mode by respectively acquiring the multi-mode characteristics of the text, the picture and the video so as to display the vector things in a multi-level mode, so that the cognition and the learning efficiency of a user on different things are improved, the experience is improved, and the problem that a method for generating the text, the picture or the video of the things through the multi-mode learning technology does not exist in the prior art at present is solved. The cognition of people on things is limited, and the experience of users is reduced.
In one embodiment, the obtaining multi-modal data corresponding to the text, the picture and the video respectively, generating the learning feature according to the multi-modal data, includes:
determining respective expansion modes of a text, a picture and a video, and acquiring standard mode data vector expression of the expansion modes;
acquiring the modal coding vectors of the text, the picture and the video on different expansion modes based on the standard modal data vector expression;
integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;
and extracting morphological characteristics of the multi-modal data corresponding to the text, the picture and the video, and acquiring multi-modal learning characteristics of the text, the picture and the video according to the morphological characteristics.
In this embodiment, the expansion mode is expressed as a representation of each of text, picture, and video in other expansion forms.
The beneficial effects of the technical scheme are as follows: the convertible modes of the text, the picture and the video can be determined by determining the respective expansion modes of the text, the picture and the video, so that the acquired multi-mode data has better referential property, the data precision and the accuracy are improved, furthermore, the learning characteristics can be acquired rapidly in a targeted manner based on the morphological parameters of different modes by acquiring the learning characteristics according to the morphological characteristics of the multi-mode data, and the working efficiency is improved.
In one embodiment, the training the preset neural network learning model according to the respective learning characteristics of the text, the picture and the video to generate the multi-modal learning model includes:
generating a training set according to respective learning characteristics of the text, the picture and the video;
acquiring a preset neural network learning model, setting initial model parameters of the model, and training the preset neural network learning model to generate a first learning model after the setting is finished;
acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, and selectively optimizing the first learning model based on the model precision to acquire a second learning model;
and confirming the second learning model as a multi-modal learning model.
The beneficial effects of the technical scheme are as follows: the accuracy test of the learning model can ensure the model accuracy of the learning model, improve the stability and lay the condition for the subsequent modeling treatment of the object modeling.
In one embodiment, as shown in fig. 2, the receiving the decision task issued by the user, analyzing the task vector and the decision instruction of the decision task, includes:
step S201, receiving a decision task issued by a user through a task platform, acquiring a link data packet corresponding to the decision task, and decompressing the link data packet;
step S202, analyzing the decompressed link data packet to obtain a task vector;
step 203, acquiring a task instruction corresponding to the decision task, and determining a decision mode for a task vector according to the task instruction;
step S204, a preset mode instruction sample is called, and the decision mode is filled into the preset mode instruction sample to generate a decision instruction.
The beneficial effects of the technical scheme are as follows: the modeling subject matter can be determined by acquiring the task vector according to the link data packet, so that the working efficiency is improved, furthermore, the instruction generation time can be saved by rapidly generating the decision instruction by calling the preset modal instruction sample, and the working efficiency and the practicability are further improved.
In this embodiment, before receiving the decision task issued by the user through the task platform, the method further includes:
receiving a task execution request sent by a user through a task platform, responding to the task execution request, and obtaining the equipment type of the task execution request sent by the user;
acquiring a plurality of protocol buffers and annotation files thereof according to the equipment type, and feeding back the plurality of protocol buffers and the annotation files thereof to user equipment;
receiving a target protocol buffer fed back by user equipment, and obtaining a buffer file corresponding to the target protocol buffer;
determining a protocol security event according to the buffer file, and determining a plurality of attack chain directions and attack stage parameters of each attack chain direction based on the protocol security event;
acquiring network attributes of user equipment, and determining a security connection strategy for the user equipment according to the network attributes, a plurality of attack chain directions and attack stage parameters of each attack chain direction;
configuring target protocol parameters of a target protocol buffer according to a secure connection policy for user equipment;
the user equipment is connected through the configured target protocol buffer, a plurality of data files transmitted by the user equipment are received, virus detection is carried out on the data files, and a detection result is obtained;
if the detection result determines that the data file is virus-free, receiving a decision task with an accessory issued by a user through a task platform;
if the data file is detected, determining that the data file has viruses, and setting an intermediate database based on a data sharing mechanism and a safety protection mechanism;
and constructing a link between the task platform and the intermediate database, receiving a decision task with an accessory issued by a user through the task platform, and transmitting the accessory to the intermediate database in priority for subsequent calling.
In the present embodiment, the device type represents a device medium for which a user issues a task execution request, for example: a mobile phone, a notebook computer, a tablet computer, etc.;
in this embodiment, the protocol buffer and the annotation file thereof are represented as annotation files of the connection conditions of the network protocol and the protocol that are in data connection with the user equipment;
in this embodiment, the protocol security event is represented as a security event affecting the communication protocol;
in this embodiment, the attack chain direction is expressed as an attack direction of an attack chain to a server where the task platform is located, for example: software and hardware;
in this embodiment, the attack stage parameter is expressed as a distribution stage of overall attack in different attack directions and an attack element set parameter of each stage;
in this embodiment, the network attribute is represented as a network type attribute of a currently used network of the user equipment, for example: public network, operating network.
The beneficial effects of the technical scheme are as follows: the safety and stability of the data transmission process can be ensured by configuring the protocol parameters of the data connection protocol with the user equipment, and further, the accessory carried by the task can be prevented from being damaged by viruses due to the existence of the accessory by constructing the intermediate database for storing the accessory carried by the task, and meanwhile, the safety detection can be carried out on the accessory carried by the task through the intermediate database, so that the practicability and stability are improved.
In one embodiment, adaptively processing task vectors to generate real-time text/real-time pictures or real-time video using a multi-modal learning model based on decision instructions includes:
determining a modal expression processing mode for the task vector based on the decision instruction, and acquiring modal parameters corresponding to the modal expression processing mode;
acquiring multimode corpus corresponding to the task vector;
calling a multi-modal learning model to model the multi-modal corpus based on the modal parameters, and obtaining a modeling result;
and generating real-time text/real-time pictures or real-time videos according to the modeling result.
The beneficial effects of the technical scheme are as follows: modeling parameters can be rapidly obtained for modeling by modeling aiming at multimode expectation, so that modeling efficiency and generating efficiency are improved, modeling results are ensured to be more in line with expectations, and practicability and stability are further improved.
In one embodiment, the embodiment further discloses a text, picture and video generating system based on the multi-modal learning technology, as shown in fig. 3, the system includes:
the first generating module 301 is configured to respectively obtain multimodal data corresponding to a text, a picture and a video, and generate learning features according to the multimodal data;
a second generating module 302, configured to train a preset neural network learning model according to respective learning features of the text, the picture and the video to generate a multi-modal learning model;
the parsing module 303 is configured to receive a decision task issued by a user, parse a task vector and a decision instruction of the decision task;
the third generating module 304 is configured to adaptively process the task vector by using a multi-modal learning model based on the decision instruction to generate a real-time text/real-time picture or a real-time video.
The working principle of the technical scheme is as follows: firstly, respectively acquiring multi-modal data corresponding to texts, pictures and videos through a first generation module, and generating learning characteristics according to the multi-modal data; secondly, training a preset neural network learning model by using a second generation module according to respective learning characteristics of the text, the picture and the video so as to generate a multi-modal learning model; then, based on the analysis module, receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task; and finally, carrying out self-adaptive processing on the task vector by utilizing a third generation module based on the decision instruction and utilizing a multi-mode learning model so as to generate a real-time text/real-time picture or a real-time video.
The beneficial effects of the technical scheme are as follows: the learning module can model the subsequent vector things in a modeling way by respectively acquiring the multi-mode characteristics of the text, the picture and the video so as to display the vector things in multiple layers, thereby improving the cognition and learning efficiency of the user on different things and improving the experience.
In one embodiment, as shown in fig. 4, the first generating module 301 includes:
the first determining submodule 3011 is used for determining respective expansion modes of the text, the picture and the video and obtaining standard mode data vector expression of the expansion modes;
the first obtaining submodule 3012 is used for obtaining the modal coding vectors of the text, the picture and the video on different expansion modes respectively based on the standard modal data vector expression;
the first generation sub-module 3013 is configured to integrate the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;
the second obtaining sub-module 3014 is configured to extract morphological features of multimodal data corresponding to the text, the picture, and the video, and obtain multimodal learning features of the text, the picture, and the video according to the morphological features.
The beneficial effects of the technical scheme are as follows: the convertible modes of the text, the picture and the video can be determined by determining the respective expansion modes of the text, the picture and the video, so that the acquired multi-mode data has better referential property, the data precision and the accuracy are improved, furthermore, the learning characteristics can be acquired rapidly in a targeted manner based on the morphological parameters of different modes by acquiring the learning characteristics according to the morphological characteristics of the multi-mode data, and the working efficiency is improved.
In one embodiment, the second generating module includes:
the second generation submodule is used for generating a training set according to the respective learning characteristics of the text, the picture and the video;
the third generation sub-module is used for acquiring a preset neural network learning model and setting initial model parameters of the preset neural network learning model, and after the setting is finished, the preset neural network learning model is trained to generate a first learning model;
the optimization sub-module is used for acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, selectively optimizing the first learning model based on the model precision, and acquiring a second learning model;
and the confirmation sub-module is used for confirming the second learning model as a multi-mode learning model.
The beneficial effects of the technical scheme are as follows: the accuracy test of the learning model can ensure the model accuracy of the learning model, improve the stability and lay the condition for the subsequent modeling treatment of the object modeling.
In one embodiment, the parsing module includes:
the third acquisition sub-module is used for receiving a decision task issued by a user through the task platform, acquiring a link data packet corresponding to the decision task and decompressing the link data packet;
the analysis sub-module is used for analyzing the decompressed link data packet to obtain a task vector;
the second determining submodule is used for acquiring a task instruction corresponding to the decision task and determining a decision mode for a task vector according to the task instruction;
and the fourth generation sub-module is used for retrieving a preset mode instruction sample and filling the decision mode into the preset mode instruction sample to generate a decision instruction.
The beneficial effects of the technical scheme are as follows: the modeling subject matter can be determined by acquiring the task vector according to the link data packet, so that the working efficiency is improved, furthermore, the instruction generation time can be saved by rapidly generating the decision instruction by calling the preset modal instruction sample, and the working efficiency and the practicability are further improved.
In one embodiment, the third generation module includes:
the third determining submodule is used for determining a mode expression processing mode for the task vector based on the decision instruction and acquiring mode parameters corresponding to the mode expression processing mode;
the third acquisition sub-module is used for acquiring the multimode corpus corresponding to the task vector;
the modeling module is used for calling a multi-mode learning model to model the multi-mode corpus based on the mode parameters, and obtaining a modeling result;
and the fifth generation sub-module is used for generating real-time text/real-time pictures or real-time videos according to the modeling result.
The beneficial effects of the technical scheme are as follows: modeling parameters can be rapidly obtained for modeling by modeling aiming at multimode expectation, so that modeling efficiency and generating efficiency are improved, modeling results are ensured to be more in line with expectations, and practicability and stability are further improved.
It will be appreciated by those skilled in the art that the first and second aspects of the present application refer to different phases of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A text, picture and video generation method based on a multi-modal learning technology is characterized by comprising the following steps:
respectively acquiring multi-mode data corresponding to a text, a picture and a video, and generating learning characteristics according to the multi-mode data;
training a preset neural network learning model according to respective learning characteristics of the text, the picture and the video to generate a multi-modal learning model;
receiving a decision task issued by a user, and analyzing a task vector and a decision instruction of the decision task;
and carrying out self-adaptive processing on the task vector by utilizing a multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.
2. The method for generating text, picture and video based on multi-modal learning technology according to claim 1, wherein the steps of respectively acquiring multi-modal data corresponding to the text, picture and video, generating learning features according to the multi-modal data include:
determining respective expansion modes of a text, a picture and a video, and acquiring standard mode data vector expression of the expansion modes;
acquiring the modal coding vectors of the text, the picture and the video on different expansion modes based on the standard modal data vector expression;
integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;
and extracting morphological characteristics of the multi-modal data corresponding to the text, the picture and the video, and acquiring multi-modal learning characteristics of the text, the picture and the video according to the morphological characteristics.
3. The method for generating text, pictures and videos based on the multi-modal learning technology according to claim 1, wherein training the preset neural network learning model according to the respective learning characteristics of the text, the pictures and the videos to generate the multi-modal learning model comprises:
generating a training set according to respective learning characteristics of the text, the picture and the video;
acquiring a preset neural network learning model, setting initial model parameters of the model, and training the preset neural network learning model to generate a first learning model after the setting is finished;
acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, and selectively optimizing the first learning model based on the model precision to acquire a second learning model;
and confirming the second learning model as a multi-modal learning model.
4. The method for generating text, pictures and videos based on the multi-modal learning technology according to claim 1, wherein the steps of receiving a decision task issued by a user, analyzing a task vector and a decision instruction of the decision task, and include:
receiving a decision task issued by a user through a task platform, acquiring a link data packet corresponding to the decision task, and decompressing the link data packet;
analyzing the decompressed link data packet to obtain a task vector;
acquiring a task instruction corresponding to the decision task, and determining a decision mode for a task vector according to the task instruction;
and calling a preset mode instruction sample, and filling the decision mode into the preset mode instruction sample to generate a decision instruction.
5. The method for generating text, pictures and videos based on the multi-modal learning technology according to claim 1, wherein the adaptive processing of task vectors based on decision instructions by using a multi-modal learning model to generate real-time text/real-time pictures or real-time videos comprises:
determining a modal expression processing mode for the task vector based on the decision instruction, and acquiring modal parameters corresponding to the modal expression processing mode;
acquiring multimode corpus corresponding to the task vector;
calling a multi-modal learning model to model the multi-modal corpus based on the modal parameters, and obtaining a modeling result;
and generating real-time text/real-time pictures or real-time videos according to the modeling result.
6. A text, picture, video generation system based on a multimodal learning technique, the system comprising:
the first generation module is used for respectively acquiring multi-mode data corresponding to the text, the picture and the video and generating learning characteristics according to the multi-mode data;
the second generation module is used for training a preset neural network learning model according to the respective learning characteristics of the text, the picture and the video so as to generate a multi-modal learning model;
the analysis module is used for receiving a decision task issued by a user and analyzing a task vector and a decision instruction of the decision task;
and the third generation module is used for carrying out self-adaptive processing on the task vector by utilizing the multi-mode learning model based on the decision instruction so as to generate real-time text/real-time picture or real-time video.
7. The multi-modal learning technology based text, picture, video generation system of claim 6, wherein the first generation module comprises:
the first determining submodule is used for determining respective expansion modes of texts, pictures and videos and obtaining standard mode data vector expression of the expansion modes;
the first acquisition submodule is used for acquiring the modal coding vectors of the text, the picture and the video on different expansion modes respectively based on the standard modal data vector expression;
the first generation sub-module is used for integrating the modal coding vectors of the text, the picture and the video on different expansion modes to generate multi-modal data corresponding to the text, the picture and the video;
the second obtaining submodule is used for extracting morphological characteristics of the multi-mode data corresponding to the text, the picture and the video respectively, and obtaining multi-mode learning characteristics of the text, the picture and the video according to the morphological characteristics.
8. The multi-modal learning technology based text, picture, video generation system of claim 6, wherein the second generation module comprises:
the second generation submodule is used for generating a training set according to the respective learning characteristics of the text, the picture and the video;
the third generation sub-module is used for acquiring a preset neural network learning model and setting initial model parameters of the preset neural network learning model, and after the setting is finished, the preset neural network learning model is trained to generate a first learning model;
the optimization sub-module is used for acquiring a preset test sample, testing the first learning model, determining model precision according to a test result, selectively optimizing the first learning model based on the model precision, and acquiring a second learning model;
and the confirmation sub-module is used for confirming the second learning model as a multi-mode learning model.
9. The multi-modal learning technology based text, picture, video generation system of claim 6, wherein the parsing module comprises:
the third acquisition sub-module is used for receiving a decision task issued by a user through the task platform, acquiring a link data packet corresponding to the decision task and decompressing the link data packet;
the analysis sub-module is used for analyzing the decompressed link data packet to obtain a task vector;
the second determining submodule is used for acquiring a task instruction corresponding to the decision task and determining a decision mode for a task vector according to the task instruction;
and the fourth generation sub-module is used for retrieving a preset mode instruction sample and filling the decision mode into the preset mode instruction sample to generate a decision instruction.
10. The text, picture, video generation system based on the multi-modal learning technique of claim 6, wherein the third generation module comprises:
the third determining submodule is used for determining a mode expression processing mode for the task vector based on the decision instruction and acquiring mode parameters corresponding to the mode expression processing mode;
the third acquisition sub-module is used for acquiring the multimode corpus corresponding to the task vector;
the modeling module is used for calling a multi-mode learning model to model the multi-mode corpus based on the mode parameters, and obtaining a modeling result;
and the fifth generation sub-module is used for generating real-time text/real-time pictures or real-time videos according to the modeling result.
CN202310912800.XA 2023-07-24 2023-07-24 Text, picture and video generation method and system based on multi-modal learning technology Active CN117035004B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310912800.XA CN117035004B (en) 2023-07-24 2023-07-24 Text, picture and video generation method and system based on multi-modal learning technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310912800.XA CN117035004B (en) 2023-07-24 2023-07-24 Text, picture and video generation method and system based on multi-modal learning technology

Publications (2)

Publication Number Publication Date
CN117035004A true CN117035004A (en) 2023-11-10
CN117035004B CN117035004B (en) 2024-07-23

Family

ID=88627185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310912800.XA Active CN117035004B (en) 2023-07-24 2023-07-24 Text, picture and video generation method and system based on multi-modal learning technology

Country Status (1)

Country Link
CN (1) CN117035004B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015562A (en) * 2020-10-27 2020-12-01 北京淇瑀信息科技有限公司 Resource allocation method and device based on transfer learning and electronic equipment
CN113539233A (en) * 2020-04-16 2021-10-22 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN114880441A (en) * 2022-07-06 2022-08-09 北京百度网讯科技有限公司 Visual content generation method, device, system, equipment and medium
WO2022228958A1 (en) * 2021-04-28 2022-11-03 Bayer Aktiengesellschaft Method and apparatus for processing of multi-modal data
CN115393678A (en) * 2022-08-01 2022-11-25 北京理工大学 Multi-modal data fusion decision-making method based on image type intermediate state
US20230089566A1 (en) * 2020-05-30 2023-03-23 Huawei Technologies Co., Ltd. Video generation method and related apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539233A (en) * 2020-04-16 2021-10-22 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
US20230089566A1 (en) * 2020-05-30 2023-03-23 Huawei Technologies Co., Ltd. Video generation method and related apparatus
CN112015562A (en) * 2020-10-27 2020-12-01 北京淇瑀信息科技有限公司 Resource allocation method and device based on transfer learning and electronic equipment
WO2022228958A1 (en) * 2021-04-28 2022-11-03 Bayer Aktiengesellschaft Method and apparatus for processing of multi-modal data
CN114880441A (en) * 2022-07-06 2022-08-09 北京百度网讯科技有限公司 Visual content generation method, device, system, equipment and medium
CN115393678A (en) * 2022-08-01 2022-11-25 北京理工大学 Multi-modal data fusion decision-making method based on image type intermediate state

Also Published As

Publication number Publication date
CN117035004B (en) 2024-07-23

Similar Documents

Publication Publication Date Title
CN107728780A (en) A kind of man-machine interaction method and device based on virtual robot
US11138903B2 (en) Method, apparatus, device and system for sign language translation
CN109993150B (en) Method and device for identifying age
CN106383875A (en) Artificial intelligence-based man-machine interaction method and device
US11520822B2 (en) Information providing system and information providing method
CN113726890B (en) Federal prediction method and system for block chain data service
CN113283347B (en) Assembly job guidance method, device, system, server and readable storage medium
CN114092759A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN111813910A (en) Method, system, terminal device and computer storage medium for updating customer service problem
CN110717555B (en) Picture generation system and device based on natural language and generation countermeasure network
CN113223121A (en) Video generation method and device, electronic equipment and storage medium
CN116756564A (en) Training method and using method of task solution-oriented generation type large language model
CN111859210B (en) Image processing method, device, equipment and storage medium
KR20230132350A (en) Joint perception model training method, joint perception method, device, and storage medium
CN117035004B (en) Text, picture and video generation method and system based on multi-modal learning technology
CN113822114A (en) Image processing method, related equipment and computer readable storage medium
CN109034085B (en) Method and apparatus for generating information
CN113378723B (en) Automatic safety recognition system for hidden danger of power transmission and transformation line based on depth residual error network
CN110163043B (en) Face detection method, device, storage medium and electronic device
CN111507758B (en) Investigation method, device, system and server based on semantic analysis
CN114676705A (en) Dialogue relation processing method, computer and readable storage medium
CN117934997B (en) Large language model system and method for generating camera case sample
US20220335393A1 (en) Smartglasses based cheque fault discern and abatement engine
CN115250378A (en) Video method and electronic equipment for adjusting eyesight based on machine learning model
CN114138958A (en) Information interaction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant