CN117349427A

CN117349427A - Artificial intelligence multi-mode content generation system for public opinion event coping

Info

Publication number: CN117349427A
Application number: CN202311282149.9A
Authority: CN
Inventors: 范静如; 范文庆
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2023-10-07
Filing date: 2023-10-07
Publication date: 2024-01-05

Abstract

The invention discloses an artificial intelligence multi-mode content generating system facing public opinion event coping, which applies an artificial intelligence multi-mode content generating technology to the application scene of public opinion event coping, comprising: data processing, content generation and content quality credibility evaluation. The system abandons the traditional manual public opinion coping mode, automatically processes a large amount of content generating tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the requirements of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, simultaneously has an intelligent algorithm and model, can generate data driving according to public opinion data and user feedback, and provides more accurate, comprehensive and diversified content, thereby effectively improving the effect and efficiency of public opinion event coping.

Description

Artificial intelligence multi-mode content generation system for public opinion event coping

Technical Field

The invention relates to the field of IT application, in particular to an artificial intelligent multi-mode content generation system for public opinion event coping.

Background

The existing public opinion monitoring system is more than the public opinion monitoring system, mainly focuses on collecting, analyzing and monitoring public opinion information to help users to know and grasp public opinion dynamics, but when public opinion events occur, a certain public opinion multimode manuscript writing capability is required, and the artificial intelligent multimode content generating system for public opinion event coping focuses on generating multimode content to respond and guide public opinion events, and an artificial intelligent multimode content generating technology is used for carrying out automatic public opinion coping, so that timely response to public opinion is a problem demand with difficulty and practical application significance for public opinion manuscript writing.

1. The existing public opinion coping often needs to be manually participated, and when the public opinion coping technology processes a large amount of information, the problem of slower processing speed possibly exists, so that the response of the public opinion is lagged, time and resources are consumed, and the technology is subject to the subjective factors and skill level of the manpower.

2. The conventional manuscript generation for public opinion is usually carried out by text as a main part and responding or issuing declarations through characters, but many news contents need to generate multi-mode information, and the real situation of an event can not be completely transmitted through characters or the resonance of an audience can be triggered.

3. Most of the existing public opinion systems are used for public opinion monitoring, mainly focusing on collecting, analyzing and monitoring public opinion information, but lacking in an automatic step of coping with public opinion analysis by generating content, and the existing public opinion event coping method is limited to content generation of objective events.

Disclosure of Invention

The invention aims to provide an artificial intelligence multi-mode content generation system for coping with public opinion events, so as to solve the problems in the background art.

In order to achieve the above purpose, the present invention provides the following technical solutions:

an artificial intelligence multi-mode content generating system facing public opinion event coping is to apply an artificial intelligence multi-mode content generating technology to an application scene of public opinion event coping, comprising:

and (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;

Content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;

content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.

As a further scheme of the invention: in data processing, for different modal data such as text, images, audio and video, corresponding methods such as API interfaces and crawler tools are required to be adopted for collection, collected data is sorted and classified according to the relevance of public opinion events, text word segmentation, stop word removal and part-of-speech labeling are carried out on the text data so as to facilitate subsequent text analysis and generation, image cutting, size unification and color space conversion are carried out on the image data so as to facilitate subsequent image generation, audio segmentation, noise reduction and feature extraction are carried out on the audio data so as to facilitate subsequent audio analysis and generation, and video clipping, frame extraction and key frame selection are carried out on the video data so as to facilitate subsequent video generation.

As a further scheme of the invention: the content generation includes: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, for image data, feature extraction is performed using a convolutional neural network, for text data, feature extraction is performed using a word embedding or text encoding model, for audio data, a spectrogram or other audio feature extraction method is used, by using a fusion model, including: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.

As a further scheme of the invention: the technical basis of text content generation comprises:

natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;

language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;

sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;

attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;

The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.

As a further scheme of the invention: the image content generation refers to a process of generating an image in a single mode or in a cross mode according to given data by using an artificial intelligence technology, wherein the image content generation comprises image synthesis according to different task targets and input modes, new image generation according to existing pictures, and image conforming to semantics according to text description, and the technical foundation of the image content generation comprises:

diffusion model:

the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;

model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;

CLIP：Contrastive Language-image Pre-training

Principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;

model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.

As a further scheme of the invention: the audio content generation refers to a process of synthesizing corresponding sound waveforms according to input data, and comprises the steps of synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies and music, wherein the technical basis of the audio content generation comprises the following steps:

Tacotron2：

the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;

model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;

Transformer-TTS：

The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;

model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;

FastSpeech：

the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;

model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;

DeepVoice3：

The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;

model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;

AudioLM：

the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;

model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.

As a further scheme of the invention: video content generation refers to the automatic generation of descriptive, high-fidelity video content according to given text, image, video single-mode or multi-mode data through training of artificial intelligence, wherein the technical basis of the video content generation comprises the following steps:

Imagen-Video：

The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;

model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;

Gen：

the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;

model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;

CogVideo：

the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;

Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.

As still further aspects of the invention: in the evaluation of the content quality reliability, natural language processing techniques such as language models and semantic analysis are used to evaluate the consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with the information in the knowledge base is evaluated by using a knowledge graph or professional knowledge in the related field, whether emotion expressed in the generated content is reasonable and consistent is evaluated, emotion analysis techniques such as emotion dictionary and emotion classifier are used to evaluate emotion tendencies in the generated content, and if the generated content contains images or audios, image processing and audio processing techniques can be used to evaluate the authenticity and reliability thereof such as detecting whether the images are edited or synthesized or analyzing the sound characteristics of audios for verification.

Compared with the prior art, the invention has the beneficial effects that:

1. the system abandons the traditional manual public opinion coping mode, automatically processes a large amount of content generating tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the requirements of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, simultaneously has an intelligent algorithm and model, can generate data driving according to public opinion data and user feedback, and provides more accurate, comprehensive and diversified content, thereby effectively improving the effect and efficiency of public opinion event coping.

2. The invention abandons the traditional text-only public opinion coping mode, and the system can combine multiple modes of texts, images, videos and audios through the multi-mode content generation technology to respond to public opinion in a more comprehensive and diversified mode.

3. According to the invention, besides generating the content of the objective event, emotion and emotion information in the public opinion event can be analyzed, and corresponding content is generated according to the emotion analysis result to express different emotions and emotions, so that the method has high controllability and editability, and the accuracy and pertinence of the content are realized.

Drawings

Fig. 1 is a system configuration diagram of an artificial intelligence multi-modal content generation system for public opinion event coping.

Fig. 2 is a system configuration diagram of content production in an artificial intelligence multi-modal content generation system for public opinion event coping.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following examples in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, well known methods, procedures, and components have not been described in detail so as not to obscure the subject matter of the present application.

Example 1

Referring to fig. 1-2, an artificial intelligence multi-mode content generating system for public opinion event coping is to apply an artificial intelligence multi-mode content generating technology to an application scenario for public opinion event coping, including:

Specifically, the system of the invention focuses on generating multi-mode content to respond and guide public opinion events, uses artificial intelligent multi-mode content generation technology to automatically cope with public opinion, and requires certain public opinion manuscript writing capability when public opinion events occur, and optimizes self image through writing, publishing or network comment articles, attaching bars, forums, blogs, microblogs and a WeChat platform.

Preferably, in the data processing, different modal data such as text, image, audio and video are required to be collected by adopting corresponding methods such as an API interface and a crawler tool, the collected data are sorted and classified according to the relevance of public opinion events, text word segmentation, stop word removal and part-of-speech labeling processing are performed on the text data so as to facilitate subsequent text analysis and generation, image cutting, size unification and color space conversion processing are performed on the image data so as to facilitate subsequent image generation, audio segmentation, noise reduction and feature extraction processing are performed on the audio data so as to facilitate subsequent audio analysis and generation, and video clipping, frame extraction and key frame selection processing are performed on the video data so as to facilitate subsequent video generation.

Preferably, the content generation includes: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, and for image data, feature extraction is performed using a convolutional neural network, and for text data, the data is subjected to such a way thatFeature extraction using word embedding or text encoding models, using sonograms or other audio feature extraction methods for audio data, by using fusion models。Comprising: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.

Preferably, the technical basis of text content generation includes:

Preferably, the image content generation refers to a process of generating an image in a single mode or in a cross mode according to given data by using an artificial intelligence technology, the image content generation comprises image synthesis according to different task targets and input modes, a new image is generated according to an existing picture, and an image conforming to semantics is generated according to text description, wherein the technical basis of the image content generation comprises:

diffusion model:

CLIP：Contrastive Language-image Pre-training

Preferably, the audio content generation refers to a process of synthesizing corresponding sound waveforms according to input data, including synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies and music, wherein the technical basis of the audio content generation includes:

Tacotron2：

Transformer-TTS：

FastSpeech：

DeepVoice3：

AudioLM：

Preferably, the video content generation means that the video content is automatically generated according to the given text, image, video single-mode or multi-mode data through training of artificial intelligence, wherein the technical basis of the video content generation comprises the following steps:

Imagen-Video：

Gen：

CogVideo：

Preferably, in the evaluation of the reliability of the quality of the content, natural language processing techniques such as language models and semantic analysis are used to evaluate the consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with information in the knowledge base, whether the emotion expressed in the generated content is reasonable and consistent is evaluated using knowledge graphs or expert knowledge in the related field, emotion tendency in the generated content is evaluated using emotion analysis techniques such as emotion dictionary and emotion classifier, and if the generated content contains images or audio, image processing and audio processing techniques can be used to evaluate the authenticity and reliability thereof such as detecting whether the images are edited or synthesized or analyzing the sound characteristics of the audio for verification.

The specific explanation is as follows: the invention uses an artificial intelligent multimode content generation system to enable an automatic flow of public opinion processing to be more perfect, improves the efficiency and accuracy of public opinion coping, provides better public opinion analysis and decision support for users, combines a plurality of modes of texts, images, videos and audios, responds to the public opinion in a more comprehensive and diversified mode, can communicate information in a more visual mode through multimode content generation, and abandons the traditional mode of conducting public opinion coping manually, automatically processes a large amount of content generation tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the demands of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, has high controllability and editability, can analyze emotion and emotion information in public opinion events, can generate corresponding content according to the result of emotion analysis to express different and custom content according to the characteristics and the target of the user, and can realize the accuracy and customization of the demands of the content.

Interpretation of commonly used terms:

multi-modal content generation: multimodal digital content generation generally refers to a synthesis technique that utilizes AI generation techniques to generate image, video, speech, text, music content.

Public opinion: public opinion is the abbreviation of "public opinion situation", which refers to the social attitude of the public as a subject to the creation and holding of social managers, enterprises, individuals and other various organizations and their politics, society, moral aspects around the occurrence, development and change of intermediate social events in a certain social space, and is the sum of the expressions of beliefs, attitudes, ideas, emotions and the like expressed by more people about various phenomena and problems in the society.

Emotion recognition technology: emotion recognition is a key technology for enabling a machine to understand human emotion, and researchers try to fuse more emotion signals from texts, voices, facial expressions and limb actions to physiological signals in a body, so that recognition is more accurate, and man-machine interaction is more natural, smooth and warm.

Prompt engineering: the term prompting engineering is a technology for making a proper prompt for the large model application to enable the large model to have a better generation effect, and is called Prompt Engineering.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. An artificial intelligence multi-mode content generating system facing public opinion event coping is characterized by applying an artificial intelligence multi-mode content generating technology to an application scene of public opinion event coping, and comprising the following steps:

2. The artificial intelligence multi-modal content generation system for public opinion event according to claim 1, wherein in the data processing, different modal data such as text, image, audio and video are collected by adopting corresponding methods, such as API interface and crawler tool, collected data are sorted and classified according to the relevance of public opinion event, text data are subjected to text word segmentation, stop word removal and part-of-speech labeling processing so as to be subjected to subsequent text analysis and generation, image data are subjected to image cutting, size unification and color space conversion processing so as to be subjected to subsequent image generation, audio segmentation, noise reduction and feature extraction processing so as to be subjected to subsequent audio analysis and generation, and video data are subjected to video clip, frame extraction and key frame selection processing so as to be subjected to subsequent video generation.

3. The public opinion event coping-oriented artificial intelligence multi-modal content generation system of claim 1, wherein the content generation comprises: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, for image data, feature extraction is performed using a convolutional neural network, for text data, feature extraction is performed using a word embedding or text encoding model, for audio data, a spectrogram or other audio feature extraction method is used, by using a fusion model, including: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.

4. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the technical base of text content generation comprises:

5. The artificial intelligence multi-modal content generation system for public opinion event according to claim 3, wherein the image content generation means a process of generating an image in a single mode or across modes according to given data by using artificial intelligence technology, the image content generation includes image synthesis according to different task targets and input modes, a new image is generated according to existing pictures, and an image conforming to semantics is generated according to text description, wherein the technical basis of the image content generation includes:

diffusion model:

CLIP：Contrastive Language-image Pre-training

6. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the audio content generation means a process of synthesizing corresponding sound waveforms according to inputted data, including synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies, music, wherein the technical basis of the audio content generation includes:

Tacotron2：

Transformer-TTS：

FastSpeech：

DeepVoice3：

AudioLM：

7. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the video content generation means that by training the artificial intelligence, the video content can automatically generate the video content which accords with the description and has high fidelity according to given text, image, video single-mode or multi-modal data, wherein the technical basis of the video content generation comprises:

Imagen-Video：

Gen：

CogVideo：

8. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 1, wherein in the content quality reliability assessment, natural language processing techniques such as language model and semantic analysis are used to assess consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with information in the knowledge base, whether emotion expressed in the generated content is reasonable and consistent is assessed using knowledge graph or professional knowledge in related fields, emotion tendency in the generated content is assessed using emotion analysis techniques such as emotion dictionary and emotion classifier, and if the generated content contains images or audio, image processing and audio processing techniques may be used to assess authenticity and reliability thereof such as detecting whether the images are edited or synthesized, or analyzing sound characteristics of the audio for verification.