CN117349427A - Artificial intelligence multi-mode content generation system for public opinion event coping - Google Patents

Artificial intelligence multi-mode content generation system for public opinion event coping Download PDF

Info

Publication number
CN117349427A
CN117349427A CN202311282149.9A CN202311282149A CN117349427A CN 117349427 A CN117349427 A CN 117349427A CN 202311282149 A CN202311282149 A CN 202311282149A CN 117349427 A CN117349427 A CN 117349427A
Authority
CN
China
Prior art keywords
model
image
text
generation
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311282149.9A
Other languages
Chinese (zh)
Inventor
范静如
范文庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202311282149.9A priority Critical patent/CN117349427A/en
Publication of CN117349427A publication Critical patent/CN117349427A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an artificial intelligence multi-mode content generating system facing public opinion event coping, which applies an artificial intelligence multi-mode content generating technology to the application scene of public opinion event coping, comprising: data processing, content generation and content quality credibility evaluation. The system abandons the traditional manual public opinion coping mode, automatically processes a large amount of content generating tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the requirements of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, simultaneously has an intelligent algorithm and model, can generate data driving according to public opinion data and user feedback, and provides more accurate, comprehensive and diversified content, thereby effectively improving the effect and efficiency of public opinion event coping.

Description

Artificial intelligence multi-mode content generation system for public opinion event coping
Technical Field
The invention relates to the field of IT application, in particular to an artificial intelligent multi-mode content generation system for public opinion event coping.
Background
The existing public opinion monitoring system is more than the public opinion monitoring system, mainly focuses on collecting, analyzing and monitoring public opinion information to help users to know and grasp public opinion dynamics, but when public opinion events occur, a certain public opinion multimode manuscript writing capability is required, and the artificial intelligent multimode content generating system for public opinion event coping focuses on generating multimode content to respond and guide public opinion events, and an artificial intelligent multimode content generating technology is used for carrying out automatic public opinion coping, so that timely response to public opinion is a problem demand with difficulty and practical application significance for public opinion manuscript writing.
1. The existing public opinion coping often needs to be manually participated, and when the public opinion coping technology processes a large amount of information, the problem of slower processing speed possibly exists, so that the response of the public opinion is lagged, time and resources are consumed, and the technology is subject to the subjective factors and skill level of the manpower.
2. The conventional manuscript generation for public opinion is usually carried out by text as a main part and responding or issuing declarations through characters, but many news contents need to generate multi-mode information, and the real situation of an event can not be completely transmitted through characters or the resonance of an audience can be triggered.
3. Most of the existing public opinion systems are used for public opinion monitoring, mainly focusing on collecting, analyzing and monitoring public opinion information, but lacking in an automatic step of coping with public opinion analysis by generating content, and the existing public opinion event coping method is limited to content generation of objective events.
Disclosure of Invention
The invention aims to provide an artificial intelligence multi-mode content generation system for coping with public opinion events, so as to solve the problems in the background art.
In order to achieve the above purpose, the present invention provides the following technical solutions:
an artificial intelligence multi-mode content generating system facing public opinion event coping is to apply an artificial intelligence multi-mode content generating technology to an application scene of public opinion event coping, comprising:
and (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;
Content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;
content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.
As a further scheme of the invention: in data processing, for different modal data such as text, images, audio and video, corresponding methods such as API interfaces and crawler tools are required to be adopted for collection, collected data is sorted and classified according to the relevance of public opinion events, text word segmentation, stop word removal and part-of-speech labeling are carried out on the text data so as to facilitate subsequent text analysis and generation, image cutting, size unification and color space conversion are carried out on the image data so as to facilitate subsequent image generation, audio segmentation, noise reduction and feature extraction are carried out on the audio data so as to facilitate subsequent audio analysis and generation, and video clipping, frame extraction and key frame selection are carried out on the video data so as to facilitate subsequent video generation.
As a further scheme of the invention: the content generation includes: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, for image data, feature extraction is performed using a convolutional neural network, for text data, feature extraction is performed using a word embedding or text encoding model, for audio data, a spectrogram or other audio feature extraction method is used, by using a fusion model, including: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.
As a further scheme of the invention: the technical basis of text content generation comprises:
natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;
language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;
sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;
attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;
The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.
As a further scheme of the invention: the image content generation refers to a process of generating an image in a single mode or in a cross mode according to given data by using an artificial intelligence technology, wherein the image content generation comprises image synthesis according to different task targets and input modes, new image generation according to existing pictures, and image conforming to semantics according to text description, and the technical foundation of the image content generation comprises:
diffusion model:
the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;
model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;
CLIP:Contrastive Language-image Pre-training
Principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;
model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.
As a further scheme of the invention: the audio content generation refers to a process of synthesizing corresponding sound waveforms according to input data, and comprises the steps of synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies and music, wherein the technical basis of the audio content generation comprises the following steps:
Tacotron2:
the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;
model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;
Transformer-TTS:
The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;
model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;
FastSpeech:
the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;
model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;
DeepVoice3:
The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;
model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;
AudioLM:
the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;
model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.
As a further scheme of the invention: video content generation refers to the automatic generation of descriptive, high-fidelity video content according to given text, image, video single-mode or multi-mode data through training of artificial intelligence, wherein the technical basis of the video content generation comprises the following steps:
Imagen-Video:
The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;
model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;
Gen:
the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;
model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;
CogVideo:
the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;
Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.
As still further aspects of the invention: in the evaluation of the content quality reliability, natural language processing techniques such as language models and semantic analysis are used to evaluate the consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with the information in the knowledge base is evaluated by using a knowledge graph or professional knowledge in the related field, whether emotion expressed in the generated content is reasonable and consistent is evaluated, emotion analysis techniques such as emotion dictionary and emotion classifier are used to evaluate emotion tendencies in the generated content, and if the generated content contains images or audios, image processing and audio processing techniques can be used to evaluate the authenticity and reliability thereof such as detecting whether the images are edited or synthesized or analyzing the sound characteristics of audios for verification.
Compared with the prior art, the invention has the beneficial effects that:
1. the system abandons the traditional manual public opinion coping mode, automatically processes a large amount of content generating tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the requirements of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, simultaneously has an intelligent algorithm and model, can generate data driving according to public opinion data and user feedback, and provides more accurate, comprehensive and diversified content, thereby effectively improving the effect and efficiency of public opinion event coping.
2. The invention abandons the traditional text-only public opinion coping mode, and the system can combine multiple modes of texts, images, videos and audios through the multi-mode content generation technology to respond to public opinion in a more comprehensive and diversified mode.
3. According to the invention, besides generating the content of the objective event, emotion and emotion information in the public opinion event can be analyzed, and corresponding content is generated according to the emotion analysis result to express different emotions and emotions, so that the method has high controllability and editability, and the accuracy and pertinence of the content are realized.
Drawings
Fig. 1 is a system configuration diagram of an artificial intelligence multi-modal content generation system for public opinion event coping.
Fig. 2 is a system configuration diagram of content production in an artificial intelligence multi-modal content generation system for public opinion event coping.
Detailed Description
Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following examples in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, well known methods, procedures, and components have not been described in detail so as not to obscure the subject matter of the present application.
Example 1
Referring to fig. 1-2, an artificial intelligence multi-mode content generating system for public opinion event coping is to apply an artificial intelligence multi-mode content generating technology to an application scenario for public opinion event coping, including:
and (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;
content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;
Content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.
Specifically, the system of the invention focuses on generating multi-mode content to respond and guide public opinion events, uses artificial intelligent multi-mode content generation technology to automatically cope with public opinion, and requires certain public opinion manuscript writing capability when public opinion events occur, and optimizes self image through writing, publishing or network comment articles, attaching bars, forums, blogs, microblogs and a WeChat platform.
Preferably, in the data processing, different modal data such as text, image, audio and video are required to be collected by adopting corresponding methods such as an API interface and a crawler tool, the collected data are sorted and classified according to the relevance of public opinion events, text word segmentation, stop word removal and part-of-speech labeling processing are performed on the text data so as to facilitate subsequent text analysis and generation, image cutting, size unification and color space conversion processing are performed on the image data so as to facilitate subsequent image generation, audio segmentation, noise reduction and feature extraction processing are performed on the audio data so as to facilitate subsequent audio analysis and generation, and video clipping, frame extraction and key frame selection processing are performed on the video data so as to facilitate subsequent video generation.
Preferably, the content generation includes: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, and for image data, feature extraction is performed using a convolutional neural network, and for text data, the data is subjected to such a way thatFeature extraction using word embedding or text encoding models, using sonograms or other audio feature extraction methods for audio data, by using fusion modelsComprising: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.
Preferably, the technical basis of text content generation includes:
natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;
language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;
sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;
attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;
The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.
Preferably, the image content generation refers to a process of generating an image in a single mode or in a cross mode according to given data by using an artificial intelligence technology, the image content generation comprises image synthesis according to different task targets and input modes, a new image is generated according to an existing picture, and an image conforming to semantics is generated according to text description, wherein the technical basis of the image content generation comprises:
diffusion model:
the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;
model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;
CLIP:Contrastive Language-image Pre-training
Principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;
model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.
Preferably, the audio content generation refers to a process of synthesizing corresponding sound waveforms according to input data, including synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies and music, wherein the technical basis of the audio content generation includes:
Tacotron2:
the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;
model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;
Transformer-TTS:
The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;
model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;
FastSpeech:
the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;
model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;
DeepVoice3:
The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;
model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;
AudioLM:
the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;
model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.
Preferably, the video content generation means that the video content is automatically generated according to the given text, image, video single-mode or multi-mode data through training of artificial intelligence, wherein the technical basis of the video content generation comprises the following steps:
Imagen-Video:
The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;
model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;
Gen:
the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;
model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;
CogVideo:
the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;
Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.
Preferably, in the evaluation of the reliability of the quality of the content, natural language processing techniques such as language models and semantic analysis are used to evaluate the consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with information in the knowledge base, whether the emotion expressed in the generated content is reasonable and consistent is evaluated using knowledge graphs or expert knowledge in the related field, emotion tendency in the generated content is evaluated using emotion analysis techniques such as emotion dictionary and emotion classifier, and if the generated content contains images or audio, image processing and audio processing techniques can be used to evaluate the authenticity and reliability thereof such as detecting whether the images are edited or synthesized or analyzing the sound characteristics of the audio for verification.
The specific explanation is as follows: the invention uses an artificial intelligent multimode content generation system to enable an automatic flow of public opinion processing to be more perfect, improves the efficiency and accuracy of public opinion coping, provides better public opinion analysis and decision support for users, combines a plurality of modes of texts, images, videos and audios, responds to the public opinion in a more comprehensive and diversified mode, can communicate information in a more visual mode through multimode content generation, and abandons the traditional mode of conducting public opinion coping manually, automatically processes a large amount of content generation tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the demands of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, has high controllability and editability, can analyze emotion and emotion information in public opinion events, can generate corresponding content according to the result of emotion analysis to express different and custom content according to the characteristics and the target of the user, and can realize the accuracy and customization of the demands of the content.
Interpretation of commonly used terms:
multi-modal content generation: multimodal digital content generation generally refers to a synthesis technique that utilizes AI generation techniques to generate image, video, speech, text, music content.
Public opinion: public opinion is the abbreviation of "public opinion situation", which refers to the social attitude of the public as a subject to the creation and holding of social managers, enterprises, individuals and other various organizations and their politics, society, moral aspects around the occurrence, development and change of intermediate social events in a certain social space, and is the sum of the expressions of beliefs, attitudes, ideas, emotions and the like expressed by more people about various phenomena and problems in the society.
Emotion recognition technology: emotion recognition is a key technology for enabling a machine to understand human emotion, and researchers try to fuse more emotion signals from texts, voices, facial expressions and limb actions to physiological signals in a body, so that recognition is more accurate, and man-machine interaction is more natural, smooth and warm.
Prompt engineering: the term prompting engineering is a technology for making a proper prompt for the large model application to enable the large model to have a better generation effect, and is called Prompt Engineering.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (8)

1. An artificial intelligence multi-mode content generating system facing public opinion event coping is characterized by applying an artificial intelligence multi-mode content generating technology to an application scene of public opinion event coping, and comprising the following steps:
And (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;
content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;
content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.
2. The artificial intelligence multi-modal content generation system for public opinion event according to claim 1, wherein in the data processing, different modal data such as text, image, audio and video are collected by adopting corresponding methods, such as API interface and crawler tool, collected data are sorted and classified according to the relevance of public opinion event, text data are subjected to text word segmentation, stop word removal and part-of-speech labeling processing so as to be subjected to subsequent text analysis and generation, image data are subjected to image cutting, size unification and color space conversion processing so as to be subjected to subsequent image generation, audio segmentation, noise reduction and feature extraction processing so as to be subjected to subsequent audio analysis and generation, and video data are subjected to video clip, frame extraction and key frame selection processing so as to be subjected to subsequent video generation.
3. The public opinion event coping-oriented artificial intelligence multi-modal content generation system of claim 1, wherein the content generation comprises: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, for image data, feature extraction is performed using a convolutional neural network, for text data, feature extraction is performed using a word embedding or text encoding model, for audio data, a spectrogram or other audio feature extraction method is used, by using a fusion model, including: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.
4. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the technical base of text content generation comprises:
natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;
language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;
sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;
attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;
The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.
5. The artificial intelligence multi-modal content generation system for public opinion event according to claim 3, wherein the image content generation means a process of generating an image in a single mode or across modes according to given data by using artificial intelligence technology, the image content generation includes image synthesis according to different task targets and input modes, a new image is generated according to existing pictures, and an image conforming to semantics is generated according to text description, wherein the technical basis of the image content generation includes:
diffusion model:
the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;
Model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;
CLIP:Contrastive Language-image Pre-training
principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;
model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.
6. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the audio content generation means a process of synthesizing corresponding sound waveforms according to inputted data, including synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies, music, wherein the technical basis of the audio content generation includes:
Tacotron2:
the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;
model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;
Transformer-TTS:
The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;
model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;
FastSpeech:
the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;
model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;
DeepVoice3:
The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;
model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;
AudioLM:
the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;
model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.
7. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the video content generation means that by training the artificial intelligence, the video content can automatically generate the video content which accords with the description and has high fidelity according to given text, image, video single-mode or multi-modal data, wherein the technical basis of the video content generation comprises:
Imagen-Video:
The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;
model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;
Gen:
the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;
model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;
CogVideo:
the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;
Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.
8. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 1, wherein in the content quality reliability assessment, natural language processing techniques such as language model and semantic analysis are used to assess consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with information in the knowledge base, whether emotion expressed in the generated content is reasonable and consistent is assessed using knowledge graph or professional knowledge in related fields, emotion tendency in the generated content is assessed using emotion analysis techniques such as emotion dictionary and emotion classifier, and if the generated content contains images or audio, image processing and audio processing techniques may be used to assess authenticity and reliability thereof such as detecting whether the images are edited or synthesized, or analyzing sound characteristics of the audio for verification.
CN202311282149.9A 2023-10-07 2023-10-07 Artificial intelligence multi-mode content generation system for public opinion event coping Pending CN117349427A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311282149.9A CN117349427A (en) 2023-10-07 2023-10-07 Artificial intelligence multi-mode content generation system for public opinion event coping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311282149.9A CN117349427A (en) 2023-10-07 2023-10-07 Artificial intelligence multi-mode content generation system for public opinion event coping

Publications (1)

Publication Number Publication Date
CN117349427A true CN117349427A (en) 2024-01-05

Family

ID=89358750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311282149.9A Pending CN117349427A (en) 2023-10-07 2023-10-07 Artificial intelligence multi-mode content generation system for public opinion event coping

Country Status (1)

Country Link
CN (1) CN117349427A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117708746A (en) * 2024-02-04 2024-03-15 北京长河数智科技有限责任公司 Risk prediction method based on multi-mode data fusion
CN117994610A (en) * 2024-04-03 2024-05-07 江西虔安电子科技有限公司 Chart generation method and system
CN117994610B (en) * 2024-04-03 2024-06-04 江西虔安电子科技有限公司 Chart generation method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117708746A (en) * 2024-02-04 2024-03-15 北京长河数智科技有限责任公司 Risk prediction method based on multi-mode data fusion
CN117708746B (en) * 2024-02-04 2024-04-30 北京长河数智科技有限责任公司 Risk prediction method based on multi-mode data fusion
CN117994610A (en) * 2024-04-03 2024-05-07 江西虔安电子科技有限公司 Chart generation method and system
CN117994610B (en) * 2024-04-03 2024-06-04 江西虔安电子科技有限公司 Chart generation method and system

Similar Documents

Publication Publication Date Title
Zhao et al. Automatic assessment of depression from speech via a hierarchical attention transfer network and attention autoencoders
Zadeh et al. CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French
Nyatsanga et al. A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation
Kheddar et al. Deep transfer learning for automatic speech recognition: Towards better generalization
Triantafyllopoulos et al. An overview of affective speech synthesis and conversion in the deep learning era
US11727915B1 (en) Method and terminal for generating simulated voice of virtual teacher
Wang et al. Comic-guided speech synthesis
CN114911932A (en) Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement
CN117349427A (en) Artificial intelligence multi-mode content generation system for public opinion event coping
CN116092472A (en) Speech synthesis method and synthesis system
Wu et al. Speech synthesis with face embeddings
Han et al. [Retracted] The Modular Design of an English Pronunciation Level Evaluation System Based on Machine Learning
CN116129868A (en) Method and system for generating structured photo
Sun et al. Pre-avatar: An automatic presentation generation framework leveraging talking avatar
Arora et al. Universlu: Universal spoken language understanding for diverse classification and sequence generation tasks with a single network
Le et al. Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning.
CN113409768A (en) Pronunciation detection method, pronunciation detection device and computer readable medium
Zhang An automatic assessment method for spoken English based on multimodal feature fusion
Barakat et al. Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources
Liu et al. Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
Zhao et al. A multimodal teacher speech emotion recognition method in the smart classroom
Kadam et al. A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation
Peng Speech synthesis system based on big data and evaluation of Japanese language feeling
Gong et al. A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques
Tu et al. Contextual expressive text-to-speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination