CN117349427A - Artificial intelligence multi-mode content generation system for public opinion event coping - Google Patents
Artificial intelligence multi-mode content generation system for public opinion event coping Download PDFInfo
- Publication number
- CN117349427A CN117349427A CN202311282149.9A CN202311282149A CN117349427A CN 117349427 A CN117349427 A CN 117349427A CN 202311282149 A CN202311282149 A CN 202311282149A CN 117349427 A CN117349427 A CN 117349427A
- Authority
- CN
- China
- Prior art keywords
- model
- image
- text
- generation
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000010485 coping Effects 0.000 title claims abstract description 34
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 claims abstract description 46
- 238000012545 processing Methods 0.000 claims abstract description 26
- 238000005516 engineering process Methods 0.000 claims abstract description 24
- 230000008569 process Effects 0.000 claims abstract description 19
- 230000000694 effects Effects 0.000 claims abstract description 15
- 238000011156 evaluation Methods 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 63
- 230000015572 biosynthetic process Effects 0.000 claims description 43
- 238000003786 synthesis reaction Methods 0.000 claims description 43
- 238000009792 diffusion process Methods 0.000 claims description 33
- 230000008451 emotion Effects 0.000 claims description 29
- 230000007246 mechanism Effects 0.000 claims description 27
- 238000004458 analytical method Methods 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 12
- 238000009826 distribution Methods 0.000 claims description 9
- 238000003058 natural language processing Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 125000004122 cyclic group Chemical group 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 5
- 230000000007 visual effect Effects 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000013145 classification model Methods 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000001427 coherent effect Effects 0.000 claims description 3
- 239000003086 colorant Substances 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 230000006872 improvement Effects 0.000 claims description 3
- 238000013140 knowledge distillation Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 description 5
- 230000008909 emotion recognition Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an artificial intelligence multi-mode content generating system facing public opinion event coping, which applies an artificial intelligence multi-mode content generating technology to the application scene of public opinion event coping, comprising: data processing, content generation and content quality credibility evaluation. The system abandons the traditional manual public opinion coping mode, automatically processes a large amount of content generating tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the requirements of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, simultaneously has an intelligent algorithm and model, can generate data driving according to public opinion data and user feedback, and provides more accurate, comprehensive and diversified content, thereby effectively improving the effect and efficiency of public opinion event coping.
Description
Technical Field
The invention relates to the field of IT application, in particular to an artificial intelligent multi-mode content generation system for public opinion event coping.
Background
The existing public opinion monitoring system is more than the public opinion monitoring system, mainly focuses on collecting, analyzing and monitoring public opinion information to help users to know and grasp public opinion dynamics, but when public opinion events occur, a certain public opinion multimode manuscript writing capability is required, and the artificial intelligent multimode content generating system for public opinion event coping focuses on generating multimode content to respond and guide public opinion events, and an artificial intelligent multimode content generating technology is used for carrying out automatic public opinion coping, so that timely response to public opinion is a problem demand with difficulty and practical application significance for public opinion manuscript writing.
1. The existing public opinion coping often needs to be manually participated, and when the public opinion coping technology processes a large amount of information, the problem of slower processing speed possibly exists, so that the response of the public opinion is lagged, time and resources are consumed, and the technology is subject to the subjective factors and skill level of the manpower.
2. The conventional manuscript generation for public opinion is usually carried out by text as a main part and responding or issuing declarations through characters, but many news contents need to generate multi-mode information, and the real situation of an event can not be completely transmitted through characters or the resonance of an audience can be triggered.
3. Most of the existing public opinion systems are used for public opinion monitoring, mainly focusing on collecting, analyzing and monitoring public opinion information, but lacking in an automatic step of coping with public opinion analysis by generating content, and the existing public opinion event coping method is limited to content generation of objective events.
Disclosure of Invention
The invention aims to provide an artificial intelligence multi-mode content generation system for coping with public opinion events, so as to solve the problems in the background art.
In order to achieve the above purpose, the present invention provides the following technical solutions:
an artificial intelligence multi-mode content generating system facing public opinion event coping is to apply an artificial intelligence multi-mode content generating technology to an application scene of public opinion event coping, comprising:
and (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;
Content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;
content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.
As a further scheme of the invention: in data processing, for different modal data such as text, images, audio and video, corresponding methods such as API interfaces and crawler tools are required to be adopted for collection, collected data is sorted and classified according to the relevance of public opinion events, text word segmentation, stop word removal and part-of-speech labeling are carried out on the text data so as to facilitate subsequent text analysis and generation, image cutting, size unification and color space conversion are carried out on the image data so as to facilitate subsequent image generation, audio segmentation, noise reduction and feature extraction are carried out on the audio data so as to facilitate subsequent audio analysis and generation, and video clipping, frame extraction and key frame selection are carried out on the video data so as to facilitate subsequent video generation.
As a further scheme of the invention: the content generation includes: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, for image data, feature extraction is performed using a convolutional neural network, for text data, feature extraction is performed using a word embedding or text encoding model, for audio data, a spectrogram or other audio feature extraction method is used, by using a fusion model, including: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.
As a further scheme of the invention: the technical basis of text content generation comprises:
natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;
language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;
sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;
attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;
The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.
As a further scheme of the invention: the image content generation refers to a process of generating an image in a single mode or in a cross mode according to given data by using an artificial intelligence technology, wherein the image content generation comprises image synthesis according to different task targets and input modes, new image generation according to existing pictures, and image conforming to semantics according to text description, and the technical foundation of the image content generation comprises:
diffusion model:
the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;
model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;
CLIP:Contrastive Language-image Pre-training
Principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;
model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.
As a further scheme of the invention: the audio content generation refers to a process of synthesizing corresponding sound waveforms according to input data, and comprises the steps of synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies and music, wherein the technical basis of the audio content generation comprises the following steps:
Tacotron2:
the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;
model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;
Transformer-TTS:
The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;
model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;
FastSpeech:
the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;
model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;
DeepVoice3:
The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;
model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;
AudioLM:
the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;
model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.
As a further scheme of the invention: video content generation refers to the automatic generation of descriptive, high-fidelity video content according to given text, image, video single-mode or multi-mode data through training of artificial intelligence, wherein the technical basis of the video content generation comprises the following steps:
Imagen-Video:
The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;
model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;
Gen:
the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;
model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;
CogVideo:
the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;
Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.
As still further aspects of the invention: in the evaluation of the content quality reliability, natural language processing techniques such as language models and semantic analysis are used to evaluate the consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with the information in the knowledge base is evaluated by using a knowledge graph or professional knowledge in the related field, whether emotion expressed in the generated content is reasonable and consistent is evaluated, emotion analysis techniques such as emotion dictionary and emotion classifier are used to evaluate emotion tendencies in the generated content, and if the generated content contains images or audios, image processing and audio processing techniques can be used to evaluate the authenticity and reliability thereof such as detecting whether the images are edited or synthesized or analyzing the sound characteristics of audios for verification.
Compared with the prior art, the invention has the beneficial effects that:
1. the system abandons the traditional manual public opinion coping mode, automatically processes a large amount of content generating tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the requirements of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, simultaneously has an intelligent algorithm and model, can generate data driving according to public opinion data and user feedback, and provides more accurate, comprehensive and diversified content, thereby effectively improving the effect and efficiency of public opinion event coping.
2. The invention abandons the traditional text-only public opinion coping mode, and the system can combine multiple modes of texts, images, videos and audios through the multi-mode content generation technology to respond to public opinion in a more comprehensive and diversified mode.
3. According to the invention, besides generating the content of the objective event, emotion and emotion information in the public opinion event can be analyzed, and corresponding content is generated according to the emotion analysis result to express different emotions and emotions, so that the method has high controllability and editability, and the accuracy and pertinence of the content are realized.
Drawings
Fig. 1 is a system configuration diagram of an artificial intelligence multi-modal content generation system for public opinion event coping.
Fig. 2 is a system configuration diagram of content production in an artificial intelligence multi-modal content generation system for public opinion event coping.
Detailed Description
Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following examples in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, well known methods, procedures, and components have not been described in detail so as not to obscure the subject matter of the present application.
Example 1
Referring to fig. 1-2, an artificial intelligence multi-mode content generating system for public opinion event coping is to apply an artificial intelligence multi-mode content generating technology to an application scenario for public opinion event coping, including:
and (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;
content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;
Content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.
Specifically, the system of the invention focuses on generating multi-mode content to respond and guide public opinion events, uses artificial intelligent multi-mode content generation technology to automatically cope with public opinion, and requires certain public opinion manuscript writing capability when public opinion events occur, and optimizes self image through writing, publishing or network comment articles, attaching bars, forums, blogs, microblogs and a WeChat platform.
Preferably, in the data processing, different modal data such as text, image, audio and video are required to be collected by adopting corresponding methods such as an API interface and a crawler tool, the collected data are sorted and classified according to the relevance of public opinion events, text word segmentation, stop word removal and part-of-speech labeling processing are performed on the text data so as to facilitate subsequent text analysis and generation, image cutting, size unification and color space conversion processing are performed on the image data so as to facilitate subsequent image generation, audio segmentation, noise reduction and feature extraction processing are performed on the audio data so as to facilitate subsequent audio analysis and generation, and video clipping, frame extraction and key frame selection processing are performed on the video data so as to facilitate subsequent video generation.
Preferably, the content generation includes: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, and for image data, feature extraction is performed using a convolutional neural network, and for text data, the data is subjected to such a way thatFeature extraction using word embedding or text encoding models, using sonograms or other audio feature extraction methods for audio data, by using fusion models。Comprising: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.
Preferably, the technical basis of text content generation includes:
natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;
language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;
sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;
attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;
The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.
Preferably, the image content generation refers to a process of generating an image in a single mode or in a cross mode according to given data by using an artificial intelligence technology, the image content generation comprises image synthesis according to different task targets and input modes, a new image is generated according to an existing picture, and an image conforming to semantics is generated according to text description, wherein the technical basis of the image content generation comprises:
diffusion model:
the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;
model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;
CLIP:Contrastive Language-image Pre-training
Principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;
model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.
Preferably, the audio content generation refers to a process of synthesizing corresponding sound waveforms according to input data, including synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies and music, wherein the technical basis of the audio content generation includes:
Tacotron2:
the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;
model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;
Transformer-TTS:
The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;
model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;
FastSpeech:
the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;
model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;
DeepVoice3:
The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;
model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;
AudioLM:
the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;
model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.
Preferably, the video content generation means that the video content is automatically generated according to the given text, image, video single-mode or multi-mode data through training of artificial intelligence, wherein the technical basis of the video content generation comprises the following steps:
Imagen-Video:
The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;
model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;
Gen:
the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;
model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;
CogVideo:
the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;
Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.
Preferably, in the evaluation of the reliability of the quality of the content, natural language processing techniques such as language models and semantic analysis are used to evaluate the consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with information in the knowledge base, whether the emotion expressed in the generated content is reasonable and consistent is evaluated using knowledge graphs or expert knowledge in the related field, emotion tendency in the generated content is evaluated using emotion analysis techniques such as emotion dictionary and emotion classifier, and if the generated content contains images or audio, image processing and audio processing techniques can be used to evaluate the authenticity and reliability thereof such as detecting whether the images are edited or synthesized or analyzing the sound characteristics of the audio for verification.
The specific explanation is as follows: the invention uses an artificial intelligent multimode content generation system to enable an automatic flow of public opinion processing to be more perfect, improves the efficiency and accuracy of public opinion coping, provides better public opinion analysis and decision support for users, combines a plurality of modes of texts, images, videos and audios, responds to the public opinion in a more comprehensive and diversified mode, can communicate information in a more visual mode through multimode content generation, and abandons the traditional mode of conducting public opinion coping manually, automatically processes a large amount of content generation tasks through an intelligent algorithm and technology, can quickly and accurately generate public opinion content meeting the demands of users, improves the efficiency and quality of content generation, provides more comprehensive and diversified public opinion event coping capability, saves the time and cost of manual processing, has high controllability and editability, can analyze emotion and emotion information in public opinion events, can generate corresponding content according to the result of emotion analysis to express different and custom content according to the characteristics and the target of the user, and can realize the accuracy and customization of the demands of the content.
Interpretation of commonly used terms:
multi-modal content generation: multimodal digital content generation generally refers to a synthesis technique that utilizes AI generation techniques to generate image, video, speech, text, music content.
Public opinion: public opinion is the abbreviation of "public opinion situation", which refers to the social attitude of the public as a subject to the creation and holding of social managers, enterprises, individuals and other various organizations and their politics, society, moral aspects around the occurrence, development and change of intermediate social events in a certain social space, and is the sum of the expressions of beliefs, attitudes, ideas, emotions and the like expressed by more people about various phenomena and problems in the society.
Emotion recognition technology: emotion recognition is a key technology for enabling a machine to understand human emotion, and researchers try to fuse more emotion signals from texts, voices, facial expressions and limb actions to physiological signals in a body, so that recognition is more accurate, and man-machine interaction is more natural, smooth and warm.
Prompt engineering: the term prompting engineering is a technology for making a proper prompt for the large model application to enable the large model to have a better generation effect, and is called Prompt Engineering.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.
Claims (8)
1. An artificial intelligence multi-mode content generating system facing public opinion event coping is characterized by applying an artificial intelligence multi-mode content generating technology to an application scene of public opinion event coping, and comprising the following steps:
And (3) data processing: firstly, collecting and sorting multi-mode data related to public opinion events, including texts, images, audios and videos, acquiring the data related to the public opinion events from various sources through a web crawler technology, including news reports, social media information and user comments, and preprocessing the sorted data to facilitate subsequent model training and generation, cleaning and denoising the data, eliminating useless or erroneous data and ensuring data quality;
content generation: firstly, aiming at multi-modal data related to public opinion events, understanding and analyzing the data of different modes are needed to acquire semantic information and characteristics of the data, then, the characteristics of multiple modes are fused and aligned to establish the association between the different modes, finally, multi-modal content generation is carried out, and based on the fused characteristic representation, a generation model is used for generating an countermeasure network to generate multi-modal content;
content quality credibility assessment: evaluating the quality and trustworthiness of the generated content may use metrics and evaluation methods to evaluate whether the context of the generated content is consistent, consistent with language and logic rules.
2. The artificial intelligence multi-modal content generation system for public opinion event according to claim 1, wherein in the data processing, different modal data such as text, image, audio and video are collected by adopting corresponding methods, such as API interface and crawler tool, collected data are sorted and classified according to the relevance of public opinion event, text data are subjected to text word segmentation, stop word removal and part-of-speech labeling processing so as to be subjected to subsequent text analysis and generation, image data are subjected to image cutting, size unification and color space conversion processing so as to be subjected to subsequent image generation, audio segmentation, noise reduction and feature extraction processing so as to be subjected to subsequent audio analysis and generation, and video data are subjected to video clip, frame extraction and key frame selection processing so as to be subjected to subsequent video generation.
3. The public opinion event coping-oriented artificial intelligence multi-modal content generation system of claim 1, wherein the content generation comprises: text content generation, image content generation, audio content generation, and video content generation, in which data of each modality is subjected to feature extraction and representation learning to be converted into a machine-understandable form, for image data, feature extraction is performed using a convolutional neural network, for text data, feature extraction is performed using a word embedding or text encoding model, for audio data, a spectrogram or other audio feature extraction method is used, by using a fusion model, including: the multi-modal neural network, image-text alignment model is implemented with the aim of supplementing and aligning information between different modalities with each other for facilitating subsequent generation tasks, for text generation, text generation tasks such as generating news headlines, social media posts using a recurrent neural network or a Transformer model, for image generation, image generation tasks such as generating image descriptions, image synthesis using a generative countermeasure network or a variational self-encoder model, for audio generation, audio generation tasks such as generating speech synthesis, music synthesis, for video generation, video generation tasks such as generating video clips, video descriptions using a generative countermeasure network or a spatiotemporal generation model, and a method which is combinable and diffusive, capable of generating output modalities of any combination of language, image, video or audio from any combination of any input modalities.
4. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the technical base of text content generation comprises:
natural language processing: in the text content generation task, a natural language processing technology is used for word segmentation, part-of-speech tagging, syntactic analysis, semantic understanding preprocessing steps and modeling and evaluation key links of a language model;
language model: the language model is a probability distribution model for predicting the next word or character according to the previous text sequence, and comprises an n-gram model, a cyclic neural network model and a Transformer model, wherein the language model is an important basis for generating text content and is used for generating coherent sentences and chapters;
sequence-to-sequence model: the sequence-to-sequence model is a text generation framework, which consists of an encoder that encodes an input sequence into a vector representation of a fixed dimension, and a decoder that uses this vector to generate a target sequence, the sequence-to-sequence model being used for machine translation, dialog generation tasks;
attention mechanism: the attention mechanism is a mechanism for giving different weights to different parts of input in the generation process, the attention mechanism helps a model to pay attention to important parts in an input sequence better, the accuracy and fluency of generation are improved, and the self-attention mechanism in the transducer model is an attention mechanism widely applied at present;
The technical foundation is combined and applied to the text content generation task to realize the generation of different types of text content, such as public opinion abstract, public opinion comment generation and public opinion news generation.
5. The artificial intelligence multi-modal content generation system for public opinion event according to claim 3, wherein the image content generation means a process of generating an image in a single mode or across modes according to given data by using artificial intelligence technology, the image content generation includes image synthesis according to different task targets and input modes, a new image is generated according to existing pictures, and an image conforming to semantics is generated according to text description, wherein the technical basis of the image content generation includes:
diffusion model:
the realization principle is as follows: the diffusion model is characterized in that a Markov chain of a diffusion step is defined, random noise is continuously added to data until pure Gaussian noise data is obtained, then an inverse diffusion process is learned, an image is generated through inverse noise reduction inference, the diffusion model systematically perturbs the distribution in the data, and then the data distribution is restored, so that the whole process presents a gradually optimized property, and the stability and the controllability of the model are ensured;
Model advantages and disadvantages: the diffusion model has the advantages that the forward and backward diffusion processes based on the Markov chain can restore real data more accurately, the holding capacity of image details is stronger, so that the writing property of the generated image is better, particularly, the diffusion model can obtain good effects in the application of image complement restoration and molecular diagram generation, but the diffusion model also has the problem of slower sampling speed and weaker generalization capacity of data types due to the complexity of calculation steps;
CLIP:Contrastive Language-image Pre-training
principle of: the CLIP is a contrast learning-based text-image cross-mode pre-training model, the training principle is that the encoder is used for extracting characteristics of a text and an image respectively, mapping the text and the image into the same representation space, and training the model through similarity and difference calculation of the text-image pair, so that an image conforming to description can be generated according to a given text;
model advantages and disadvantages: the CLIP model has the advantages that the text characteristics and the image characteristics can be aligned based on the multi-mode contrast learning and pre-training process, so that data do not need to be marked in advance, the CLIP model is excellent in zero-sample image text classification tasks, meanwhile, the text description and the image style can be mastered more accurately, unnecessary details of images can be changed while the accuracy is not changed, the image diversity can be better generated, the CLIP model essentially belongs to an image classification model, the complex and abstract scene performance is limited, such as poor image generation effect in tasks containing time sequence data and needing reasoning calculation, in addition, the training effect of the CLIP depends on a large-scale text-image pair data set, and the consumption of training resources is relatively large.
6. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the audio content generation means a process of synthesizing corresponding sound waveforms according to inputted data, including synthesizing voices according to texts, performing voice conversion between different languages, performing voice description according to visual content images or videos, and generating melodies, music, wherein the technical basis of the audio content generation includes:
Tacotron2:
the realization principle is as follows: tacotron2 is an end-to-end speech synthesis model formed by a sound spectrum prediction network and a vocoder on the basis of combining WaveNet and Tacotron, wherein the sequence-to-sequence prediction network extracts text features and inputs the text features into the model, a predicted value is superimposed on a Mel frequency spectrum, and the vocoder generates a time domain waveform according to the predicted sequence;
model advantages and disadvantages: tacotron2 introduces an attention mechanism to replace a traditional voice synthesis duration model, extracts structural features through a neural network, learns the corresponding relation between text and acoustic features, has the advantages that the gradient vanishing problem is optimized through improvement of the attention mechanism, the tone quality generated by audio content is good, the input text data is good in robustness, but has the disadvantages that the autoregressive model using a cyclic neural network structure is low in synthesis speed, the pronunciation of complex words is difficult, the generated voice lacks emotion colors, the training time and cost for a large data set are high, and the model lacks controllability;
Transformer-TTS:
The realization principle is as follows: the transducer-TTS is an end-to-end speech generation model that applies a transducer structure in combination to a TTS system, specifically, the transducer-TTS constructs an encoder-decoder structure by introducing a multi-head attention mechanism to improve training efficiency, generates mel spectrum using a phoneme sequence as an input, and outputs waveforms through a WaveNet vocoder;
model advantages and disadvantages: the voice model of the transducer structure can accelerate the training speed, solves the problems that the training speed is low and a long-dependency model is difficult to build in Tacotron2, and the transducer is based on understanding of semantics and relations, so that the effect of generating audio content is more natural, but the autoregressive model still has the problems of slower reasoning and model deviation caused by autoregressive error accumulation;
FastSpeech:
the realization principle is as follows: fatspeech is a non-autoregressive sequence-to-sequence speech synthesis model, and works on the principle that a phoneme sequence is taken as an input, a Mel frequency spectrum is output through the alignment result of a length regulator, and the speech synthesis speed is improved through a network structure capable of being parallel;
model advantages and disadvantages: the Fatspeech has the advantages that the Mel frequency spectrum is generated in a non-autoregressive decoding mode in a parallelization mode, the calculation speed is obviously improved, meanwhile, a duration model ensures that phonemes correspond to Mel characteristics, the synthesis speed and the voice quality are improved, the controllability of generated audio is good, but the Fatspeech has the disadvantage that information loss exists when training is performed by using knowledge distillation, so that the situation of inaccurate synthesis results occurs;
DeepVoice3:
The realization principle is as follows: deep voice3 is a voice system based on a full convolution architecture, converts various text features into vocoder parameters through a full parallel computing mode, and generates voice by taking the vocoder parameters as input of a waveform synthesis model;
model advantages and disadvantages: the deep voice3 expands the data set scale of the voice synthesis training, can be rapidly applied to the training of different novel data sets, is suitable for the voice synthesis task of multiple speakers, and meanwhile, the model adopts a full convolution mode to extract text characteristics, so that the training speed and the GPU utilization rate can be obviously improved, and the training cost can be reduced;
AudioLM:
the realization principle is as follows: the AudioLM models and trains semantic marks and acoustic marks through a transducer structure based on a training principle of a language model, so that semantic information reasoning is carried out according to audio prompts, and subsequent voice or piano music is generated;
model advantages and disadvantages: the audioLM does not need to train on the labeling data, can keep the speaker characteristics or the music style of the original prompt audio, generates new audio with consistent semantics and style, and has better naturalness and consistency of the generated sound.
7. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 3, wherein the video content generation means that by training the artificial intelligence, the video content can automatically generate the video content which accords with the description and has high fidelity according to given text, image, video single-mode or multi-modal data, wherein the technical basis of the video content generation comprises:
Imagen-Video:
The realization principle is as follows: the image-Video is developed based on an image model and generates a Video model based on text conditions, the model generates an initial Video according to text prompt through combination of a plurality of diffusion models, and then the resolution and the frame number of the Video are gradually improved to generate the Video;
model advantages and disadvantages: the generated video has high fidelity, controllability and world knowledge, supports to generate various videos and text animations of various artistic styles, has the understanding capability on 3D objects, and has higher calculation resources required by a parallel training mode adopted by a cascading model;
Gen:
the realization principle is as follows: the Gen model learns text-image characteristics through a potential diffusion model, generates a new video according to a given text prompt or a reference image, or performs multiple tasks of video style conversion according to an original video and a driving image;
model advantages and disadvantages: the model has better performance in the aspects of video rendering and style conversion, the generated video artistry and image structure retaining capability are stronger, the model customization requirement is better adapted, but the Gen model still has limitation in the aspect of generating the stability of the result;
CogVideo:
the realization principle is as follows: the CogVideo is a large-scale text-video generation model based on an autoregressive method, the image generation model CogView2 is applied to text-video generation to realize efficient learning, and a video is generated by predicting and continuously splicing a recursion mode of a previous frame;
Model advantages and disadvantages: the model supports Chinese prompt, and the multi-frame rate layered training method can better understand the relation between text and video, so that the generated video looks more natural, but the model has a limit on the length of an input sequence.
8. The artificial intelligence multi-modal content generation system for public opinion event coping according to claim 1, wherein in the content quality reliability assessment, natural language processing techniques such as language model and semantic analysis are used to assess consistency of the generated content, if the generated content is based on a specific knowledge base or database, whether the generated content is consistent with information in the knowledge base, whether emotion expressed in the generated content is reasonable and consistent is assessed using knowledge graph or professional knowledge in related fields, emotion tendency in the generated content is assessed using emotion analysis techniques such as emotion dictionary and emotion classifier, and if the generated content contains images or audio, image processing and audio processing techniques may be used to assess authenticity and reliability thereof such as detecting whether the images are edited or synthesized, or analyzing sound characteristics of the audio for verification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311282149.9A CN117349427A (en) | 2023-10-07 | 2023-10-07 | Artificial intelligence multi-mode content generation system for public opinion event coping |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311282149.9A CN117349427A (en) | 2023-10-07 | 2023-10-07 | Artificial intelligence multi-mode content generation system for public opinion event coping |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117349427A true CN117349427A (en) | 2024-01-05 |
Family
ID=89358750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311282149.9A Pending CN117349427A (en) | 2023-10-07 | 2023-10-07 | Artificial intelligence multi-mode content generation system for public opinion event coping |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117349427A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117708746A (en) * | 2024-02-04 | 2024-03-15 | 北京长河数智科技有限责任公司 | Risk prediction method based on multi-mode data fusion |
CN117994610A (en) * | 2024-04-03 | 2024-05-07 | 江西虔安电子科技有限公司 | Chart generation method and system |
CN117994610B (en) * | 2024-04-03 | 2024-06-04 | 江西虔安电子科技有限公司 | Chart generation method and system |
-
2023
- 2023-10-07 CN CN202311282149.9A patent/CN117349427A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117708746A (en) * | 2024-02-04 | 2024-03-15 | 北京长河数智科技有限责任公司 | Risk prediction method based on multi-mode data fusion |
CN117708746B (en) * | 2024-02-04 | 2024-04-30 | 北京长河数智科技有限责任公司 | Risk prediction method based on multi-mode data fusion |
CN117994610A (en) * | 2024-04-03 | 2024-05-07 | 江西虔安电子科技有限公司 | Chart generation method and system |
CN117994610B (en) * | 2024-04-03 | 2024-06-04 | 江西虔安电子科技有限公司 | Chart generation method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Automatic assessment of depression from speech via a hierarchical attention transfer network and attention autoencoders | |
Zadeh et al. | CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French | |
Nyatsanga et al. | A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation | |
Kheddar et al. | Deep transfer learning for automatic speech recognition: Towards better generalization | |
Triantafyllopoulos et al. | An overview of affective speech synthesis and conversion in the deep learning era | |
US11727915B1 (en) | Method and terminal for generating simulated voice of virtual teacher | |
Wang et al. | Comic-guided speech synthesis | |
CN114911932A (en) | Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement | |
CN117349427A (en) | Artificial intelligence multi-mode content generation system for public opinion event coping | |
CN116092472A (en) | Speech synthesis method and synthesis system | |
Wu et al. | Speech synthesis with face embeddings | |
Han et al. | [Retracted] The Modular Design of an English Pronunciation Level Evaluation System Based on Machine Learning | |
CN116129868A (en) | Method and system for generating structured photo | |
Sun et al. | Pre-avatar: An automatic presentation generation framework leveraging talking avatar | |
Arora et al. | Universlu: Universal spoken language understanding for diverse classification and sequence generation tasks with a single network | |
Le et al. | Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning. | |
CN113409768A (en) | Pronunciation detection method, pronunciation detection device and computer readable medium | |
Zhang | An automatic assessment method for spoken English based on multimodal feature fusion | |
Barakat et al. | Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources | |
Liu et al. | Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling | |
Zhao et al. | A multimodal teacher speech emotion recognition method in the smart classroom | |
Kadam et al. | A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation | |
Peng | Speech synthesis system based on big data and evaluation of Japanese language feeling | |
Gong et al. | A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques | |
Tu et al. | Contextual expressive text-to-speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |