CN117475086A - Scientific literature drawing generation method and system based on diffusion model - Google Patents

Scientific literature drawing generation method and system based on diffusion model Download PDF

Info

Publication number
CN117475086A
CN117475086A CN202311773821.4A CN202311773821A CN117475086A CN 117475086 A CN117475086 A CN 117475086A CN 202311773821 A CN202311773821 A CN 202311773821A CN 117475086 A CN117475086 A CN 117475086A
Authority
CN
China
Prior art keywords
text
model
training
picture
literature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311773821.4A
Other languages
Chinese (zh)
Inventor
尤元岳
杜寅辰
仓浩
徐青伟
严长春
裴非
范娥媚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhiguagua Tianjin Big Data Technology Co ltd
Original Assignee
Zhiguagua Tianjin Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhiguagua Tianjin Big Data Technology Co ltd filed Critical Zhiguagua Tianjin Big Data Technology Co ltd
Priority to CN202311773821.4A priority Critical patent/CN117475086A/en
Publication of CN117475086A publication Critical patent/CN117475086A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a scientific literature attached drawing generation method and a scientific literature attached drawing generation system based on a diffusion model, wherein the method comprises the steps of obtaining picture text description and corresponding pictures in a target document and forming training data pairs; training the diffusion model by using training data; and finally, directly connecting each component and each component in the picture descriptive text in the training data to be extracted, and fusing the extracted components and the relations into a picture generation process. The method can understand the contents of the descriptions of the drawings and generate the matched drawings, thereby helping to improve the efficiency of searching and reading analysis documents by scientific researchers and helping the scientific researchers to better express and present research results.

Description

Scientific literature drawing generation method and system based on diffusion model
Technical Field
The application relates to the technical field of multimodal text generated images, in particular to a scientific literature drawing generation method and system based on a diffusion model.
Background
In recent years, with rapid development of technology, a large number of scientific papers and patents emerge in the field of scientific research. However, efficient retrieval, reading analysis, and understanding of these documents, as well as accurate presentation of research results, remains a challenge. In addition, when a technician performs drawing of a technological drawing, it takes a lot of time to perform drawing of the drawing, and this process takes a lot of time.
The existing text-generated image models such as diffusion models can realize the generation of rough images from the text, but the images may lose specific components, for example, when a mechanical diagram is generated, the text is described as a screw sleeved with a nut is placed on a wood table, the text is used as an instruction to enable the diffusion model to generate pictures, certain components may be lost, and the screw may not be generated, so that the situation is more obvious particularly when the picture descriptive text is more complex.
Disclosure of Invention
The application provides a scientific literature drawing generation method and a scientific literature drawing generation system based on a diffusion model, which can greatly reduce the situation that the model loses components in the diffusion generation process.
In a first aspect, a method for generating a scientific literature attached drawing based on a diffusion model includes S1 data processing, S2 scientific field text-to-diagram diffusion model training, and S3 component relation extraction and picture generation, which specifically includes:
s1, data processing is carried out, picture text description and corresponding pictures in a target document are obtained, and training data pairs are formed; the picture text describes input used for model training, and the corresponding picture is output used for model training;
s2, training a venturi chart diffusion model in the technical field, constructing the venturi chart diffusion model, and training through training data pairs formed by data processing;
s3, extracting the component relation and generating the picture, carrying out syntactic analysis on the picture text description in the training data pair, and extracting each component and the component relation in the picture text description; and forming a text vector based on the extracted components and the relation among the components, adjusting an attention matrix of the trained literature graph diffusion model, and taking the adjusted literature graph diffusion model as a target scientific literature graph generation model.
Optionally, the S1 data processing specifically includes:
using part of the relevant literature downloaded in the published data technology literature database;
extracting picture text description information of the drawing from the scientific literature by using a natural language processing technology; the natural language processing technology at least comprises word segmentation, part-of-speech tagging, named entity recognition and regular expression;
extracting a drawing with matched picture text description information from technical literature by using an image detection technology; the image detection technology at least comprises a target detection technology and a semantic segmentation technology.
Optionally, in the training of the diffusion model of the venturi graph in the S2 technology field, constructing the diffusion model of the venturi graph specifically includes constructing a diffusion model-based diffusion model by fine tuning the multi-mode CLIP model, and then performing fine tuning in the vertical field of the technology literature.
Optionally, the multi-mode CLIP model by fine tuning specifically includes:
loading a data set of scientific and technical literature in batches; wherein, the data set comprises drawings of scientific literature and corresponding drawing descriptions;
inputting the drawing of the scientific and technological literature into an image encoder of the CLIP model to obtain image coding characteristics;
inputting the attached drawing description of the technical literature into a text encoder of the CLIP model to obtain text coding characteristics;
calculating cosine similarity loss between the image coding features and the text coding features;
adjusting by taking the cosine similarity between the maximized positive samples and the minimized negative samples as targets; wherein the figure specification and figure matching pairs of samples are taken as positive samples for training, whereas the unmatched pairs of samples are taken as negative samples.
Optionally, a diffusion model-based textbook diffusion model is built by fine tuning a multi-mode CLIP model, and specifically comprises:
loading an open-source image-text pair data set, randomly sampling data from the open-source image-text pair data set, and firstly compressing an image to a text space by utilizing an encoder module in a pre-trained VAE model to obtain image vector characterization;
then, encoding by using text encoder module texts in the trimmed CLIP model to obtain corresponding matched text vectors;
forward training is performed by the set number of sampling steps until convergence.
Optionally, forward training is performed through a set sampling step number until convergence, and the optimization target is specifically:
wherein t is a set sampling step length, alpha t Is a set of hyper-parameters; t (T) 0 Is a Text vector obtained using Text Embedding of the pre-trained CLIP, ε is the noise sampled from the standard normal distribution N, M 0 Is an image vector characterization; epsilon θ For predicting the noise of each step, a U-Net network was used for fitting.
Optionally, in the extracting of the S3 component relation and the generating of the picture, the method specifically includes:
carrying out syntactic analysis on the picture text description in the training data pair through a text analyzer, and extracting the part-of-speech phrases and the dependency relationship among each part-of-speech phrase;
encoding the extracted noun phrases into vectors through a clip text encoder, and encoding the whole section of the diagram description into vectors through the clip text encoder;
aligning the vector corresponding to the extracted noun phrase with the vector illustrated by the whole section of the drawing to obtain a new text vector;
the new text vector is sent to the cross-attention layer section for attention calculations.
In a second aspect, a system for generating a scientific literature attached drawing based on a diffusion model, the system comprises a data processing module, a scientific field text-to-graphic diffusion model training module and a component relation extraction and picture generation module, and the system specifically comprises:
the data processing module is used for acquiring the picture text description and the corresponding picture in the target document and forming a training data pair; the picture text describes input used for model training, and the corresponding picture is output used for model training;
the training module of the textbook diffusion model in the science and technology field is used for constructing the textbook diffusion model and training the training data pair formed by data processing;
the component relation extraction and picture generation module is used for carrying out syntactic analysis on picture text description in the training data pair and extracting each component and component relation in the picture text description; and forming a text vector based on the extracted components and the relation among the components, adjusting an attention matrix of the trained literature graph diffusion model, and taking the adjusted literature graph diffusion model as a target scientific literature graph generation model.
In a third aspect, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the diffusion model-based scientific literature drawing generation method according to any one of the first aspects when the processor executes the computer program.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the diffusion model-based scientific literature drawing generation method according to any one of the above first aspects.
Compared with the prior art, the application has the following beneficial effects:
the method combines natural language processing and image processing technology, can automatically analyze abstract content of scientific and technological literature, generates corresponding high-quality drawings according to the abstract content, and provides richer and more diversified abstract for the scientific and technological literature, thereby not only improving efficiency of searching and reading analysis literature for scientific and technological researchers, but also helping the scientific and technological researchers to express and present research results better. In addition, the method also supports the generation of literature abstract drawings of multiple languages and fine adjustment in different specific vertical fields, and has good applicability and universality. The method is mainly applied to the generation of scenes of the multi-mode scientific literature attached drawings.
In the abstract drawing generation of the scientific and technical literature, the target abstract drawing with correct content, clear structure and high definition can be generated by using the prompt text.
In addition, compared with the existing diffusion generation model based on pixels, the time of sampling reasoning and the generation quality are greatly reduced, and the speed of generating the drawing by the system can be improved. Meanwhile, the loss phenomenon of the component in the diffusion generation process is greatly reduced, and the quality and the integrity of the figure generation are ensured.
Drawings
FIG. 1 is an overall flow chart provided in an embodiment of the present application;
FIG. 2 is a detailed model block diagram of the CLIP model;
FIG. 3 is a view showing a specific model construction of a diffusion model in the present embodiment;
FIG. 4 is a diagram illustrating an example of a dependency tree extracted from "a screw with a nut placed on a wooden table";
FIG. 5 is a flowchart illustrating the extraction of a dependency tree generation graph;
FIG. 6 is a diagram of a prior art diffusion model;
FIG. 7 is a drawing generated by the drawing generation model of the target scientific literature obtained in the present embodiment;
FIG. 8 is a block diagram of a module architecture of a diffusion model-based scientific literature drawing generation system according to one embodiment of the present application;
fig. 9 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In the description of the present application: the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements but may include other steps or elements not expressly listed but inherent to such process, method, article, or apparatus or steps or elements added to further optimization schemes based on the inventive concepts.
The existing text-generated image models such as diffusion models can realize the generation of rough images from the text, but the images may lose specific components, for example, when a mechanical diagram is generated, the text is described as a screw sleeved with a nut is placed on a wood table, the text is used as an instruction to enable the diffusion model to generate pictures, certain components may be lost, and the screw may not be generated, so that the situation is more obvious particularly when the picture descriptive text is more complex. Therefore, based on the determination of the prior art, the invention provides a diffusion model generation mode which focuses more on components and association relations among the components, and the mode can extract each component in text description and the association among the components, and control the diffusion model to generate an image based on the characteristics. The system combines natural language processing and image processing technology, can automatically analyze descriptive texts of pictures, and generates drawings corresponding to the descriptive texts according to the description content of the drawings, so that the efficiency of searching and reading analysis documents by scientific researchers is improved, and scientific and technological staff can be assisted to better present research results.
The invention relates to a scientific literature drawing generation method and a scientific literature drawing generation system based on a diffusion model, which can generate a correspondingly matched abstract drawing according to picture descriptions (hereinafter also referred to as drawing descriptions) extracted from scientific literature, and control the diffusion model to generate images by extracting components and connections among the components, so that the efficiency of searching and reading analysis literature by scientific researchers can be improved, the scientific staff can be helped to better express and present research results, and the scientific staff can be helped to draw pictures to save time. The method is mainly applied to the generation of scenes of the multi-mode scientific literature attached drawings.
In one embodiment, as shown in fig. 1, a diffusion model-based scientific literature drawing generation method is provided, and the method can be applied to a server, and comprises the steps of S1 data processing, S2 scientific field text-to-drawing diffusion model training, S3 component relation extraction and picture generation, wherein the method specifically comprises the following steps:
s1, data processing is carried out, picture text description and corresponding pictures in the target document are obtained, and training data pairs are formed.
Wherein the picture text describes an input for model training and the corresponding picture is an output for model training. Firstly, a data processing process is performed, the main content of the process is collecting data in the technical field, the data can be technical literature or patent, picture descriptions in the technical literature or picture descriptions and corresponding pictures in the patent are extracted, and the text contents and the corresponding pictures are used as training data of a diffusion model training process in the next stage, wherein the text contents are used as input of a Chinese raw graph diffusion model, and the corresponding pictures are output corresponding to the Chinese raw graph diffusion model.
In this step, mainly, a text description of a picture and a corresponding picture are obtained as an input/output training data pair. And the data of the part is used as the data of the next stage of the text-generated graph diffusion model training process. The detailed process of specific data acquisition can be based on the following procedure:
(1) Use is made of partially relevant documents downloaded in the published data technology document database.
(2) Text information of the drawing description is extracted from the scientific and technological literature by using natural language processing technology including word segmentation, part-of-speech tagging, named entity recognition, regular expression and the like.
(3) The image detection related technology including target detection, semantic segmentation and other technologies is used to detect the drawing matched with the text abstract from the corresponding scientific literature.
Based on the above operations, a scientific literature graphic data set is constructed, in which graphic data set includes drawing descriptive Text of a scientific literature (hereinafter referred to as drawing description) and a matching drawing, which is denoted as dataset= { (Text) i , Image i ) I=1, 2,..and N }, where Text i Figure description, image, representing the ith training sample in the training set i A drawing representing the ith training sample in the training set, i representing the subscript of each training sample, and N representing the size of the training set.
S2, training a literature graph diffusion model in the technical field, constructing the literature graph diffusion model, and training through training data pairs formed by data processing.
The training of the diffusion model of the textbook in the technical field is realized by using the descriptive text of the pictures and the corresponding pictures collected in the data processing process as training data, and training the diffusion model, so that the diffusion model of the textbook has a changeable generation effect in the technical field. The process comprises the process of adding and removing noise of the diffusion model.
The method is mainly used for fine-tuning a multi-mode CLIP model, constructing a figure based on a diffusion model to generate a large model, and finally conducting fine-tuning in the vertical field of scientific and technical literature. Wherein, the fine tuning steps of the multi-mode CLIP model are as follows:
(1) DataSet DataSet { (Text) of scientific literature built in batch loading data processing module i , Image i )}。
(2) Image of drawings of scientific literature i Input to the image encoder of the CLIP to obtain the coding feature I of the image i
(3) Description of drawings of scientific literature Text i Inputting into the text encoder of the CLIP to obtain the coding feature T of the text i
(4) Cosine similarity loss between the image features and the text features is calculated, wherein the figure specification and figure matching sample pairs are used as positive samples for training, whereas non-matching sample pairs are used as negative samples.
The goal of training is to maximize cosine similarity between positive samples and minimize cosine similarity between negative samples. The training targets were as follows:
TrainObject ~ Cos(T i_pos ,I i_pos )-Cos(T i_neg ,I i_neg )
wherein T is i_pos ,I i_pos For positive sample pairs of text and drawings, T i_neg ,I i_neg Is a negative sample pair of text and drawings. In the embodiment of the application, a specific model structure diagram of the multi-mode CLIP is shown in fig. 2.
After obtaining the fine-tuning multi-mode CLIP model, a diffusion model is constructed, and the specific modeling steps are as follows:
loading a public open-source image-text pair data set, randomly sampling data from the public open-source image-text pair data set, and firstly compressing an image into a text space by utilizing an encoder module in a pre-trained VAE model to obtain an image vector representation M i = VAE.encoder(Image i ) The method comprises the steps of carrying out a first treatment on the surface of the Then the text encoder module text in the pre-training CLIP is utilized to encode, and a text vector T corresponding to the matching is obtained i = CLIP.text_encoder(Text i )。
Given a set of sampling steps T, a specific forward training procedure is as follows:
repeated training
Sampling M0 from Mi:
M0~Mi=VAE.encoder(Imagei)
sampling a step number t:
t~Uniform({1,2,3,4,5,...,T})
a set of noise samples from a standard normal distribution N:
ε~N(0,1)
the following objectives are optimized:
until convergence.
Wherein t is a set sampling step length, and αt is a set of super parameters; t (T) 0 The Text vector obtained by Text Embedding of the pre-training CLIP is used, the Text vector and the image feature hidden code are subjected to cross attention computing operation (Query, key, value), the Text information and the image information are subjected to depth cross fusion, and the understanding of the generated model on the Text information is improved; epsilon is the noise sampled from a standard normal distribution N, M 0 Is an image vector characterization; epsilon θ The method is used for predicting the noise of each step, a U-Net network is used for fitting, the optimization target is to minimize the difference loss between the actual noise and the predicted noise of the step, and training is repeated until convergence.
Through the training of the process, the training corpus is the drawing description and the drawing of the technical field, so that the trained text-to-drawing diffusion model has better performance corresponding to the text-to-drawing task of the technical field. The style of the generated drawings may be more biased toward the scientific literature type of drawings. Note that the trained diffusion model can already diffuse the input text content into a picture corresponding to the input text content, but when the input text content has more components and more complex relationships, component loss or component relationship misalignment occurs, and the situation often occurs. As shown in fig. 3, a specific model structure diagram of the diffusion model is given.
S3, extracting the component relation and generating the picture, carrying out syntactic analysis on the picture text description in the training data pair, and extracting each component and the component relation in the picture text description; and forming a text vector based on the extracted components and the relation among the components, adjusting an attention matrix of the trained literature graph diffusion model, and taking the adjusted literature graph diffusion model as a target scientific literature graph generation model.
The component relation extraction and the picture generation are carried out, each component and each component in the picture descriptive text in the training data are directly connected to the extraction, the extracted components and the relation are fused into the picture generation process, the components and the relation are used as control signals, the diffusion model is enabled to pay more attention to each component and the relation therein, therefore, a more accurate picture is generated, and the probability of losing the components in the diffusion process is reduced.
In order to solve the problem that when the drawing description is complex, a part of components can be lost when corresponding images are generated through the trained diffusion model, the part mainly uses the drawing generation model to extract the dependency relationship in the drawing description according to the input section of the drawing description, generates the images meeting the requirements of the drawing description according to the dependency relationship, and does not lose component information in the drawing description.
Firstly, the drawing description is subjected to syntactic analysis by a text analyzer xi, the Noun Phrases (NP) in the drawing description and the dependency relations among the Noun phrases are extracted, the extracted Noun phrases are encoded into vectors by a clip text encoder, and meanwhile, the whole section of the drawing description is encoded into the vectors by the clip text encoder. And aligning the vector corresponding to the extracted noun phrase with the vector illustrated by the whole section of the drawing to obtain a new text vector. The new text vector is sent to the cross-attention layer section for attention calculations. Therefore, the last calculated attention can pay more attention to the noun phrase, so that the noun phrase cannot be forgotten, and the situation that the component is not generated in the clip image decoder stage is avoided. That is, this process is an optimization of the noise reduction process in the previous process, where the U-Net noise reduction network of the previous step can be completely replaced by this step. FIG. 4 is a diagram illustrating an example of the dependency tree extracted by "a screw with a nut placed on a wooden table";
step 1: dependency extraction and encoding:
in this example, assuming that a certain figure is described as "a screw with a nut is placed on a wooden table", this figure is described as a sample, the figure is first described as extracting the dependency relationship between the words of the text by word segmentation and syntax analysis. In this example, the dependency of the illustration of the figure is eventually identified as: the drawing of FIG. 4 illustrates an example of the dependency tree extracted by "a screw with a nut placed on a wooden table".
For more attention to each component, the NPs inside are extracted from the obtained dependency tree, in this case, the extracted NPs are respectively nuts, screws, a wood table, a screw sleeved with the nuts, and these extracted NPs are respectively marked as c= { C1, C2, where C1, C2 is each NP and C is the set of NPs. In this example, c= { C1, C2, C3, C4}, and C1 is a nut, C2 is a screw, C3 is a wood table, and C4 is a screw with a nut.
Each noun phrase is then encoded into a vector with the clip's text encoder for each NP. This procedure is denoted wi=cliptext (ci), i=1, wi is the vector after encoding by the clip text encoder, and the clip text is the text encoder identification. The whole sentence is also encoded by a clip text encoder, this process being denoted wp= CLIPtext (prompt). After the text has been encoded for the entire sentence drawing specification, w= [ Wp, W1, W2, ], wk ] is obtained.
Step 2: noun phrase vectors are realigned with figure description vectors:
wi is realigned with Wp. W= [ Wp, W1, W2, ], wk is obtained in a]Then, each noun phrase vector Wi needs to be realigned with the whole sentence text vector Wp, and the specific process is to replace the noun phrase vector with the corresponding vector position of the original noun phrase in the illustration position of the drawing, thereby obtaining a new vectorAs shown in the middle portion of the following figures. After alignment and vector replacement, a new text code can be obtained>
Step 3: attention calculation:
the new text code obtained is fed into the linear layer as value in the attention layer. This process can be described as:
where fv (-) represents the value-based mapping function, currently being a linear layer. And the same thing fq (), fk () is a mapping function based on the query and the key respectively, and corresponds to one linear layer respectively. Map Wp to Kp by fk (), let the feature map X of the last time step t Mapping to Q by fq () t Kp and Q are obtained t I.e. obtaining an attention map M t Denoted fM ().
After the attention map is acquired, the output of the current time step can be obtained through accumulation after the product of the value. The formula is V i
The whole diffusion model generation process is shown as the following algorithm, and the component relation extraction and diffusion model generation algorithm comprises the following steps:
input: the figure illustrates a sample, a syntactic dependency analyzer ζ, an image decoder ψ, a trained diffusion model φ
And (3) outputting: generated Image x
Extracting NP set c= { C1, C2, &..
Ci and the accompanying description, sample, wi=CLIPtext (Ci),
Wp = CLIPtext(prompt)
noise reduction and diffusion processes:
For t = T, T-1, ..., 1 do
for each of the attention layers in diffusion model phido
Obtaining the output X of the previous layer t
Q t =fq(X t ),Kp=fk(Wp), Vi=fv();
Acquisition of attention force diagram
By M t ,V i Acquisition of O t And transported to the diffusion model of the next layer
End For
End For
Noise reduction and diffusion in step T are carried out to obtain z 0 Note z 0 Is output O at t=0 0 Z is the state of each hidden layer of the diffusion model. Will z 0 For delivery to the image decoder ψ to generate an image x. A flowchart illustrating the extraction of the dependency tree generation graph is given as figure 5.
Fig. 6 shows a drawing generated by the existing diffusion model, and the specific text of the drawing is described as that a screw sleeved with a nut is placed on a wood table, and the diffusion model is subjected to picture generation by taking the text as an instruction, and the screw is not generated.
As shown in FIG. 7, for the purposes of illustrating the drawing in the embodiments of the present application, and after extracting the dependency relationship and affecting the attention matrix, the diffusion model is based on the drawing generated by the instruction "a screw with a nut placed on a wooden table", it can be seen that the screw is generated compared to FIG. 6.
In summary, the invention provides a scientific literature drawing generation method based on a diffusion model. The method combines natural language processing and image processing technologies, can automatically analyze abstract contents of scientific and technical documents, generates corresponding high-quality drawings according to the abstract contents, and provides richer and more diversified abstract for the scientific and technical documents, so that not only can efficiency of searching and reading analysis documents by scientific and technical researchers be improved, but also scientific and technical researchers can be helped to express and present research results better. In addition, the method also supports the generation of literature abstract drawings of multiple languages and fine adjustment in different specific vertical fields, and has good applicability and universality. The method is mainly applied to the generation of scenes of the multi-mode scientific literature attached drawings.
In abstract drawing generation of scientific and technical literature, a diffusion model is one of the most advanced modes for supporting a literary sketch task at present, and is mainly expressed in understanding of strong text semantics based on a multi-mode pre-training model Open-CLIP and strong drawing generation capability based on a variation self-encoder, and can generate a target abstract drawing with correct content, clear structure and high definition by using prompt texts.
In addition, the method is based on diffusion generation of hidden feature vectors, compared with a diffusion generation model based on pixels, the time of sampling reasoning and the quality of generation are greatly reduced, and the speed of generating the drawing by a system can be improved. Meanwhile, through extracting the dependency relation of the description text and carrying out text coding on the noun phrases in the description text, the diffusion model can pay more attention to the noun phrases and corresponding components in the diffusion process, so that the loss phenomenon of the components in the diffusion generation process is greatly reduced, and the quality and the integrity of the figure generation are ensured.
In general, the technical method provided by the application can understand the contents of the descriptions of the drawings and generate the matched drawings, thereby helping to improve the efficiency of searching and reading analysis documents by scientific researchers and helping the scientific researchers to better express and present research results.
In one embodiment, as shown in fig. 8, a diffusion model-based scientific literature drawing generation system is provided, the system includes a data processing module, a scientific field text-to-drawing diffusion model training module, and a component relation extraction and picture generation module, and the system specifically includes:
the data processing module is used for acquiring the picture text description and the corresponding picture in the target document and forming a training data pair; the picture text describes input used for model training, and the corresponding picture is output used for model training;
the training module of the textbook diffusion model in the science and technology field is used for constructing the textbook diffusion model and training the training data pair formed by data processing;
the component relation extraction and picture generation module is used for carrying out syntactic analysis on picture text description in the training data pair and extracting each component and component relation in the picture text description; and forming a text vector based on the extracted components and the relation among the components, adjusting an attention matrix of the trained literature graph diffusion model, and taking the adjusted literature graph diffusion model as a target scientific literature graph generation model.
The specific implementation content of each module can be referred to above for the limitation of the system method for generating the scientific literature drawing based on the diffusion model, and will not be described herein.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, and a network interface connected by a system bus. The processor of the computer device is used for providing computing and control capability, the network interface is used for communicating with an external terminal through network connection, and the computer device runs the computer program by loading to realize the multi-domain knowledge extraction method of the patent.
It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, a computer readable storage medium is also provided, on which a computer program is stored, involving all or part of the flow of the method of the above embodiment.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

Claims (10)

1. The technological literature attached drawing generation method based on the diffusion model is characterized by comprising S1 data processing, S2 technological field text-to-drawing diffusion model training, S3 component relation extraction and picture generation, and specifically comprising the following steps:
s1, data processing is carried out, picture text description and corresponding pictures in a target document are obtained, and training data pairs are formed; the picture text describes input used for model training, and the corresponding picture is output used for model training;
s2, training a venturi chart diffusion model in the technical field, constructing the venturi chart diffusion model, and training through training data pairs formed by data processing;
s3, extracting the component relation and generating the picture, carrying out syntactic analysis on the picture text description in the training data pair, and extracting each component and the component relation in the picture text description; and forming a text vector based on the extracted components and the relation among the components, adjusting an attention matrix of the trained literature graph diffusion model, and taking the adjusted literature graph diffusion model as a target scientific literature graph generation model.
2. The method according to claim 1, wherein the S1 data processing specifically comprises:
using part of the relevant literature downloaded in the published data technology literature database;
extracting picture text description information of the drawing from the scientific literature by using a natural language processing technology; the natural language processing technology at least comprises word segmentation, part-of-speech tagging, named entity recognition and regular expression;
extracting a drawing with matched picture text description information from technical literature by using an image detection technology; the image detection technology at least comprises a target detection technology and a semantic segmentation technology.
3. The method according to claim 1, wherein in the S2 technology field text-to-graph diffusion model training, the step of constructing the text-to-graph diffusion model specifically includes constructing a text-to-graph diffusion model based on a diffusion model by fine tuning a multi-mode CLIP model, and then performing fine tuning in the technology literature vertical field.
4. A method according to claim 3, characterized in that the modeling by fine tuning the multi-modal CLIP model comprises in particular:
loading a data set of scientific and technical literature in batches; wherein, the data set comprises drawings of scientific literature and corresponding drawing descriptions;
inputting the drawing of the scientific and technological literature into an image encoder of the CLIP model to obtain image coding characteristics;
inputting the attached drawing description of the technical literature into a text encoder of the CLIP model to obtain text coding characteristics;
calculating cosine similarity loss between the image coding features and the text coding features;
adjusting by taking the cosine similarity between the maximized positive samples and the minimized negative samples as targets; wherein the figure specification and figure matching pairs of samples are taken as positive samples for training, whereas the unmatched pairs of samples are taken as negative samples.
5. A method according to claim 3, wherein the construction of a diffusion model-based textbook diffusion model by fine tuning a multi-modal CLIP model, comprises:
loading an open-source image-text pair data set, randomly sampling data from the open-source image-text pair data set, and firstly compressing an image to a text space by utilizing an encoder module in a pre-trained VAE model to obtain image vector characterization;
then, encoding by using text encoder module texts in the trimmed CLIP model to obtain corresponding matched text vectors;
forward training is performed by the set number of sampling steps until convergence.
6. The method according to claim 5, wherein the forward training is performed by a set number of sampling steps until convergence, and the optimization objective is specifically:
wherein t is a set sampling step length, alpha t Is a set of hyper-parameters; t (T) 0 Is a Text vector obtained using Text Embedding of the pre-trained CLIP, ε is the noise sampled from the standard normal distribution N, M 0 Is an image vector characterization; epsilon θ For predicting the noise of each step, a U-Net network was used for fitting.
7. The method of claim 1, wherein the S3 component relation extraction and picture generation specifically includes:
carrying out syntactic analysis on the picture text description in the training data pair through a text analyzer, and extracting the part-of-speech phrases and the dependency relationship among each part-of-speech phrase;
encoding the extracted noun phrases into vectors through a clip text encoder, and encoding the whole section of the diagram description into vectors through the clip text encoder;
aligning the vector corresponding to the extracted noun phrase with the vector illustrated by the whole section of the drawing to obtain a new text vector;
the new text vector is sent to the cross-attention layer section for attention calculations.
8. The utility model provides a scientific and technological literature attached drawing generation system based on diffusion model which characterized in that, the system includes data processing module, scientific and technological field text-generated graph diffusion model training module and subassembly relation extraction and picture generation module, and it specifically includes:
the data processing module is used for acquiring the picture text description and the corresponding picture in the target document and forming a training data pair; the picture text describes input used for model training, and the corresponding picture is output used for model training;
the training module of the textbook diffusion model in the science and technology field is used for constructing the textbook diffusion model and training the training data pair formed by data processing;
the component relation extraction and picture generation module is used for carrying out syntactic analysis on picture text description in the training data pair and extracting each component and component relation in the picture text description; and forming a text vector based on the extracted components and the relation among the components, adjusting an attention matrix of the trained literature graph diffusion model, and taking the adjusted literature graph diffusion model as a target scientific literature graph generation model.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.
CN202311773821.4A 2023-12-22 2023-12-22 Scientific literature drawing generation method and system based on diffusion model Pending CN117475086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311773821.4A CN117475086A (en) 2023-12-22 2023-12-22 Scientific literature drawing generation method and system based on diffusion model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311773821.4A CN117475086A (en) 2023-12-22 2023-12-22 Scientific literature drawing generation method and system based on diffusion model

Publications (1)

Publication Number Publication Date
CN117475086A true CN117475086A (en) 2024-01-30

Family

ID=89634951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311773821.4A Pending CN117475086A (en) 2023-12-22 2023-12-22 Scientific literature drawing generation method and system based on diffusion model

Country Status (1)

Country Link
CN (1) CN117475086A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581953A (en) * 2019-01-30 2020-08-25 武汉慧人信息科技有限公司 Method for automatically analyzing grammar phenomenon of English text
CN111897970A (en) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 Text comparison method, device and equipment based on knowledge graph and storage medium
CN113268591A (en) * 2021-04-17 2021-08-17 中国人民解放军战略支援部队信息工程大学 Air target intention evidence judging method and system based on affair atlas
CN114970513A (en) * 2022-04-22 2022-08-30 武汉轻工大学 Image generation method, device, equipment and storage medium
CN116051668A (en) * 2022-12-30 2023-05-02 北京百度网讯科技有限公司 Training method of diffusion model of draft map and image generation method based on text
CN116168411A (en) * 2022-12-30 2023-05-26 企知道科技有限公司 Patent intelligent drawing generation method and system
CN116935169A (en) * 2023-09-13 2023-10-24 腾讯科技(深圳)有限公司 Training method for draft graph model and draft graph method
CN117151098A (en) * 2023-06-09 2023-12-01 阳光保险集团股份有限公司 Relation extraction method and device and electronic equipment
CN117252957A (en) * 2023-09-14 2023-12-19 上海焕泽信息技术有限公司 Method, device and storage medium for generating picture with accurate text according to text description

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581953A (en) * 2019-01-30 2020-08-25 武汉慧人信息科技有限公司 Method for automatically analyzing grammar phenomenon of English text
CN111897970A (en) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 Text comparison method, device and equipment based on knowledge graph and storage medium
CN113268591A (en) * 2021-04-17 2021-08-17 中国人民解放军战略支援部队信息工程大学 Air target intention evidence judging method and system based on affair atlas
CN114970513A (en) * 2022-04-22 2022-08-30 武汉轻工大学 Image generation method, device, equipment and storage medium
CN116051668A (en) * 2022-12-30 2023-05-02 北京百度网讯科技有限公司 Training method of diffusion model of draft map and image generation method based on text
CN116168411A (en) * 2022-12-30 2023-05-26 企知道科技有限公司 Patent intelligent drawing generation method and system
CN117151098A (en) * 2023-06-09 2023-12-01 阳光保险集团股份有限公司 Relation extraction method and device and electronic equipment
CN116935169A (en) * 2023-09-13 2023-10-24 腾讯科技(深圳)有限公司 Training method for draft graph model and draft graph method
CN117252957A (en) * 2023-09-14 2023-12-19 上海焕泽信息技术有限公司 Method, device and storage medium for generating picture with accurate text according to text description

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEC RADFORD ET AL.: "Learning Transferable Visual Models From Natural Language Supervision", 《ARXIV:2103.00020V1 [CS.CV]》, 26 February 2021 (2021-02-26), pages 1 *
DEEPHUB: "Diffusion 和Stable Diffusion的数学和工作原理详细解释", 《知乎》, 2 May 2023 (2023-05-02), pages 10 *

Similar Documents

Publication Publication Date Title
WO2019118256A1 (en) Generation of text from structured data
CN112199501A (en) Scientific and technological information text classification method
CN110866129A (en) Cross-media retrieval method based on cross-media uniform characterization model
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN111859950A (en) Method for automatically generating lecture notes
CN115775349A (en) False news detection method and device based on multi-mode fusion
Liang et al. Adapting language-audio models as few-shot audio learners
Hafeth et al. Semantic representations with attention networks for boosting image captioning
WO2022087688A1 (en) System and method for text mining
CN116561325B (en) Multi-language fused media text emotion analysis method
CN115827815B (en) Keyword extraction method and device based on small sample learning
CN117034921A (en) Prompt learning training method, device and medium based on user data
CN116682110A (en) Image processing method, device, equipment and medium
CN117475086A (en) Scientific literature drawing generation method and system based on diffusion model
CN116204622A (en) Query expression enhancement method in cross-language dense retrieval
CN115292533A (en) Cross-modal pedestrian retrieval method driven by visual positioning
Pa et al. Improving Myanmar Image Caption Generation Using NASNetLarge and Bi-directional LSTM
Hu et al. Dual-spatial normalized transformer for image captioning
Xie et al. Enhancing multimodal deep representation learning by fixed model reuse
CN116383339A (en) Method and device for structuring energy text data based on relation extraction
Abdal Hafeth et al. Semantic Representations with Attention Networks for Boosting Image Captioning
Mei et al. An External Denoising Framework for Magnetic Resonance Imaging: Leveraging Anatomical Similarities Across Subjects with Fast Searches
CN114996448A (en) Text classification method and device based on artificial intelligence, terminal equipment and medium
CN115982324A (en) Purchase file inspection method based on improved natural language processing
CN118093783A (en) Semantic retrieval method and device based on natural language expression learning and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination