CN118570566A - A method and device for automatically generating Chinese image-text pair data based on LVLM - Google Patents
A method and device for automatically generating Chinese image-text pair data based on LVLM Download PDFInfo
- Publication number
- CN118570566A CN118570566A CN202411052852.5A CN202411052852A CN118570566A CN 118570566 A CN118570566 A CN 118570566A CN 202411052852 A CN202411052852 A CN 202411052852A CN 118570566 A CN118570566 A CN 118570566A
- Authority
- CN
- China
- Prior art keywords
- text
- chinese
- module
- image
- lvlm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000001914 filtration Methods 0.000 claims description 63
- 230000003252 repetitive effect Effects 0.000 claims description 39
- 238000012545 processing Methods 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 6
- 238000003672 processing method Methods 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims 1
- 230000002776 aggregation Effects 0.000 claims 1
- 238000004153 renaturation Methods 0.000 claims 1
- 238000013519 translation Methods 0.000 abstract description 2
- 238000012805 post-processing Methods 0.000 abstract 1
- 230000000717 retained effect Effects 0.000 description 8
- 230000004044 response Effects 0.000 description 3
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
- G06V10/765—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Document Processing Apparatus (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及大语言模型领域,尤其涉及一种基于LVLM的自动化中文图文对数据生成方法及装置。The present invention relates to the field of large language models, and in particular to a method and device for automatically generating Chinese image-text pair data based on LVLM.
背景技术Background Art
在人工智能领域,多模态数据融合技术,尤其是图像与文本的结合,已成为研究热点。这类数据不仅丰富了机器学习模型的输入维度,还促进了深度学习在视觉问答、图像描述生成、视觉定位等应用领域的快速发展。其中,中文图文对数据集作为连接视觉和语义的重要桥梁,对于推动中文语境下的多模态理解与生成研究至关重要。In the field of artificial intelligence, multimodal data fusion technology, especially the combination of images and text, has become a research hotspot. This type of data not only enriches the input dimension of machine learning models, but also promotes the rapid development of deep learning in application fields such as visual question answering, image description generation, and visual positioning. Among them, the Chinese image-text pair dataset, as an important bridge connecting vision and semantics, is crucial to promoting multimodal understanding and generation research in the Chinese context.
尽管近年来已有多项工作致力于创建中文图文对数据集,但这些数据集往往存在以下几个方面的局限:(1)相较于英文领域,高质量、大规模的中文图文对数据集较为稀缺,限制了模型的训练效果和泛化能力;(2)现有数据集中,部分图文对的匹配度不高,文本描述可能与图像内容不符,导致模型学习到错误的关联关系;(3)数据集中的图像和文本主题往往集中于某些特定领域,缺乏广泛性和代表性,影响模型的全面理解和生成能力。Although many works have been devoted to creating Chinese image-text pair datasets in recent years, these datasets often have the following limitations: (1) Compared with the English field, high-quality and large-scale Chinese image-text pair datasets are relatively scarce, which limits the training effect and generalization ability of the model; (2) In the existing datasets, the matching degree of some image-text pairs is not high, and the text description may not match the image content, causing the model to learn incorrect associations; (3) The image and text topics in the dataset are often concentrated in certain specific fields, lacking breadth and representativeness, which affects the model's comprehensive understanding and generation capabilities.
鉴于上述局限性,自动化中文图文对数据生成与清洗技术应运而生,该技术旨在通过算法自动从海量互联网资源中挖掘、匹配并清洗中文图文对数据,以期构建一个规模大、质量高、多样化且实时更新的数据集。In view of the above limitations, automated Chinese image-text pair data generation and cleaning technology came into being. This technology aims to automatically mine, match and clean Chinese image-text pair data from massive Internet resources through algorithms, in order to build a large-scale, high-quality, diverse and real-time updated dataset.
发明内容Summary of the invention
本发明提供一种基于LVLM的自动化中文图文对数据生成方法及装置,以解决现有中文图文对数据中存在中文数据质量不高且数量少的问题。The present invention provides an automatic Chinese image-text pair data generation method and device based on LVLM, so as to solve the problem that the Chinese data in the existing Chinese image-text pair data is of low quality and small quantity.
第一个方面,本发明提供了一种基于LVLM(CLIP与大型视觉-语言模型)的自动化中文图文对数据生成方法及装置,适用于仅包含图像的情况,具体包括如下步骤:In the first aspect, the present invention provides a method and device for automatically generating Chinese image-text pair data based on LVLM (CLIP and Large Vision-Language Model), which is applicable to the case where only images are included, and specifically includes the following steps:
步骤S1、获取图像D1并生成指令prompt-1;Step S1, acquiring image D1 and generating instruction prompt-1;
步骤S2、将D1和prompt-1输入LVLM模型,形成文本T1;Step S2, input D1 and prompt-1 into the LVLM model to form text T1;
步骤S3、对T1进行截断样本过滤,形成文本T2;Step S3, filtering the truncated samples of T1 to form text T2;
步骤S4、对T2进行重复性过滤,形成文本T3;Step S4, performing repetitive filtering on T2 to form text T3;
步骤S5、通过中文CLIP模型计算D1与T3的相似度S-1,若S-1小于阈值γ,则删除D1,反之,则保留D1。Step S5, calculate the similarity S-1 between D1 and T3 through the Chinese CLIP model. If S-1 is less than the threshold γ, D1 is deleted; otherwise, D1 is retained.
优选地,步骤S1中,所述prompt-1用于引导LVLM模型生成特定类型的响应或执行特定任务。Preferably, in step S1, the prompt-1 is used to guide the LVLM model to generate a specific type of response or perform a specific task.
优选地,步骤S3中,对T1进行截断样本过滤,形成T2,具体包括如下步骤:Preferably, in step S3, truncated samples of T1 are filtered to form T2, which specifically includes the following steps:
步骤S301、判断T1的字符长度LT1与字符阈值max_new_tokens的大小,若将LT1小于max_new_tokens,则将T2设为T1并执行步骤S4,反之,对T1中超过max_new_tokens部分的字符进行截断处理,形成T1-truncate;Step S301, determine the size of the character length L T1 of T1 and the character threshold max_new_tokens. If L T1 is less than max_new_tokens, set T2 to T1 and execute step S4. Otherwise, truncate the characters in T1 that exceed max_new_tokens to form T1-truncate.
步骤S302、判断T1-truncate最后一个字符character-1是否为中文句号,若character-1为中文句号,则将T2设为T1-truncate,反之,将T1-truncate中最后一个中文句号后的内容删除,形成T2。Step S302, determine whether the last character character-1 of T1-truncate is a Chinese period. If character-1 is a Chinese period, set T2 to T1-truncate. Otherwise, delete the content after the last Chinese period in T1-truncate to form T2.
优选地,步骤S4中,对T2进行重复性过滤,形成T3,具体包括如下步骤:Preferably, in step S4, T2 is repeatedly filtered to form T3, which specifically includes the following steps:
步骤S401、将T2分成L个句子,形成文本T3;Step S401, dividing T2 into L sentences to form text T3;
步骤S402、对T3进行去重处理,得到去重处理后的L'个句子;Step S402, perform deduplication processing on T3 to obtain L' sentences after deduplication processing;
步骤S403、若L'<L,则删除D1,反之,执行步骤S5。Step S403: If L'<L, delete D1; otherwise, execute step S5.
优选地,步骤S402中,对L个句子进行集合操作实现所述去重处理。Preferably, in step S402, a set operation is performed on the L sentences to implement the deduplication process.
优选地,当计算资源不足时,去掉步骤S5,直接保留图像D1并输出文本T3。Preferably, when computing resources are insufficient, step S5 is removed, image D1 is directly retained and text T3 is output.
第二个方面,本发明提供了一种基于LVLM的自动化中文图文对数据生成方法,适用于包含图像和对该图像的英文描述的情况,具体包括如下步骤:In a second aspect, the present invention provides an automatic Chinese image-text pair data generation method based on LVLM, which is applicable to the case of including an image and an English description of the image, and specifically comprises the following steps:
步骤A1、获取图像D2及对D2的英文描述D2-description;Step A1, obtaining an image D2 and an English description D2-description of D2;
步骤A2、通过中文CLIP模型计算D2与D2-description的相似度S-2,若S-2小于阈值γ,则删除D2,反之,则保留D2;Step A2: Calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model. If S-2 is less than the threshold γ, D2 is deleted. Otherwise, D2 is retained.
步骤A3、生成指令prompt-2并将prompt-2和D2-description输入LVLM模型,将D2-description翻译成中文,形成文本T4;Step A3, generate prompt-2 and input prompt-2 and D2-description into the LVLM model, translate D2-description into Chinese, and form text T4;
步骤A4、对T4进行截断样本过滤,形成文本T5;Step A4, filtering the truncated samples of T4 to form text T5;
步骤A5、对T5进行重复性过滤,形成文本T6。Step A5: perform repetitive filtering on T5 to form text T6.
优选地,步骤A4中,对T4进行截断样本过滤,形成文本T5,具体包括如下步骤:Preferably, in step A4, truncated samples of T4 are filtered to form text T5, which specifically includes the following steps:
步骤A401、判断T4的字符长度LT4与字符阈值max_new_tokens的大小,若将LT4小于max_new_tokens,则将T5设为T4并执行步骤A5,反之,对T4中超过max_new_tokens部分的字符进行截断处理,形成T4-truncate;Step A401, determine the size of the character length L T4 of T4 and the character threshold max_new_tokens. If L T4 is less than max_new_tokens, set T5 to T4 and execute step A5. Otherwise, truncate the characters in T4 that exceed max_new_tokens to form T4-truncate.
步骤A402、判断T4-truncate最后一个字符character-2是否为中文句号,若character-2为中文句号,则将T5设为T4-truncate,反之,将T4-truncate中最后一个中文句号后的内容删除,形成T5。Step A402, determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, set T5 to T4-truncate. Otherwise, delete the content after the last Chinese period in T4-truncate to form T5.
优选地,步骤A5中,对T5进行重复性过滤,形成文本T6,步骤A5中所述数据的处理方法,具体包括如下步骤:Preferably, in step A5, T5 is repeatedly filtered to form text T6. The data processing method in step A5 specifically includes the following steps:
步骤A501、将T5分成N个句子,形成文本T6;Step A501, dividing T5 into N sentences to form text T6;
步骤 A502、对T6进行去重处理,得到去重处理后的N'个句子;Step A502: perform deduplication processing on T6 to obtain N' sentences after deduplication processing;
步骤 A503、若N'<N,则删除D2,反之,输出文本T6。Step A503: If N'<N, delete D2; otherwise, output text T6.
优选地,步骤A502中,对N个句子进行集合操作实现所述去重处理。Preferably, in step A502, a set operation is performed on the N sentences to implement the deduplication process.
优选地,当计算资源不足时,删除步骤A2。Preferably, when computing resources are insufficient, step A2 is deleted.
第三个方面,本发明提供了一种基于LVLM的自动化中文图文对数据生成装置,适用于仅包含图像的情况,具体包括如下模块:In a third aspect, the present invention provides an automatic Chinese image-text pair data generation device based on LVLM, which is applicable to the case where only images are included, and specifically includes the following modules:
第一初始化模块,用于获取图像D1并生成指令prompt-1;The first initialization module is used to obtain the image D1 and generate the instruction prompt-1;
第一文本生成模块,与所述第一初始化模块连接,用于将D1和prompt-1输入LVLM模型,形成文本T1;A first text generation module, connected to the first initialization module, for inputting D1 and prompt-1 into the LVLM model to form text T1;
第一截断样本过滤模块,与所述第一文本生成模块连接,用于对T1进行截断样本过滤,形成文本T2;A first truncated sample filtering module, connected to the first text generating module, for filtering the truncated samples of T1 to form text T2;
第一重复性过滤模块,与所述第一截断样本过滤模块连接,用于对T2进行重复性过滤,形成文本T3;A first repetitive filtering module, connected to the first truncated sample filtering module, is used to perform repetitive filtering on T2 to form text T3;
第一样本处理模块,与所述第一重复性过滤模块连接,用于通过中文CLIP模型计算D1与T3的相似度S-1,若S-1小于阈值γ,则删除D1,反之,则保留D1。The first sample processing module is connected to the first repeatability filtering module and is used to calculate the similarity S-1 between D1 and T3 through the Chinese CLIP model. If S-1 is less than a threshold γ, D1 is deleted; otherwise, D1 is retained.
优选地,第一初始化模块中,所述prompt-1用于引导LVLM模型生成特定类型的响应或执行特定任务。Preferably, in the first initialization module, the prompt-1 is used to guide the LVLM model to generate a specific type of response or perform a specific task.
优选地,所述第一截断样本过滤模块,具体包括如下子模块:Preferably, the first truncated sample filtering module specifically includes the following submodules:
第一截断子模块,用于判断T1的字符长度LT1与字符阈值max_new_tokens的大小,若将LT1小于max_new_tokens,则将T2设为T1并执行所述第一重复性过滤模块,反之,对T1中超过max_new_tokens部分的字符进行截断处理,形成T1-truncate;The first truncation submodule is used to determine the size of the character length L T1 of T1 and the character threshold max_new_tokens. If L T1 is less than max_new_tokens, T2 is set to T1 and the first repetitive filtering module is executed. Otherwise, the characters in T1 exceeding max_new_tokens are truncated to form T1-truncate;
第二截断子模块,与所述第一截断子模块连接,用于判断T1-truncate最后一个字符character-1是否为中文句号,若character-1为中文句号,则将T2设为T1-truncate,反之,将T1-truncate中最后一个中文句号后的内容删除,形成T2。The second truncation submodule is connected to the first truncation submodule and is used to determine whether the last character character-1 of T1-truncate is a Chinese period. If character-1 is a Chinese period, T2 is set to T1-truncate. Otherwise, the content after the last Chinese period in T1-truncate is deleted to form T2.
优选地,所述第一重复性过滤模块,具体包括如下子模块:Preferably, the first repetitive filtering module specifically includes the following submodules:
第一重复性子模块,用于将T2分成L个句子,形成文本T3;The first repetitive submodule is used to divide T2 into L sentences to form text T3;
第二重复性子模块,与所述第一重复性子模块连接,用于对T3进行去重处理,得到去重处理后的L'个句子;A second repetitive submodule, connected to the first repetitive submodule, is used to perform a deduplication process on T3 to obtain L' sentences after the deduplication process;
第三重复性子模块,与所述第二重复性子模块连接,用于在当L'<L时,删除D1,反之,执行所述第一样本处理模块。The third repetitive submodule is connected to the second repetitive submodule and is used to delete D1 when L'<L, and otherwise execute the first sample processing module.
优选地,第一重复性子模块连接中,所述去重处理通过对L个句子进行集合操作实现。Preferably, in the first repetitive submodule connection, the deduplication process is implemented by performing a set operation on L sentences.
第四个方面,一种基于LVLM的自动化中文图文对数据生成装置,适用于包含图像和对该图像的英文描述的情况,具体包括如下模块:The fourth aspect is an automatic Chinese image-text pair data generation device based on LVLM, which is suitable for the case of including an image and an English description of the image, and specifically includes the following modules:
第二初始化模块,用于获取图像D2及对D2的英文描述D2-description;A second initialization module is used to obtain an image D2 and an English description D2-description of D2;
第二样本处理模块,与所述第二初始化模块连接,用于通过中文CLIP模型计算D2与D2-description的相似度S-2,若S-2小于阈值γ,则删除D2,反之,则保留D2;A second sample processing module, connected to the second initialization module, is used to calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model, and if S-2 is less than a threshold γ, D2 is deleted, otherwise, D2 is retained;
第二文本生成模块,与所述第二样本处理模块连接,用于生成指令prompt-2并将prompt-2和D2-description输入LVLM模型,将D2-description翻译成中文,形成文本T4;A second text generation module, connected to the second sample processing module, is used to generate instruction prompt-2 and input prompt-2 and D2-description into the LVLM model, and translate D2-description into Chinese to form text T4;
第二截断样本过滤模块,与所述第二文本生成模块连接,用于对T4进行截断样本过滤,形成文本T5;A second truncated sample filtering module, connected to the second text generating module, for filtering the truncated samples of T4 to form text T5;
第二重复性过滤模块,与所述第二截断样本过滤模块连接,用于对T5进行重复性过滤,形成文本T6。The second repetitive filtering module is connected to the second truncated sample filtering module and is used to perform repetitive filtering on T5 to form text T6.
优选地,所述第二截断样本过滤模块具体包括如下子模块:Preferably, the second truncated sample filtering module specifically includes the following submodules:
第三截断子模块,用于判断T4的字符长度LT4与字符阈值max_new_tokens的大小,若将LT4小于max_new_tokens,则将T5设为T4并执行所述第二重复性过滤模块,反之,对T4中超过max_new_tokens部分的字符进行截断处理,形成T4-truncate;The third truncation submodule is used to determine the size of the character length L T4 of T4 and the character threshold max_new_tokens. If L T4 is less than max_new_tokens, T5 is set to T4 and the second repetitive filtering module is executed. Otherwise, the characters in T4 exceeding max_new_tokens are truncated to form T4-truncate.
第四截断子模块,与所述第三截断子模块连接,用于判断T4-truncate最后一个字符character-2是否为中文句号,若character-2为中文句号,则将T5设为T4-truncate,反之,将T4-truncate中最后一个中文句号后的内容删除,形成T5。The fourth truncation submodule is connected to the third truncation submodule and is used to determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, T5 is set to T4-truncate. Otherwise, the content after the last Chinese period in T4-truncate is deleted to form T5.
优选地,所述第二重复性过滤模块具体包括如下子模块:Preferably, the second repetitive filtering module specifically includes the following submodules:
第四重复子模块,用于将T5分成N个句子,形成文本T6;The fourth repetitive submodule is used to divide T5 into N sentences to form text T6;
第五重复子模块,与所述第四重复子模块连接,用于对T6进行去重处理,得到去重处理后的N'个句子;A fifth repetition submodule, connected to the fourth repetition submodule, for performing a deduplication process on T6 to obtain N' sentences after the deduplication process;
第六重复子模块,与所述第五重复子模块连接,用于在当N'<N时,删除D2,反之,输出文本T6。The sixth repetitive submodule is connected to the fifth repetitive submodule and is used to delete D2 when N'<N, otherwise, output text T6.
优选地,第五重复子模块中,所述去重处理通过对N个句子进行集合操作实现。Preferably, in the fifth repetition submodule, the deduplication process is implemented by performing a set operation on N sentences.
另一个方面,本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现本申请第一方面或第二方面中任一项所述的一种基于LVLM的自动化中文图文对数据生成方法。In another aspect, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements an automated Chinese image-text pair data generation method based on LVLM as described in any one of the first aspect or the second aspect of the present application.
另一个方面,本发明还提供一种电子设备,所述电子设备包括:存储器,存储有一计算机程序:处理器,与所述存储器通信相连,调用所述计算机程序时执行本申请第一方面或第二方面中任一项所述的一种基于LVLM的自动化中文图文对数据生成方法。In another aspect, the present invention also provides an electronic device, comprising: a memory storing a computer program; and a processor communicatively connected to the memory, for executing an automated Chinese image-text pair data generation method based on LVLM as described in any one of the first aspect or the second aspect of the present application when the computer program is called.
本发明与现有技术相比较,具有如下显而易见的突出实质性特点和显著优点:Compared with the prior art, the present invention has the following obvious outstanding substantial features and significant advantages:
本发明提供一种基于LVLM的自动化中文图文对数据生成方法及装置,解决了现有中文图文对数据中存在中文数据质量不高且数量少的问题。具有以下优点:(1)利用同一个LVLM进行中文数据的生成与英文数据的翻译,简化了方案的架构;(2)通过生成和翻译两种方法结合,扩大中文图文数据的来源,增加数据的多样性;(3)对生成/翻译后的样本进行了多阶段的后处理,以进一步提升中文语料的质量;(4)使用CLIP模型来进行图文相似度的对比,进一步提升中文语料的质量。The present invention provides an automatic Chinese image-text data generation method and device based on LVLM, which solves the problem of low quality and small quantity of Chinese data in existing Chinese image-text data. It has the following advantages: (1) The same LVLM is used to generate Chinese data and translate English data, which simplifies the architecture of the solution; (2) By combining the generation and translation methods, the source of Chinese image-text data is expanded and the diversity of data is increased; (3) The generated/translated samples are post-processed in multiple stages to further improve the quality of Chinese corpus; (4) The CLIP model is used to compare the similarity of images and texts, which further improves the quality of Chinese corpus.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
构成本发明的一部分附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings, which constitute part of the present invention, are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations on the present invention. In the accompanying drawings:
图1是本发明优选实施例的一种基于LVLM的自动化中文图文对数据生成方法中一种情况的流程图。FIG. 1 is a flow chart of a situation in an automatic Chinese image-text pair data generation method based on LVLM according to a preferred embodiment of the present invention.
图2是本发明优选实施例的一种基于LVLM的自动化中文图文对数据生成方法中又一种情况的流程图。FIG. 2 is a flow chart of another situation of an automatic Chinese image-text pair data generation method based on LVLM in a preferred embodiment of the present invention.
图3是本发明的优选实施例的一种基于LVLM的自动化中文图文对数据生成装置中一种情况的结构示意图。FIG3 is a schematic structural diagram of a situation in an automatic Chinese image-text pair data generation device based on LVLM according to a preferred embodiment of the present invention.
图4是本发明的优选实施例的一种基于LVLM的自动化中文图文对数据生成装置中又一种情况的结构示意图。FIG. 4 is a schematic structural diagram of another situation in an automatic Chinese image-text pair data generation device based on LVLM according to a preferred embodiment of the present invention.
图5是本发明优选实施例的一种基于LVLM的自动化中文图文对数据生成方法的流程示意图。FIG5 is a flow chart of an automated Chinese image-text pair data generation method based on LVLM according to a preferred embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
本发明提供一种基于LVLM的自动化中文图文对数据生成方法及装置,为使本发明的目的、技术方案及效果更加清楚、明确,以下参照附图并举实例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。The present invention provides a method and device for automatically generating Chinese text-image data based on LVLM. In order to make the purpose, technical solution and effect of the present invention clearer and more specific, the present invention is further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not used to limit the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序,应该理解这样使用的数据在适当情况下可以互换。此外,术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
实施例Example
如图1和图5(左)所示,本实施例所述的一种基于LVLM的自动化中文图文对数据生成方法,适用于仅包含图像的情况,具体包括如下步骤:As shown in FIG. 1 and FIG. 5 (left), the method for automatically generating Chinese image-text pair data based on LVLM described in this embodiment is applicable to the case where only images are included, and specifically includes the following steps:
步骤S1、获取图像D1并生成指令prompt-1,如:请描述这张图。Step S1, obtain image D1 and generate instruction prompt-1, such as: please describe this picture.
其中,所述prompt-1用于引导LVLM模型生成特定类型的响应或执行特定任务。Among them, the prompt-1 is used to guide the LVLM model to generate a specific type of response or perform a specific task.
步骤S2、将D1和prompt-1输入LVLM模型,形成文本T1。Step S2: Input D1 and prompt-1 into the LVLM model to form text T1.
步骤S3、对T1进行截断样本过滤,形成文本T2。Step S3: truncate and filter samples of T1 to form text T2.
可选的,步骤S3具体包括如下步骤:Optionally, step S3 specifically includes the following steps:
步骤S301、判断T1的字符长度LT1与字符阈值max_new_tokens的大小,若将LT1小于max_new_tokens,则将T2设为T1并执行步骤S4,反之,对T1中超过max_new_tokens部分的字符进行截断处理,形成T1-truncate。Step S301, determine the size of the character length LT1 of T1 and the character threshold max_new_tokens. If LT1 is less than max_new_tokens, set T2 to T1 and execute step S4. Otherwise, truncate the characters in T1 that exceed max_new_tokens to form T1-truncate.
步骤S302、判断T1-truncate最后一个字符character-1是否为中文句号,若character-1为中文句号,则将T2设为T1-truncate,反之,将T1-truncate中最后一个中文句号后的内容删除,形成T2。Step S302, determine whether the last character character-1 of T1-truncate is a Chinese period. If character-1 is a Chinese period, set T2 to T1-truncate. Otherwise, delete the content after the last Chinese period in T1-truncate to form T2.
步骤S4、对T2进行重复性过滤,形成文本T3。Step S4, perform repetitive filtering on T2 to form text T3.
可选的,步骤S4具体包括如下步骤:Optionally, step S4 specifically includes the following steps:
步骤S401、将T2分成L个句子,形成文本T3。Step S401, divide T2 into L sentences to form text T3.
步骤S402、对T3进行去重处理,得到去重处理后的L'个句子。Step S402: perform deduplication processing on T3 to obtain L' sentences after deduplication processing.
可选的,所述去重处理通过对L个句子进行集合操作实现。Optionally, the deduplication process is implemented by performing a set operation on the L sentences.
步骤S403、若L'<L,则删除D1,反之,执行步骤S5。Step S403: If L'<L, delete D1; otherwise, execute step S5.
步骤S5、通过中文CLIP模型计算D1与T3的相似度S-1,若S-1小于阈值γ,则删除D1,反之,则保留D1。Step S5, calculate the similarity S-1 between D1 and T3 through the Chinese CLIP model. If S-1 is less than the threshold γ, D1 is deleted; otherwise, D1 is retained.
实施例Example
如图2和图5(右)所示,本实施例所述的一种基于LVLM的自动化中文图文对数据生成方法,适用于包含图像和对该图像的英文描述的情况,具体包括如下步骤:As shown in FIG. 2 and FIG. 5 (right), the method for automatically generating Chinese image-text pair data based on LVLM described in this embodiment is applicable to the case of including an image and an English description of the image, and specifically includes the following steps:
步骤S6、获取图像D2及对D2的英文描述D2-description。Step S6: Obtain the image D2 and the English description D2-description of D2.
步骤S7、通过中文CLIP模型计算D2与D2-description的相似度S-2,若S-2小于阈值γ,则删除D2,反之,则保留D2。Step S7: Calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model. If S-2 is less than the threshold γ, delete D2; otherwise, retain D2.
步骤S8、生成指令prompt-2(如:请翻译成中文)并将prompt-2和D2-description输入LVLM模型,将D2-description翻译成中文,形成文本T4。Step S8, generate instruction prompt-2 (such as: please translate into Chinese) and input prompt-2 and D2-description into the LVLM model, translate D2-description into Chinese, and form text T4.
步骤S9、对T4进行截断样本过滤,形成文本T5。Step S9: truncate and filter T4 to form text T5.
可选的,步骤S9具体包括如下步骤:Optionally, step S9 specifically includes the following steps:
步骤S901、判断T4的字符长度LT4与字符阈值max_new_tokens的大小,若将LT4小于max_new_tokens,则将T5设为T4并执行步骤S10,反之,对T4中超过max_new_tokens部分的字符进行截断处理,形成T4-truncate。Step S901, determine the size of the character length LT4 of T4 and the character threshold max_new_tokens. If LT4 is less than max_new_tokens, set T5 to T4 and execute step S10. Otherwise, truncate the characters in T4 that exceed max_new_tokens to form T4-truncate.
步骤S902、判断T4-truncate最后一个字符character-2是否为中文句号,若character-2为中文句号,则将T5设为T4-truncate,反之,将T4-truncate中最后一个中文句号后的内容删除,形成T5。Step S902, determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, set T5 to T4-truncate. Otherwise, delete the content after the last Chinese period in T4-truncate to form T5.
步骤S10、对T5进行重复性过滤,形成文本T6。Step S10, performing repetitive filtering on T5 to form text T6.
可选的,步骤S10具体包括如下步骤:Optionally, step S10 specifically includes the following steps:
步骤S1001、将T5分成N个句子,形成文本T6。Step S1001, divide T5 into N sentences to form text T6.
步骤S1002、对T6进行去重处理,得到去重处理后的N'个句子。Step S1002: perform deduplication processing on T6 to obtain N' sentences after deduplication processing.
可选的,所述去重处理通过对N个句子进行集合操作实现。Optionally, the deduplication process is implemented by performing a set operation on N sentences.
步骤S1003、若N'<N,则删除D2,反之,输出文本T6。Step S1003: If N'<N, delete D2; otherwise, output text T6.
实施例Example
如图3所示,本实施例所述的一种基于LVLM的自动化中文图文对数据生成装置,具体包括如下模块:As shown in FIG3 , the automatic Chinese picture-text pair data generation device based on LVLM described in this embodiment specifically includes the following modules:
第一初始化模块,用于获取图像D1并生成指令prompt-1。The first initialization module is used to obtain the image D1 and generate the instruction prompt-1.
第一文本生成模块,与所述第一初始化模块连接,用于将D1和prompt-1输入LVLM模型,形成文本T1。The first text generation module is connected to the first initialization module and is used to input D1 and prompt-1 into the LVLM model to form text T1.
第一截断样本过滤模块,与所述第一文本生成模块连接,用于对T1进行截断样本过滤,形成文本T2。The first truncated sample filtering module is connected to the first text generating module and is used to perform truncated sample filtering on T1 to form text T2.
其中,所述第一截断样本过滤模块,具体包括如下子模块:The first truncated sample filtering module specifically includes the following submodules:
第一截断子模块,用于判断T1的字符长度LT1与字符阈值max_new_tokens的大小,若将LT1小于max_new_tokens,则将T2设为T1并执行所述第一重复性过滤模块,反之,对T1中超过max_new_tokens部分的字符进行截断处理,形成T1-truncate。The first truncation submodule is used to determine the size of the character length LT1 of T1 and the character threshold max_new_tokens. If LT1 is less than max_new_tokens, T2 is set to T1 and the first repetitive filtering module is executed. Otherwise, the characters in T1 exceeding max_new_tokens are truncated to form T1-truncate.
第二截断子模块,与所述第一截断子模块连接,用于判断T1-truncate最后一个字符character-1是否为中文句号,若character-1为中文句号,则将T2设为T1-truncate,反之,将T1-truncate中最后一个中文句号后的内容删除,形成T2。The second truncation submodule is connected to the first truncation submodule and is used to determine whether the last character character-1 of T1-truncate is a Chinese period. If character-1 is a Chinese period, T2 is set to T1-truncate. Otherwise, the content after the last Chinese period in T1-truncate is deleted to form T2.
第一重复性过滤模块,与所述第一截断样本过滤模块连接,用于对T2进行重复性过滤,形成文本T3。The first repetitive filtering module is connected to the first truncated sample filtering module and is used to perform repetitive filtering on T2 to form text T3.
其中,所述第一重复性过滤模块,具体包括如下子模块:The first repetitive filtering module specifically includes the following submodules:
第一重复性子模块,用于将T2分成L个句子,形成文本T3。The first repetitive submodule is used to divide T2 into L sentences to form text T3.
第二重复性子模块,与所述第一重复性子模块连接,用于对T3进行去重处理,得到去重处理后的L'个句子。The second repetitive submodule is connected to the first repetitive submodule and is used to perform a deduplication process on T3 to obtain L' sentences after the deduplication process.
可选的,所述去重处理通过对L个句子进行集合操作实现。Optionally, the deduplication process is implemented by performing a set operation on the L sentences.
第三重复性子模块,与所述第二重复性子模块连接,用于在当L'<L时,删除D1,反之,执行所述第一样本处理模块。The third repetitive submodule is connected to the second repetitive submodule and is used to delete D1 when L'<L, and otherwise execute the first sample processing module.
第一样本处理模块,与所述第一重复性过滤模块连接,用于通过中文CLIP模型计算D1与T3的相似度S-1,若S-1小于阈值γ,则删除D1,反之,则保留D1。The first sample processing module is connected to the first repeatability filtering module and is used to calculate the similarity S-1 between D1 and T3 through the Chinese CLIP model. If S-1 is less than a threshold γ, D1 is deleted; otherwise, D1 is retained.
实施例Example
如图4所示,本实施例所述的一种基于LVLM的自动化中文图文对数据生成装置,对于包含图像和对该图像的英文描述的情况,具体包括如下模块:As shown in FIG4 , the automatic Chinese image-text pair data generation device based on LVLM described in this embodiment, for a case including an image and an English description of the image, specifically includes the following modules:
第二初始化模块,用于获取图像D2及对D2的英文描述D2-description。The second initialization module is used to obtain the image D2 and the English description D2-description of D2.
第二样本处理模块,与所述第二初始化模块连接,用于通过中文CLIP模型计算D2与D2-description的相似度S-2,若S-2小于阈值γ,则删除D2,反之,则保留D2。The second sample processing module is connected to the second initialization module and is used to calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model. If S-2 is less than a threshold γ, D2 is deleted; otherwise, D2 is retained.
第二文本生成模块,与所述第二样本处理模块连接,用于生成指令prompt-2并将prompt-2和D2-description输入LVLM模型,将D2-description翻译成中文,形成文本T4。The second text generation module is connected to the second sample processing module, and is used to generate instruction prompt-2 and input prompt-2 and D2-description into the LVLM model, and translate D2-description into Chinese to form text T4.
第二截断样本过滤模块,与所述第二文本生成模块连接,用于对T4进行截断样本过滤,形成文本T5。The second truncated sample filtering module is connected to the second text generating module and is used to perform truncated sample filtering on T4 to form text T5.
其中,所述第二截断样本过滤模块具体包括如下子模块:The second truncated sample filtering module specifically includes the following submodules:
第三截断子模块,用于判断T4的字符长度LT4与字符阈值max_new_tokens的大小,若将LT4小于max_new_tokens,则将T5设为T4并执行所述第二重复性过滤模块,反之,对T4中超过max_new_tokens部分的字符进行截断处理,形成T4-truncate。The third truncation submodule is used to determine the size of the character length L T4 of T4 and the character threshold max_new_tokens. If L T4 is less than max_new_tokens, T5 is set to T4 and the second repeatability filtering module is executed. Otherwise, the characters in T4 exceeding max_new_tokens are truncated to form T4-truncate.
第四截断子模块,与所述第三截断子模块连接,用于判断T4-truncate最后一个字符character-2是否为中文句号,若character-2为中文句号,则将T5设为T4-truncate,反之,将T4-truncate中最后一个中文句号后的内容删除,形成T5。The fourth truncation submodule is connected to the third truncation submodule and is used to determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, T5 is set to T4-truncate. Otherwise, the content after the last Chinese period in T4-truncate is deleted to form T5.
第二重复性过滤模块,与所述第二截断样本过滤模块连接,用于对T5进行重复性过滤,形成文本T6。The second repetitive filtering module is connected to the second truncated sample filtering module and is used to perform repetitive filtering on T5 to form text T6.
其中,所述第二重复性过滤模块具体包括如下子模块:Wherein, the second repetitive filtering module specifically includes the following submodules:
第四重复子模块,用于将T5分成N个句子,形成文本T6。The fourth repetitive submodule is used to divide T5 into N sentences to form text T6.
第五重复子模块,与所述第四重复子模块连接,用于对T6进行去重处理,得到去重处理后的N'个句子。The fifth repetition submodule is connected to the fourth repetition submodule and is used to perform a deduplication process on T6 to obtain N' sentences after the deduplication process.
可选的,所述去重处理通过对N个句子进行集合操作实现。Optionally, the deduplication process is implemented by performing a set operation on N sentences.
第六重复子模块,与所述第五重复子模块连接,用于在当N'<N时,删除D2,反之,输出文本T6。The sixth repetitive submodule is connected to the fifth repetitive submodule and is used to delete D2 when N'<N, otherwise, output text T6.
以上对本发明的具体实施例进行了详细描述,但其只是作为范例,本发明并不限制于以上描述的具体实施例。对于本领域技术人员而言,任何对本发明进行的等同修改和替代也都在本发明的范畴之中。因此,在不脱离本发明的精神和范围下所作的均等变换和修改,都应涵盖在本发明的范围内。The specific embodiments of the present invention are described in detail above, but they are only examples, and the present invention is not limited to the specific embodiments described above. For those skilled in the art, any equivalent modifications and substitutions made to the present invention are also within the scope of the present invention. Therefore, the equalization changes and modifications made without departing from the spirit and scope of the present invention should be included in the scope of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411052852.5A CN118570566B (en) | 2024-08-02 | 2024-08-02 | LVLM-based automatic Chinese image-text pair data generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411052852.5A CN118570566B (en) | 2024-08-02 | 2024-08-02 | LVLM-based automatic Chinese image-text pair data generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118570566A true CN118570566A (en) | 2024-08-30 |
CN118570566B CN118570566B (en) | 2024-11-22 |
Family
ID=92474961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411052852.5A Active CN118570566B (en) | 2024-08-02 | 2024-08-02 | LVLM-based automatic Chinese image-text pair data generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118570566B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023082915A1 (en) * | 2021-11-09 | 2023-05-19 | 北京有竹居网络技术有限公司 | Method and apparatus for training visual language pre-training model, and device and medium |
CN118051600A (en) * | 2024-03-15 | 2024-05-17 | 零犀(北京)科技有限公司 | Method, device, storage medium and electronic device for constructing multimodal knowledge based on large model |
CN118397643A (en) * | 2024-06-28 | 2024-07-26 | 浪潮电子信息产业股份有限公司 | Image processing method, device, equipment and readable storage medium |
EP4407567A1 (en) * | 2023-01-26 | 2024-07-31 | Google LLC | Zero-shot prompt ensembling for zero-shot classification with text-image models |
-
2024
- 2024-08-02 CN CN202411052852.5A patent/CN118570566B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023082915A1 (en) * | 2021-11-09 | 2023-05-19 | 北京有竹居网络技术有限公司 | Method and apparatus for training visual language pre-training model, and device and medium |
EP4407567A1 (en) * | 2023-01-26 | 2024-07-31 | Google LLC | Zero-shot prompt ensembling for zero-shot classification with text-image models |
CN118051600A (en) * | 2024-03-15 | 2024-05-17 | 零犀(北京)科技有限公司 | Method, device, storage medium and electronic device for constructing multimodal knowledge based on large model |
CN118397643A (en) * | 2024-06-28 | 2024-07-26 | 浪潮电子信息产业股份有限公司 | Image processing method, device, equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN118570566B (en) | 2024-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10755048B2 (en) | Artificial intelligence based method and apparatus for segmenting sentence | |
CN111506696A (en) | Information extraction method and device based on small number of training samples | |
CN109255118A (en) | A kind of keyword extracting method and device | |
TW202030640A (en) | Cross-modal information retrieval method and apparatus, and storage medium | |
JP7553202B2 (en) | Text sequence generation method, device, equipment and medium | |
CN117501283A (en) | Text-to-question model system | |
JP6151404B1 (en) | Learning device, learning method, and learning program | |
CN108090400A (en) | A kind of method and apparatus of image text identification | |
JP2007234024A (en) | Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model | |
CN105975497A (en) | Automatic microblog topic recommendation method and device | |
WO2022073341A1 (en) | Disease entity matching method and apparatus based on voice semantics, and computer device | |
CN113536800A (en) | A word vector representation method and device | |
CN109359308B (en) | Machine translation method, device and readable storage medium | |
Rahman | A Cross Modal Deep Learning Based Approach for Caption Prediction and Concept Detection by CS Morgan State. | |
JP7433068B2 (en) | Infer titles and sections in documents | |
CN109670047B (en) | Abstract note generation method, computer device and readable storage medium | |
US9940320B2 (en) | Plugin tool for collecting user generated document segmentation feedback | |
CN118536073B (en) | Accelerator, data processing method, device, medium, program product and system | |
CN118570566A (en) | A method and device for automatically generating Chinese image-text pair data based on LVLM | |
US20210295738A1 (en) | Providing math content for visually impaired | |
CN114154489A (en) | Triple extraction method, device, equipment and storage medium | |
WO2018120575A1 (en) | Method and device for identifying main picture in web page | |
CN117290490A (en) | Model training processing method, information processing device, model training equipment and model training medium | |
CN113010717B (en) | Image verse description generation method, device and equipment | |
CN115984857A (en) | OCR recognition data generation and training method, system, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |