CN118570566A

CN118570566A - A method and device for automatically generating Chinese image-text pair data based on LVLM

Info

Publication number: CN118570566A
Application number: CN202411052852.5A
Authority: CN
Inventors: 李骋远; 刘邦贵
Original assignee: Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Current assignee: Shanghai Rock Core Digital Intelligence Artificial Intelligence Technology Co ltd
Priority date: 2024-08-02
Filing date: 2024-08-02
Publication date: 2024-08-30
Anticipated expiration: 2044-08-02
Also published as: CN118570566B

Abstract

The invention provides an automatic Chinese image-text pair data generation method and device based on LVLM, which solve the problems of low quality and small quantity of Chinese data in the existing Chinese image-text pair data. Has the following advantages: (1) The same LVLM is utilized to generate Chinese data and translate English data, so that the scheme architecture is simplified; (2) The method combines the generation method and the translation method, so that the source of Chinese image-text data is enlarged, and the diversity of the data is increased; (3) Performing multi-stage post-processing on the generated/translated samples to further improve the quality of the Chinese corpus; (4) And comparing the image-text similarity by using the CLIP model, so as to further improve the quality of the Chinese corpus.

Description

A method and device for automatically generating Chinese image-text pair data based on LVLM

技术领域Technical Field

本发明涉及大语言模型领域，尤其涉及一种基于LVLM的自动化中文图文对数据生成方法及装置。The present invention relates to the field of large language models, and in particular to a method and device for automatically generating Chinese image-text pair data based on LVLM.

背景技术Background Art

在人工智能领域，多模态数据融合技术，尤其是图像与文本的结合，已成为研究热点。这类数据不仅丰富了机器学习模型的输入维度，还促进了深度学习在视觉问答、图像描述生成、视觉定位等应用领域的快速发展。其中，中文图文对数据集作为连接视觉和语义的重要桥梁，对于推动中文语境下的多模态理解与生成研究至关重要。In the field of artificial intelligence, multimodal data fusion technology, especially the combination of images and text, has become a research hotspot. This type of data not only enriches the input dimension of machine learning models, but also promotes the rapid development of deep learning in application fields such as visual question answering, image description generation, and visual positioning. Among them, the Chinese image-text pair dataset, as an important bridge connecting vision and semantics, is crucial to promoting multimodal understanding and generation research in the Chinese context.

尽管近年来已有多项工作致力于创建中文图文对数据集，但这些数据集往往存在以下几个方面的局限：（1）相较于英文领域，高质量、大规模的中文图文对数据集较为稀缺，限制了模型的训练效果和泛化能力；（2）现有数据集中，部分图文对的匹配度不高，文本描述可能与图像内容不符，导致模型学习到错误的关联关系；（3）数据集中的图像和文本主题往往集中于某些特定领域，缺乏广泛性和代表性，影响模型的全面理解和生成能力。Although many works have been devoted to creating Chinese image-text pair datasets in recent years, these datasets often have the following limitations: (1) Compared with the English field, high-quality and large-scale Chinese image-text pair datasets are relatively scarce, which limits the training effect and generalization ability of the model; (2) In the existing datasets, the matching degree of some image-text pairs is not high, and the text description may not match the image content, causing the model to learn incorrect associations; (3) The image and text topics in the dataset are often concentrated in certain specific fields, lacking breadth and representativeness, which affects the model's comprehensive understanding and generation capabilities.

鉴于上述局限性，自动化中文图文对数据生成与清洗技术应运而生，该技术旨在通过算法自动从海量互联网资源中挖掘、匹配并清洗中文图文对数据，以期构建一个规模大、质量高、多样化且实时更新的数据集。In view of the above limitations, automated Chinese image-text pair data generation and cleaning technology came into being. This technology aims to automatically mine, match and clean Chinese image-text pair data from massive Internet resources through algorithms, in order to build a large-scale, high-quality, diverse and real-time updated dataset.

发明内容Summary of the invention

本发明提供一种基于LVLM的自动化中文图文对数据生成方法及装置，以解决现有中文图文对数据中存在中文数据质量不高且数量少的问题。The present invention provides an automatic Chinese image-text pair data generation method and device based on LVLM, so as to solve the problem that the Chinese data in the existing Chinese image-text pair data is of low quality and small quantity.

第一个方面，本发明提供了一种基于LVLM（CLIP与大型视觉-语言模型）的自动化中文图文对数据生成方法及装置，适用于仅包含图像的情况，具体包括如下步骤：In the first aspect, the present invention provides a method and device for automatically generating Chinese image-text pair data based on LVLM (CLIP and Large Vision-Language Model), which is applicable to the case where only images are included, and specifically includes the following steps:

步骤S1、获取图像D1并生成指令prompt-1；Step S1, acquiring image D1 and generating instruction prompt-1;

步骤S2、将D1和prompt-1输入LVLM模型，形成文本T1；Step S2, input D1 and prompt-1 into the LVLM model to form text T1;

步骤S3、对T1进行截断样本过滤，形成文本T2；Step S3, filtering the truncated samples of T1 to form text T2;

步骤S4、对T2进行重复性过滤，形成文本T3；Step S4, performing repetitive filtering on T2 to form text T3;

步骤S5、通过中文CLIP模型计算D1与T3的相似度S-1，若S-1小于阈值γ，则删除D1，反之，则保留D1。Step S5, calculate the similarity S-1 between D1 and T3 through the Chinese CLIP model. If S-1 is less than the threshold γ, D1 is deleted; otherwise, D1 is retained.

优选地，步骤S1中，所述prompt-1用于引导LVLM模型生成特定类型的响应或执行特定任务。Preferably, in step S1, the prompt-1 is used to guide the LVLM model to generate a specific type of response or perform a specific task.

优选地，步骤S3中，对T1进行截断样本过滤，形成T2，具体包括如下步骤：Preferably, in step S3, truncated samples of T1 are filtered to form T2, which specifically includes the following steps:

步骤S301、判断T1的字符长度L_T1与字符阈值max_new_tokens的大小，若将L_T1小于max_new_tokens，则将T2设为T1并执行步骤S4，反之，对T1中超过max_new_tokens部分的字符进行截断处理，形成T1-truncate；Step S301, determine the size of the character length L _T1 of T1 and the character threshold max_new_tokens. If L _T1 is less than max_new_tokens, set T2 to T1 and execute step S4. Otherwise, truncate the characters in T1 that exceed max_new_tokens to form T1-truncate.

步骤S302、判断T1-truncate最后一个字符character-1是否为中文句号，若character-1为中文句号，则将T2设为T1-truncate，反之，将T1-truncate中最后一个中文句号后的内容删除，形成T2。Step S302, determine whether the last character character-1 of T1-truncate is a Chinese period. If character-1 is a Chinese period, set T2 to T1-truncate. Otherwise, delete the content after the last Chinese period in T1-truncate to form T2.

优选地，步骤S4中，对T2进行重复性过滤，形成T3，具体包括如下步骤：Preferably, in step S4, T2 is repeatedly filtered to form T3, which specifically includes the following steps:

步骤S401、将T2分成L个句子，形成文本T3；Step S401, dividing T2 into L sentences to form text T3;

步骤S402、对T3进行去重处理，得到去重处理后的L'个句子；Step S402, perform deduplication processing on T3 to obtain L' sentences after deduplication processing;

步骤S403、若L'<L，则删除D1，反之，执行步骤S5。Step S403: If L'<L, delete D1; otherwise, execute step S5.

优选地，步骤S402中，对L个句子进行集合操作实现所述去重处理。Preferably, in step S402, a set operation is performed on the L sentences to implement the deduplication process.

优选地，当计算资源不足时，去掉步骤S5，直接保留图像D1并输出文本T3。Preferably, when computing resources are insufficient, step S5 is removed, image D1 is directly retained and text T3 is output.

第二个方面，本发明提供了一种基于LVLM的自动化中文图文对数据生成方法，适用于包含图像和对该图像的英文描述的情况，具体包括如下步骤：In a second aspect, the present invention provides an automatic Chinese image-text pair data generation method based on LVLM, which is applicable to the case of including an image and an English description of the image, and specifically comprises the following steps:

步骤A1、获取图像D2及对D2的英文描述D2-description；Step A1, obtaining an image D2 and an English description D2-description of D2;

步骤A2、通过中文CLIP模型计算D2与D2-description的相似度S-2，若S-2小于阈值γ，则删除D2，反之，则保留D2；Step A2: Calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model. If S-2 is less than the threshold γ, D2 is deleted. Otherwise, D2 is retained.

步骤A3、生成指令prompt-2并将prompt-2和D2-description输入LVLM模型，将D2-description翻译成中文，形成文本T4；Step A3, generate prompt-2 and input prompt-2 and D2-description into the LVLM model, translate D2-description into Chinese, and form text T4;

步骤A4、对T4进行截断样本过滤，形成文本T5；Step A4, filtering the truncated samples of T4 to form text T5;

步骤A5、对T5进行重复性过滤，形成文本T6。Step A5: perform repetitive filtering on T5 to form text T6.

优选地，步骤A4中，对T4进行截断样本过滤，形成文本T5，具体包括如下步骤：Preferably, in step A4, truncated samples of T4 are filtered to form text T5, which specifically includes the following steps:

步骤A401、判断T4的字符长度L_T4与字符阈值max_new_tokens的大小，若将L_T4小于max_new_tokens，则将T5设为T4并执行步骤A5，反之，对T4中超过max_new_tokens部分的字符进行截断处理，形成T4-truncate；Step A401, determine the size of the character length L _T4 of T4 and the character threshold max_new_tokens. If L _T4 is less than max_new_tokens, set T5 to T4 and execute step A5. Otherwise, truncate the characters in T4 that exceed max_new_tokens to form T4-truncate.

步骤A402、判断T4-truncate最后一个字符character-2是否为中文句号，若character-2为中文句号，则将T5设为T4-truncate，反之，将T4-truncate中最后一个中文句号后的内容删除，形成T5。Step A402, determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, set T5 to T4-truncate. Otherwise, delete the content after the last Chinese period in T4-truncate to form T5.

优选地，步骤A5中，对T5进行重复性过滤，形成文本T6，步骤A5中所述数据的处理方法，具体包括如下步骤：Preferably, in step A5, T5 is repeatedly filtered to form text T6. The data processing method in step A5 specifically includes the following steps:

步骤A501、将T5分成N个句子，形成文本T6；Step A501, dividing T5 into N sentences to form text T6;

步骤 A502、对T6进行去重处理，得到去重处理后的N'个句子；Step A502: perform deduplication processing on T6 to obtain N' sentences after deduplication processing;

步骤 A503、若N'<N，则删除D2，反之，输出文本T6。Step A503: If N'<N, delete D2; otherwise, output text T6.

优选地，步骤A502中，对N个句子进行集合操作实现所述去重处理。Preferably, in step A502, a set operation is performed on the N sentences to implement the deduplication process.

优选地，当计算资源不足时，删除步骤A2。Preferably, when computing resources are insufficient, step A2 is deleted.

第三个方面，本发明提供了一种基于LVLM的自动化中文图文对数据生成装置，适用于仅包含图像的情况，具体包括如下模块：In a third aspect, the present invention provides an automatic Chinese image-text pair data generation device based on LVLM, which is applicable to the case where only images are included, and specifically includes the following modules:

第一初始化模块，用于获取图像D1并生成指令prompt-1；The first initialization module is used to obtain the image D1 and generate the instruction prompt-1;

第一文本生成模块，与所述第一初始化模块连接，用于将D1和prompt-1输入LVLM模型，形成文本T1；A first text generation module, connected to the first initialization module, for inputting D1 and prompt-1 into the LVLM model to form text T1;

第一截断样本过滤模块，与所述第一文本生成模块连接，用于对T1进行截断样本过滤，形成文本T2；A first truncated sample filtering module, connected to the first text generating module, for filtering the truncated samples of T1 to form text T2;

第一重复性过滤模块，与所述第一截断样本过滤模块连接，用于对T2进行重复性过滤，形成文本T3；A first repetitive filtering module, connected to the first truncated sample filtering module, is used to perform repetitive filtering on T2 to form text T3;

第一样本处理模块，与所述第一重复性过滤模块连接，用于通过中文CLIP模型计算D1与T3的相似度S-1，若S-1小于阈值γ，则删除D1，反之，则保留D1。The first sample processing module is connected to the first repeatability filtering module and is used to calculate the similarity S-1 between D1 and T3 through the Chinese CLIP model. If S-1 is less than a threshold γ, D1 is deleted; otherwise, D1 is retained.

优选地，第一初始化模块中，所述prompt-1用于引导LVLM模型生成特定类型的响应或执行特定任务。Preferably, in the first initialization module, the prompt-1 is used to guide the LVLM model to generate a specific type of response or perform a specific task.

优选地，所述第一截断样本过滤模块，具体包括如下子模块：Preferably, the first truncated sample filtering module specifically includes the following submodules:

第一截断子模块，用于判断T1的字符长度L_T1与字符阈值max_new_tokens的大小，若将L_T1小于max_new_tokens，则将T2设为T1并执行所述第一重复性过滤模块，反之，对T1中超过max_new_tokens部分的字符进行截断处理，形成T1-truncate；The first truncation submodule is used to determine the size of the character length L _T1 of T1 and the character threshold max_new_tokens. If L _T1 is less than max_new_tokens, T2 is set to T1 and the first repetitive filtering module is executed. Otherwise, the characters in T1 exceeding max_new_tokens are truncated to form T1-truncate;

第二截断子模块，与所述第一截断子模块连接，用于判断T1-truncate最后一个字符character-1是否为中文句号，若character-1为中文句号，则将T2设为T1-truncate，反之，将T1-truncate中最后一个中文句号后的内容删除，形成T2。The second truncation submodule is connected to the first truncation submodule and is used to determine whether the last character character-1 of T1-truncate is a Chinese period. If character-1 is a Chinese period, T2 is set to T1-truncate. Otherwise, the content after the last Chinese period in T1-truncate is deleted to form T2.

优选地，所述第一重复性过滤模块，具体包括如下子模块：Preferably, the first repetitive filtering module specifically includes the following submodules:

第一重复性子模块，用于将T2分成L个句子，形成文本T3；The first repetitive submodule is used to divide T2 into L sentences to form text T3;

第二重复性子模块，与所述第一重复性子模块连接，用于对T3进行去重处理，得到去重处理后的L'个句子；A second repetitive submodule, connected to the first repetitive submodule, is used to perform a deduplication process on T3 to obtain L' sentences after the deduplication process;

第三重复性子模块，与所述第二重复性子模块连接，用于在当L'<L时，删除D1，反之，执行所述第一样本处理模块。The third repetitive submodule is connected to the second repetitive submodule and is used to delete D1 when L'<L, and otherwise execute the first sample processing module.

优选地，第一重复性子模块连接中，所述去重处理通过对L个句子进行集合操作实现。Preferably, in the first repetitive submodule connection, the deduplication process is implemented by performing a set operation on L sentences.

第四个方面，一种基于LVLM的自动化中文图文对数据生成装置，适用于包含图像和对该图像的英文描述的情况，具体包括如下模块：The fourth aspect is an automatic Chinese image-text pair data generation device based on LVLM, which is suitable for the case of including an image and an English description of the image, and specifically includes the following modules:

第二初始化模块，用于获取图像D2及对D2的英文描述D2-description；A second initialization module is used to obtain an image D2 and an English description D2-description of D2;

第二样本处理模块，与所述第二初始化模块连接，用于通过中文CLIP模型计算D2与D2-description的相似度S-2，若S-2小于阈值γ，则删除D2，反之，则保留D2；A second sample processing module, connected to the second initialization module, is used to calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model, and if S-2 is less than a threshold γ, D2 is deleted, otherwise, D2 is retained;

第二文本生成模块，与所述第二样本处理模块连接，用于生成指令prompt-2并将prompt-2和D2-description输入LVLM模型，将D2-description翻译成中文，形成文本T4；A second text generation module, connected to the second sample processing module, is used to generate instruction prompt-2 and input prompt-2 and D2-description into the LVLM model, and translate D2-description into Chinese to form text T4;

第二截断样本过滤模块，与所述第二文本生成模块连接，用于对T4进行截断样本过滤，形成文本T5；A second truncated sample filtering module, connected to the second text generating module, for filtering the truncated samples of T4 to form text T5;

第二重复性过滤模块，与所述第二截断样本过滤模块连接，用于对T5进行重复性过滤，形成文本T6。The second repetitive filtering module is connected to the second truncated sample filtering module and is used to perform repetitive filtering on T5 to form text T6.

优选地，所述第二截断样本过滤模块具体包括如下子模块：Preferably, the second truncated sample filtering module specifically includes the following submodules:

第三截断子模块，用于判断T4的字符长度L_T4与字符阈值max_new_tokens的大小，若将L_T4小于max_new_tokens，则将T5设为T4并执行所述第二重复性过滤模块，反之，对T4中超过max_new_tokens部分的字符进行截断处理，形成T4-truncate；The third truncation submodule is used to determine the size of the character length L _T4 of T4 and the character threshold max_new_tokens. If L _T4 is less than max_new_tokens, T5 is set to T4 and the second repetitive filtering module is executed. Otherwise, the characters in T4 exceeding max_new_tokens are truncated to form T4-truncate.

第四截断子模块，与所述第三截断子模块连接，用于判断T4-truncate最后一个字符character-2是否为中文句号，若character-2为中文句号，则将T5设为T4-truncate，反之，将T4-truncate中最后一个中文句号后的内容删除，形成T5。The fourth truncation submodule is connected to the third truncation submodule and is used to determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, T5 is set to T4-truncate. Otherwise, the content after the last Chinese period in T4-truncate is deleted to form T5.

优选地，所述第二重复性过滤模块具体包括如下子模块：Preferably, the second repetitive filtering module specifically includes the following submodules:

第四重复子模块，用于将T5分成N个句子，形成文本T6；The fourth repetitive submodule is used to divide T5 into N sentences to form text T6;

第五重复子模块，与所述第四重复子模块连接，用于对T6进行去重处理，得到去重处理后的N'个句子；A fifth repetition submodule, connected to the fourth repetition submodule, for performing a deduplication process on T6 to obtain N' sentences after the deduplication process;

第六重复子模块，与所述第五重复子模块连接，用于在当N'<N时，删除D2，反之，输出文本T6。The sixth repetitive submodule is connected to the fifth repetitive submodule and is used to delete D2 when N'<N, otherwise, output text T6.

优选地，第五重复子模块中，所述去重处理通过对N个句子进行集合操作实现。Preferably, in the fifth repetition submodule, the deduplication process is implemented by performing a set operation on N sentences.

另一个方面，本发明还提供一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现本申请第一方面或第二方面中任一项所述的一种基于LVLM的自动化中文图文对数据生成方法。In another aspect, the present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements an automated Chinese image-text pair data generation method based on LVLM as described in any one of the first aspect or the second aspect of the present application.

另一个方面，本发明还提供一种电子设备，所述电子设备包括：存储器，存储有一计算机程序：处理器，与所述存储器通信相连，调用所述计算机程序时执行本申请第一方面或第二方面中任一项所述的一种基于LVLM的自动化中文图文对数据生成方法。In another aspect, the present invention also provides an electronic device, comprising: a memory storing a computer program; and a processor communicatively connected to the memory, for executing an automated Chinese image-text pair data generation method based on LVLM as described in any one of the first aspect or the second aspect of the present application when the computer program is called.

本发明与现有技术相比较，具有如下显而易见的突出实质性特点和显著优点：Compared with the prior art, the present invention has the following obvious outstanding substantial features and significant advantages:

本发明提供一种基于LVLM的自动化中文图文对数据生成方法及装置，解决了现有中文图文对数据中存在中文数据质量不高且数量少的问题。具有以下优点：（1）利用同一个LVLM进行中文数据的生成与英文数据的翻译，简化了方案的架构；（2）通过生成和翻译两种方法结合，扩大中文图文数据的来源，增加数据的多样性；（3）对生成/翻译后的样本进行了多阶段的后处理，以进一步提升中文语料的质量；（4）使用CLIP模型来进行图文相似度的对比，进一步提升中文语料的质量。The present invention provides an automatic Chinese image-text data generation method and device based on LVLM, which solves the problem of low quality and small quantity of Chinese data in existing Chinese image-text data. It has the following advantages: (1) The same LVLM is used to generate Chinese data and translate English data, which simplifies the architecture of the solution; (2) By combining the generation and translation methods, the source of Chinese image-text data is expanded and the diversity of data is increased; (3) The generated/translated samples are post-processed in multiple stages to further improve the quality of Chinese corpus; (4) The CLIP model is used to compare the similarity of images and texts, which further improves the quality of Chinese corpus.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本发明的一部分附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings, which constitute part of the present invention, are used to provide a further understanding of the present invention. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations on the present invention. In the accompanying drawings:

图1是本发明优选实施例的一种基于LVLM的自动化中文图文对数据生成方法中一种情况的流程图。FIG. 1 is a flow chart of a situation in an automatic Chinese image-text pair data generation method based on LVLM according to a preferred embodiment of the present invention.

图2是本发明优选实施例的一种基于LVLM的自动化中文图文对数据生成方法中又一种情况的流程图。FIG. 2 is a flow chart of another situation of an automatic Chinese image-text pair data generation method based on LVLM in a preferred embodiment of the present invention.

图3是本发明的优选实施例的一种基于LVLM的自动化中文图文对数据生成装置中一种情况的结构示意图。FIG3 is a schematic structural diagram of a situation in an automatic Chinese image-text pair data generation device based on LVLM according to a preferred embodiment of the present invention.

图4是本发明的优选实施例的一种基于LVLM的自动化中文图文对数据生成装置中又一种情况的结构示意图。FIG. 4 is a schematic structural diagram of another situation in an automatic Chinese image-text pair data generation device based on LVLM according to a preferred embodiment of the present invention.

图5是本发明优选实施例的一种基于LVLM的自动化中文图文对数据生成方法的流程示意图。FIG5 is a flow chart of an automated Chinese image-text pair data generation method based on LVLM according to a preferred embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

本发明提供一种基于LVLM的自动化中文图文对数据生成方法及装置，为使本发明的目的、技术方案及效果更加清楚、明确，以下参照附图并举实例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。The present invention provides a method and device for automatically generating Chinese text-image data based on LVLM. In order to make the purpose, technical solution and effect of the present invention clearer and more specific, the present invention is further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not used to limit the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序，应该理解这样使用的数据在适当情况下可以互换。此外，术语“包括”和“具有”以及它们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

实施例Example

如图1和图5（左）所示，本实施例所述的一种基于LVLM的自动化中文图文对数据生成方法，适用于仅包含图像的情况，具体包括如下步骤：As shown in FIG. 1 and FIG. 5 (left), the method for automatically generating Chinese image-text pair data based on LVLM described in this embodiment is applicable to the case where only images are included, and specifically includes the following steps:

步骤S1、获取图像D1并生成指令prompt-1，如：请描述这张图。Step S1, obtain image D1 and generate instruction prompt-1, such as: please describe this picture.

其中，所述prompt-1用于引导LVLM模型生成特定类型的响应或执行特定任务。Among them, the prompt-1 is used to guide the LVLM model to generate a specific type of response or perform a specific task.

步骤S2、将D1和prompt-1输入LVLM模型，形成文本T1。Step S2: Input D1 and prompt-1 into the LVLM model to form text T1.

步骤S3、对T1进行截断样本过滤，形成文本T2。Step S3: truncate and filter samples of T1 to form text T2.

可选的，步骤S3具体包括如下步骤：Optionally, step S3 specifically includes the following steps:

步骤S301、判断T1的字符长度L_T1与字符阈值max_new_tokens的大小，若将L_T1小于max_new_tokens，则将T2设为T1并执行步骤S4，反之，对T1中超过max_new_tokens部分的字符进行截断处理，形成T1-truncate。Step S301, determine the size of the character length LT1 of _T1 and the character threshold max_new_tokens. If _LT1 is less than max_new_tokens, set T2 to T1 and execute step S4. Otherwise, truncate the characters in T1 that exceed max_new_tokens to form T1-truncate.

步骤S4、对T2进行重复性过滤，形成文本T3。Step S4, perform repetitive filtering on T2 to form text T3.

可选的，步骤S4具体包括如下步骤：Optionally, step S4 specifically includes the following steps:

步骤S401、将T2分成L个句子，形成文本T3。Step S401, divide T2 into L sentences to form text T3.

步骤S402、对T3进行去重处理，得到去重处理后的L'个句子。Step S402: perform deduplication processing on T3 to obtain L' sentences after deduplication processing.

可选的，所述去重处理通过对L个句子进行集合操作实现。Optionally, the deduplication process is implemented by performing a set operation on the L sentences.

实施例Example

如图2和图5（右）所示，本实施例所述的一种基于LVLM的自动化中文图文对数据生成方法，适用于包含图像和对该图像的英文描述的情况，具体包括如下步骤：As shown in FIG. 2 and FIG. 5 (right), the method for automatically generating Chinese image-text pair data based on LVLM described in this embodiment is applicable to the case of including an image and an English description of the image, and specifically includes the following steps:

步骤S6、获取图像D2及对D2的英文描述D2-description。Step S6: Obtain the image D2 and the English description D2-description of D2.

步骤S7、通过中文CLIP模型计算D2与D2-description的相似度S-2，若S-2小于阈值γ，则删除D2，反之，则保留D2。Step S7: Calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model. If S-2 is less than the threshold γ, delete D2; otherwise, retain D2.

步骤S8、生成指令prompt-2（如：请翻译成中文）并将prompt-2和D2-description输入LVLM模型，将D2-description翻译成中文，形成文本T4。Step S8, generate instruction prompt-2 (such as: please translate into Chinese) and input prompt-2 and D2-description into the LVLM model, translate D2-description into Chinese, and form text T4.

步骤S9、对T4进行截断样本过滤，形成文本T5。Step S9: truncate and filter T4 to form text T5.

可选的，步骤S9具体包括如下步骤：Optionally, step S9 specifically includes the following steps:

步骤S901、判断T4的字符长度L_T4与字符阈值max_new_tokens的大小，若将L_T4小于max_new_tokens，则将T5设为T4并执行步骤S10，反之，对T4中超过max_new_tokens部分的字符进行截断处理，形成T4-truncate。Step S901, determine the size of the character length LT4 of _T4 and the character threshold max_new_tokens. If _LT4 is less than max_new_tokens, set T5 to T4 and execute step S10. Otherwise, truncate the characters in T4 that exceed max_new_tokens to form T4-truncate.

步骤S902、判断T4-truncate最后一个字符character-2是否为中文句号，若character-2为中文句号，则将T5设为T4-truncate，反之，将T4-truncate中最后一个中文句号后的内容删除，形成T5。Step S902, determine whether the last character character-2 of T4-truncate is a Chinese period. If character-2 is a Chinese period, set T5 to T4-truncate. Otherwise, delete the content after the last Chinese period in T4-truncate to form T5.

步骤S10、对T5进行重复性过滤，形成文本T6。Step S10, performing repetitive filtering on T5 to form text T6.

可选的，步骤S10具体包括如下步骤：Optionally, step S10 specifically includes the following steps:

步骤S1001、将T5分成N个句子，形成文本T6。Step S1001, divide T5 into N sentences to form text T6.

步骤S1002、对T6进行去重处理，得到去重处理后的N'个句子。Step S1002: perform deduplication processing on T6 to obtain N' sentences after deduplication processing.

可选的，所述去重处理通过对N个句子进行集合操作实现。Optionally, the deduplication process is implemented by performing a set operation on N sentences.

步骤S1003、若N'<N，则删除D2，反之，输出文本T6。Step S1003: If N'<N, delete D2; otherwise, output text T6.

实施例Example

如图3所示，本实施例所述的一种基于LVLM的自动化中文图文对数据生成装置，具体包括如下模块：As shown in FIG3 , the automatic Chinese picture-text pair data generation device based on LVLM described in this embodiment specifically includes the following modules:

第一初始化模块，用于获取图像D1并生成指令prompt-1。The first initialization module is used to obtain the image D1 and generate the instruction prompt-1.

第一文本生成模块，与所述第一初始化模块连接，用于将D1和prompt-1输入LVLM模型，形成文本T1。The first text generation module is connected to the first initialization module and is used to input D1 and prompt-1 into the LVLM model to form text T1.

第一截断样本过滤模块，与所述第一文本生成模块连接，用于对T1进行截断样本过滤，形成文本T2。The first truncated sample filtering module is connected to the first text generating module and is used to perform truncated sample filtering on T1 to form text T2.

其中，所述第一截断样本过滤模块，具体包括如下子模块：The first truncated sample filtering module specifically includes the following submodules:

第一截断子模块，用于判断T1的字符长度L_T1与字符阈值max_new_tokens的大小，若将L_T1小于max_new_tokens，则将T2设为T1并执行所述第一重复性过滤模块，反之，对T1中超过max_new_tokens部分的字符进行截断处理，形成T1-truncate。The first truncation submodule is used to determine the size of the character length LT1 of _T1 and the character threshold max_new_tokens. If _LT1 is less than max_new_tokens, T2 is set to T1 and the first repetitive filtering module is executed. Otherwise, the characters in T1 exceeding max_new_tokens are truncated to form T1-truncate.

第一重复性过滤模块，与所述第一截断样本过滤模块连接，用于对T2进行重复性过滤，形成文本T3。The first repetitive filtering module is connected to the first truncated sample filtering module and is used to perform repetitive filtering on T2 to form text T3.

其中，所述第一重复性过滤模块，具体包括如下子模块：The first repetitive filtering module specifically includes the following submodules:

第一重复性子模块，用于将T2分成L个句子，形成文本T3。The first repetitive submodule is used to divide T2 into L sentences to form text T3.

第二重复性子模块，与所述第一重复性子模块连接，用于对T3进行去重处理，得到去重处理后的L'个句子。The second repetitive submodule is connected to the first repetitive submodule and is used to perform a deduplication process on T3 to obtain L' sentences after the deduplication process.

实施例Example

如图4所示，本实施例所述的一种基于LVLM的自动化中文图文对数据生成装置，对于包含图像和对该图像的英文描述的情况，具体包括如下模块：As shown in FIG4 , the automatic Chinese image-text pair data generation device based on LVLM described in this embodiment, for a case including an image and an English description of the image, specifically includes the following modules:

第二初始化模块，用于获取图像D2及对D2的英文描述D2-description。The second initialization module is used to obtain the image D2 and the English description D2-description of D2.

第二样本处理模块，与所述第二初始化模块连接，用于通过中文CLIP模型计算D2与D2-description的相似度S-2，若S-2小于阈值γ，则删除D2，反之，则保留D2。The second sample processing module is connected to the second initialization module and is used to calculate the similarity S-2 between D2 and D2-description through the Chinese CLIP model. If S-2 is less than a threshold γ, D2 is deleted; otherwise, D2 is retained.

第二文本生成模块，与所述第二样本处理模块连接，用于生成指令prompt-2并将prompt-2和D2-description输入LVLM模型，将D2-description翻译成中文，形成文本T4。The second text generation module is connected to the second sample processing module, and is used to generate instruction prompt-2 and input prompt-2 and D2-description into the LVLM model, and translate D2-description into Chinese to form text T4.

第二截断样本过滤模块，与所述第二文本生成模块连接，用于对T4进行截断样本过滤，形成文本T5。The second truncated sample filtering module is connected to the second text generating module and is used to perform truncated sample filtering on T4 to form text T5.

其中，所述第二截断样本过滤模块具体包括如下子模块：The second truncated sample filtering module specifically includes the following submodules:

第三截断子模块，用于判断T4的字符长度L_T4与字符阈值max_new_tokens的大小，若将L_T4小于max_new_tokens，则将T5设为T4并执行所述第二重复性过滤模块，反之，对T4中超过max_new_tokens部分的字符进行截断处理，形成T4-truncate。The third truncation submodule is used to determine the size of the character length L _T4 of T4 and the character threshold max_new_tokens. If L _T4 is less than max_new_tokens, T5 is set to T4 and the second repeatability filtering module is executed. Otherwise, the characters in T4 exceeding max_new_tokens are truncated to form T4-truncate.

其中，所述第二重复性过滤模块具体包括如下子模块：Wherein, the second repetitive filtering module specifically includes the following submodules:

第四重复子模块，用于将T5分成N个句子，形成文本T6。The fourth repetitive submodule is used to divide T5 into N sentences to form text T6.

第五重复子模块，与所述第四重复子模块连接，用于对T6进行去重处理，得到去重处理后的N'个句子。The fifth repetition submodule is connected to the fourth repetition submodule and is used to perform a deduplication process on T6 to obtain N' sentences after the deduplication process.

以上对本发明的具体实施例进行了详细描述，但其只是作为范例，本发明并不限制于以上描述的具体实施例。对于本领域技术人员而言，任何对本发明进行的等同修改和替代也都在本发明的范畴之中。因此，在不脱离本发明的精神和范围下所作的均等变换和修改，都应涵盖在本发明的范围内。The specific embodiments of the present invention are described in detail above, but they are only examples, and the present invention is not limited to the specific embodiments described above. For those skilled in the art, any equivalent modifications and substitutions made to the present invention are also within the scope of the present invention. Therefore, the equalization changes and modifications made without departing from the spirit and scope of the present invention should be included in the scope of the present invention.

Claims

1. The automatic Chinese image-text data generation method based on LVLM is characterized by being suitable for the condition of only containing images and specifically comprising the following steps of:

s1, acquiring an image D1 and generating an instruction prompt-1;

S2, inputting D1 and campt-1 into LVLM models to form a text T1;

s3, filtering the truncated samples of the T1 to form a text T2;

s4, repeatedly filtering the T2 to form a text T3;

And S5, calculating the similarity S-1 of the D1 and the T3 through a Chinese CLIP model, deleting the D1 if the S-1 is smaller than a threshold gamma, otherwise, reserving the D1.

2. The method for generating data based on LVLM automated chinese text-to-text data as recited in claim 1, wherein in step S3, T1 is subjected to truncated sample filtering to form T2, comprising the steps of:

Step S301, judging the character length L _T1 and the character threshold value max_new_ tokens of the T1, if L _T1 is smaller than max_new_ tokens, setting T2 as T1 and executing step S4, otherwise, performing truncation processing on the characters exceeding max_new_ tokens in the T1 to form T1-truncate;

Step S302, judging whether the last character of the T1-trunk is a Chinese period, if the last character of the T1-trunk is the Chinese period, setting the T2 as the T1-trunk, otherwise, deleting the content after the last Chinese period in the T1-trunk to form the T2.

3. The automated chinese text-to-data generation method of claim 1, LVLM-based, wherein in step S4, T2 is filtered repeatedly to form T3, comprising the steps of:

S401, dividing T2 into L sentences to form a text T3;

step S402, performing de-duplication treatment on the T3 to obtain L' sentences subjected to the de-duplication treatment;

Step S403, if L' < L, deleting D1, otherwise, executing step S5; in step S402, the L sentences are collected to implement the deduplication process.

4. The automatic Chinese image-text data generation method based on LVLM is characterized by being suitable for the conditions of containing an image and English description of the image, and specifically comprises the following steps of:

step A1, acquiring an image D2 and an English description D2-description of the image D2;

Step A2, calculating the similarity S-2 of D2 and D2-description through a Chinese CLIP model, deleting D2 if S-2 is smaller than a threshold gamma, otherwise, reserving D2;

Step A3, generating an instruction promt-2, inputting the promt-2 and the D2-description into LVLM models, and translating the D2-description into Chinese to form a text T4;

Step A4, filtering the truncated samples of the T4 to form a text T5;

and A5, repeatedly filtering the T5 to form a text T6.

5. The method for generating data from automated chinese text-to-text data based on LVLM of claim 4, wherein in step A4, truncated sample filtering is performed on T4 to form text T5, comprising the steps of:

Step A401, judging the character length L _T4 and the character threshold value max_new_ tokens of the T4, if L _T4 is smaller than max_new_ tokens, setting T5 as the T4 and executing step A5, otherwise, performing truncation processing on the characters exceeding the max_new_ tokens in the T4 to form T4-truncate;

And step A402, judging whether the last character of the T4-trunk is a Chinese period, if the last character of the T4-trunk is the Chinese period, setting the T5 as the T4-trunk, otherwise, deleting the content after the last Chinese period in the T4-trunk to form the T5.

6. An automated chinese graphic data generating method according to claim 5 and based on LVLM, wherein in step A5, T5 is filtered repeatedly to form text T6, and the data processing method in step A5 specifically comprises the steps of:

Step A501, dividing the T5 into N sentences to form a text T6;

step A502, performing de-duplication treatment on the T6 to obtain N' sentences subjected to the de-duplication treatment;

step A503, if N' < N, delete D2, otherwise, output text T6;

In step a502, the aggregation operation is performed on N sentences to implement the deduplication processing.

7. The automatic Chinese image-text pair data generating device based on LVLM is characterized by being suitable for the condition of only containing images and specifically comprising the following modules:

the first initialization module is used for acquiring an image D1 and generating an instruction prompt-1;

The first text generation module is connected with the first initialization module and is used for inputting D1 and campt-1 into LVLM models to form a text T1;

the first truncated sample filtering module is connected with the first text generating module and is used for filtering the truncated samples of the T1 to form a text T2;

the first repeated filtering module is connected with the first truncated sample filtering module and is used for repeatedly filtering the T2 to form a text T3;

The first sample processing module is connected with the first repeatability filtering module and is used for calculating the similarity S-1 of D1 and T3 through a Chinese CLIP model, if S-1 is smaller than a threshold gamma, D1 is deleted, and otherwise D1 is reserved; wherein,

The first truncated sample filtering module specifically comprises the following sub-modules:

A first truncating sub-module, configured to determine the character length L _T1 of T1 and the size of the character threshold max_new_ tokens, if L _T1 is smaller than max_new_ tokens, set T2 as T1 and execute the first repetitive filtering module, otherwise, truncate the character exceeding the max_new_ tokens in T1 to form T1-truncate;

The second truncation submodule is connected with the first truncation submodule and is used for judging whether the last character of the T1-truncate is a Chinese period or not, if the character of the T1-truncate is the Chinese period, setting T2 as the T1-truncate, otherwise, deleting the content after the last Chinese period in the T1-truncate to form T2;

The first repetitive filtering module specifically comprises the following sub-modules:

the first repeatability submodule is used for dividing the T2 into L sentences to form a text T3;

the second repeating sub-module is connected with the first repeating sub-module and is used for carrying out de-duplication treatment on the T3 to obtain L' sentences subjected to the de-duplication treatment;

And the third repeating sub-module is connected with the second repeating sub-module and is used for deleting D1 when L' < L, and executing the first sample processing module in the reverse way.

8. The automatic Chinese image-text pair data generating device based on LVLM is characterized by being suitable for the conditions of containing an image and English description of the image, and specifically comprises the following modules:

the second initialization module is used for acquiring an image D2 and an English description D2-description of the D2;

The second sample processing module is connected with the second initialization module and is used for calculating the similarity S-2 of D2 and D2-description through a Chinese CLIP model, if S-2 is smaller than a threshold gamma, D2 is deleted, and otherwise D2 is reserved;

The second text generation module is connected with the second sample processing module and is used for generating an instruction promt-2, inputting the promt-2 and the D2-description into LVLM models, and translating the D2-description into Chinese to form a text T4;

the second truncated sample filtering module is connected with the second text generating module and is used for filtering the truncated samples of the T4 to form a text T5;

The second renaturation filtering module is connected with the second truncated sample filtering module and is used for carrying out repeated filtering on the T5 to form a text T6; wherein,

The second truncated sample filtering module specifically includes the following sub-modules:

A third cutting sub-module, configured to determine the character length L _T4 of T4 and the size of the character threshold value max_new_ tokens, if L _T4 is smaller than max_new_ tokens, set T5 as T4 and execute the second repeating filtering module, otherwise, cut the character exceeding the max_new_ tokens in T4 to form T4-trunk;

The fourth cutting sub-module is connected with the third cutting sub-module and is used for judging whether the last character of the T4-truncate is a Chinese period, if the character of the T4-truncate is the Chinese period, setting the T5 as the T4-truncate, otherwise, deleting the content after the last Chinese period in the T4-truncate to form the T5;

The second repetitive filtering module specifically comprises the following sub-modules:

a fourth repeating sub-module for dividing T5 into N sentences to form text T6;

the fifth repeating sub-module is connected with the fourth repeating sub-module and is used for carrying out de-duplication treatment on the T6 to obtain N' sentences subjected to the de-duplication treatment;

a sixth repeating sub-module, connected to the fifth repeating sub-module, for deleting D2 when N' < N, and otherwise outputting text T6.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-6.

10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1-6 when the computer program is executed.