CN118350464A - Conversational target positioning method and device based on arbitrary granularity text input - Google Patents
Conversational target positioning method and device based on arbitrary granularity text input Download PDFInfo
- Publication number
- CN118350464A CN118350464A CN202410250372.3A CN202410250372A CN118350464A CN 118350464 A CN118350464 A CN 118350464A CN 202410250372 A CN202410250372 A CN 202410250372A CN 118350464 A CN118350464 A CN 118350464A
- Authority
- CN
- China
- Prior art keywords
- image
- text
- sample
- target
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 239000013598 vector Substances 0.000 claims abstract description 196
- 230000011218 segmentation Effects 0.000 claims abstract description 76
- 230000000007 visual effect Effects 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 33
- 230000004807 localization Effects 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 11
- 230000005540 biological transmission Effects 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 235000019580 granularity Nutrition 0.000 description 40
- 238000004891 communication Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 230000000750 progressive effect Effects 0.000 description 4
- 229920000742 Cotton Polymers 0.000 description 3
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 3
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 241000406668 Loxodonta cyclotis Species 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 235000002198 Annona diversifolia Nutrition 0.000 description 1
- 208000004547 Hallucinations Diseases 0.000 description 1
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 241000282842 Lama glama Species 0.000 description 1
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 1
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Character Discrimination (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及人工智能技术领域,尤其涉及一种基于任意粒度文本输入的对话式目标定位方法及装置。The present invention relates to the field of artificial intelligence technology, and in particular to a conversational target positioning method and device based on arbitrary granularity text input.
背景技术Background technique
随着指令调整技术的发展,图文大模型(Large Vision-Language Models,LVLMs)在多种视觉理解任务中展现出了优越性。这些模型越来越多地应用于实际场景中,例如移动设备中的个人助理。然而,它们在需要细粒度物体感知和空间定位的特定领域的应用仍然有限。此外,也有人对由于缺乏细粒度监督而导致的幻觉发生表示担忧。With the development of command adjustment technology, large vision-language models (LVLMs) have shown superiority in various visual understanding tasks. These models are increasingly used in real-world scenarios, such as personal assistants in mobile devices. However, their application in specific areas that require fine-grained object perception and spatial localization is still limited. In addition, there are concerns about the occurrence of hallucinations due to the lack of fine-grained supervision.
一方面,这种限制主要源于图文大模型对多个物体理解的薄弱。当前流行的图文大模型,如Shikra、Qwen-VL和KOSMOS-2,虽然具备基本的物体感知能力,但局限于定位单个存在的物体,在更复杂的场景中则显得力不从心。尽管集成了来自指称表达理解(Referring Expression Comprehension,REC)任务的数百万数据,或者将其他任务数据重新格式化以适应指令表达理解格式,但它们只包含单一的指代场景。本质上,问题的根源不在于数据量的多少,而在于缺乏全面而有序的训练基准。On the one hand, this limitation mainly stems from the weak understanding of multiple objects by large image-text models. Although the current popular large image-text models, such as Shikra, Qwen-VL, and KOSMOS-2, have basic object perception capabilities, they are limited to locating a single existing object and are unable to cope with more complex scenarios. Despite integrating millions of data from the Referring Expression Comprehension (REC) task, or reformatting other task data to adapt to the instruction expression comprehension format, they only contain a single reference scenario. In essence, the root of the problem lies not in the amount of data, but in the lack of a comprehensive and orderly training benchmark.
另一方面,一些研究尝试通过大型语言模型调用离线视觉专家模型,或者集成特定的视觉检测头来实现对各种细粒度物体的定位。前者是一种类似调度器的方法,通过特定的应用程序编程接口(Application Programming Interface,API)管理每个视觉专家模型。然而,这种方法在充分利用大型语言模型(Large Language Model,LLM)的综合分析能力方面存在不足,且视觉专家模型本身也受到泛化能力不足的限制。后者通过专门的头部结构(如检测头和分割头)生成特定任务输出。这允许更加定制化的响应,但随着任务增加,参数数量显著增加。接口和任务头的增加都会给大型语言模型带来负担,并且无法与图文大模型享有统一的结构。此外,这两种方法均依赖于外部辅助,且都没有充分利用大语言模型通过大规模预训练获得的广泛知识,降低了对话式目标定位方法的准确性。On the other hand, some studies have attempted to call offline visual expert models through large language models, or integrate specific visual detection heads to achieve the localization of various fine-grained objects. The former is a scheduler-like approach that manages each visual expert model through a specific application programming interface (API). However, this approach is insufficient in fully utilizing the comprehensive analysis capabilities of large language models (LLMs), and the visual expert models themselves are also limited by insufficient generalization capabilities. The latter generates task-specific outputs through specialized head structures such as detection heads and segmentation heads. This allows for more customized responses, but the number of parameters increases significantly as the number of tasks increases. The increase in interfaces and task heads will burden large language models and cannot share a unified structure with large image and text models. In addition, both methods rely on external assistance, and neither fully utilizes the extensive knowledge acquired by large language models through large-scale pre-training, reducing the accuracy of conversational object localization methods.
发明内容Summary of the invention
本发明提供一种基于任意粒度文本输入的对话式目标定位方法及装置,用以解决现有技术中两种方法均依赖于外部辅助,且都没有充分利用大语言模型通过大规模预训练获得的广泛知识,降低了对话式目标定位方法的准确性的缺陷。The present invention provides a conversational target localization method and device based on arbitrary granularity text input, which are used to solve the defects of the two methods in the prior art that both methods rely on external assistance and do not fully utilize the extensive knowledge obtained by large language models through large-scale pre-training, thereby reducing the accuracy of the conversational target localization method.
本发明提供一种基于任意粒度文本输入的对话式目标定位方法,包括:The present invention provides a conversational target positioning method based on arbitrary granularity text input, comprising:
获取待定位图像和输入文本;Get the image to be located and the input text;
将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;Inputting the image to be located into a visual encoder to extract image features of the image to be located, and projecting the image features into a word embedding space to obtain an image word vector;
对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量;Segmenting the input text to obtain a segmentation vector, and mapping the segmentation vector to the word embedding space to obtain a text word vector;
将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;The image word vector and the text word vector are used as an image-text word vector pair, and the image-text word vector pair is input into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes the target category and target position in the image to be located;
所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。The large language model is trained based on sample input text, sample images to be located, and label target categories and label target positions in the sample images to be located.
根据本发明提供的一种基于任意粒度文本输入的对话式目标定位方法,所述将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列,之后还包括:According to a conversational target localization method based on arbitrary granularity text input provided by the present invention, the image-text word vector pair is input into a large language model to obtain an answer sequence output by the large language model, and then the method further includes:
基于所述待定位图像、所述输入文本,以及预设分词前的回答序列,确定所述回答序列中各分词产生的各第一条件概率;Based on the image to be located, the input text, and a preset answer sequence before word segmentation, determining each first conditional probability of each word segmentation in the answer sequence;
基于所述各第一条件概率,确定所述回答序列的生成概率。Based on the first conditional probabilities, a generation probability of the answer sequence is determined.
根据本发明提供的一种基于任意粒度文本输入的对话式目标定位方法,所述将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列,之后还包括:According to a conversational target localization method based on arbitrary granularity text input provided by the present invention, the image-text word vector pair is input into a large language model to obtain an answer sequence output by the large language model, and then the method further includes:
基于所述待定位图像、所述输入文本,以及预设分词前的回答序列,确定所述回答序列中所述待定位图像中各目标类别的各第二条件概率;Determine, based on the image to be located, the input text, and a preset answer sequence before word segmentation, each second conditional probability of each target category in the image to be located in the answer sequence;
基于所述各第二条件概率,确定所述回答序列中所述待定位图像中各目标类别的生成概率。Based on the second conditional probabilities, a generation probability of each target category in the image to be located in the answer sequence is determined.
根据本发明提供的一种基于任意粒度文本输入的对话式目标定位方法,所述基于所述各第二条件概率,确定所述回答序列中所述待定位图像中各目标类别的生成概率,之后还包括:According to a conversational target localization method based on arbitrary granularity text input provided by the present invention, the generation probability of each target category in the image to be located in the answer sequence is determined based on each second conditional probability, and then the method further includes:
基于所述待定位图像、所述输入文本,以及预设位置前的回答序列,确定所述回答序列中所述待定位图像中目标位置的各第三条件概率;Determine each third conditional probability of the target position in the image to be located in the answer sequence based on the image to be located, the input text, and the answer sequence before the preset position;
基于所述各第三条件概率,确定所述回答序列中所述待定位图像中各目标位置的生成概率。Based on the third conditional probabilities, a generation probability of each target position in the image to be located in the answer sequence is determined.
根据本发明提供的一种基于任意粒度文本输入的对话式目标定位方法,所述基于所述各第三条件概率,确定所述回答序列中所述待定位图像中各目标位置的生成概率,之后还包括:According to a conversational target location method based on arbitrary granularity text input provided by the present invention, the generation probability of each target position in the image to be located in the answer sequence is determined based on each third conditional probability, and then the method further includes:
基于所述各目标位置的生成概率,以及所述各目标类别的生成概率,确定所述待定位图像中各目标的置信度评分。Based on the generation probability of each target position and the generation probability of each target category, a confidence score of each target in the image to be located is determined.
根据本发明提供的一种基于任意粒度文本输入的对话式目标定位方法,所述大型语言模型的训练步骤,包括:According to a conversational target localization method based on arbitrary granularity text input provided by the present invention, the training step of the large language model includes:
确定初始大型语言模型,并获取所述样本输入文本、所述样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置;Determine an initial large language model, and obtain the sample input text, the sample image to be located, and the label target category and label target position in the sample image to be located;
将所述样本待定位图像输入初始视觉编码器中提取所述样本待定位图像的样本图像特征,并基于初始投影层将所述样本图像特征投影至样本词嵌入空间中,得到样本图像词向量;Input the sample image to be located into the initial visual encoder to extract sample image features of the sample image to be located, and project the sample image features into the sample word embedding space based on the initial projection layer to obtain a sample image word vector;
对所述样本输入文本进行分词化得到样本分词向量,将所述样本分词向量映射至所述样本词嵌入空间中,得到样本文本词向量;Segmenting the sample input text to obtain a sample segmentation vector, and mapping the sample segmentation vector to the sample word embedding space to obtain a sample text word vector;
将所述样本图像词向量和所述样本文本词向量作为样本图像文本词向量对,并将所述样本图像文本词向量对输入至初始大型语言模型中,得到所述初始大型语言模型输出的预测回答序列;The sample image word vector and the sample text word vector are used as a sample image text word vector pair, and the sample image text word vector pair is input into an initial large language model to obtain a predicted answer sequence output by the initial large language model;
基于所述预测回答序列中的预测目标类别和所述标签目标类别,以及所述预测回答序列中的预测目标位置和所述标签目标位置,确定目标损失,并基于所述目标损失对所述初始大型语言模型进行参数迭代,得到所述大型语言模型。Based on the predicted target category and the label target category in the predicted answer sequence, as well as the predicted target position and the label target position in the predicted answer sequence, a target loss is determined, and parameters of the initial large language model are iterated based on the target loss to obtain the large language model.
根据本发明提供的一种基于任意粒度文本输入的对话式目标定位方法,所述基于所述目标损失对所述初始大型语言模型进行参数迭代,得到所述大型语言模型,还包括:According to a conversational target localization method based on arbitrary granularity text input provided by the present invention, the parameter iteration of the initial large language model based on the target loss to obtain the large language model further includes:
基于所述目标损失对所述初始视觉编码器和所述初始投影层进行梯度反传,得到所述视觉编码器和投影层。Performing gradient back propagation on the initial visual encoder and the initial projection layer based on the target loss to obtain the visual encoder and the projection layer.
本发明还提供一种基于任意粒度文本输入的对话式目标定位装置,包括:The present invention also provides a conversational target positioning device based on arbitrary granularity text input, comprising:
获取单元,用于获取待定位图像和输入文本;An acquisition unit, used for acquiring an image to be located and input text;
投影单元,用于将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;A projection unit, configured to input the image to be located into a visual encoder to extract image features of the image to be located, and project the image features into a word embedding space to obtain an image word vector;
映射单元,用于对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量;A mapping unit, configured to segment the input text to obtain a segmentation vector, and map the segmentation vector to the word embedding space to obtain a text word vector;
输入单元,用于将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;An input unit, used to use the image word vector and the text word vector as an image-text word vector pair, and input the image-text word vector pair into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes the target category and target position in the image to be located;
所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。The large language model is trained based on sample input text, sample images to be located, and label target categories and label target positions in the sample images to be located.
本发明还提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如上述任一种所述基于任意粒度文本输入的对话式目标定位方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the method for conversational target localization based on text input of arbitrary granularity as described above is implemented.
本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如上述任一种所述基于任意粒度文本输入的对话式目标定位方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements any of the above-described conversational target localization methods based on arbitrary granularity text input.
本发明还提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述基于任意粒度文本输入的对话式目标定位方法。The present invention also provides a computer program product, comprising a computer program, wherein when the computer program is executed by a processor, the computer program implements any of the above-mentioned methods for conversational target localization based on text input of arbitrary granularity.
本发明提供的基于任意粒度文本输入的对话式目标定位方法及装置,将待定位图像输入视觉编码器中提取待定位图像的图像特征,并将图像特征投影至词嵌入空间中,得到图像词向量,对输入文本进行分词化得到分词向量,将分词向量映射至词嵌入空间中,得到文本词向量,最后,将图像词向量和文本词向量作为图像文本词向量对,并将图像文本词向量对输入至大型语言模型中,得到大型语言模型输出的回答序列,回答序列包括待定位图像中的目标类别和目标位置。大型语言模型是基于样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置训练得到的,一方面,不依赖于外部辅助,且样本输入文本和样本待定位图像与图文大模型享有统一的结构;另一方面,充分利用大型语言模型通过大规模预训练获得的广泛知识,从而具备对任意粒度的文本输入进行定位的能力,进一步提高了对话式目标定位方法的准确性和可靠性。The method and device for conversational target localization based on arbitrary granularity text input provided by the present invention inputs the image to be located into the visual encoder to extract the image features of the image to be located, and projects the image features into the word embedding space to obtain the image word vector, segment the input text to obtain the word segmentation vector, and map the word segmentation vector into the word embedding space to obtain the text word vector. Finally, the image word vector and the text word vector are used as an image-text word vector pair, and the image-text word vector pair is input into the large language model to obtain the answer sequence output by the large language model, and the answer sequence includes the target category and target position in the image to be located. The large language model is trained based on sample input text, sample image to be located, and the label target category and label target position in the sample image to be located. On the one hand, it does not rely on external assistance, and the sample input text and the sample image to be located have a unified structure with the large picture-text model; on the other hand, it makes full use of the extensive knowledge obtained by the large language model through large-scale pre-training, so as to have the ability to locate text input of arbitrary granularity, and further improves the accuracy and reliability of the conversational target localization method.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明提供的基于任意粒度文本输入的对话式目标定位方法的流程示意图之一;FIG1 is a flow chart of a method for conversational target location based on arbitrary granularity text input provided by the present invention;
图2是本发明提供的基于任意粒度文本输入的对话式目标定位方法的流程示意图之二;FIG2 is a second flow chart of the method for conversational target location based on arbitrary granularity text input provided by the present invention;
图3是本发明提供的渐进式两阶段训练方法的流程示意图;FIG3 is a schematic diagram of a flow chart of a progressive two-stage training method provided by the present invention;
图4是本发明提供的基于任意粒度文本输入的对话式目标定位装置的结构示意图;FIG4 is a schematic diagram of the structure of a conversational target location device based on arbitrary granularity text input provided by the present invention;
图5是本发明提供的电子设备的结构示意图。FIG. 5 is a schematic diagram of the structure of an electronic device provided by the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明中的附图,对本发明中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
本发明的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类。The terms "first", "second", etc. in the specification and claims of the present invention are used to distinguish similar objects, rather than to describe a specific order or precedence. It should be understood that the terms used in this way can be interchangeable where appropriate, so that the embodiments of the present application can be implemented in an order other than those illustrated or described herein, and the objects distinguished by "first", "second", etc. are generally of the same type.
本发明提供一种基于任意粒度文本输入的对话式目标定位方法,图1是本发明提供的基于任意粒度文本输入的对话式目标定位方法的流程示意图之一,图2是本发明提供的基于任意粒度文本输入的对话式目标定位方法的流程示意图之二,如图1、图2所示,该方法包括:The present invention provides a method for conversational target location based on arbitrary granularity text input. FIG. 1 is a flow chart of the method for conversational target location based on arbitrary granularity text input provided by the present invention. FIG. 2 is a flow chart of the method for conversational target location based on arbitrary granularity text input provided by the present invention. As shown in FIG. 1 and FIG. 2, the method includes:
步骤110,获取待定位图像和输入文本。Step 110: Obtain the image to be located and the input text.
具体地,可以获取待定位图像和输入文本,此处,待定位图像即后续需要进行目标定位的图像,待定位图像可以用Xv表示。待定位图像可以是通过图像采集设备预先采集得到的,也可以是实时拍摄得到的,还可以是通过互联网下载或者扫描得到的,本发明实施例对此不作具体限定。Specifically, an image to be located and an input text may be obtained, where the image to be located is an image that needs to be subsequently located, and the image to be located may be represented by Xv. The image to be located may be pre-collected by an image acquisition device, may be captured in real time, or may be downloaded or scanned from the Internet, which is not specifically limited in the embodiment of the present invention.
进一步地,在获取到待定位图像后,还可以对待定位图像进行预处理操作,该预处理操作包括但不限于归一化处理、锐化处理、去噪处理等,本发明实施例对此不作具体限定。Furthermore, after the image to be positioned is acquired, a preprocessing operation may be performed on the image to be positioned. The preprocessing operation includes but is not limited to normalization processing, sharpening processing, denoising processing, etc., which is not specifically limited in the embodiment of the present invention.
输入文本可以用Xins表示,输入文本可以是指令也可以是问题。输入文本可以是用户直接输入的,也可以是将采集所得的音频进行语音转写后得到的,还可以是通过扫描仪、手机、相机、平板等图像采集设备采集得到图像,并对图像进行OCR(Optical CharacterRecognition,光学字符识别)得到的,本发明实施例对此不作具体限定。The input text can be represented by Xins , and the input text can be an instruction or a question. The input text can be directly input by the user, or can be obtained by transcribing the collected audio, or can be obtained by collecting images through image collection devices such as scanners, mobile phones, cameras, tablets, and performing OCR (Optical Character Recognition) on the images. The embodiments of the present invention do not specifically limit this.
输入文本可以是“请定位图中那个穿着红色棉服的人”,也可以是“请检测图中所有的车”,还可以是“请定位图中那个穿着红色棉服的人”、“请检测图中所有的车”、“图中有大象吗?”。The input text can be "Please locate the person wearing a red cotton coat in the picture", or "Please detect all the cars in the picture", or "Please locate the person wearing a red cotton coat in the picture", "Please detect all the cars in the picture", "Is there an elephant in the picture?".
步骤120,将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;Step 120, inputting the image to be located into a visual encoder to extract image features of the image to be located, and projecting the image features into a word embedding space to obtain an image word vector;
步骤130,对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量。Step 130, segment the input text to obtain a segmentation vector, map the segmentation vector to the word embedding space, and obtain a text word vector.
具体地,在获取到待定位图像和输入文本之后,可以将待定位图像输入视觉编码器中提取待定位图像的图像特征,此处,待定位图像的图像特征可以用Zv表示,在提取得到图像特征之后,可以将图像特征投影至词嵌入空间中,得到图像词向量。其中,图像词向量可以用Hv表示。Specifically, after obtaining the image to be located and the input text, the image to be located can be input into the visual encoder to extract the image features of the image to be located. Here, the image features of the image to be located can be represented by Z v . After the image features are extracted, the image features can be projected into the word embedding space to obtain the image word vector. The image word vector can be represented by H v .
需要说明的是,图像词向量Hv与词嵌入空间的维度相同。It should be noted that the dimension of the image word vector H v is the same as that of the word embedding space.
词嵌入空间是自然语言处理(Natural Language Processing,NLP)领域中的一种技术,旨在将语义映射到几何空间。这种技术通过将数字向量与字典中的每一个单词相关联,使得任何两个向量之间的距离能够体现两个关联单词的部分语义关系。这些向量构成的几何空间被称为嵌入空间。词嵌入空间具有一定的结构化语义信息,其中语义相似的词或短语会被嵌入到相近的向量中,而语义不相似的词或短语则被嵌入到距离较远的向量中。Word embedding space is a technique in the field of Natural Language Processing (NLP) that aims to map semantics to geometric space. This technique associates a digital vector with each word in the dictionary so that the distance between any two vectors can reflect the partial semantic relationship between the two associated words. The geometric space formed by these vectors is called the embedding space. The word embedding space has certain structured semantic information, in which semantically similar words or phrases are embedded in close vectors, while semantically dissimilar words or phrases are embedded in vectors that are farther away.
需要说明的是,本发明实施例的目标定位方法的框架可以是基于图文模型构建的,图文模型可以是LLaVA(Large Language and Vision Assistant)图文模型,也可以是InstructBLIP(Instruct Bootstraping Language Image Pre-training)模型等,本发明实施例对此不作具体限定。It should be noted that the framework of the target positioning method of the embodiment of the present invention can be constructed based on a graphic model. The graphic model can be an LLaVA (Large Language and Vision Assistant) graphic model or an InstructBLIP (Instruct Bootstraping Language Image Pre-training) model, etc. The embodiment of the present invention does not specifically limit this.
LlaVA图文模型由三个模块组成,包括视觉编码器、投影层和大型语言模型,其中,视觉编码器可以是CLIP(Contrastive Language-Image Pre-training)模型中的ViT-L/14视觉编码器,投影层是任意初始化的线性层,大型语言模型可以是Llama2-13B模型,本发明实施例对此不作具体限定。The LlaVA graphic model consists of three modules, including a visual encoder, a projection layer and a large language model, wherein the visual encoder can be the ViT-L/14 visual encoder in the CLIP (Contrastive Language-Image Pre-training) model, the projection layer is an arbitrarily initialized linear layer, and the large language model can be a Llama2-13B model, which is not specifically limited in the embodiment of the present invention.
其中,ViT-L/14视觉编码器:ViT(Vision Transformer)是一种用于图像分类的模型,它将图像分割成一系列小块(称为patches),然后将这些小块作为序列输入到Transformer架构中。ViT-L/14指的是一个大型版本的ViT,其中“L”表示“Large”,“14”表示Transformer编码器中图像块(patch)的大小。这种大型模型具有更多的参数和更强的表示能力,能够捕获图像中的复杂特征。Among them, ViT-L/14 visual encoder: ViT (Vision Transformer) is a model for image classification that divides an image into a series of small blocks (called patches), which are then input into the Transformer architecture as a sequence. ViT-L/14 refers to a large version of ViT, where "L" stands for "Large" and "14" refers to the size of the image block (patch) in the Transformer encoder. This large model has more parameters and stronger representation capabilities, and can capture complex features in images.
Llama2-13B大语言模型:Llama(Large Language Model Family)是一个系列的大型语言模型。Llama2-13B指的是这个系列中的一个具体模型,其中“2”表示这是第二代模型,“13B”表示模型有130亿个参数。这种规模的模型具有强大的文本生成和理解能力,能够处理复杂的自然语言任务。Llama2-13B Large Language Model: Llama (Large Language Model Family) is a series of large language models. Llama2-13B refers to a specific model in this series, where "2" indicates that this is the second-generation model, and "13B" indicates that the model has 13 billion parameters. Models of this size have strong text generation and understanding capabilities and can handle complex natural language tasks.
然后,对输入文本进行分词化得到分词向量,将分词向量映射至词嵌入空间中,得到文本词向量,其中,文本词向量可以用Hins表示。Then, the input text is segmented to obtain a segmentation vector, and the segmentation vector is mapped to the word embedding space to obtain a text word vector, where the text word vector can be represented by Hins .
此处,对输入文本进行分词化,可以将输入文本输入至分词器中进行分词化。Here, the input text is segmented. The input text can be input into a word segmenter for segmentation.
步骤140,将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;Step 140, using the image word vector and the text word vector as an image-text word vector pair, and inputting the image-text word vector pair into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes the target category and target position in the image to be located;
所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。The large language model is trained based on sample input text, sample images to be located, and label target categories and label target positions in the sample images to be located.
具体地,在得到图像词向量和文本词向量之后,可以将图像词向量和文本词向量作为图像文本词向量对,此处,图像文本词向量对可以用(Hv,Hins)表示。Specifically, after the image word vector and the text word vector are obtained, the image word vector and the text word vector can be used as an image-text word vector pair. Here, the image-text word vector pair can be represented by (H v ,H ins ).
在得到图像文本词向量对之后,可以将图像文本词向量对输入至大型语言模型中,得到大型语言模型输出的回答序列,其中,回答序列可以用Xa表示,其序列长度表示为L,其中每一个输出的分词表示为xa,j,j代表该分词在回答序列中的索引。After obtaining the image-text word vector pair, the image-text word vector pair can be input into a large language model to obtain an answer sequence output by the large language model, where the answer sequence can be represented by Xa , and its sequence length is represented by L, where each output word segment is represented by xa,j , and j represents the index of the word segment in the answer sequence.
此处,回答序列包括待定位图像中的目标类别和目标位置,目标类别可以用R表示,目标位置即对于待定位图像中的任意一个物体的候选框表示,即左上、右下点坐标,即[x1,y1,x2,y2]。Here, the answer sequence includes the target category and target position in the image to be located. The target category can be represented by R, and the target position is the candidate box representation of any object in the image to be located, that is, the coordinates of the upper left and lower right points, that is, [x1, y1, x2, y2].
例如,回答序列Xa可以是Xa:穿着红色棉服的人-[0.465,0.785,0.658,0.998],也可以是Xa:车-[0.123,0.325,0.498,0.567]&车-[0.321,0.465,0.785,0.892],还可以是Xa:图中没有大象等,本发明实施例对此不作具体限定。For example, the answer sequence Xa may be Xa : person wearing red cotton clothes - [0.465, 0.785, 0.658, 0.998], or Xa : car - [0.123, 0.325, 0.498, 0.567] & car - [0.321, 0.465, 0.785, 0.892], or Xa : there is no elephant in the picture, etc. This embodiment of the present invention does not make any specific limitation to this.
需要说明的是,若待定位图像中存在多个目标,回答序列中多个目标之间用“&”进行连接,无缝适应多物体的场景,无需引入其他的特殊占位符。It should be noted that if there are multiple targets in the image to be located, multiple targets in the answer sequence are connected with "&" to seamlessly adapt to the scene of multiple objects without introducing other special placeholders.
此外,对于目标位置中坐标的表示,目前主要由两种:一种是创建新的分词来表示坐标轴上每个坐标,另一种是直接用自然语言中的数字字符来表示坐标中的每一位。基于Shikra在REC(Recommendation Systems)任务上的对比实验,即数值方法可以取得刚好的性能,且引入原始分词表之外的额外分词需从头训练,导致更慢的收敛速度,因此,本发明实施例采用数值字符的方式来表示目标位置,并将回答序列中的单个目标表示为目标类别和文本化的归一化坐标(目标位置),并将归一化坐标的精度设置为0.001。这种精度在减少小物体(像素面积小于32×32)的定位误差和序列长度之间取得了均衡。In addition, there are currently two main ways to represent the coordinates in the target position: one is to create a new word segmentation to represent each coordinate on the coordinate axis, and the other is to directly use numeric characters in natural language to represent each bit in the coordinate. Based on Shikra's comparative experiments on the REC (Recommendation Systems) task, the numerical method can achieve just the right performance, and the introduction of additional word segmentations outside the original word segmentation table requires training from scratch, resulting in a slower convergence speed. Therefore, the embodiment of the present invention uses numerical characters to represent the target position, and represents a single target in the answer sequence as a target category and textualized normalized coordinates (target position), and sets the accuracy of the normalized coordinates to 0.001. This accuracy strikes a balance between reducing the positioning error of small objects (pixel area less than 32×32) and sequence length.
此处,大型语言模型是基于样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置训练得到的。Here, the large language model is trained based on sample input text, sample images to be located, label target categories and label target positions in the sample images to be located.
具体来说,将样本待定位图像输入初始视觉编码器中提取样本待定位图像的样本图像特征,并基于初始投影层将样本图像特征投影至样本词嵌入空间中,得到样本图像词向量。Specifically, the sample image to be located is input into the initial visual encoder to extract the sample image features of the sample image to be located, and the sample image features are projected into the sample word embedding space based on the initial projection layer to obtain the sample image word vector.
然后,对样本输入文本进行分词化得到样本分词向量,将样本分词向量映射至样本词嵌入空间中,得到样本文本词向量。Then, the sample input text is segmented to obtain a sample segmentation vector, and the sample segmentation vector is mapped to the sample word embedding space to obtain the sample text word vector.
进一步地,将样本图像词向量和样本文本词向量作为样本图像文本词向量对,并将样本图像文本词向量对输入至初始大型语言模型中,得到初始大型语言模型输出的预测回答序列。Furthermore, the sample image word vector and the sample text word vector are used as a sample image-text word vector pair, and the sample image-text word vector pair is input into an initial large language model to obtain a predicted answer sequence output by the initial large language model.
最后,基于预测回答序列中的预测目标类别和标签目标类别之间的差异,以及预测回答序列中的预测目标位置和标签目标位置之间的差异,确定目标损失,并基于目标损失对初始大型语言模型进行参数迭代,将参数迭代后的初始大型语言模型作为大型语言模型。Finally, based on the difference between the predicted target category and the labeled target category in the predicted answer sequence, as well as the difference between the predicted target position and the labeled target position in the predicted answer sequence, the target loss is determined, and the parameters of the initial large language model are iterated based on the target loss, and the initial large language model after parameter iteration is used as the large language model.
可以理解的是,基于样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置训练得到大型语言模型,一方面,不依赖于外部辅助,且样本输入文本和样本待定位图像与图文大模型享有统一的结构;另一方面,充分利用大型语言模型通过大规模预训练获得的广泛知识,从而具备对任意粒度的文本输入进行定位的能力,进一步提高了对话式目标定位方法的准确性和可靠性。It can be understood that the large language model trained based on the sample input text, the sample image to be located, the labeled target categories and the labeled target positions in the sample image to be located, on the one hand, does not rely on external assistance, and the sample input text and the sample image to be located have a unified structure with the large picture and text model; on the other hand, it makes full use of the extensive knowledge acquired by the large language model through large-scale pre-training, so that it has the ability to locate text input of any granularity, further improving the accuracy and reliability of the conversational target localization method.
本发明实施例提供的方法,将待定位图像输入视觉编码器中提取待定位图像的图像特征,并将图像特征投影至词嵌入空间中,得到图像词向量,对输入文本进行分词化得到分词向量,将分词向量映射至词嵌入空间中,得到文本词向量,最后,将图像词向量和文本词向量作为图像文本词向量对,并将图像文本词向量对输入至大型语言模型中,得到大型语言模型输出的回答序列,回答序列包括待定位图像中的目标类别和目标位置。大型语言模型是基于样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置训练得到的,一方面,不依赖于外部辅助,且样本输入文本和样本待定位图像与图文大模型享有统一的结构;另一方面,充分利用大型语言模型通过大规模预训练获得的广泛知识,从而具备对任意粒度的文本输入进行定位的能力,进一步提高了对话式目标定位方法的准确性和可靠性。The method provided by the embodiment of the present invention inputs the image to be located into the visual encoder to extract the image features of the image to be located, and projects the image features into the word embedding space to obtain the image word vector, segment the input text to obtain the word segmentation vector, and map the word segmentation vector into the word embedding space to obtain the text word vector. Finally, the image word vector and the text word vector are used as an image-text word vector pair, and the image-text word vector pair is input into the large language model to obtain the answer sequence output by the large language model, and the answer sequence includes the target category and target position in the image to be located. The large language model is trained based on sample input text, sample image to be located, and the label target category and label target position in the sample image to be located. On the one hand, it does not rely on external assistance, and the sample input text and the sample image to be located have a unified structure with the large image-text model; on the other hand, it makes full use of the extensive knowledge obtained by the large language model through large-scale pre-training, so as to have the ability to locate text input of any granularity, further improving the accuracy and reliability of the conversational target location method.
基于上述实施例,步骤140中所述将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列,之后还包括:Based on the above embodiment, in step 140, the image-text word vector pair is input into a large language model to obtain an answer sequence output by the large language model, and then the following steps are further included:
步骤141,基于所述待定位图像、所述输入文本,以及预设分词前的回答序列,确定所述回答序列中各分词产生的各第一条件概率;Step 141, based on the image to be located, the input text, and a preset answer sequence before word segmentation, determining first conditional probabilities of each word segmentation in the answer sequence;
步骤142,基于所述各第一条件概率,确定所述回答序列的生成概率。Step 142: Determine the generation probability of the answer sequence based on the first conditional probabilities.
具体地,可以基于待定位图像、所述输入文本,以及预设分词前的回答序列,确定回答序列中各分词产生的各第一条件概率。Specifically, the first conditional probabilities of each word segmentation in the answer sequence may be determined based on the image to be located, the input text, and a preset answer sequence before word segmentation.
在得到各第一条件概率之后,可以基于各第一条件概率,确定回答序列的生成概率,公式如下:After obtaining each first conditional probability, the generation probability of the answer sequence can be determined based on each first conditional probability. The formula is as follows:
其中,p(Xa|Xv,Xins)表示回答序列的生成概率,Xa表示输出长度为La的回答序列,Xv表示待定位图像,Xins表示输入文本,Xa,<j表示预设分词j前的回答序列,θ为整个框架中所有训练的参数。Among them, p( Xa | Xv , Xins ) represents the generation probability of the answer sequence, Xa represents the answer sequence with an output length of La , Xv represents the image to be located, Xins represents the input text, Xa, <j represents the answer sequence before the preset segmentation j, and θ is the parameter of all training in the entire framework.
即,回答序列的生成概率为序列中基于输入(待定位图像和输入文本)和预设分词前的回答序列Xa,<j的每一个分词产生的条件概率的连乘。That is, the generation probability of the answer sequence is the product of the conditional probability of each word segmentation in the sequence based on the input (the image to be located and the input text) and the preset answer sequence Xa , <j before the word segmentation.
因此,在结束符前的每一步的生成时,本发明实施例都能够获得每一个分词产生的条件概率pθ(xa,j|Xv,Xins,Xa,<j)。Therefore, in each step of generation before the end symbol, the embodiment of the present invention can obtain the conditional probability p θ (xa ,j | Xv , Xins ,Xa ,<j ) of each word segmentation.
相关技术中,现有的具备定位能力的对话式定位框架局限于单个目标场景下,用户无法同时定位到同类别的所有目标,且在定位多个类别的目标时,需要进行多次输入或调用专家模型,从而降低了对话式目标定位方法的准确性和效率。In the related technology, the existing conversational positioning framework with positioning capabilities is limited to a single target scenario. Users cannot locate all targets of the same category at the same time. When locating targets of multiple categories, multiple inputs or calls to expert models are required, which reduces the accuracy and efficiency of the conversational target positioning method.
基于上述问题,本发明实施例提出一种无需特意训练的多物体置信度评分机制,用于表示每一个定位到的物体的可信度,凸显出所有定位到的物体中最高质量的物体,该框架采用一种自回归的方式产生输出。Based on the above problems, an embodiment of the present invention proposes a multi-object confidence scoring mechanism that does not require special training, which is used to represent the credibility of each located object and highlight the highest quality objects among all located objects. The framework uses an autoregressive method to generate output.
步骤140中所述将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列,之后还包括:In step 140, the image-text word vector pair is input into a large language model to obtain an answer sequence output by the large language model, and then the following steps are further included:
步骤210,基于所述待定位图像、所述输入文本,以及预设分词前的回答序列,确定所述回答序列中所述待定位图像中各目标类别的各第二条件概率;Step 210, based on the image to be located, the input text, and a preset answer sequence before word segmentation, determining each second conditional probability of each target category in the image to be located in the answer sequence;
步骤220,基于所述各第二条件概率,确定所述回答序列中所述待定位图像中各目标类别的生成概率。Step 220: Determine the generation probability of each target category in the image to be located in the answer sequence based on the second conditional probabilities.
具体地,可以基于待定位图像、输入文本,以及预设分词前的回答序列,确定回答序列中待定位图像中各目标类别的各第二条件概率。Specifically, the second conditional probabilities of each target category in the image to be located in the answer sequence may be determined based on the image to be located, the input text, and a preset answer sequence before word segmentation.
在得到各目标类别的各第二条件概率之后,可以基于各第二条件概率,确定回答序列中待定位图像中各目标类别的生成概率,公式如下:After obtaining the second conditional probabilities of each target category, the generation probability of each target category in the image to be located in the answer sequence can be determined based on the second conditional probabilities. The formula is as follows:
其中,p(R|Xv,Xins,Xa,<j)表示回答序列中待定位图像中各目标类别的生成概率,Xa表示输出长度为La的回答序列,Xv表示待定位图像,Xins表示输入文本,Xa,<j表示预设分词j前的回答序列,Lr表示各目标类别的长度,k表示回答序列的第k位置,θ为整个框架中所有训练的参数。Among them, p(R| Xv , Xins , Xa,<j ) represents the generation probability of each target category in the image to be located in the answer sequence, Xa represents the answer sequence with an output length of L a , Xv represents the image to be located, Xins represents the input text, Xa,<j represents the answer sequence before the preset word segmentation j, Lr represents the length of each target category, k represents the kth position of the answer sequence, and θ is the parameter of all training in the entire framework.
即,对于每一个被定位的物体的类别描述,它的目标类别的描述R,假设长度为Lr且从整个回答序列的第k位置开始,根据贝叶斯概率公式,得到回答序列中待定位图像中各目标类别的生成概率。That is, for each category description of the located object, its target category description R is assumed to be of length L r and starts from the kth position of the entire answer sequence. According to the Bayesian probability formula, the generation probability of each target category in the image to be located in the answer sequence is obtained.
基于上述实施例,步骤220,之后还包括:Based on the above embodiment, step 220 further includes:
步骤221,基于所述待定位图像、所述输入文本,以及预设位置前的回答序列,确定所述回答序列中所述待定位图像中目标位置的各第三条件概率;Step 221, based on the image to be located, the input text, and the answer sequence before the preset position, determine each third conditional probability of the target position in the image to be located in the answer sequence;
步骤222,基于所述各第三条件概率,确定所述回答序列中所述待定位图像中各目标位置的生成概率。Step 222: Determine the generation probability of each target position in the image to be located in the answer sequence based on the third conditional probabilities.
具体地,可以基于待定位图像、输入文本,以及预设位置前的回答序列,确定回答序列中所述待定位图像中目标位置的各第三条件概率。Specifically, each third conditional probability of the target position in the image to be located in the answer sequence may be determined based on the image to be located, the input text, and the answer sequence before the preset position.
在得到待定位图像中目标位置的各第三条件概率之后,可以基于各第三条件概率,确定回答序列中待定位图像中各目标位置的生成概率,公式如下:After obtaining the third conditional probabilities of the target positions in the image to be located, the generation probability of each target position in the image to be located in the answer sequence can be determined based on the third conditional probabilities. The formula is as follows:
其中,ploc表示回答序列中待定位图像中各目标位置的生成概率,Xv表示待定位图像,Xins表示输入文本,Xa,<k表示预设位置k前的回答序列,coord表示待定位图像中各目标位置,即各目标预测的候选框,(x1,y1)表示各目标预测的候选框的左上角坐标,(x2,y2)表示各目标预测的候选框的右下角坐标。Among them, p loc represents the generation probability of each target position in the image to be located in the answer sequence, X v represents the image to be located, Xins represents the input text, X a, <k represents the answer sequence before the preset position k, coord represents each target position in the image to be located, that is, the candidate box predicted by each target, (x1, y1) represents the upper left corner coordinate of the candidate box predicted by each target, and (x2, y2) represents the lower right corner coordinate of the candidate box predicted by each target.
基于上述实施例,步骤222,之后还包括:Based on the above embodiment, step 222 further includes:
步骤2221,基于所述各目标位置的生成概率,以及所述各目标类别的生成概率,确定所述待定位图像中各目标的置信度评分。Step 2221, based on the generation probability of each target position and the generation probability of each target category, determine the confidence score of each target in the image to be located.
具体地,在得到各目标位置的生成概率,以及各目标类别的生成概率之后,可以基于各目标位置的生成概率,以及各目标类别的生成概率,确定待定位图像中各目标的置信度评分,公式如下:Specifically, after obtaining the generation probability of each target position and the generation probability of each target category, the confidence score of each target in the image to be located can be determined based on the generation probability of each target position and the generation probability of each target category. The formula is as follows:
其中,score表示待定位图像中各目标的置信度评分,ploc表示回答序列中待定位图像中各目标位置的生成概率,p(R|Xv,Xins,Xa,<j)表示回答序列中待定位图像中各目标类别的生成概率。Wherein, score represents the confidence score of each target in the image to be located, p loc represents the generation probability of each target position in the image to be located in the answer sequence, and p(R|X v ,X ins ,X a,<j ) represents the generation probability of each target category in the image to be located in the answer sequence.
需要说明的是,待定位图像中各目标的置信度评分是各目标位置的生成概率,以及各目标类别的生成概率的几何平均。It should be noted that the confidence score of each target in the image to be located is the geometric mean of the generation probability of each target position and the generation probability of each target category.
本发明实施例提供的方法,针对多个类别、多个物体的定位场景,为了实现更加准确和用户友好的定位,提出一种无需训练的目标置信度评分机制,当输出多个不同类别的目标时,该机制同时输出各目标的置信度评分,用户可根据实际需要通过阈值对所有定位到的目标进行筛选。该框架突破了现有具备定位能力的图文模型的单个指代局限,允许用户定制化输入不同粒度的感兴趣物体描述(输入文本),提高了模型的使用灵活性,能够根据用户输入同时输出所有检测到的感兴趣目标或拒绝不存在的目标,降低了引入多个模型的部署难度和资源消耗,具有更强的通用性和更广泛的应用前景。The method provided by the embodiment of the present invention proposes a target confidence scoring mechanism that does not require training in order to achieve more accurate and user-friendly positioning for positioning scenarios of multiple categories and multiple objects. When outputting multiple targets of different categories, the mechanism simultaneously outputs the confidence score of each target, and the user can filter all located targets by threshold according to actual needs. This framework breaks through the single reference limitation of existing graphic models with positioning capabilities, allowing users to customize the input of descriptions of objects of interest (input text) of different granularities, improving the flexibility of the model, and being able to simultaneously output all detected targets of interest or reject non-existent targets based on user input, reducing the deployment difficulty and resource consumption of introducing multiple models, and has stronger versatility and broader application prospects.
基于上述实施例,所述大型语言模型的训练步骤,包括:Based on the above embodiment, the training step of the large language model includes:
步骤310,确定初始大型语言模型,并获取所述样本输入文本、所述样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置;Step 310, determining an initial large language model, and obtaining the sample input text, the sample image to be located, and the label target category and label target position in the sample image to be located;
步骤320,将所述样本待定位图像输入初始视觉编码器中提取所述样本待定位图像的样本图像特征,并基于初始投影层将所述样本图像特征投影至样本词嵌入空间中,得到样本图像词向量;Step 320, inputting the sample image to be located into the initial visual encoder to extract sample image features of the sample image to be located, and projecting the sample image features into a sample word embedding space based on an initial projection layer to obtain a sample image word vector;
步骤330,对所述样本输入文本进行分词化得到样本分词向量,将所述样本分词向量映射至所述样本词嵌入空间中,得到样本文本词向量;Step 330, segmenting the sample input text to obtain a sample segmentation vector, and mapping the sample segmentation vector to the sample word embedding space to obtain a sample text word vector;
步骤340,将所述样本图像词向量和所述样本文本词向量作为样本图像文本词向量对,并将所述样本图像文本词向量对输入至初始大型语言模型中,得到所述初始大型语言模型输出的预测回答序列;Step 340, taking the sample image word vector and the sample text word vector as a sample image text word vector pair, and inputting the sample image text word vector pair into an initial large language model to obtain a predicted answer sequence output by the initial large language model;
步骤350,基于所述预测回答序列中的预测目标类别和所述标签目标类别,以及所述预测回答序列中的预测目标位置和所述标签目标位置,确定目标损失,并基于所述目标损失对所述初始大型语言模型进行参数迭代,得到所述大型语言模型。Step 350, based on the predicted target category and the label target category in the predicted answer sequence, as well as the predicted target position and the label target position in the predicted answer sequence, determine the target loss, and iterate the parameters of the initial large language model based on the target loss to obtain the large language model.
具体地,为了更好地训练得到大型语言模型,可以通过如下步骤进行训练:Specifically, in order to better train a large language model, the following steps can be used for training:
首先,确定初始大型语言模型,并预先收集样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置。First, an initial large language model is determined, and sample input texts, sample images to be located, label target categories and label target positions in sample images to be located are collected in advance.
此处的初始大型语言模型的参数可以是随机生成的,也可以是预先设置的。The parameters of the initial large language model here can be randomly generated or preset.
然后,可以将样本待定位图像输入初始视觉编码器中提取样本待定位图像的样本图像特征,并基于初始投影层将样本图像特征投影至样本词嵌入空间中,得到样本图像词向量。Then, the sample image to be located can be input into the initial visual encoder to extract the sample image features of the sample image to be located, and the sample image features can be projected into the sample word embedding space based on the initial projection layer to obtain the sample image word vector.
再对样本输入文本进行分词化得到样本分词向量,将样本分词向量映射至样本词嵌入空间中,得到样本文本词向量。The sample input text is then segmented to obtain a sample segmentation vector, and the sample segmentation vector is mapped to the sample word embedding space to obtain the sample text word vector.
在得到样本图像词向量和样本文本词向量之后,可以将样本图像词向量和样本文本词向量作为样本图像文本词向量对,并将样本图像文本词向量对输入至初始大型语言模型中,得到初始大型语言模型输出的预测回答序列。After obtaining the sample image word vector and the sample text word vector, the sample image word vector and the sample text word vector can be used as a sample image-text word vector pair, and the sample image-text word vector pair can be input into the initial large language model to obtain a predicted answer sequence output by the initial large language model.
在基于初始大型语言模型得到预测回答序列后,将预测回答序列和标签回答序列进行比较,并基于预测回答序列和标签回答序列之间的差异,确定目标损失。After obtaining the predicted answer sequence based on the initial large language model, the predicted answer sequence is compared with the labeled answer sequence, and the target loss is determined based on the difference between the predicted answer sequence and the labeled answer sequence.
可以理解的是,预测回答序列中预测目标类别和标签回答序列中标签目标类别,以及预测回答序列中预测目标位置和标签回答序列中标签目标位置的差异越大,目标损失值越大。It can be understood that the greater the difference between the predicted target category in the predicted answer sequence and the labeled target category in the labeled answer sequence, and the predicted target position in the predicted answer sequence and the labeled target position in the labeled answer sequence, the greater the target loss value.
在得到目标损失之后,可以基于目标损失对初始大型语言模型进行参数迭代,将完成参数迭代之后的初始大型语言模型作为大型语言模型。After obtaining the target loss, parameters of the initial large language model can be iterated based on the target loss, and the initial large language model after the parameter iteration is used as the large language model.
基于上述实施例,步骤350中所述基于所述目标损失对所述初始大型语言模型进行参数迭代,得到所述大型语言模型,还包括:Based on the above embodiment, the step 350 of performing parameter iteration on the initial large language model based on the target loss to obtain the large language model further includes:
步骤351,基于所述目标损失对所述初始视觉编码器和所述初始投影层进行梯度反传,得到所述视觉编码器和投影层。Step 351, performing gradient back propagation on the initial visual encoder and the initial projection layer based on the target loss to obtain the visual encoder and the projection layer.
具体地,还可以基于目标损失对初始视觉编码器和初始投影层进行梯度反传,得到参数更新后的视觉编码器和参数更新后的投影层。Specifically, the initial visual encoder and the initial projection layer may be gradient back-propagated based on the target loss to obtain a visual encoder with updated parameters and a projection layer with updated parameters.
基于上述实施例,图3是本发明提供的渐进式两阶段训练方法的流程示意图,如图3所示,本发明实施例的框架采用所设计的渐进式两阶段训练方法进行训练。此前具有定位能力的图文大模型将指代表达理解数据加入图文对齐阶段,然后通过所设计的定位相关的图像输入文本对进行微调。通过增加数据量,这种方式对于单个物体指代定位是可行的。但在处理多个不同粒度的物体时,模型需要对整个图像场景和上下文有基础的理解。出于该考量,借鉴了大型语言模型的预训练和输入文本微调范式,如图3所示,提出了如下的渐进式两阶段训练流程:Based on the above embodiment, Figure 3 is a flow chart of the progressive two-stage training method provided by the present invention. As shown in Figure 3, the framework of the embodiment of the present invention is trained using the designed progressive two-stage training method. Previously, the large picture-text model with positioning capability added the referential expression understanding data to the picture-text alignment stage, and then fine-tuned the image input text pairs related to the designed positioning. By increasing the amount of data, this method is feasible for the reference positioning of a single object. However, when dealing with multiple objects of different granularities, the model needs to have a basic understanding of the entire image scene and context. For this reason, the pre-training and input text fine-tuning paradigm of large language models is borrowed, as shown in Figure 3, and the following progressive two-stage training process is proposed:
第一阶段为基础场景预训练。第一阶段旨在增强框架的多物体感知能力,并实现基本场景中的定位。继承LLaVA图文模型的权重进行初始化,赋予该框架图文对齐能力和指令理解能力。使用600万的包含各类定位场景的数据进行训练。为了构建输入文本Xins,对于每个待定位图像Xv,从所有生成的任务模版中随机抽取一个模版,并与标签组合构成训练样本。在该阶段,本发明实施例训练整个网络框架,将细节保留在视觉特征中,更有利于小目标定位和细粒度判别。The first stage is the pre-training of basic scenes. The first stage aims to enhance the multi-object perception capability of the framework and realize positioning in basic scenes. The weights of the inherited LLaVA image-text model are initialized to give the framework the ability to align images and instructions and the ability to understand instructions. 6 million data containing various positioning scenes are used for training. In order to construct the input text Xins , for each image to be positioned Xv , a template is randomly selected from all generated task templates and combined with the label to form a training sample. At this stage, the embodiment of the present invention trains the entire network framework, retains the details in the visual features, and is more conducive to small target positioning and fine-grained discrimination.
第二阶段为全场景指令微调。为了更好适应全定位场景,在第一阶段训练得到的基础框架上进行指令微调(输入文本微调)。具体而言,在具备对任意粒度的文本输入进行定位的能力下,进一步优化框架对特定场景下用户意图的理解。本发明实施例用50万的输入文本数据(指令数据)进一步微调了框架的视觉编码器、投影层和大型语言模型,提升框架的用户意图理解能力。The second stage is full-scenario instruction fine-tuning. In order to better adapt to the full positioning scenario, instruction fine-tuning (input text fine-tuning) is performed on the basic framework trained in the first stage. Specifically, with the ability to locate text input of any granularity, the framework's understanding of user intent in specific scenarios is further optimized. The embodiment of the present invention further fine-tunes the framework's visual encoder, projection layer, and large language model with 500,000 input text data (instruction data) to enhance the framework's ability to understand user intent.
基于上述任一实施例,本发明提供一种基于任意粒度文本输入的对话式目标定位方法,步骤如下:Based on any of the above embodiments, the present invention provides a conversational target location method based on arbitrary granularity text input, the steps are as follows:
第一步,获取待定位图像和输入文本。The first step is to obtain the image to be located and the input text.
第二步,将待定位图像输入视觉编码器中提取待定位图像的图像特征,并将图像特征投影至词嵌入空间中,得到图像词向量。In the second step, the image to be located is input into the visual encoder to extract the image features of the image to be located, and the image features are projected into the word embedding space to obtain the image word vector.
第三步,对输入文本进行分词化得到分词向量,将分词向量映射至所述词嵌入空间中,得到文本词向量。The third step is to segment the input text to obtain a segmentation vector, and map the segmentation vector to the word embedding space to obtain a text word vector.
第四步,将图像词向量和文本词向量作为图像文本词向量对,并将图像文本词向量对输入至大型语言模型中,得到大型语言模型输出的回答序列,其中,回答序列包括待定位图像中的目标类别和目标位置。In the fourth step, the image word vector and the text word vector are used as an image-text word vector pair, and the image-text word vector pair is input into a large language model to obtain an answer sequence output by the large language model, wherein the answer sequence includes the target category and target position in the image to be located.
第五步,基于待定位图像、输入文本,以及预设分词前的回答序列,确定回答序列中各分词产生的各第一条件概率,再基于各第一条件概率,确定回答序列的生成概率。The fifth step is to determine the first conditional probabilities of each word segmentation in the answer sequence based on the image to be located, the input text, and the preset answer sequence before word segmentation, and then determine the generation probability of the answer sequence based on each first conditional probability.
第六步,基于待定位图像、输入文本,以及预设分词前的回答序列,确定回答序列中待定位图像中各目标类别的各第二条件概率,再基于各第二条件概率,确定回答序列中待定位图像中各目标类别的生成概率。The sixth step is to determine the second conditional probabilities of each target category in the image to be located in the answer sequence based on the image to be located, the input text, and the preset answer sequence before word segmentation, and then determine the generation probability of each target category in the image to be located in the answer sequence based on each second conditional probability.
第七步,基于待定位图像、输入文本,以及预设位置前的回答序列,确定回答序列中所述待定位图像中目标位置的各第三条件概率,再基于各第三条件概率,确定回答序列中待定位图像中各目标位置的生成概率。The seventh step is to determine the third conditional probabilities of the target positions in the image to be located in the answer sequence based on the image to be located, the input text, and the answer sequence before the preset position, and then determine the generation probability of each target position in the image to be located in the answer sequence based on each third conditional probability.
第八步,基于各目标位置的生成概率,以及各目标类别的生成概率,确定待定位图像中各目标的置信度评分。In the eighth step, based on the generation probability of each target position and the generation probability of each target category, the confidence score of each target in the image to be located is determined.
其中,大型语言模型的训练步骤,包括:Among them, the training steps of the large language model include:
S1、确定初始大型语言模型,并获取样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置。S1. Determine an initial large language model and obtain sample input text, sample images to be located, label target categories and label target positions in the sample images to be located.
S2、将样本待定位图像输入初始视觉编码器中提取样本待定位图像的样本图像特征,并基于初始投影层将样本图像特征投影至样本词嵌入空间中,得到样本图像词向量。S2. Input the sample image to be located into the initial visual encoder to extract the sample image features of the sample image to be located, and project the sample image features into the sample word embedding space based on the initial projection layer to obtain the sample image word vector.
S3、对样本输入文本进行分词化得到样本分词向量,将样本分词向量映射至样本词嵌入空间中,得到样本文本词向量。S3. Segment the sample input text to obtain a sample segmentation vector, map the sample segmentation vector to the sample word embedding space, and obtain the sample text word vector.
S4、将样本图像词向量和样本文本词向量作为样本图像文本词向量对,并将样本图像文本词向量对输入至初始大型语言模型中,得到初始大型语言模型输出的预测回答序列。S4. Use the sample image word vector and the sample text word vector as a sample image-text word vector pair, and input the sample image-text word vector pair into an initial large language model to obtain a predicted answer sequence output by the initial large language model.
S5、基于预测回答序列中的预测目标类别和标签目标类别,以及预测回答序列中的预测目标位置和标签目标位置,确定目标损失,并基于目标损失对初始大型语言模型进行参数迭代,得到大型语言模型。S5. Determine the target loss based on the predicted target category and the label target category in the predicted answer sequence, as well as the predicted target position and the label target position in the predicted answer sequence, and iterate the parameters of the initial large language model based on the target loss to obtain the large language model.
S6、基于目标损失对初始视觉编码器和初始投影层进行梯度反传,得到视觉编码器和投影层。S6. Perform gradient backpropagation on the initial visual encoder and the initial projection layer based on the target loss to obtain the visual encoder and the projection layer.
本发明实施例提供的方法,提供一种针对任意粒度文本输入的对话式目标定位框架,使得模型具备灵活理解用户输入不同粒度的物体指代描述和精准定位的能力。本发明实施例参考了自然语言处理中的序列建模方法,将不同粒度的定位任务重新定义为基于输入文本的感兴趣物体位置描述生成问题。该对话式目标定位框架将所有目标定位场景统一为输入文本-所有物体坐标的对话式序列生成任务,且不引入任何的特殊占位符,同时为多类别目标构建了一种置信度评分机制,通过形式上的统一使得在任意文本输入下的不同粒度的定位任务可以用同一个模型进行处理。The method provided in the embodiment of the present invention provides a conversational target positioning framework for text input of arbitrary granularity, so that the model has the ability to flexibly understand the object reference descriptions of different granularities input by the user and accurately locate. The embodiment of the present invention refers to the sequence modeling method in natural language processing, and redefines the positioning tasks of different granularities as the problem of generating the location description of the object of interest based on the input text. The conversational target positioning framework unifies all target positioning scenarios into a conversational sequence generation task of input text-all object coordinates without introducing any special placeholders. At the same time, it constructs a confidence scoring mechanism for multi-category targets. Through formal unification, positioning tasks of different granularities under arbitrary text input can be processed by the same model.
下面对本发明提供的基于任意粒度文本输入的对话式目标定位装置进行描述,下文描述的基于任意粒度文本输入的对话式目标定位装置与上文描述的基于任意粒度文本输入的对话式目标定位方法可相互对应参照。The following is a description of the conversational target positioning device based on arbitrary granularity text input provided by the present invention. The conversational target positioning device based on arbitrary granularity text input described below and the conversational target positioning method based on arbitrary granularity text input described above can be referenced to each other.
基于上述任一实施例,本发明提供一种基于任意粒度文本输入的对话式目标定位装置,图4是本发明提供的基于任意粒度文本输入的对话式目标定位装置的结构示意图,如图4所示,该装置包括:Based on any of the above embodiments, the present invention provides a conversational target positioning device based on arbitrary granularity text input. FIG4 is a structural schematic diagram of the conversational target positioning device based on arbitrary granularity text input provided by the present invention. As shown in FIG4, the device includes:
获取单元410,用于获取待定位图像和输入文本;An acquisition unit 410 is used to acquire an image to be located and input text;
投影单元420,用于将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;A projection unit 420 is used to input the image to be located into a visual encoder to extract image features of the image to be located, and project the image features into a word embedding space to obtain an image word vector;
映射单元430,用于对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量;A mapping unit 430 is used to segment the input text to obtain a segmentation vector, and map the segmentation vector to the word embedding space to obtain a text word vector;
输入单元440,用于将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;An input unit 440 is used to use the image word vector and the text word vector as an image-text word vector pair, and input the image-text word vector pair into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes the target category and target position in the image to be located;
所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。The large language model is trained based on sample input text, sample images to be located, and label target categories and label target positions in the sample images to be located.
本发明实施例提供的装置,将待定位图像输入视觉编码器中提取待定位图像的图像特征,并将图像特征投影至词嵌入空间中,得到图像词向量,对输入文本进行分词化得到分词向量,将分词向量映射至词嵌入空间中,得到文本词向量,最后,将图像词向量和文本词向量作为图像文本词向量对,并将图像文本词向量对输入至大型语言模型中,得到大型语言模型输出的回答序列,回答序列包括待定位图像中的目标类别和目标位置。大型语言模型是基于样本输入文本、样本待定位图像、样本待定位图像中的标签目标类别和标签目标位置训练得到的,一方面,不依赖于外部辅助,且样本输入文本和样本待定位图像与图文大模型享有统一的结构;另一方面,充分利用大型语言模型通过大规模预训练获得的广泛知识,从而具备对任意粒度的文本输入进行定位的能力,进一步提高了对话式目标定位方法的准确性和可靠性。The device provided by the embodiment of the present invention inputs the image to be located into the visual encoder to extract the image features of the image to be located, and projects the image features into the word embedding space to obtain the image word vector, segment the input text to obtain the word segmentation vector, and map the word segmentation vector into the word embedding space to obtain the text word vector. Finally, the image word vector and the text word vector are used as an image-text word vector pair, and the image-text word vector pair is input into the large language model to obtain the answer sequence output by the large language model, and the answer sequence includes the target category and target position in the image to be located. The large language model is trained based on sample input text, sample image to be located, and the label target category and label target position in the sample image to be located. On the one hand, it does not rely on external assistance, and the sample input text and the sample image to be located have a unified structure with the large image-text model; on the other hand, it makes full use of the extensive knowledge obtained by the large language model through large-scale pre-training, so as to have the ability to locate text input of any granularity, further improving the accuracy and reliability of the conversational target location method.
基于上述任一实施例,还包括确定回答序列的生成概率单元,确定回答序列的生成概率单元,具体用于:Based on any of the above embodiments, the method further includes a unit for determining a generation probability of an answer sequence, wherein the unit for determining a generation probability of an answer sequence is specifically configured to:
基于所述待定位图像、所述输入文本,以及预设分词前的回答序列,确定所述回答序列中各分词产生的各第一条件概率;Based on the image to be located, the input text, and a preset answer sequence before word segmentation, determining each first conditional probability of each word segmentation in the answer sequence;
基于所述各第一条件概率,确定所述回答序列的生成概率。Based on the first conditional probabilities, a generation probability of the answer sequence is determined.
基于上述任一实施例,还包括确定各目标类别的生成概率单元,确定各目标类别的生成概率单元,具体包括:Based on any of the above embodiments, it further includes a generation probability unit for determining each target category, and the generation probability unit for determining each target category specifically includes:
确定各第二条件概率单元,用于基于所述待定位图像、所述输入文本,以及预设分词前的回答序列,确定所述回答序列中所述待定位图像中各目标类别的各第二条件概率;Determine each second conditional probability unit, which is used to determine each second conditional probability of each target category in the image to be located in the answer sequence based on the image to be located, the input text, and the preset answer sequence before word segmentation;
确定第二生成概率单元,用于基于所述各第二条件概率,确定所述回答序列中所述待定位图像中各目标类别的生成概率。A second generation probability determination unit is used to determine the generation probability of each target category in the image to be located in the answer sequence based on the second conditional probabilities.
基于上述任一实施例,还包括确定各目标位置的生成概率单元,确定各目标位置的生成概率单元,具体包括:Based on any of the above embodiments, it further includes a unit for determining a generation probability of each target position, and the unit for determining a generation probability of each target position specifically includes:
确定各第三条件概率单元,用于基于所述待定位图像、所述输入文本,以及预设位置前的回答序列,确定所述回答序列中所述待定位图像中目标位置的各第三条件概率;Determine each third conditional probability unit, used for determining each third conditional probability of the target position in the image to be located in the answer sequence based on the image to be located, the input text, and the answer sequence before the preset position;
确定各目标位置的生成概率子单元,用于基于所述各第三条件概率,确定所述回答序列中所述待定位图像中各目标位置的生成概率。The subunit for determining the generation probability of each target position is used to determine the generation probability of each target position in the image to be located in the answer sequence based on the third conditional probabilities.
基于上述任一实施例,还包括确定各目标的置信度评分单元,确定各目标的置信度评分单元,具体用于:Based on any of the above embodiments, it further includes a confidence scoring unit for determining each target, wherein the confidence scoring unit for determining each target is specifically used to:
基于所述各目标位置的生成概率,以及所述各目标类别的生成概率,确定所述待定位图像中各目标的置信度评分。Based on the generation probability of each target position and the generation probability of each target category, a confidence score of each target in the image to be located is determined.
基于上述任一实施例,还包括训练单元,训练单元具体用于:Based on any of the above embodiments, a training unit is further included, and the training unit is specifically used for:
确定初始大型语言模型,并获取所述样本输入文本、所述样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置;Determine an initial large language model, and obtain the sample input text, the sample image to be located, and the label target category and label target position in the sample image to be located;
将所述样本待定位图像输入初始视觉编码器中提取所述样本待定位图像的样本图像特征,并基于初始投影层将所述样本图像特征投影至样本词嵌入空间中,得到样本图像词向量;Input the sample image to be located into the initial visual encoder to extract sample image features of the sample image to be located, and project the sample image features into the sample word embedding space based on the initial projection layer to obtain a sample image word vector;
对所述样本输入文本进行分词化得到样本分词向量,将所述样本分词向量映射至所述样本词嵌入空间中,得到样本文本词向量;Segmenting the sample input text to obtain a sample segmentation vector, and mapping the sample segmentation vector to the sample word embedding space to obtain a sample text word vector;
将所述样本图像词向量和所述样本文本词向量作为样本图像文本词向量对,并将所述样本图像文本词向量对输入至初始大型语言模型中,得到所述初始大型语言模型输出的预测回答序列;The sample image word vector and the sample text word vector are used as a sample image text word vector pair, and the sample image text word vector pair is input into an initial large language model to obtain a predicted answer sequence output by the initial large language model;
基于所述预测回答序列中的预测目标类别和所述标签目标类别,以及所述预测回答序列中的预测目标位置和所述标签目标位置,确定目标损失,并基于所述目标损失对所述初始大型语言模型进行参数迭代,得到所述大型语言模型。Based on the predicted target category and the label target category in the predicted answer sequence, as well as the predicted target position and the label target position in the predicted answer sequence, a target loss is determined, and parameters of the initial large language model are iterated based on the target loss to obtain the large language model.
基于上述任一实施例,还包括梯度反传单元,梯度反传单元具体用于:Based on any of the above embodiments, a gradient back propagation unit is further included, and the gradient back propagation unit is specifically used for:
基于所述目标损失对所述初始视觉编码器和所述初始投影层进行梯度反传,得到所述视觉编码器和投影层。Performing gradient back propagation on the initial visual encoder and the initial projection layer based on the target loss to obtain the visual encoder and the projection layer.
图5示例了一种电子设备的实体结构示意图,如图5所示,该电子设备可以包括:处理器(processor)510、通信接口(Communications Interface)520、存储器(memory)530和通信总线540,其中,处理器510,通信接口520,存储器530通过通信总线540完成相互间的通信。处理器510可以调用存储器530中的逻辑指令,以执行基于任意粒度文本输入的对话式目标定位方法,该方法包括:获取待定位图像和输入文本;将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量;将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。Figure 5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 5, the electronic device may include: a processor (processor) 510, a communication interface (Communications Interface) 520, a memory (memory) 530 and a communication bus 540, wherein the processor 510, the communication interface 520, and the memory 530 communicate with each other through the communication bus 540. The processor 510 can call the logic instructions in the memory 530 to execute a conversational target localization method based on arbitrary granularity text input, the method comprising: obtaining an image to be located and an input text; inputting the image to be located into a visual encoder to extract image features of the image to be located, and projecting the image features into a word embedding space to obtain an image word vector; segmenting the input text to obtain a segmentation vector, mapping the segmentation vector to the word embedding space to obtain a text word vector; using the image word vector and the text word vector as an image-text word vector pair, and inputting the image-text word vector pair into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes a target category and a target position in the image to be located; the large language model is trained based on sample input text, sample image to be located, and labeled target categories and labeled target positions in the sample image to be located.
此外,上述的存储器530中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 530 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk, etc. Various media that can store program codes.
另一方面,本发明还提供一种计算机程序产品,所述计算机程序产品包括计算机程序,计算机程序可存储在非暂态计算机可读存储介质上,所述计算机程序被处理器执行时,计算机能够执行上述各方法所提供的基于任意粒度文本输入的对话式目标定位方法,该方法包括:获取待定位图像和输入文本;将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量;将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。On the other hand, the present invention also provides a computer program product, which includes a computer program, which can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the conversational target localization method based on arbitrary granularity text input provided by the above methods, the method including: obtaining an image to be located and an input text; inputting the image to be located into a visual encoder to extract image features of the image to be located, and projecting the image features into a word embedding space to obtain an image word vector; segmenting the input text to obtain a segmentation vector, mapping the segmentation vector to the word embedding space to obtain a text word vector; using the image word vector and the text word vector as an image-text word vector pair, and inputting the image-text word vector pair into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes a target category and a target position in the image to be located; the large language model is trained based on sample input text, sample image to be located, and labeled target categories and labeled target positions in the sample image to be located.
又一方面,本发明还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各方法提供的基于任意粒度文本输入的对话式目标定位方法,该方法包括:获取待定位图像和输入文本;将所述待定位图像输入视觉编码器中提取所述待定位图像的图像特征,并将所述图像特征投影至词嵌入空间中,得到图像词向量;对所述输入文本进行分词化得到分词向量,将所述分词向量映射至所述词嵌入空间中,得到文本词向量;将所述图像词向量和所述文本词向量作为图像文本词向量对,并将所述图像文本词向量对输入至大型语言模型中,得到所述大型语言模型输出的回答序列;所述回答序列包括所述待定位图像中的目标类别和目标位置;所述大型语言模型是基于样本输入文本、样本待定位图像、所述样本待定位图像中的标签目标类别和标签目标位置训练得到的。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the conversational target localization method based on arbitrary granularity text input provided by the above-mentioned methods, the method comprising: obtaining an image to be located and an input text; inputting the image to be located into a visual encoder to extract image features of the image to be located, and projecting the image features into a word embedding space to obtain an image word vector; segmenting the input text to obtain a segmentation vector, mapping the segmentation vector to the word embedding space to obtain a text word vector; using the image word vector and the text word vector as an image-text word vector pair, and inputting the image-text word vector pair into a large language model to obtain an answer sequence output by the large language model; the answer sequence includes a target category and a target position in the image to be located; the large language model is trained based on sample input text, sample image to be located, and labeled target categories and labeled target positions in the sample image to be located.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410250372.3A CN118350464A (en) | 2024-03-05 | 2024-03-05 | Conversational target positioning method and device based on arbitrary granularity text input |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410250372.3A CN118350464A (en) | 2024-03-05 | 2024-03-05 | Conversational target positioning method and device based on arbitrary granularity text input |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118350464A true CN118350464A (en) | 2024-07-16 |
Family
ID=91817704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410250372.3A Pending CN118350464A (en) | 2024-03-05 | 2024-03-05 | Conversational target positioning method and device based on arbitrary granularity text input |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118350464A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118568289A (en) * | 2024-08-02 | 2024-08-30 | 杭州海康威视数字技术股份有限公司 | Target positioning method and related equipment thereof |
CN118570481A (en) * | 2024-08-05 | 2024-08-30 | 中国科学院自动化研究所 | Generative coreference segmentation method and device based on implicit structural features |
CN118861214A (en) * | 2024-09-26 | 2024-10-29 | 鹏城实验室 | Visual language model training method, text generation method and related equipment |
CN118941872A (en) * | 2024-08-08 | 2024-11-12 | 中国医科大学 | Feature detection and classification method of region of interest in medical videos based on large language model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792112A (en) * | 2020-07-31 | 2021-12-14 | 北京京东尚科信息技术有限公司 | Visual language task processing system, training method, device, equipment and medium |
CN117423108A (en) * | 2023-09-28 | 2024-01-19 | 中国科学院自动化研究所 | Image fine granularity description method and system for instruction fine adjustment multi-mode large model |
-
2024
- 2024-03-05 CN CN202410250372.3A patent/CN118350464A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113792112A (en) * | 2020-07-31 | 2021-12-14 | 北京京东尚科信息技术有限公司 | Visual language task processing system, training method, device, equipment and medium |
CN117423108A (en) * | 2023-09-28 | 2024-01-19 | 中国科学院自动化研究所 | Image fine granularity description method and system for instruction fine adjustment multi-mode large model |
Non-Patent Citations (1)
Title |
---|
YUFEI ZHAN等: "Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models", 《ARXIV》, 27 November 2023 (2023-11-27), pages 1 - 20 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118568289A (en) * | 2024-08-02 | 2024-08-30 | 杭州海康威视数字技术股份有限公司 | Target positioning method and related equipment thereof |
CN118570481A (en) * | 2024-08-05 | 2024-08-30 | 中国科学院自动化研究所 | Generative coreference segmentation method and device based on implicit structural features |
CN118570481B (en) * | 2024-08-05 | 2024-12-06 | 中国科学院自动化研究所 | Generative coreference segmentation method and device based on implicit structural features |
CN118941872A (en) * | 2024-08-08 | 2024-11-12 | 中国医科大学 | Feature detection and classification method of region of interest in medical videos based on large language model |
CN118861214A (en) * | 2024-09-26 | 2024-10-29 | 鹏城实验室 | Visual language model training method, text generation method and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11151406B2 (en) | Method, apparatus, device and readable storage medium for image-based data processing | |
CN112101357B (en) | RPA robot intelligent element positioning and picking method and system | |
US11960843B2 (en) | Multi-module and multi-task machine learning system based on an ensemble of datasets | |
WO2021036059A1 (en) | Image conversion model training method, heterogeneous face recognition method, device and apparatus | |
CN118350464A (en) | Conversational target positioning method and device based on arbitrary granularity text input | |
CN115526259A (en) | Training method and device for multi-mode pre-training model | |
CN113269089A (en) | Real-time gesture recognition method and system based on deep learning | |
WO2021127916A1 (en) | Facial emotion recognition method, smart device and computer-readabel storage medium | |
CN115131801B (en) | Multimodal document recognition method, device, equipment and storage medium | |
KR20200059993A (en) | Apparatus and method for generating conti for webtoon | |
CN114429566A (en) | Image semantic understanding method, device, equipment and storage medium | |
CN117235605B (en) | Sensitive information classification method and device based on multi-mode attention fusion | |
CN114708462A (en) | Detection model generation method, system, device and storage medium for multi-data training | |
CN108537109B (en) | Monocular camera sign language recognition method based on OpenPose | |
CN114241495B (en) | Data enhancement method for off-line handwritten text recognition | |
CN112084788B (en) | Automatic labeling method and system for implicit emotion tendencies of image captions | |
CN114238587A (en) | Reading understanding method and device, storage medium and computer equipment | |
CN118898864A (en) | A facial expression recognition method based on multi-granularity perception and label distribution learning | |
CN116259050B (en) | Text positioning and recognition method, device, equipment and detection method for filling barrel label | |
CN117391201A (en) | Question and answer methods, devices and electronic equipment | |
CN114842482B (en) | Image classification method, device, equipment and storage medium | |
Yap et al. | Enhancing BISINDO Recognition Accuracy Through Comparative Analysis of Three CNN Architecture Models | |
CN114863450A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
Rahul et al. | Deep reader: Information extraction from document images via relation extraction and natural language | |
Shane et al. | Sign Language Detection Using Faster RCNN Resnet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |