CN117333886A - Method, device, electronic equipment and storage medium for matching regular text for image - Google Patents

Method, device, electronic equipment and storage medium for matching regular text for image Download PDF

Info

Publication number
CN117333886A
CN117333886A CN202311485444.4A CN202311485444A CN117333886A CN 117333886 A CN117333886 A CN 117333886A CN 202311485444 A CN202311485444 A CN 202311485444A CN 117333886 A CN117333886 A CN 117333886A
Authority
CN
China
Prior art keywords
text
rule
sample
image
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311485444.4A
Other languages
Chinese (zh)
Inventor
刘一廷
何泽文
汪翔
李亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Shenzhen Tencent Computer Systems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Computer Systems Co Ltd filed Critical Shenzhen Tencent Computer Systems Co Ltd
Priority to CN202311485444.4A priority Critical patent/CN117333886A/en
Publication of CN117333886A publication Critical patent/CN117333886A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/19007Matching; Proximity measures
    • G06V30/19093Proximity measures, i.e. similarity or distance measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The application discloses a method, a device, electronic equipment and a storage medium for matching rule texts for images. The embodiment of the application relates to the technical fields of artificial intelligence, cloud technology, intelligent traffic, auxiliary driving and the like. The method comprises the following steps: determining a description text of the image to be matched according to the image content of the image to be matched; determining at least one candidate rule text with the matching degree higher than a matching degree threshold value between the candidate rule text and the descriptive text from a plurality of preset rule texts; generating a question text according to the description text, at least one candidate rule text and a preset prompt text; generating a reply text of the questioning text according to the questioning text and the image to be matched; and determining rule texts matched with the images to be matched from at least one candidate rule text according to the reply text. By the method, the purpose of matching the regular text with high accuracy for the image to be matched is achieved.

Description

Method, device, electronic equipment and storage medium for matching regular text for image
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to a method, an apparatus, an electronic device, and a storage medium for matching rule text for an image.
Background
Rule matching refers to selecting a rule closest to a target text in a rule base. At present, a rule closest to the target text can be selected from the rule base through the similarity between the target text and each preset rule in the rule base. However, rule matching of pictures is difficult to achieve by existing means.
Disclosure of Invention
In view of this, the embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for matching rule text for an image.
In a first aspect, an embodiment of the present application provides a method for matching rule text for an image, where the method includes: determining a description text of the image to be matched according to the image content of the image to be matched; determining at least one candidate rule text with the matching degree higher than a matching degree threshold value between the candidate rule text and the descriptive text from a plurality of preset rule texts; generating a question text according to the description text, the at least one candidate rule text and a preset prompt text, wherein the prompt text is used for prompting the screening of the rule text most relevant to the image to be matched from the at least one candidate rule text; generating a reply text of the questioning text according to the questioning text and the image to be matched; and determining rule texts matched with the images to be matched from at least one candidate rule text according to the reply text.
In a second aspect, an embodiment of the present application provides an apparatus for matching regular text for an image, where the apparatus includes: the text determining module is used for determining the description text of the image to be matched according to the image content of the image to be matched; the selection module is used for determining at least one candidate rule text with the matching degree higher than a matching degree threshold value from a plurality of preset rule texts; the questioning text generation module is used for generating a questioning text according to the description text, the at least one candidate rule text and a preset prompt text, wherein the prompt text is used for prompting the screening of the rule text most relevant to the image to be matched from the at least one candidate rule text; the reply text generation module is used for generating a reply text of the question text according to the question text and the image to be matched; and the rule matching module is used for determining rule texts matched with the images to be matched from at least one candidate rule text according to the reply text.
Optionally, the reply text comprises a candidate matching rule text determined for the image to be matched and an interpretation text output for the candidate matching rule text; the rule matching module is also used for scoring based on the candidate matching rule text and the interpretation text through the evaluation model to obtain the prediction score of the reply text; and if the prediction score reaches a scoring threshold, the candidate matching rule text is used as the rule text matched with the image to be matched.
Optionally, generating a reply text of the question text according to the question text and the image to be matched through the multi-mode model; the rule matching module is further used for generating an error prompt text if the prediction score does not reach the score threshold value, wherein the error prompt text is used for prompting that the candidate matching rule text is not matched with the image to be matched; inputting the error prompt text, the question text and the image to be matched into the multi-mode large language model to obtain a new reply text output by the multi-mode large language model; and returning to execute the step of scoring based on the candidate matching rule text and the interpretation text through the evaluation model to obtain the prediction score of the reply text until the reply text with the prediction score reaching the scoring threshold is obtained, and obtaining the candidate matching rule text in the reply text with the prediction score reaching the scoring threshold as the rule text matched with the image to be matched.
Optionally, the device further includes a training module, configured to obtain a sample reply text and a sample label corresponding to the sample reply text, where the sample reply text includes a sample candidate rule text and a sample interpretation text for the sample candidate rule text; the sample label is used for indicating whether the sample candidate rule text and the sample interpretation text are matched; scoring based on the sample candidate rule text and the sample interpretation text through an initial evaluation model to obtain a prediction score of the sample reply text; and adjusting parameters of the initial evaluation model according to the prediction scores of the sample reply texts and the sample labels to obtain the evaluation model.
Optionally, the selection module is further configured to encode the description text to obtain an encoding result of the description text; coding each preset rule text to obtain a coding result of each preset rule text; determining a plurality of initial candidate rule texts from a plurality of preset rule texts according to the similarity between the coding result of the descriptive text and the coding result of each preset rule text; carrying out joint coding on the description text and each initial candidate rule text to obtain a joint coding result; and determining at least one candidate rule text with the matching degree higher than a matching degree threshold value between the description text and the plurality of initial candidate rule texts according to the joint coding result of the description text and each initial candidate rule text.
Optionally, the coding result of the descriptive text is obtained by coding through a first text coder, and the coding result of the preset rule text is obtained by coding through a second text coder; the training module is further used for encoding a first sample description text corresponding to the first sample image through a pre-trained first initial text encoder to obtain an encoding result corresponding to the first sample description text; encoding each first sample rule text through a pre-trained second initial text encoder to obtain an encoding result of each first sample rule text; determining a loss value corresponding to the first sample image according to the similarity between the coding result of the first sample description text and the coding result of each first sample rule text; and adjusting parameters of the first initial text encoder and the second initial text encoder according to the loss value corresponding to the first sample image to obtain the first text encoder corresponding to the first initial text encoder and the second text encoder corresponding to the second initial text encoder.
Optionally, the plurality of first sample rule texts include positive sample rule texts and negative sample rule texts corresponding to the first sample images; the training module is further configured to determine a loss value corresponding to the first sample image according to a similarity between the encoding result of the first sample description text and the encoding result of the positive sample regular text, and a similarity between the encoding result of the first sample description text and the encoding result of the negative sample regular text.
Optionally, the selection module is further configured to splice the description text and each initial candidate rule text to obtain a spliced text corresponding to each initial candidate rule text; and inputting the spliced text into a third text encoder to obtain a joint encoding result of the description text and the initial candidate rule text in the spliced text.
Optionally, the training module is further configured to splice a second sample description text corresponding to the second sample image with each second sample rule text, so as to obtain a sample spliced text corresponding to each second sample rule text; inputting the sample spliced text corresponding to each second sample rule text into a pre-trained third initial text encoder to obtain a coding result corresponding to each sample spliced text; determining the matching degree between a second sample rule text and a second sample description text in the sample spliced text according to the coding result corresponding to the sample spliced text; determining a loss value corresponding to the second sample image according to the matching degree between the second sample rule text and the second sample description text in each sample spliced text; and adjusting parameters of the third initial text encoder through the loss value corresponding to the second sample image to obtain the third text encoder.
Optionally, the text determining module is further configured to understand visual content of the image to be matched through the multi-mode large language model, so that a first sub-description text corresponding to the image to be matched is obtained; performing text extraction on the image to be matched to obtain a second sub-description text corresponding to the image to be matched; and obtaining the description text corresponding to the image to be matched according to the first sub description text and the second sub description text.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory; the memory has stored thereon computer readable instructions which, when executed by the processor, implement the method described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor, implement the above-described method.
In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions which, when executed by a processor, implement the above-described method.
In the method, the device, the electronic equipment and the storage medium for matching the rule texts for the images, the description texts of the images to be matched are determined according to the image content of the images to be matched, at least one candidate rule text with the matching degree higher than a matching degree threshold value between the description texts is determined from a plurality of preset rule texts, and then question texts are generated according to the description texts, the at least one candidate rule text and preset prompt texts, wherein the prompt texts are used for prompting the screening of the rule texts most relevant to the images to be matched from the at least one candidate rule text; generating a reply text of the questioning text according to the questioning text and the image to be matched; determining a rule text matched with the image to be matched according to the reply text, wherein the description text can accurately describe the image content of the image to be matched, and the similarity between the candidate rule text and the description text is higher, so that the matching degree between the candidate rule text and the image content of the image to be matched is higher, then constructing a question text according to the candidate rule text with higher matching degree with the image content of the image to be matched, determining the reply text according to the question text, and continuously determining the rule text most relevant to the image to be matched according to the reply text, so that the selected rule text is the rule text with highest matching degree with the image content of the image to be matched in the candidate rule text, and the aim of matching the rule text with higher accuracy for the image to be matched is fulfilled.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a schematic diagram of an application scenario applicable to an embodiment of the present application;
FIG. 2 illustrates a flow chart of a method for matching regular text for an image in accordance with one embodiment of the present application;
FIG. 3 is a flow chart of step S150 of the corresponding embodiment of FIG. 2 in one embodiment;
FIG. 4 is a flow chart of step S120 of the corresponding embodiment of FIG. 2 in one embodiment;
FIG. 5 is a schematic diagram of a rule matching process of images to be matched according to an embodiment of the present application;
FIG. 6 illustrates a block diagram of an apparatus for matching regular text for images in accordance with one embodiment of the present application;
fig. 7 shows a block diagram of an electronic device for performing a method for matching rule text for images according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
In the following description, the terms "first", "second", and the like are merely used to distinguish similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", and the like may be interchanged with one another, if permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
It should be noted that: references herein to "a plurality" means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., a and/or B may represent: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The application discloses a method, a device, electronic equipment and a storage medium for matching a rule text for an image, and relates to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and a pre-trained model in the vision fields of swin-transformer, viT, V-MOE, MAE and the like can be rapidly and widely applied to downstream specific tasks through fine tuning. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; meanwhile, the method relates to an important technology of model training in the fields of computer science, mathematics and artificial intelligence, and a pre-training model is developed from a large language model (Large Language Model) in the NLP field. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
As shown in fig. 1, an application scenario applicable to the embodiments of the present application includes a terminal 20 and a server 10, where the terminal 20 and the server 10 are connected through a wired network or a wireless network. The terminal 20 may include, but is not limited to, a cell phone, a computer, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, and other terminal devices that may match the regular text for an image, or other applications (e.g., instant messaging applications, shopping applications, search applications, gaming applications, forum applications, map traffic applications, etc.) that run other methods that may invoke the regular text for an image.
The server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The server 10 may be used to provide services for applications running at the terminal 20.
The terminal 20 may send an image to be matched to the server 10, and the server 10 determines a description text of the image to be matched according to the image content of the image to be matched, and determines at least one candidate rule text with a matching degree higher than a matching degree threshold value from a plurality of preset rule texts; then, the server 10 continues to generate a question text according to the description text, at least one candidate rule text and a preset prompt text; and then, the server 10 generates a reply text of the question text according to the prompt text, wherein the reply text indicates a target rule text which is used as the rule text matched with the image to be matched. Finally, the server 10 may return the target rule text to the terminal 20.
In another embodiment, the terminal 20 may be configured to perform the method of the present application to determine a target rule text from a plurality of preset rule texts.
In the application, the description text and the rule text can be processed by a text encoder to screen candidate rule text, and the question text is processed by a multi-modal large language model to obtain a reply text. The server 10 may store the text encoder and the multimodal big language model in a distributed cloud storage system after training to obtain the text encoder and the multimodal big language model, acquire the text encoder and the multimodal big language model from the distributed cloud storage system by the terminal 20, and determine a target rule text from a plurality of preset rule texts using the text encoder and the multimodal big language model after acquiring the text encoder and the multimodal big language model
For convenience of description, in the following embodiments, an example will be described in which a method for matching a rule text for an image is performed by an electronic device.
Referring to fig. 2, fig. 2 shows a flowchart of a method for matching rule text for an image according to an embodiment of the present application, where the method may be applied to an electronic device, and the electronic device may be the terminal 20 or the server 10 in fig. 1, and the method includes:
S110, determining the description text of the image to be matched according to the image content of the image to be matched.
The image to be matched may refer to an image to be regularly matched, may be a color image or a gray image, and the image format of the image to be matched may be RGB (R refers to red, G refers to green, and B refers to blue) or YUV (Y represents brightness and V represents chromaticity), etc.
The image to be matched can be an image obtained by the electronic device from the local, an image obtained by other devices or an image shot by a camera of the electronic device, or can be a video frame obtained from the video to be matched (each video frame in the video to be matched can be also a part of video frames selected from the video to be matched), wherein the video to be matched can be a video obtained by the electronic device from the local, a video shot by other devices or a video shot by a camera of the electronic device.
The image content of the image to be matched can refer to the content which can be displayed by the image to be matched, and the content which is displayed in an image mode in the image to be matched and characters in the image to be matched can be included. Visual content understanding can be carried out on the images to be matched through the multi-modal large language model, and description text corresponding to the images to be matched is obtained; text extraction can be performed on the images to be matched through OCR (Optical Character Recognition ) and other technologies, so that corresponding description text of the images to be matched can be obtained; visual content understanding can be carried out on the images to be matched through the multi-modal large language model, and a first sub-description text corresponding to the images to be matched is obtained; and performing text extraction on the image to be matched through technologies such as OCR and the like to obtain a second sub-description text corresponding to the image to be matched, and finally obtaining a description text corresponding to the image to be matched according to the first sub-description text and the second sub-description text.
In addition, text extraction can be performed on the images to be matched through an OCR technology to obtain an extracted text, and then the extracted text and the images to be matched are input into a multi-mode large language model to understand visual contents, so that a description text corresponding to the images to be matched is obtained.
The image to be matched and the question can be input into the multi-mode large language model to obtain a first sub description text output by the multi-mode large language model, wherein the question can be a description text requesting to generate an image or a text requesting to describe the image, and the like.
A large language model (Large Language Model, abbreviated LLM), also known as a large language model, is an artificial intelligence model intended to understand and generate human language. They train on a large amount of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and so forth. LLMs are characterized by a large scale, containing billions of parameters, which help them learn complex patterns in linguistic data. These models are typically based on deep learning architectures, such as converters, which help them to achieve impressive performance on various NLP tasks. Among other things, common large language models may include GPT-3, BERT, T5, and the like.
In the field of large language models, input and output can be converted into text, and a task can be completed according to the converted text through an interface of LLM. For example, for a summary task, a document to be summarized may be input into a large language model, which may generate a summary.
The multimodal large scale language model (Multimodal Large Language Mode, abbreviated as MLLM) refers to a model that performs multimodal tasks using a powerful large scale language model (LLM) as the brain. MLLM has a striking capability, such as composing stories based on image and OCR-free mathematical reasoning, as well as generating text from image content of an image, as well as composing stories from image content of an image, etc. In short, the input of the multimodal large-scale language model is not limited to text, and images and videos can be used as the input of the multimodal large-scale language model.
It is worth mentioning that the large language model can only directly process natural language and cannot process images, so that a multi-mode large language model is introduced into the method, and visual content understanding is performed on images to be matched through the multi-mode language model so as to obtain description texts corresponding to the images to be matched.
S120, determining at least one candidate rule text with the matching degree higher than a matching degree threshold value with the description text from a plurality of preset rule texts.
Rule text refers to text that may be a rule or a criterion, for example, in the medical arts, rule text may refer to a medical solution, and for example, in the legal arts, rule text may be legal terms, and for example, rule text may be traffic rule text. The preset rule text is set according to the requirement, and the preset rule text may relate to different technical fields, for example, the preset rule text may be only specific to the legal technical field, and for example, the preset rule text may be specific to the legal technical field and the medical technical field.
It should be noted that the technical field to which the image to be matched belongs to the technical field related to the preset rule text, for example, if the image to be matched is a medical image of a patient, the preset rule text at least includes a medical solution, and if the image to be matched is an image of a certain case site, the preset rule text at least includes legal terms.
After the description text of the image to be matched is obtained, the similarity between the description text and the preset rule text can be determined to be the matching degree between the description text and the preset rule text, and then the preset rule text with the matching degree higher than the matching degree threshold value is screened from the preset rule text to be used as the candidate rule text. The matching degree threshold may be a value set based on requirements, which is not limited in this application.
For example, the description text and the preset rule text may be semantically encoded to obtain a vector representation of the description text and a vector representation of the preset rule text, and a similarity (for example, cosine similarity) between the vector representation of the description text and the vector representation of the preset rule text is calculated as a matching degree between the description text and the preset rule text.
And S130, generating a question text according to the description text, the at least one candidate rule text and a preset prompt text, wherein the prompt text is used for prompting the screening of the rule text most relevant to the image to be matched from the at least one candidate rule text.
The description text, at least one candidate rule text and a preset prompt text can be fused into one text to obtain a question text.
As an implementation manner, the description text, at least one candidate rule text and a preset prompt text can be directly spliced to obtain a question text. For example, the description text, the at least one candidate rule text and the preset prompt text are spliced in sequence according to the sequence that the description text is before, the at least one candidate rule text is behind the description text and the preset prompt text is last, so that the question text is obtained. When the candidate rule texts are multiple, the multiple candidate rule texts can be spliced according to any sequence.
As a further embodiment, the question text may be generated according to a preset format according to the description text, the at least one candidate rule text and a preset prompt text. The distribution information used for indicating the descriptive text, the at least one candidate rule text and the preset prompt text is in a preset format, and the preset format may be, for example: the content of the image description to be matched + the first sub description text + the characters on the image to be matched are "+the second sub description text +" the image to be matched can be related to one of the following rules "+all candidate rule texts +" please select the most relevant rule ", wherein the content in the" "is a prompt text, and the content outside the" "is the description text and the candidate rule text which need to be input.
And S140, generating a reply text of the question text according to the question text and the image to be matched.
In this embodiment, the image to be matched and the question text may be input into the multi-mode large language model at the same time, so as to obtain a reply text output by the multi-mode large language model for the prompt text in the question text.
The reply text may include a candidate matching rule text selected from the candidate rule texts, and the reply text may further include a question text, that is, the reply text may be a result of the question text being spliced with the candidate matching rule text.
And S150, determining the rule text matched with the image to be matched from at least one candidate rule text according to the reply text.
After the reply text is obtained, the candidate matching rule text in the reply text can be directly used as the rule text matched with the image to be matched. From the foregoing, it is known that the candidate matching rule text is determined from at least one candidate rule text, that is, the determined rule text that matches the image to be matched is determined from at least one candidate rule text.
In this embodiment, according to the image content of the image to be matched, determining a description text of the image to be matched, determining at least one candidate rule text with a matching degree higher than a matching degree threshold value from a plurality of preset rule texts, and then generating a question text according to the description text, the at least one candidate rule text and a preset prompt text, wherein the prompt text is used for prompting the screening of the rule text most relevant to the image to be matched from the at least one candidate rule text; generating a reply text of the questioning text according to the questioning text and the image to be matched; determining a rule text matched with the image to be matched according to the reply text, wherein the description text can accurately describe the image content of the image to be matched, the similarity between the candidate rule text and the description text is higher, so that the matching degree between the candidate rule text and the image content of the image to be matched is higher, then constructing a question text according to the candidate rule text with higher matching degree between the candidate rule text and the image to be matched, determining the reply text according to the question text, and continuously determining the rule text most relevant to the image to be matched according to the reply text, so that the selected rule text is the rule text with highest matching degree with the image content of the image to be matched in the candidate rule text, and the rule text matched with the image to be matched can be understood as the rule text most applicable to the image to be matched, so that the aim of matching the rule text with higher matching accuracy for the image to be matched is achieved.
In some traffic scenes, if the image to be matched is an image of a traffic scene, a traffic rule text can be matched for the traffic scene image according to the method of the application to be used as the traffic rule text suitable for the image content presented by the traffic scene image, for example, if the image content presented by the traffic scene image is that a vehicle runs a traffic light, the traffic rule text of a behavior violation presented by the traffic scene image can be automatically matched for the traffic scene image.
In some medical scenarios, if the image to be matched is a medical image of a patient, a medical solution text may be matched for the medical image according to the method of the present application as a medical solution suitable for the image content presented by the medical image, for example, if the image content presented by the medical image is that a tumor exists, a medical solution for cutting the tumor presented by the medical image can be automatically matched for the medical image.
In one embodiment, as shown in fig. 3, the reply text includes candidate matching rule text determined for the image to be matched and interpretation text output for the candidate matching rule text; the interpretation text is used for explaining why the candidate matching rule text is selected as the rule text matched with the image to be matched; accordingly, step S150 may include:
And S210, scoring based on the candidate matching rule text and the interpretation text through an evaluation model to obtain a predictive score of the reply text.
The evaluation model may refer to a model for evaluating that the reply text includes a candidate matching rule text determined for an image to be matched and an interpretation text output for the candidate matching rule text, the candidate matching rule text and the interpretation text are input into the evaluation model, the evaluation model outputs a prediction score for indicating a degree of matching between the candidate matching rule text and the interpretation text, the higher the evaluation model outputs the prediction score, the more the candidate matching rule text and the interpretation text are matched, the lower the evaluation model outputs the prediction score, and the candidate matching rule text and the interpretation text are not matched.
The training process of the evaluation model may include: acquiring a sample reply text and a sample label corresponding to the sample reply text, wherein the sample reply text comprises a sample candidate rule text and a sample interpretation text aiming at the sample candidate rule text; the sample label is used for indicating whether the sample candidate rule text and the sample interpretation text are matched; scoring based on the sample candidate rule text and the sample interpretation text through an initial evaluation model to obtain a prediction score of the sample reply text; and adjusting parameters of the initial evaluation model according to the prediction scores of the sample reply texts and the sample labels to obtain the evaluation model. The initial evaluation model may include, among other things, a semantic encoding network (e.g., transformer encoder, encoder in a converter network), a linear mapping layer, which may include a feed-forward neural network, and an activation layer, which may include a sigmoid activation function.
The sample candidate rule text may refer to any text that may be used as a rule, and the sample candidate rule text may be the same as the preset rule text, or the sample candidate rule text may be different from the preset rule text.
The label of the sample reply text may be 1 when the sample candidate rule text and the sample interpretation text in the sample reply text match, and 0 when the sample candidate rule text and the sample interpretation text in the sample reply text do not match.
The sample candidate rule text and the sample interpretation text can be input into transformer encoder of the initial evaluation model for semantic coding to obtain a sample coding result, the sample coding result is input into a linear mapping layer of the initial evaluation model to obtain a sample mapping result, the sample mapping result is processed through an activation layer of the initial evaluation model to obtain a prediction score of a sample reply text, then a cross entropy loss value is calculated according to the prediction score of the sample reply text and a sample label to obtain a loss value of the sample reply text, and parameters of the initial evaluation model are adjusted through the loss value of the sample reply text until training ending conditions are met to obtain the evaluation model.
The training ending condition of the evaluation model may include that the loss value of the sample reply text is lower than a first loss value threshold or the iteration number reaches a first preset number, which is not limited in the present application. The training process of the evaluation model can comprise a plurality of iteration processes, each iteration process can comprise a plurality of sample reply texts, the loss values of the plurality of sample replies of each iteration process are summed to obtain an iteration loss value corresponding to the iteration process, and the initial evaluation model corresponding to the iteration process is subjected to parameter adjustment once through the iteration loss value corresponding to the iteration process.
It may be understood that the evaluation model obtained by training may also include transformer encode, a linear mapping layer and an activation layer, where the candidate matching rule text and the interpretation text may be input into transformer encode of the evaluation model to obtain the coding result of the reply text, the coding result of the reply text is input into the linear mapping layer of the evaluation model to obtain the mapping result of the reply text, and then the mapping result of the reply text is input into the activation layer of the evaluation model to obtain the prediction score of the reply text.
And S220, if the prediction score reaches a score threshold, taking the candidate matching rule text as the rule text matched with the image to be matched.
If the predictive score of the reply text reaches a scoring threshold (which may be a value set based on requirements, for example, 0.75), the candidate matching rule text in the reply text is directly used as the rule text matched with the image to be matched.
The more the candidate matching rule text and the interpretation text are matched, the more reasonable the interpretation of the interpretation text to the candidate matching rule text is represented, the higher the possibility that the candidate matching rule text and the interpretation text are accurate, and conversely, the more the candidate matching rule text and the interpretation text are not matched, the more unreasonable the interpretation of the interpretation text to the candidate matching rule text is represented, and the lower the possibility that the candidate matching rule text and the interpretation text are accurate.
Therefore, when the prediction score reaches the score threshold, the higher the matching degree of the candidate matching rule text and the interpretation text in the reply text is, the higher the possibility that the candidate matching rule text and the interpretation text in the reply text are correct is, and the candidate matching rule text in the reply text with the prediction score reaching the score threshold can be directly obtained as the rule text matched with the image to be matched, and the rule text is accurate.
And S230, if the prediction score does not reach the score threshold, generating an error prompt text, wherein the error prompt text is used for prompting that the candidate matching rule text is not matched with the image to be matched.
If the prediction score does not reach the score threshold, determining that the matching degree of the candidate matching rule text in the reply text and the interpretation text is lower, and generating an error prompt text at the moment that the matching degree of the candidate matching rule text and the image to be matched is lower. For example, the error prompt text may be a rule that the candidate matching rule text is error, or the error prompt text may be that the candidate matching rule text does not match the image to be matched.
S240, inputting the error prompt text, the question text and the image to be matched into the multi-mode large language model to obtain a new reply text output by the multi-mode large language model.
After obtaining the error prompt text, the question text and the image to be matched can be input into the multi-mode large language model at the same time to obtain a new reply text predicted by the multi-mode large language model again, and the step of executing S210 is returned until a reply text with the prediction score reaching the scoring threshold is obtained, and S250 is executed.
S250, obtaining candidate matching rule texts in the reply texts with the predictive scores reaching the scoring threshold as rule texts matched with the images to be matched.
And after the reply text with the prediction score reaching the scoring threshold is obtained, the candidate matching rule text in the reply text with the prediction score reaching the scoring threshold can be used as the rule text matched with the image to be matched.
In the embodiment, the reply text is determined through the multi-mode large language model, the prediction score of the reply text is determined through the scoring model, the candidate matching rule text in the reply text with the prediction score reaching the scoring threshold is determined to be used as the rule text matched with the image to be matched, the prediction score of the reply text can accurately indicate the matching degree between the candidate matching rule text in the reply text and the interpretation text, so that the matching degree between the candidate matching rule text in the reply text with the determined prediction score reaching the scoring threshold and the interpretation text is higher, the accuracy of the candidate matching rule text in the reply text with the prediction score reaching the scoring threshold is higher, and the accuracy of the determined rule text matched with the image to be matched is further improved.
In an embodiment, as shown in fig. 4, step S120 may include:
s310, coding the descriptive text to obtain a coding result of the descriptive text; and coding each preset rule text to obtain a coding result of each preset rule text.
In this embodiment, the description text and each preset rule text may be encoded by a pre-trained text encoder, so as to obtain an encoding result of the description text and an encoding result of each preset rule text.
An initial retriever may be built based on the dual encoder architecture CoSENT (Cosine Sentence) and then trained by meta-learning to yield a pre-trained text encoder.
In this embodiment, the description text may be encoded by the first text encoder to obtain an encoding result of the description text, and each preset rule text may be encoded by the second text encoder to obtain an encoding result of each preset rule text. The first text encoder and the second text encoder may be text encoders having the same network structure, or text encoders having different network structures, which are not particularly limited herein.
The method comprises the steps of adding a cls starting identifier representing a starting position at the sentence head of a description text to obtain a preprocessing description text, encoding the preprocessing description text through a first text encoder to obtain a multidimensional real value vector as an encoding result of the preprocessing description text, and finally obtaining an encoding result corresponding to the cls starting identifier from the encoding result of the preprocessing description text as an encoding result of the description text.
Similarly, a cls start identifier representing a start position can be added to a sentence head of the preset rule text to obtain a pre-processing preset rule text, the pre-processing preset rule text is encoded through a second text encoder to obtain a multi-dimensional real value vector as an encoding result of the pre-processing preset rule text, and finally, an encoding result corresponding to the cls start identifier is obtained from the encoding result of the pre-processing preset rule text and is used as an encoding result of the preset rule text.
Wherein the training process of the first text encoder and the second text encoder may include: encoding a first sample description text corresponding to the first sample image through a pre-trained first initial text encoder to obtain an encoding result corresponding to the first sample description text; encoding each first sample rule text through a pre-trained second initial text encoder to obtain an encoding result of each first sample rule text; determining a loss value corresponding to the first sample image according to the similarity between the coding result of the first sample description text and the coding result of each first sample rule text; and adjusting parameters of the first initial text encoder and the second initial text encoder according to the loss value corresponding to the first sample image to obtain the first text encoder corresponding to the first initial text encoder and the second text encoder corresponding to the second initial text encoder.
The first sample image may be an image that may be used as a training sample, may be a color image or a gray image, and may be in RGB or YUV image format. And processing the first sample image according to the acquisition mode of the description text of the image to be matched to obtain the description text corresponding to the first sample image, wherein the description text is used as the first sample description text.
The first sample rule text may be a rule text to be matched provided for the first sample image, which may be the same as the preset rule text or may be different from the preset rule text. The technical field to which the first sample rule text belongs at least includes the technical field of the first sample image.
The method comprises the steps of adding a cls start identifier representing a start position at the sentence head of a first sample description text to obtain a preprocessed first sample description text, encoding the preprocessed first sample description text through a first initial text encoder to obtain a multidimensional real-value vector as an encoding result of the preprocessed first sample description text, and finally obtaining an encoding result corresponding to the cls start identifier from the encoding result of the preprocessed first sample description text as an encoding result of the first sample description text.
Similarly, a cls start identifier representing a start position can be added to a sentence head of the first sample rule text to obtain a preprocessed first sample rule text, the preprocessed first sample rule text is encoded through a second initial text encoder to obtain a multidimensional real value vector as an encoding result of the first sample rule text, and finally, an encoding result corresponding to the cls start identifier is obtained from the encoding result of the preprocessed first sample rule text and is used as an encoding result of the first sample rule text.
Then, a vector dot product between the encoding result of the first sample description text and the encoding result of the first sample regular text is calculated as a similarity between the encoding result of the first sample description text and the encoding result of the first sample regular text, and a loss value corresponding to the first sample image is determined through the similarity between the encoding result of the first sample description text and the encoding result of each first sample regular text.
It should be noted that, the plurality of first sample rule texts include positive sample rule texts and negative sample rule texts corresponding to the first sample images, where the positive sample rule texts refer to rule texts matched with the first sample images in the plurality of first sample rule texts, the negative sample rule texts refer to rule texts not matched with the first sample images in the plurality of first sample rule texts, and at this time, determining the loss value corresponding to the first sample images according to the similarity between the encoding results of the first sample description texts and the encoding results of each first sample rule text includes: and determining a loss value corresponding to the first sample image according to the similarity between the coding result of the first sample description text and the coding result of the positive sample rule text and the similarity between the coding result of the first sample description text and the coding result of the negative sample rule text. The similarity between the encoding result of the first sample description text and the encoding result of the first sample rule text may refer to a vector dot product of the encoding result of the first sample description text and the encoding result of the first sample rule text.
In this embodiment, a rule text matched with the first sample image may be manually selected from the plurality of first sample rule texts to be used as a positive sample rule text, and other rule texts except the positive sample rule text in the plurality of first sample rule texts are obtained to be used as negative sample rule bases; after obtaining the similarity between the encoding result of the first sample description text and the encoding result of each first sample rule text, directly obtaining the similarity between the encoding result of the first sample description text and the encoding result of the positive sample rule text and the similarity between the encoding result of the first sample description text and the encoding result of the negative sample rule text, then determining the corresponding loss value of the first sample image according to the similarity between the encoding result of the first sample description text and the encoding result of the positive sample rule text and the similarity between the encoding result of the first sample description text and the encoding result of the negative sample rule text, and simultaneously adjusting the parameters of the first initial text encoder and the second initial text encoder through the loss value to obtain the first text encoder corresponding to the first initial text encoder and the second text encoder corresponding to the second initial text encoder.
Alternatively, the loss value corresponding to the first sample image may be determined according to a formula one, which is as follows:
/>
wherein,loss value for the first sample image, < ->Describing text q for the first sample i Encoding result of (2) and positive sample rule text +.>Is a degree of similarity between the encoded results of (a),describing text q for the first sample i The coding result of (2) and the j-th negative sample rule text +.>M is the first sample description text q i The number of all negative sample rule texts corresponding thereto, e is a natural constant.
After obtaining the loss value corresponding to the first sample image, adjusting parameters of the first initial text encoder and the second initial text encoder through the loss value corresponding to the first sample image until the training ending condition is met, and obtaining the first text encoder corresponding to the first initial text encoder and the second text encoder corresponding to the second initial text encoder.
Wherein the first text encoder and the second text encoder training end condition may include traversing at least one of all of the first sample images, the penalty value being less than a second penalty value threshold, and the number of iterations reaching a second preset number. Wherein each iteration process includes a plurality of first sample images, and the sum of the loss values of the plurality of first sample images in each iteration process is used as a final loss value of the iteration process to perform parameter adjustment on parameters of the first initial text encoder and the second initial text encoder once. The second loss value threshold and the second preset number of times may be values set based on requirements, which are not limited in this application.
S320, determining a plurality of initial candidate rule texts from the plurality of preset rule texts according to the coding results of the descriptive texts and the similarity between the coding results of each preset rule text.
The preset rule text may be screened from a plurality of preset rule texts as initial candidate rule texts according to the similarity from high to low, wherein the number of screened initial candidate rule texts may be set based on requirements, for example, 10 initial candidate rule texts are screened.
And the text with the similarity reaching the similarity threshold value can be selected from a plurality of preset rule texts to serve as an initial candidate rule text. The similarity threshold is not limited in this application.
In the present embodiment, the encoding result E of the descriptive text can be calculated q (q) encoding results E of preset rule text p Dot product of (p), as similarity sim (q, p) between the encoding result of the descriptive text and the encoding result of the preset rule text, i.e., sim (q, p) =e q (q)·E p (p)。
S330, carrying out joint coding on the description text and each initial candidate rule text to obtain a joint coding result; and determining at least one candidate rule text with the matching degree higher than a matching degree threshold value between the description text and the plurality of initial candidate rule texts according to the joint coding result of the description text and each initial candidate rule text.
And for each initial candidate rule text, splicing the description text and the initial candidate rule text, and vectorizing the spliced text to obtain a joint coding result of the description text and the initial candidate rule text. Wherein, splicing the description text and the initial candidate rule text may refer to: the description text is spliced in front of the initial candidate rule text, or the description text is spliced in back of the initial candidate rule text.
After the joint coding result of the description text and the initial candidate rule text is obtained, the joint coding result of the description text and the initial candidate rule text can be processed through a full-connection layer to obtain a full-connection result corresponding to the joint coding result, and then the full-connection result corresponding to the joint coding result is normalized to obtain a normalized result as the matching degree between the description text and the initial candidate rule text; and then acquiring the initial candidate rule text with the matching degree higher than the matching degree threshold value as the candidate rule text. The matching degree threshold may be a value set based on the requirement, for example, 0.5.
As an embodiment, S330 may include: splicing the description text and each initial candidate rule text to obtain a spliced text corresponding to each initial candidate rule text; inputting the spliced text into a third text encoder to obtain a joint encoding result of the description text and the initial candidate rule text in the spliced text; and determining at least one candidate rule text with the matching degree higher than a matching degree threshold value between the description text and the plurality of initial candidate rule texts according to the joint coding result of the description text and each initial candidate rule text.
For each initial candidate rule text, adding a cls start identifier representing a start position at the sentential head of the spliced text corresponding to the initial candidate rule text to obtain a preprocessed spliced text, encoding the preprocessed spliced text through a third encoder to obtain an encoding result of the preprocessed spliced text, and obtaining the encoding result corresponding to the cls start character in the preprocessed spliced text as a joint encoding result of the description text in the spliced text and the initial candidate rule text. And then, carrying out linear mapping on the joint coding result of the description text and the initial candidate rule text through a Feed-forward Network (FFN) to obtain a mapping result corresponding to the joint coding result, and normalizing the mapping result corresponding to the joint coding result through a sigmoid function to obtain the matching degree between the initial candidate rule text and the description text.
Wherein the training process of the third text encoder may include: splicing the second sample description text corresponding to the second sample image with each second sample rule text to obtain a sample spliced text corresponding to each second sample rule text; inputting the sample spliced text corresponding to each second sample rule text into a pre-trained third initial text encoder to obtain a coding result corresponding to each sample spliced text; determining the matching degree between a second sample rule text and a second sample description text in the sample spliced text according to the coding result corresponding to the sample spliced text; determining a loss value corresponding to the second sample image according to the matching degree between the second sample rule text and the second sample description text in each sample spliced text; and adjusting parameters of the third initial text encoder through the loss value corresponding to the second sample image to obtain the third text encoder.
The second sample image may be a color image or a gray image, and the image format may be RGB or YUV, as an image that may be a training sample. And processing the second sample image according to the acquisition mode of the description text of the image to be matched to obtain the description text corresponding to the second sample image, wherein the description text is used as the second sample description text.
The second sample rule text may be a rule text for rule matching the second sample image, which may be the same as or different from the preset rule text. The technical field to which the second sample rule text belongs at least includes the technical field of the second sample image. The second sample rule text may correspond to tag information, where the tag information of the second sample rule text is used to indicate whether the second sample rule text matches the second sample image. The tag information of the second sample rule text may be in the form of a tag value, e.g., a tag value of 1 for the second sample rule text indicating that the second sample rule text matches the second sample image, and a tag value of 0 for the second sample rule text indicating that the second sample rule text does not match the second sample image.
The initial retriever may be built based on the dual encoder architecture CoSENT (Cosine Sentence) and then trained by meta-learning to yield a pre-trained third initial text encoder.
For each second sample rule text, splicing a second sample description text corresponding to a second sample image with the second sample rule text to obtain a sample spliced text corresponding to the second sample rule text, adding a cls start identifier representing a start position at the sentence head of the sample spliced text corresponding to the second sample rule text to obtain a preprocessed sample spliced text, encoding the preprocessed sample spliced text through a third initial text encoder to obtain an encoding result of the preprocessed sample spliced text, and obtaining an encoding result corresponding to a cls start character in the preprocessed sample spliced text as an encoding result of the sample spliced text (namely, a joint encoding result of the second sample description text and the second sample rule text in the sample spliced text); and then, carrying out linear mapping on the coding result of the sample spliced text through a Feed-forward Network (FFN) to obtain a mapping result corresponding to the sample spliced text, and normalizing the mapping result corresponding to the sample spliced text through a sigmoid function to obtain the matching degree between the second sample regular text and the second sample description text.
Then, a loss value corresponding to the second sample image can be determined according to the matching degree between the second sample rule text and the second sample description text in each sample spliced text through a formula II, wherein the formula II is as follows:
wherein,for the loss value corresponding to the second sample image, sim (q' i :p′ i ) Rule text p 'for the second sample' i And a second sample description text q' i The matching degree between the two is y is the second sample rule text p' i Wherein, in the second sample, the rule text p' i Descriptive text q 'for descriptive second sample' i Y is 1 in the regular text of (2), p 'in the second sample of regular text' i Not describing the second sample description text q' i Y is 0.
And after obtaining the loss value corresponding to the second sample image, adjusting the parameter of the third initial text encoder through the loss value corresponding to the second sample image until the training ending condition is met, and obtaining the third text encoder. Wherein the training end condition of the third text encoder herein may include at least one of traversing all of the second sample images, the penalty value being less than a third penalty value threshold, and the number of iterations reaching a third preset number. Wherein each iteration process includes a plurality of second sample images, and the sum of the loss values of the plurality of second sample images in each iteration process is used as a final loss value in the iteration process for performing parameter adjustment on the third initial text encoder. The third loss value threshold and the third preset times may be values set based on requirements, which are not limited in this application.
In one example, a rule matching process for images to be matched is shown in FIG. 5. Firstly, acquiring an image to be matched, analyzing the image content of the image to be matched to obtain a first sub description text and a second sub description text, and summarizing the first sub description text and the second sub description text to obtain the description text of the image to be matched.
And screening the initial candidate rule texts from the preset rule texts according to the coding result of the descriptive text of the image to be matched and the coding result of each of the preset rule texts, and then continuously screening the candidate rule texts from the initial candidate rule texts according to the joint coding result of the descriptive text of the image to be matched and each of the initial candidate rule texts.
Generating a question text according to the description text, at least one candidate rule text and a preset prompt text, processing the question text and the image to be matched through a multi-mode large language model to obtain a reply text, and scoring the reply text through an evaluation model to obtain a prediction score of the reply text.
Then, judging whether the predicted score of the reply text reaches a score threshold value, if so, determining the rule text: the candidate matching rule text in the reply text with the predictive score reaching the scoring threshold is used as the rule text matched with the image to be matched; if the predicted score reaches the score threshold, obtaining candidate matching rule texts in the reply texts with the predicted score reaching the score threshold as rule texts matched with the image to be matched.
In this embodiment, according to the similarity between the encoding result of the descriptive text and the encoding result of each preset rule text, a rule text screening operation is performed once to obtain a plurality of initial candidate rule texts, and then, according to the joint encoding result of the descriptive text and each initial candidate rule text, at least one candidate rule text with the matching degree higher than the matching degree threshold value between the descriptive text and the at least one candidate rule text is determined from the plurality of initial candidate rule texts, wherein the candidate rule text is a rule text obtained by screening twice, and the matching degree between the candidate rule text and the descriptive text is greatly improved, so that the matching degree between the target rule text obtained according to the candidate rule text and an image to be matched is also greatly improved.
Meanwhile, the first text encoder, the second text encoder and the third text encoder are obtained based on training of the pre-trained text encoder, accurate coding of the descriptive text is achieved through the first text encoder, accurate coding of the preset regular text is achieved through the second text encoder, accordingly, accuracy of similarity between coding results of the descriptive text and coding results of each preset regular text obtained according to the first text encoder and the second text encoder is high, accuracy of screened initial candidate regular texts is improved, meanwhile, accurate coding of the spliced text is achieved through the third text encoder, matching degree between the initial candidate regular text and the descriptive text in the spliced text determined according to coding results of the spliced text is improved, and accuracy of screened candidate regular texts is improved.
In addition, in the embodiment, a double filtering mode is adopted to limit candidate rule texts in a few ranges, and an evaluation model is utilized to feed back the output of the multi-mode large language model, so that a small flexible and trainable evaluation model is combined with the multi-mode large language model with strong capability in the scheme, and the training difficulty and the reasoning cost are greatly reduced.
Referring to fig. 6, fig. 6 shows a block diagram of an apparatus for matching regular text for an image according to an embodiment of the present application, where the apparatus 700 includes:
a text determining module 710, configured to determine a description text of the image to be matched according to the image content of the image to be matched;
a selection module 720, configured to determine at least one candidate rule text with a matching degree with the description text higher than a matching degree threshold value from a plurality of preset rule texts;
the question text generation module 730 is configured to generate a question text according to the description text, at least one candidate rule text, and a preset prompt text, where the prompt text is used to prompt screening of rule text most relevant to the image to be matched from the at least one candidate rule text;
the reply text generation module 740 is used for generating a reply text of the question text according to the question text and the image to be matched;
The rule matching module 750 is configured to determine, from the reply text, a rule text that matches the image to be matched from at least one candidate rule text.
Optionally, the reply text comprises a candidate matching rule text determined for the image to be matched and an interpretation text output for the candidate matching rule text; the rule matching module 750 is further configured to score, through the evaluation model, based on the candidate matching rule text and the interpretation text, to obtain a prediction score of the reply text; and if the prediction score reaches a scoring threshold, the candidate matching rule text is used as the rule text matched with the image to be matched.
Optionally, generating a reply text of the question text according to the question text and the image to be matched through the multi-mode model; the rule matching module 750 is further configured to generate an error prompt text, where the error prompt text is used to prompt that the candidate matching rule text is not matched with the image to be matched if the prediction score does not reach the score threshold; inputting the error prompt text, the question text and the image to be matched into the multi-mode large language model to obtain a new reply text output by the multi-mode large language model; and returning to execute the step of scoring based on the candidate matching rule text and the interpretation text through the evaluation model to obtain the prediction score of the reply text until the reply text with the prediction score reaching the scoring threshold is obtained, and obtaining the candidate matching rule text in the reply text with the prediction score reaching the scoring threshold as the rule text matched with the image to be matched.
Optionally, the device further includes a training module, configured to obtain a sample reply text and a sample label corresponding to the sample reply text, where the sample reply text includes a sample candidate rule text and a sample interpretation text for the sample candidate rule text; the sample label is used for indicating whether the sample candidate rule text and the sample interpretation text are matched; scoring based on the sample candidate rule text and the sample interpretation text through an initial evaluation model to obtain a prediction score of the sample reply text; and adjusting parameters of the initial evaluation model according to the prediction scores of the sample reply texts and the sample labels to obtain the evaluation model.
Optionally, the selection module 120 is further configured to encode the descriptive text to obtain an encoding result of the descriptive text; coding each preset rule text to obtain a coding result of each preset rule text; determining a plurality of initial candidate rule texts from a plurality of preset rule texts according to the similarity between the coding result of the descriptive text and the coding result of each preset rule text; carrying out joint coding on the description text and each initial candidate rule text to obtain a joint coding result; and determining at least one candidate rule text with the matching degree higher than a matching degree threshold value between the description text and the plurality of initial candidate rule texts according to the joint coding result of the description text and each initial candidate rule text.
Optionally, the coding result of the descriptive text is obtained by coding through a first text coder, and the coding result of the preset rule text is obtained by coding through a second text coder; the training module is further used for encoding a first sample description text corresponding to the first sample image through a pre-trained first initial text encoder to obtain an encoding result corresponding to the first sample description text; encoding each first sample rule text through a pre-trained second initial text encoder to obtain an encoding result of each first sample rule text; determining a loss value corresponding to the first sample image according to the similarity between the coding result of the first sample description text and the coding result of each first sample rule text; and adjusting parameters of the first initial text encoder and the second initial text encoder according to the loss value corresponding to the first sample image to obtain the first text encoder corresponding to the first initial text encoder and the second text encoder corresponding to the second initial text encoder.
Optionally, the plurality of first sample rule texts include positive sample rule texts and negative sample rule texts corresponding to the first sample images; the training module is further configured to determine a loss value corresponding to the first sample image according to a similarity between the encoding result of the first sample description text and the encoding result of the positive sample regular text, and a similarity between the encoding result of the first sample description text and the encoding result of the negative sample regular text.
Optionally, the selection module 720 is further configured to splice the description text with each initial candidate rule text to obtain a spliced text corresponding to each initial candidate rule text; and inputting the spliced text into a third text encoder to obtain a joint encoding result of the description text and the initial candidate rule text in the spliced text.
Optionally, the training module is further configured to splice a second sample description text corresponding to the second sample image with each second sample rule text, so as to obtain a sample spliced text corresponding to each second sample rule text; inputting the sample spliced text corresponding to each second sample rule text into a pre-trained third initial text encoder to obtain a coding result corresponding to each sample spliced text; determining the matching degree between a second sample rule text and a second sample description text in the sample spliced text according to the coding result corresponding to the sample spliced text; determining a loss value corresponding to the second sample image according to the matching degree between the second sample rule text and the second sample description text in each sample spliced text; and adjusting parameters of the third initial text encoder through the loss value corresponding to the second sample image to obtain the third text encoder.
Optionally, the text determining module 710 is further configured to understand visual content of the image to be matched through the multi-mode large language model, so as to obtain a first sub-description text corresponding to the image to be matched; performing text extraction on the image to be matched to obtain a second sub-description text corresponding to the image to be matched; and obtaining the description text corresponding to the image to be matched according to the first sub description text and the second sub description text.
It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.
Fig. 7 shows a block diagram of an electronic device for performing a method for matching rule text for images according to an embodiment of the present application. The electronic device may be the terminal 20 or the server 10 in fig. 1, and it should be noted that, the computer system 1200 of the electronic device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 7, the computer system 1200 includes a central processing unit (Central Processing Unit, CPU) 1201 which can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (Random Access Memory, RAM) 1203. In the RAM 1203, various programs and data required for the system operation are also stored. The CPU1201, ROM1202, and RAM 1203 are connected to each other through a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 1210 as needed, so that a computer program read out therefrom is installed into the storage section 1208 as needed.
In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. When executed by a Central Processing Unit (CPU) 1201, performs the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiments; or may exist alone without being incorporated into the electronic device. The computer readable storage medium carries computer readable instructions which, when executed by a processor, implement the method of any of the above embodiments.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the method of any of the embodiments described above.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause an electronic device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (14)

1. A method of matching regular text for an image, the method comprising:
determining a description text of an image to be matched according to the image content of the image to be matched;
determining at least one candidate rule text with the matching degree higher than a matching degree threshold value from a plurality of preset rule texts;
generating a question text according to the description text, the at least one candidate rule text and a preset prompt text, wherein the prompt text is used for prompting the at least one candidate rule text to screen the rule text most relevant to the image to be matched;
generating a reply text of the questioning text according to the questioning text and the image to be matched;
And determining rule texts matched with the images to be matched from the at least one candidate rule text according to the reply text.
2. The method of claim 1, wherein the reply text comprises a candidate matching rule text determined for the image to be matched and an interpretation text output for the candidate matching rule text;
the determining rule text matched with the image to be matched from the at least one candidate rule text according to the reply text comprises the following steps:
scoring the candidate matching rule text and the interpretation text through an evaluation model to obtain a prediction score of the reply text;
and if the prediction score reaches a score threshold, the candidate matching rule text is used as the rule text matched with the image to be matched.
3. The method according to claim 2, wherein a reply text of the question text is generated according to the question text and the image to be matched through a multi-modal model;
the method further comprises the steps of after scoring the candidate matching rule text and the interpretation text through the evaluation model to obtain a prediction score of the reply text:
If the prediction score does not reach the score threshold, generating an error prompt text, wherein the error prompt text is used for prompting that the candidate matching rule text is not matched with the image to be matched;
inputting the error prompt text, the question text and the image to be matched into the multi-modal large language model to obtain a new reply text output by the multi-modal large language model;
and returning to the step of executing the scoring based on the candidate matching rule text and the interpretation text through the evaluation model to obtain the predictive score of the reply text until a reply text with the predictive score reaching a scoring threshold is obtained, and obtaining the candidate matching rule text in the reply text with the predictive score reaching the scoring threshold as the rule text matched with the image to be matched.
4. The method of claim 2, wherein before scoring by an evaluation model based on the candidate matching rule text and the interpretation text to obtain a predictive score for the reply text, the method further comprises:
acquiring a sample reply text and a sample label corresponding to the sample reply text, wherein the sample reply text comprises a sample candidate rule text and a sample interpretation text aiming at the sample candidate rule text; the sample tag is used for indicating whether the sample candidate rule text and the sample interpretation text are matched;
Scoring based on the sample candidate rule text and the sample interpretation text through an initial evaluation model to obtain a prediction score of the sample reply text;
and adjusting parameters of the initial evaluation model according to the prediction scores of the sample reply texts and the sample labels to obtain the evaluation model.
5. The method of claim 1, wherein said determining at least one candidate rule text from a plurality of preset rule texts having a degree of match with the descriptive text above a degree of match threshold comprises:
coding the description text to obtain a coding result of the description text;
coding each preset rule text to obtain a coding result of each preset rule text;
determining a plurality of initial candidate rule texts from the plurality of preset rule texts according to the similarity between the coding result of the descriptive text and the coding result of each preset rule text;
carrying out joint coding on the description text and each initial candidate rule text to obtain a joint coding result;
and determining at least one candidate rule text with the matching degree higher than a matching degree threshold value from the plurality of initial candidate rule texts according to the joint coding result of the description text and each initial candidate rule text.
6. The method according to claim 5, wherein the encoding result of the descriptive text is encoded by a first text encoder, and the encoding result of the preset regular text is encoded by a second text encoder;
the method further comprises the steps of:
encoding a first sample description text corresponding to a first sample image through a pre-trained first initial text encoder to obtain an encoding result corresponding to the first sample description text;
encoding each first sample rule text through a pre-trained second initial text encoder to obtain an encoding result of each first sample rule text;
determining a loss value corresponding to the first sample image according to the similarity between the coding result of the first sample description text and the coding result of each first sample rule text;
and adjusting parameters of the first initial text encoder and the second initial text encoder according to the loss value corresponding to the first sample image to obtain a first text encoder corresponding to the first initial text encoder and a second text encoder corresponding to the second initial text encoder.
7. The method of claim 6, wherein the plurality of first sample rule texts includes positive sample rule texts and negative sample rule texts corresponding to the first sample images;
and determining a loss value corresponding to the first sample image according to the similarity between the coding result of the first sample description text and the coding result of each first sample rule text, wherein the loss value comprises the following steps:
and determining a loss value corresponding to the first sample image according to the similarity between the encoding result of the first sample description text and the encoding result of the positive sample rule text and the similarity between the encoding result of the first sample description text and the encoding result of the negative sample rule text.
8. The method of claim 5, wherein the jointly encoding the descriptive text and each of the initial candidate rule texts to obtain a joint encoding result comprises:
splicing the description text and each initial candidate rule text to obtain a spliced text corresponding to each initial candidate rule text;
and inputting the spliced text into a third text encoder to obtain a joint encoding result of the description text and the initial candidate rule text in the spliced text.
9. The method of claim 8, wherein before inputting the spliced text into a third text encoder to obtain a joint encoding result of descriptive text and initial candidate regular text in the spliced text, the method further comprises:
splicing the second sample description text corresponding to the second sample image with each second sample rule text to obtain a sample spliced text corresponding to each second sample rule text;
inputting a sample spliced text corresponding to each second sample rule text into a pre-trained third initial text encoder to obtain a coding result corresponding to each sample spliced text;
determining the matching degree between a second sample rule text in the sample spliced text and the second sample description text according to the coding result corresponding to the sample spliced text;
determining a loss value corresponding to the second sample image according to the matching degree between the second sample rule text and the second sample description text in each sample spliced text;
and adjusting parameters of the third initial text encoder through the loss value corresponding to the second sample image to obtain a third text encoder.
10. The method according to claim 1, wherein the determining the descriptive text of the image to be matched according to the image content of the image to be matched comprises:
performing visual content understanding on the image to be matched through a multi-mode large language model to obtain a first sub-description text corresponding to the image to be matched;
performing text extraction on the image to be matched to obtain a second sub-description text corresponding to the image to be matched;
and obtaining the description text corresponding to the image to be matched according to the first sub description text and the second sub description text.
11. An apparatus for matching regular text for an image, the apparatus comprising:
the text determining module is used for determining the description text of the image to be matched according to the image content of the image to be matched;
the selection module is used for determining at least one candidate rule text with the matching degree higher than a matching degree threshold value from a plurality of preset rule texts;
the questioning text generation module is used for generating questioning texts according to the description texts, the at least one candidate rule text and preset prompt texts, wherein the prompt texts are used for prompting the at least one candidate rule text to screen the rule text most relevant to the image to be matched;
The reply text generation module is used for generating a reply text of the question text according to the question text and the image to be matched;
and the rule matching module is used for determining rule texts matched with the image to be matched from the at least one candidate rule text according to the reply text.
12. An electronic device, comprising:
a processor;
a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any of claims 1-10.
13. A computer readable storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the method of any of claims 1-10.
14. A computer program product or computer program comprising computer instructions which, when executed by a processor, implement the method of any one of claims 1 to 10.
CN202311485444.4A 2023-11-08 2023-11-08 Method, device, electronic equipment and storage medium for matching regular text for image Pending CN117333886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311485444.4A CN117333886A (en) 2023-11-08 2023-11-08 Method, device, electronic equipment and storage medium for matching regular text for image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311485444.4A CN117333886A (en) 2023-11-08 2023-11-08 Method, device, electronic equipment and storage medium for matching regular text for image

Publications (1)

Publication Number Publication Date
CN117333886A true CN117333886A (en) 2024-01-02

Family

ID=89283051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311485444.4A Pending CN117333886A (en) 2023-11-08 2023-11-08 Method, device, electronic equipment and storage medium for matching regular text for image

Country Status (1)

Country Link
CN (1) CN117333886A (en)

Similar Documents

Publication Publication Date Title
CN111597830A (en) Multi-modal machine learning-based translation method, device, equipment and storage medium
CN110750959A (en) Text information processing method, model training method and related device
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN116824278B (en) Image content analysis method, device, equipment and medium
CN110750998B (en) Text output method, device, computer equipment and storage medium
CN110781413A (en) Interest point determining method and device, storage medium and electronic equipment
CN116994021A (en) Image detection method, device, computer readable medium and electronic equipment
CN116975288A (en) Text processing method and text processing model training method
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN117437317A (en) Image generation method, apparatus, electronic device, storage medium, and program product
CN114937277B (en) Image-based text acquisition method and device, electronic equipment and storage medium
CN115563976A (en) Text prediction method, model building method and device for text prediction
CN117333886A (en) Method, device, electronic equipment and storage medium for matching regular text for image
CN113420111A (en) Intelligent question-answering method and device for multi-hop inference problem
CN116824308B (en) Image segmentation model training method and related method, device, medium and equipment
CN117540221B (en) Image processing method and device, storage medium and electronic equipment
CN116561350B (en) Resource generation method and related device
CN116702094B (en) Group application preference feature representation method
CN116151226B (en) Machine learning-based deaf-mute sign language error correction method, equipment and medium
CN117113993B (en) Entity linking method, device, electronic equipment and storage medium
CN117711001B (en) Image processing method, device, equipment and medium
CN117152467B (en) Image recognition method, device, medium and electronic equipment
CN117034133A (en) Data processing method, device, equipment and medium
CN117473055A (en) Answer generation method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication