CN118132988A - Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product - Google Patents

Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product Download PDF

Info

Publication number
CN118132988A
CN118132988A CN202410296148.8A CN202410296148A CN118132988A CN 118132988 A CN118132988 A CN 118132988A CN 202410296148 A CN202410296148 A CN 202410296148A CN 118132988 A CN118132988 A CN 118132988A
Authority
CN
China
Prior art keywords
image
text
feature
data
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410296148.8A
Other languages
Chinese (zh)
Inventor
万超群
张巍
沈旭
叶杰平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Original Assignee
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Alibaba Cloud Feitian Information Technology Co ltd filed Critical Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority to CN202410296148.8A priority Critical patent/CN118132988A/en
Publication of CN118132988A publication Critical patent/CN118132988A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

Embodiments of the present specification provide a machine learning model training method, a text-based image search method, an automatic question-answering method, a computing device, a computer-readable storage medium, and a computer program product, wherein the machine learning model training method includes: acquiring a graphic data pair set, an image data set and a text data set, and processing the graphic data pair set through a feature extraction model to acquire a graphic data feature pair set; acquiring at least two reference feature vectors representing the corresponding relation between paired image features and text features in the image-text data feature pair set based on the image-text data feature pair set; and training the feature extraction model according to each reference feature vector, the image data set and the text data set until the training stopping condition of the feature extraction model is reached. The accuracy of feature extraction is optimized through the image-text relationship in the image-text data pair, so that the training cost of a feature extraction model is reduced.

Description

Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product
Technical Field
The embodiment of the specification relates to the technical field of computers, in particular to a machine learning model training method, a machine learning model training method applied to cloud side equipment, a text-based image searching method applied to the cloud side equipment and an automatic question-answering method applied to the cloud side equipment.
Background
With the development of a multi-mode large model, the efficient alignment of the visual content of the image and the semantic content of the text and the stabilization generalization capability which can be displayed in various scenes are brought. This generalization capability is particularly suited to changing environments in the real world, allowing models to be used without frequent algorithm iterations when small changes in demand occur.
Currently, the multi-mode large model improves understanding and application capability of the model through learning a large number of image-text pairs, and solves the key problem of cross-media content understanding. However, the training of the model depends on a large-scale data set, and the quantity, the diversity and the quality of the data in the data set are all required, and a large amount of cost is required for constructing the large-scale data set meeting the conditions, so that the cost for training the multi-mode large model is increased. Therefore, in order to solve the above-mentioned problems, a machine learning model training method capable of reducing the cost of multi-modal large model training is needed.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a machine learning model training method, a machine learning model training method applied to cloud-side devices, a text-based image search method applied to cloud-side devices, and an automatic question-answering method applied to cloud-side devices. One or more embodiments of the present specification are also directed to a machine learning model training apparatus, a computing device, a computer readable storage medium, and a computer program product that address the deficiencies of the prior art.
According to a first aspect of embodiments of the present specification, there is provided a machine learning model training method, including:
Acquiring a graphic data pair set, an image data set and a text data set, and processing the graphic data pair set through a feature extraction model to acquire a graphic data feature pair set;
Acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set;
And training the feature extraction model according to each reference feature vector, the image data set and the text data set until the training stopping condition of the feature extraction model is reached.
According to a second aspect of embodiments of the present specification, there is provided a machine learning model training method applied to cloud-side equipment, including:
The method comprises the steps that a picture-text data pair set, an image data set and a text data set sent by receiving terminal equipment are received, and the picture-text data pair set is processed through a feature extraction model to obtain a picture-text data feature pair set;
Acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set;
Training the feature extraction model according to each reference feature vector, the image data set and the text data set until a training stopping condition of the feature extraction model is reached;
and obtaining model parameters in the trained feature extraction model, and returning the model parameters to the end-side equipment.
According to a third aspect of embodiments of the present specification, there is provided a text-based image search method applied to a cloud-side device, including:
receiving an image searching instruction sent by a terminal side device, wherein the image searching instruction carries a target searching text;
inputting the target search text into a feature extraction model to obtain target search text feature information output by the feature extraction model, wherein the feature extraction model is obtained by training the machine learning model training method;
and determining an image search result corresponding to the image search instruction according to the target search text characteristic information, and returning the image search result to the terminal side equipment.
According to a fourth aspect of embodiments of the present specification, there is provided an automatic question-answering method applied to cloud-side devices, including:
receiving problem data sent by a terminal side device, wherein the problem data comprises at least one of problem text data and problem image data;
Inputting the problem data into a language processing model to obtain problem data to be processed, and inputting the problem data to be processed into a feature extraction model to obtain problem feature data output by the feature extraction model, wherein the feature extraction model is obtained by training the machine learning model training method;
and generating answer data according to the question feature data, and returning the answer data to the terminal side equipment.
According to a fifth aspect of embodiments of the present specification, there is provided a computing device comprising:
A memory and a processor;
The memory is configured to store computer executable instructions, and the processor is configured to execute the computer executable instructions, where the computer executable instructions when executed by the processor implement the steps of the machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device.
According to a sixth aspect of the embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the above-described machine learning model training method, a machine learning model training method applied to a cloud-side device, a text-based image search method applied to a cloud-side device, an image analysis method applied to a cloud-side device, and an automatic question-answering method applied to a cloud-side device.
According to a seventh aspect of the embodiments of the present specification, there is provided a computer program product comprising computer programs/instructions which, when executed by a processor, implement the above-described machine learning model training method, machine learning model training method applied to a cloud-side device, text-based image search method applied to a cloud-side device, image analysis method applied to a cloud-side device, automatic question-answering method applied to a cloud-side device.
One embodiment of the specification realizes that a set of image-text data pairs, a set of image data and a set of text data are obtained, and the set of image-text data pairs is processed through a feature extraction model to obtain a set of image-text data feature pairs; acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set; and training the feature extraction model according to each reference feature vector, the image data set and the text data set until the training stopping condition of the feature extraction model is reached.
By applying the scheme of the embodiment of the specification, each reference feature vector is acquired through the image-text feature data pair, so that the reference feature vector carries the relation between the image features and the text features in the image-text data pair, the model can understand the reference feature vector with the image-text feature association relation comprehensively carried by the extracted image features and the text features, the image features and the text features extracted by the model are more accurate, the feature extraction model is trained through the image set and the text set which are not subjected to pairing processing, and the cost required by training the model is further reduced.
Drawings
FIG. 1 is a flow chart of a machine learning model training method provided in one embodiment of the present disclosure;
FIG. 2 is a flowchart of a machine learning model training method applied to cloud-side devices according to one embodiment of the present disclosure;
Fig. 3 is a flowchart of a text-based image searching method applied to a cloud-side device according to an embodiment of the present disclosure;
fig. 4 is a flowchart of an image analysis method applied to a cloud-side device according to an embodiment of the present disclosure;
Fig. 5 is a flowchart of an automatic question-answering method applied to cloud-side devices according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of an automated question-answering system according to one embodiment of the present disclosure;
FIG. 7 is a flowchart of a process of a training method for a graphic feature extraction model according to one embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a machine learning model training apparatus according to one embodiment of the present disclosure;
FIG. 9 is a block diagram of a computing device provided in one embodiment of the present description.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
Furthermore, it should be noted that, user information (including, but not limited to, user equipment information, user personal information, etc.) and data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to one or more embodiments of the present disclosure are information and data authorized by a user or sufficiently authorized by each party, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation entries for the user to select authorization or denial.
In one or more embodiments of the present description, a large model refers to a deep learning model with large scale model parameters, typically including hundreds of millions, billions, trillions, and even more than one billion model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model (Large Language Model, LLM), a multi-modal pre-training Model (multi-modal pre-training Model) and the like.
When the large model is actually applied, the pretrained model can be applied to different tasks by only slightly adjusting a small number of samples, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, NLP for short), computer vision and the like, and particularly can be applied to the tasks of the computer vision fields such as vision question and answer (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and the tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like, and main application scenes of the large model comprise digital assistants, intelligent robots, searching, online education, office software, electronic commerce, intelligent design and the like.
First, terms related to one or more embodiments of the present specification will be explained.
Contrast language-Image Pre-training (Contrastive Language-Image Pre-training, CLIP): is an advanced computer vision and natural language processing technology. It is able to align the image content with the text description by learning from a large collection of image-text pairing data over the network, thereby exploiting knowledge in natural language beyond predefined annotation information, such as categories in image classification. The CLIP can instantly synthesize a linear classifier by embedding the name or description of the target, and perform zero sample image classification. This approach can exhibit performance comparable to the supervision model even without any exposure to supervision data.
Dall·e: is a deep learning model capable of generating corresponding images from textual descriptions. It creates satisfactory visual content by advanced algorithms by understanding the instructions and descriptions in the text. This model demonstrates the ability in image generation and multimodal learning, enabling the creation of a rich and diverse image result given a short description.
Visual transducer-Base (Vision Transformer-Base, viT-B): is a deep learning model applied to image recognition tasks. It breaks the image into small blocks and then processes the image blocks using a Transformer (transducer) architecture as it does natural language. The method enables the model to capture long-distance dependency relationship in the image, and improves efficiency and accuracy of image classification and recognition tasks.
Residual Network (ResNet): is a deep convolutional neural network. The model solves the deep network training problem by introducing the concept of "residual learning" and allows the network layer to learn the residual mapping between the input and output instead of fitting a mapping directly. The design remarkably improves the training speed and effect of the depth network, so that the network can improve the accuracy by increasing the layer number without causing the problems of gradient disappearance or gradient explosion.
The bi-directional encoder represents a transformer (Bidirectional Encoder Representations from Transformers BERT): is a natural language processing model. It is able to understand the context in text through a bi-directional representation of the pre-training depth. The design of BERT improves text processing techniques, particularly for applications in language understanding tasks such as question-answering systems, text summarization, and emotion analysis.
Sentence-BERT: the method is a deep learning model for coding sentences or text fragments, and semantic similarity between texts can be rapidly and effectively calculated. The method is optimized on the basis of BERT, and through special training skills, the model can generate sentence embedding which can be used for various natural language processing tasks, such as semantic search, text similarity comparison and the like, so that the processing speed and the processing performance are improved.
In the present specification, a machine learning model training method applied to cloud-side equipment, a text-based image searching method applied to cloud-side equipment, an image analysis method applied to cloud-side equipment, and an automatic question-answering method applied to cloud-side equipment are provided, and the present specification relates to a machine learning model training apparatus, a computing device, and a computer-readable storage medium, a computer program product, which are described in detail in the following embodiments one by one.
The graphic multi-mode large model, such as CLIP (Contrastive Language-Image Pre-training, contrast language-Image Pre-training) and the like, is trained based on graphic data pairs with huge scale, and shows better generalizable Image semantic understanding capability. However, for medical images, automatic driving, remote sensing data and other specific scenes, most of the image data and text data which can be acquired are mutually independent, only a small amount of data can be associated to form semantic matching image-text pairs, if a large number of image-text pairs are required to be constructed manually for training image-text multi-mode large models such as CLIP and the like in the specific scenes, the cost of model training is increased, and only a small amount of data pairs are used for training models, so that the problem of low accuracy of the trained models in the specific scenes is caused.
Referring to fig. 1, fig. 1 shows a flowchart of a machine learning model training method according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 102: and acquiring a graphic data pair set, an image data set and a text data set, and processing the graphic data pair set through a feature extraction model to acquire a graphic data feature pair set.
In practical application, the image-text data pair set is a set comprising images and text data pairs corresponding to the images, the image data set is a set comprising images, the text data set is a set comprising texts, the feature extraction model is a deep learning model for extracting image features and text features, and the image-text data feature pair set is a set comprising image corresponding features and text feature data pairs corresponding to the images.
In particular, a feature extraction model is understood to be a model that performs feature extraction on an image and text, and has an image coding layer that is a deep learning network layer that codes the image and a text coding layer that is a deep learning network layer that codes the text. The image coding layer and the text coding layer may be coding layers in a model with the image coding layer and the text coding layer, such as CLIP, dall·e, etc.; it is also possible to acquire the image encoding layer from only the image encoding layer, for example, viT-B (Vision Transformer-Base), resNet (Residual Network), etc., and acquire the text encoding layer from only the text encoding layer, for example, BERT (Bidirectional Encoder Representations from Transformers, bi-directional encoder represents a transformer), sentence-BERT, etc.; the two deep learning networks may also be randomly initialized as an image encoding layer and a text encoding layer, and so on, to construct a feature extraction model, which is not limited in any way in this specification.
The set of pairs of teletext data may be understood as a model having pairs of image and text data, the text corresponding to an image may be understood as descriptive text of the image, and similarly, the image corresponding to a text may be understood as the image described by the text, for example, a cat playing a ball on a table corresponds to a text of "a gray cat playing a ball on a knitting wool". The feature extraction of the image-text data pair set can be understood as that the feature extraction model is used for processing the image and the text corresponding to the image at the same time, and the obtained paired feature data is obtained.
The image data set is a data set only comprising images, the text data set is a data set only comprising texts, the image data set and the text data set are preferably data aiming at the same item, the data quantity in the image data set and the text data set can be inconsistent, the image data and the text data in the two data sets do not necessarily have a corresponding relation, and the data aiming at the same item are directly acquired.
In the application of the actual scene, in the scene aiming at the same item, partial image-text data pair sets can be directly acquired, or partial image-text data pair sets can be constructed manually, then a large amount of image data is directly acquired as image data for combination, and then a large amount of text data is acquired as text data sets. Preferably, the number of data in the image data set is close to the number of data in the text data set, and the number of data in the image data set and the text data set is far greater than the number of data in the image data pair set.
In one embodiment provided in the present specification, a set of 1000 pairs of teletext data is acquired, an image data set of 10000000 image data is acquired, and a text data set of 9874160 pieces of text is acquired.
It should be noted that since the feature extraction model is obtained from an image and text training, the features extracted by the feature extraction model have a relationship of two aspects of text and image. That is, the image features obtained by processing the image through the feature extraction model have higher similarity to the text features corresponding to the image, and similarly, the text features obtained by processing the text through the feature extraction model have higher similarity to the image features corresponding to the text. Furthermore, the conversion between graphics and texts and the mutual search can be performed through the features extracted by the feature extraction model, so that an image searching method, an image analysis method, an image processing method such as a text image and text processing method can be realized, and a service provided by a large language model can be combined, and a more generalized method such as an automatic question-answering method, an automatic judging method and the like can be realized.
By acquiring the data pair set containing a small amount of data and the image data and text data which are far more than the data in the data pair set, the problem of cost rise caused by acquiring a large amount of paired data during training of the model is avoided, and the data construction cost required during data acquisition is further reduced.
Step 104: and acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set.
In practical application, the reference feature vector is representative feature information of the image-text data feature pair set, and the reference feature vector also reflects the association relationship between the image features and the text features in the image-text data feature pair set because at least two reference feature vectors exist.
Specifically, after the reference feature vector is acquired according to the image-text data feature pair set, when an image feature or text feature is received, the best matching information is found in the reference feature vector. The found information is extracted and then integrated with the feature information to form a new, richer representation. This representation can more fully reflect the content of the input information, as it contains not only the features of the original input, but also merges knowledge and information related thereto and the relationship between the current modality and another modality.
The reference feature vector is obtained through the image-text data feature pair set, and representative knowledge of items corresponding to image-text data in the image-text data can be extracted according to a small part of image-text data. Thus enabling the model to better understand and process the input information, as the model can now make use of additional, related knowledge to support decisions and predictions. Secondly, the reference feature vector is also helpful to reduce the information amount that the model needs to directly learn and memorize from training data, because the knowledge (reference feature vector) stored in a memory bank can be utilized to assist in processing and understanding new information, the training of the model by combining the image data set and the text data which are not subjected to pairing processing is realized, the problem of overhigh cost caused by data construction is reduced, and the problem that the model cannot be trained due to the fact that no data pair exists in a feature scene is avoided.
Further, obtaining at least two reference feature vectors based on the set of teletext feature pairs includes:
Initializing at least two reference feature vectors based on a preset initialization rule;
and adjusting each reference feature vector according to the image-text data feature pair set until reaching the reference feature vector adjustment stop condition.
In practical application, the initialization rule is a rule for initializing the reference feature vector. The rule of initializing the reference feature vector may be random initialization, embedding a pre-trained word or initializing the reference feature vector by using an image feature vector, or may be a rule of initializing a vector by performing cluster analysis on text data and image data in a graphic data pair, then initializing the reference feature vector by using a cluster center, and the like, and in practical application, the present specification does not limit the present specification in any way based on the actual requirement of the project.
Specifically, the reference feature vector initialized by the image-text data feature pair set is obtained according to the image-text data feature pair set and the initialized reference feature vector, the image-text data loss value of the reference feature vector representing the difference between the index contained in the reference feature vector and the data in the image-text data feature pair is obtained, and then the parameter of each reference feature vector is adjusted by using the image-text data loss value of the reference feature vector, so that the reference feature vector contains the knowledge in the image-text data feature pair set and reflects the relationship between the image data and the text data in the image-text data feature pair set.
After the reference feature vector is adjusted, the steps can be continuously repeated, and the reference feature vector is continuously adjusted until an adjustment stop condition is reached, wherein in practical application, the adjustment stop condition of the reference feature vector comprises:
The calculated graph-text pair loss value of the reference feature vector is smaller than a preset threshold value; and/or
The adjustment turns reach the preset adjustment turns.
Specifically, in the process of adjusting the reference feature vector, the adjustment stop condition of the reference feature vector may be set to be that the loss value of the image-text pair of the reference feature vector is smaller than the preset threshold, or the adjustment stop condition may be set to be that the adjustment round is a preset adjustment round, for example, 10 rounds, where in the present specification, the preset threshold and/or the preset adjustment round of the loss value of the image-text pair of the reference feature vector are not specifically limited, and the actual application is in order.
Further, adjusting each reference feature vector according to the set of the image-text data feature pairs includes:
Acquiring a target image-text characteristic data pair, wherein the target image-text characteristic data pair is any one of the image-text characteristic data pairs;
Determining the similarity of the image-text pair prototype vector between the target image-text characteristic data pair and each reference characteristic vector according to the target image-text characteristic data pair and each reference characteristic vector;
and adjusting each reference feature vector according to the similarity of each image-text to the prototype vector.
In practical application, the target image-text characteristic data pair is any pair of data in the image-text characteristic data pair, and the similarity of the prototype vector of the image-text pair is the similarity between any pair of data in the image-text characteristic data pair and each reference characteristic vector.
Specifically, the similarity of the prototype vector of the image-text pair is the similarity between each image feature and each text feature in the image-text feature data pair and each reference feature vector, and the similarity between the fused vector and each reference feature vector is calculated as the similarity of the prototype vector of the image-text pair corresponding to the image-text feature data pair; the similarity between the image feature in the image feature data pair and the vector of the image part in the reference feature vector can be calculated respectively, then the similarity between the text feature in the image feature data pair and the vector of the text part in the reference feature vector is calculated, and the two similarities are regarded as the similarity of the prototype vector of the image pair corresponding to the image feature data pair.
The similarity of the image-text pair prototype vector can be understood as the similarity between paired data in the image-text characteristic data pair and each initialized reference characteristic vector, and the similarity between the two vectors also represents that the knowledge represented by the vectors is more similar, so that the reference characteristic vector can be adjusted through the similarity between paired data in the image-text characteristic data pair and each initialized reference characteristic vector, and more knowledge in the image-text characteristic data pair set can be carried in the reference data vector.
Preferably, the image data and the text data are processed respectively to embody the knowledge corresponding to each of the image mode and the text mode and to embody the association relationship between the image mode and the text mode of fig. Wen Duizhong, so that the image-text feature data pair comprises image-text feature data and image-text feature data corresponding to the image-text feature data, and the reference feature vector comprises an image prototype feature vector or a text prototype feature vector;
according to the target image-text characteristic data pair and each reference characteristic vector, determining the image-text pair prototype vector similarity between the target image-text characteristic data pair and each reference characteristic vector comprises the following steps:
Acquiring target image-text pair image characteristic data and target image-text pair text characteristic data in the target image-text characteristic data pair;
calculating the similarity of the image prototype vector of the image-text pair between the image feature data of the target image-text pair and each image prototype feature vector;
and calculating the similarity of the text prototype vector of the image-text between the text feature data of the target image-text and each text prototype feature vector.
In practical applications, the reference feature vector includes an image prototype feature vector corresponding to the image or a text prototype feature vector corresponding to the text. The reference feature vector is at least one image prototype feature vector and one text prototype feature vector, the image prototype feature vector is a vector representing image modal knowledge in the reference feature vector, and the text prototype feature vector is a vector representing text modal knowledge in the reference feature vector. The image-text-to-image characteristic data are characteristic data corresponding to images in the image-text pairs, and the image-text-to-text characteristic data are characteristic data corresponding to texts in the image-text pairs.
Specifically, the reference feature vector is divided into the image prototype feature vector of the corresponding image and the text prototype feature vector of the corresponding text, so that a complementary and interrelated relationship exists between the image prototype feature vector of the corresponding image and the text prototype feature vector of the stored corresponding text. These two reference feature vectors capture rich semantic content of visual and linguistic information, respectively, which work together to support more comprehensive and deep information understanding and processing.
Since the image prototype feature vector and the text prototype feature vector each store information of different modalities, where the image prototype feature vector is focused on visual content and the text prototype feature vector is focused on language content, the model can be made to understand acquired data from two different perspectives. This complementary feature ensures that the model can handle and understand a wider range of situations and concepts, and thus enables the model to obtain features of a single modality that correspond to features in another modality only through features of that modality, so that the model can correspond features between the two modalities to each other.
Further, adjusting each reference feature vector according to similarity of each image-text to the prototype vector, including:
determining image-to-image similarity characteristic information and image-to-text similarity characteristic information according to the image prototype vector similarity of each image-to-image and the text prototype vector similarity of each image-to-text;
and adjusting each reference feature vector according to the image-text image similarity feature information and the image-text similarity feature information.
In practical application, the image-text similarity feature information is a vector representing the similarity degree between the image-text feature data and each reference feature vector in the image-text feature data pair, and the image-text similarity feature information is a vector representing the similarity degree between the image-text feature data and each reference feature vector in the image-text feature data pair.
Specifically, the image-to-image similarity feature information is a vector obtained based on the image-to-image prototype vector similarity, preferably, the image-to-image prototype vector similarity is sequentially arranged, and the similarity of the preset number is obtained as a parameter of the vector so as to form the image-to-image similarity feature information, wherein the preset number can be the same as or smaller than the number of the image prototype vectors, preferably, the image-to-image similarity feature information is determined by the preset number smaller than the number of the image prototype vectors in consideration of resource consumption in a model training process.
Preferably, the similarity of each image-text prototype vector is arranged in sequence, and the similarity of the preset number is obtained as a vector parameter to form image-text similarity characteristic information, wherein the preset number can be the same as or smaller than the number of the text prototype vectors, and preferably, the image-text similarity characteristic information is determined by the preset number smaller than the number of the text prototype vectors in consideration of resource consumption in a model training process.
In one embodiment provided in the present disclosure, there are 10 reference feature vectors, including 5 image prototype vectors and 5 text prototype vectors, the similarity of the image-to-image prototype vectors corresponding to each image prototype vector is 0.05, 0.87, 0.64, 0.94, and 0.76, the similarity of the image-to-text prototype vectors corresponding to each text prototype vector is 0.47, 0.14, 0.88,0.67, and 0.97, and then a preset number is set to be 3, i.e. the similarity is sorted from small to large, and the image-to-image similarity feature information and the image-to-text similarity feature information are formed based on the similarity of the first three rows, and finally the obtained image-to-image similarity feature information is [0.94,0.87,0.76], and the image-to-text similarity feature information is [0.97,0.88,0.67].
The image-text similarity feature information and the image-text similarity feature information comprehensively represent the similarity between the part representing the image in each reference feature vector and the image feature in the image-text pair and the similarity between the part representing the text in each reference feature vector and the text feature in the image-text pair. And, because the image and text in the image-text pair are mutually corresponding, the corresponding image features and text features are similar, so that the reference feature vectors can carry the knowledge of the image features and text features in the image-text pair and the relationship between the two features based on the similarity feature information representing the similarity degree of the reference feature vectors and the image-text pair data in two modes.
Further, according to the image-text similarity feature information and the image-text similarity feature information, adjusting each reference feature vector includes:
Determining a reference feature vector image-text pair loss value according to the image-text pair image similarity feature information and the image-text pair text similarity feature information;
And adjusting each reference feature vector according to the reference feature vector graph-text pair loss value.
In practical application, after obtaining the image-text similarity feature information and the image-text similarity feature information, the reference feature vector image-text pair loss value can be calculated according to the two similarity feature information.
Specifically, the method for adjusting the reference feature vector according to the obtained reference feature vector graph-text pair loss value may be based on the adjustment method of parameters in back propagation to adjust the parameters of each vector in the reference feature vector; the parameter adjustment may be performed by calculating the value to be adjusted for the parameter in each reference feature vector for the purpose of reducing the loss value, and adjusting the parameter, and the like, and this specification is not limited in any way.
The image and the text in the image-text pair correspond to each other, so that the reference feature vector can carry the image feature and the text feature knowledge in the image-text pair and the relation between the image feature and the text feature knowledge in the image-text pair by adjusting the parameters of the reference feature vector through the image-text pair image similarity feature information and the image-text pair text similarity feature information.
Considering that the adjustment of the reference feature vector by using only the association relationship between the image and the text in the image-text pair brings about the problem that the knowledge coverage corresponding to the reference feature vector is small, after adjusting each reference feature vector, the method further comprises:
Calculating the similarity of the reference feature vectors among the reference feature vectors;
determining a loss value of the reference feature vector according to the similarity of each reference feature vector;
and adjusting each reference feature vector according to the loss value of the reference feature vector.
In practical application, the similarity of the reference feature vectors is the similarity between every two reference feature vectors, and the loss value of the reference feature vectors is a value representing the dispersion degree of the reference feature vectors.
Specifically, the similarity of the reference vectors between the respective reference feature vectors is calculated, and since the reference feature vectors include the feature vectors of the image modality and the feature vectors of the text modality, and there is no need to distinguish the knowledge coverage of interaction between the feature vectors of the two modalities, it is preferable to calculate the similarity between the respective image prototype vectors and the similarity between the respective text prototype vectors, and calculate the loss value of the reference feature vectors themselves according to the two similarity degrees, respectively.
The reference feature vector self-loss value can be understood to represent the knowledge surface covered by each reference feature vector in each mode, and the larger the difference between the feature vectors is, the more uniform the difference between the similarity of the reference feature vectors and the similarity is, therefore, the problem of low model accuracy caused by uneven knowledge distribution represented by the reference feature vectors can be effectively avoided by presetting a smaller target similarity and calculating the reference feature vector self-loss value with the target with the same similarity and similarity in the same mode and adjusting the parameters of the reference feature vector according to the loss value.
Step 106: and training the feature extraction model according to each reference feature vector, the image data set and the text data set until the training stopping condition of the feature extraction model is reached.
In practical application, feature data corresponding to an image in image data combination and text data corresponding to the feature data can be obtained through each reference feature vector, the image data set and the text data set, feature data corresponding to the image and text data corresponding to the feature data are processed based on each reference feature vector, so that more accurate features of the two feature data can be obtained, model loss values can be calculated according to the more accurate features, after the model loss values are obtained, model parameters of a feature extraction model can be adjusted according to the model loss values, and specifically, the model loss values can be reversely propagated to sequentially update model parameters of the feature extraction model.
After the model parameters are adjusted, the steps can be continuously repeated, and the feature extraction model is continuously trained until the training stopping condition is reached, and in practical application, the training stopping condition of the feature extraction model comprises the following steps:
the model loss value is smaller than a preset threshold value; and/or
The training round reaches the preset training round.
Specifically, in the process of training the feature extraction model, the training stop condition of the model may be set to be smaller than the preset threshold, or the training stop condition may be set to be a preset training round, for example, 10 training rounds, where in the present specification, the preset threshold of the loss value and/or the preset training round are not specifically limited, and the actual application is in order.
By processing the feature data corresponding to the image and the text data corresponding to the feature data by the reference feature vectors, more accurate features of the two feature data can be obtained, the model loss value can be calculated according to the more accurate features, and the accuracy of the model obtained through training in processing information input into a single mode can be higher.
Further, training the feature extraction model according to each reference feature vector, the image data set, and the text data set, includes:
inputting the image data set and the text data set into the feature extraction model to obtain at least one first image feature data and at least one first text feature data output by the feature extraction model;
Determining at least one second image feature data and second text feature data corresponding to each second image feature data according to each first image feature and each first text feature data;
calculating a model loss value according to each reference feature vector, each second image feature data and second text feature data corresponding to each second image feature data;
And adjusting parameters of the feature extraction model according to the model loss value.
In practical application, the first image feature data is feature data corresponding to each image in the image data set, the first text feature data is feature data corresponding to each text in the text data set, the second image feature data is image feature data in the initially constructed paired data, the second text feature data is text feature data in the initially constructed paired data, and the model loss value represents a gap between feature data representing an image obtained by a model and feature data representing a text.
Specifically, the first image feature data and the first text feature data are obtained after the image and the text are processed by the feature extraction model before adjustment, and because the training mode of the model is to compare whether the image feature and the text feature extracted by the model are similar, the text feature most similar to the image feature needs to be obtained first, namely the text feature data corresponding to the image feature data is determined, and the paired second image feature data and second text feature data are constructed.
Compared with manually constructed paired image-text pair data, the image feature data constructed by the method has low similarity degree and cannot be directly used for training the feature extraction model, but the construction cost required by the construction mode is far lower than that of the manually constructed image-text pair data, so that after the second text feature data corresponding to the image feature data is determined, the reference feature vector is combined to train the feature extraction model.
Further, determining at least one second image feature data and second text feature data corresponding to each second image feature data according to each first image feature data and each first text feature data, including:
Determining the image-text similarity between each first image feature data and each first text feature data according to each first image feature data and each first text feature data;
Determining at least one second image characteristic data based on the image-text similarity, and determining second text characteristic data corresponding to each second image characteristic data; or determining at least one piece of second text feature data based on the image-text similarity, and determining second image feature data corresponding to each piece of second text feature data.
In practical applications, the image-text similarity is used to represent the similarity between the image features and the text features.
Specifically, the similarity between the image feature data and the text feature data is determined according to the image features and the text features, which may be directly calculated, or may be determined by the features of the images and the texts appearing in the respective sets. The manner of determining the paired second image feature data and second text feature data may be to determine the second text feature data corresponding to the second image feature data or to determine the second image feature data corresponding to the second text feature data.
Through the similarity degree between the image feature data and each text feature data, the text feature data corresponding to the image feature is preliminarily confirmed, and the image feature and text feature pair with a certain corresponding relation can be preliminarily built, so that the feature extraction model can be trained by combining the reference feature vectors.
Further, determining the image-text similarity between the first image feature data and the first text feature data according to the first image feature data and the first text feature data, including:
determining first image similarity characteristic data corresponding to each first image characteristic data according to each first image characteristic data;
determining first text similarity feature data corresponding to each first text feature data according to each first text feature data;
And determining the image-text similarity between each first image feature data and each first text feature data based on each first image similarity feature data and each first text similarity feature data.
In practical application, the first image similarity feature data is a feature of each image feature data in the image set, and the first text similarity feature data is a feature of each text feature data in the text set.
Specifically, the obtaining the image similarity feature data corresponding to the first image feature data may be: training a deep learning model according to partial data features in the image set, and then judging the confusion degree corresponding to each image by using the trained deep learning model, wherein the confusion degree is used as image similarity feature data corresponding to the image feature data; the similarity between each piece of image feature data and the rest of image feature data obtained by calculation may be used as the image similarity feature data corresponding to the image feature data, and the like, and the present specification is not limited in any way.
The obtaining text similarity feature data corresponding to the first text feature data may be: training a deep learning model according to partial data features in the text set, and then judging the confusion degree corresponding to each text by using the trained deep learning model, wherein the confusion degree is used as text similarity feature data corresponding to the text feature data; the text feature data may be obtained by calculating the respective similarity of the text feature data, and the similarity of some piece of text feature data obtained by calculation and the rest of text feature data may be used as the text similarity feature data corresponding to the text feature data, which is not limited in any way in the present specification.
The image-text similarity determined based on the first image similarity feature data and the first text similarity feature data may be understood as a similarity between the first image similarity feature data and the first text similarity feature data, and may be determined as an image-text similarity between the first image feature data and the first text feature data.
The similarity degree in the distribution of knowledge is higher because of a large number of images in the image set and a large number of texts in the text set in the same-image-scene. Therefore, by acquiring the features of the image features in the image feature space formed by all the image features in the image set and the features of the text features in the text feature space formed by all the text features in the text set, and determining the paired second image features and second text features according to the features of the image features and the text features in the respective feature spaces, the similarity degree of the images and texts corresponding to the second image features and the second text features can be improved.
Moreover, by acquiring the characteristics of each mode in the same scene in the characteristic space of each mode, the condition that the difference between the acquired characteristic data is large due to the fact that different model similar data codes caused by encoding a plurality of mode data by using a multi-mode encoder which is trained in advance is avoided, and further the training of a randomly initialized image encoder and a text encoder to acquire the multi-mode model with high multi-mode characteristic matching degree can be realized.
Further, determining first image similarity feature data corresponding to each first image feature data according to each first image feature data includes:
determining target first image feature data and at least one reference first image feature data corresponding to the target first image feature data in each first image feature data;
Calculating the similarity of the image set corresponding to the target first image feature data and each reference first image feature data;
and determining first image similarity characteristic data corresponding to the target first image characteristic data based on the similarity of each image set.
In practical application, the target first image feature data is any one of the first image feature data, the reference first image feature data is the rest of the first image feature data except the first image feature data in the first image feature data, and the image set similarity is the vector similarity between the target first image feature data and the rest of the first image feature data.
Specifically, the vector similarity between the target first image feature data and the rest of the first image feature data may be understood as the feature of the target first image feature data in the image set, or may be further understood as the feature of the target image feature data in the image feature space formed by all the image feature data in the image set.
Because the image feature space formed by all image feature data in the image set is similar to the text feature space formed by all text feature data under the same scene, the first image similarity feature data is determined according to the similarity between the target first image feature data and the rest of the reference first image feature data, and the determined pairs of second image feature data and the second text feature data have higher similarity according to the similarity between the first image similarity feature data and the rest of the first text similarity feature data, so that the training accuracy of the data processing model can be improved, and the training accuracy of the data processing model obtained by training the randomly initialized encoder can be realized.
Further, determining first text similarity feature data corresponding to each first text feature data according to each first text feature data includes:
Determining target first text feature data and at least one reference first text feature data corresponding to the target first text feature data in each first text feature data;
calculating the similarity of the target first text feature data and the text set corresponding to each reference first text feature data;
And determining first text similarity feature data corresponding to the target first text feature data based on the similarity of each text set.
In practical application, the target first text feature data is any one of the first text feature data, the reference first text feature data is the rest text feature data except the first text feature data in the first text feature data, and the text set similarity is the vector similarity between the target first text feature data and the rest first text feature data.
Specifically, the vector similarity between the target first text feature data and the rest of the first text feature data may be understood as a feature of the target first text feature data in the text set, or may be further understood as a feature of the target text feature data in a text feature space formed by all text feature data in the text set.
Since the text feature space formed by all text feature data in the text set is similar to the image feature space formed by all image feature data under the same scene, the first text similarity feature data is determined according to the similarity between the target first text feature data and the rest of the reference first text feature data, and the determined pairs of second image feature data and second text feature data have higher similarity according to the similarity between the first text similarity feature data and the rest of the first image similarity feature data, so that the training accuracy of the data processing model can be improved, and the training accuracy of the data processing model obtained by training the randomly initialized encoder can be realized.
Further, determining at least one second image feature data based on the image-text similarity, and determining second text feature data corresponding to each second image feature data, including:
determining target second image feature data, wherein the target second image feature data is any one of the first image feature data;
and determining target second text feature data corresponding to the target second image feature data according to the image-text similarity.
Specifically, the specific manner of determining the feature data corresponding to the image feature according to the image-text similarity between the image feature and each text feature may be to confirm that a text feature data with higher text similarity is used as the text feature data corresponding to the image feature data, or to obtain a piece of feature data by fusing a plurality of text feature data with high image-text similarity as the text feature data corresponding to the image, and the description is not limited in any way.
In one embodiment provided in the present disclosure, the second image feature data is a vector 0 ', and the first image similarity feature data corresponding to the image set is a vector 0', and has 5 texts in total, which are respectively text 1: "this is a warm comfortable afternoon" text 2: "this is a fully lit afternoon" text 3: this is a warm comfortable morning text 4: "today is a cloudy day" text 5: "this scenario shows a night sky star", each corresponding vector in the text set being vector 1 '-vector 5'. The similarity between the text and the vector 0 is calculated and obtained as a vector 1': 0.85, vector 2': 0.93, vector 3': 0.69, vector 4': 0.41, vector 5': 0.13. the vector 2 corresponding to the vector 2 'with higher similarity, namely the vector 2 corresponding to the text 2' which is a afternoon with enough light, is determined according to the sequence, and the second text feature data corresponding to the image feature data vector 0 is determined.
Through the image-text similarity between the image feature data and each text feature data, the text feature data corresponding to the image feature is confirmed primarily, and image feature and text feature pairs with a certain corresponding relationship can be built primarily, so that the feature extraction model can be trained by combining the reference feature vectors subsequently.
Further, determining target second text feature data corresponding to the target second image feature data according to the image-text similarity, including:
determining at least one intermediate text feature data from the first text feature data according to the image-text similarity;
And determining target second text feature data according to each intermediate text feature data.
In practical application, the intermediate text feature data is text feature data which is determined from the first text feature data and is matched with the image feature best. Specifically, the manner of fusing the plurality of text features may be to directly fuse each text feature data according to a fusion algorithm between vectors, so as to obtain text feature data including the plurality of text features as text feature data close to the image feature data; the text with the plurality of text features can be extracted, the text is processed according to a language processing model, and the obtained text is subjected to feature extraction so as to obtain text feature data with the plurality of text features fused based on a large model and used as the text feature data with the image feature information close to each other. The direct vector fusion method may be a direct multiplication, weighted summation, maximum pooling, or the like fusion method without changing the vector length, which is not limited in this specification.
In one embodiment provided in the present specification, along the above example, the text feature vectors corresponding to the vector 1', the vector 2 ' and the vector 3' that determine the top three similarity ranks in order are determined: vector 1, vector 2 and vector 3, and the three text feature vectors are used as intermediate text feature data, and the corresponding positions are directly multiplied and then normalized to be used as second text feature data corresponding to the image feature data vector 0.
By acquiring a plurality of text feature data and fusing the text feature data to acquire text feature data corresponding to the second image feature data, the method can be understood as combining the text feature data corresponding to the second image feature data, and improves the accuracy of initially constructing the image feature data and text feature data pairs, so as to improve the training efficiency of training the feature extraction model by combining the reference feature vectors, and improve the accuracy of the trained feature extraction model.
Further, determining target second text feature data according to each intermediate text feature data includes:
based on each intermediate text feature data, acquiring text data corresponding to each intermediate text feature data;
and acquiring target text data according to each text data, inputting the target text data into the feature extraction model, and acquiring target second text feature data output by the feature extraction model.
In practical application, the target text data is text data fused with each feature of the intermediate text feature data, and specifically, acquiring text data corresponding to each intermediate text feature data can be understood as acquiring text before feature extraction of the intermediate text feature data.
In one embodiment provided in the present specification, along the above example, the text feature vectors corresponding to the vector 1 ', the vector 2 ' and the vector 3 ' that determine the top three similarity ranks in order are determined: vector 1, vector 2 and vector 3, and the three text feature vectors are taken as intermediate text feature data, and texts corresponding to the three text feature vectors are obtained as text 1 respectively: "this is a warm comfortable afternoon", text 2: "this is a fully lit afternoon" and text 3: "this is a warm comfortable morning", then the three texts described above are spliced and processed as questions "please extract" this is a warm comfortable afternoon ', ' this is a light-rich afternoon ', and ' this is a common point of several sentences of a warm comfortable morning ' and summarized as a sentence "input to a large language model to acquire target text data" this is a comfortable day ", then the target text data is input to a feature extraction model to acquire text feature data after feature extraction of the target text data as second text feature data corresponding to the image feature data vector 0.
By utilizing the powerful language processing capability of the large language model and processing the text feature data similar to the image feature data, the accuracy of initially constructing the image feature data and text feature data pairs can be further improved, so that the training efficiency of training the feature extraction model by combining the reference feature vectors is further improved, and the accuracy of the trained feature extraction model is further improved.
Considering that the model receives data of an image mode and data of a text mode, and knowledge of two different mode data is different and has a certain association relation, the reference feature vector comprises an image prototype feature vector or a text prototype feature vector;
further, calculating a model loss value according to each reference feature vector, each second image feature data and second text feature data corresponding to each second image feature data, including:
Determining target second image feature data and target second text feature data, wherein the target second image feature data is any one of the second image feature data;
Obtaining at least one image prototype vector similarity according to the target second image feature data and each image prototype feature vector, wherein the image prototype vector similarity is the similarity between the target second image feature data and each image prototype feature vector;
obtaining at least one text prototype vector similarity according to the target second text feature data and each text prototype feature vector, wherein the text prototype vector similarity is the similarity between the target second text feature data and each text prototype feature vector;
And calculating a model loss value according to the similarity of each image prototype vector and the similarity of each text prototype vector.
In practical application, the target image feature data is feature data corresponding to any one image in the image data set, the target second text feature data is feature data corresponding to any one text in the text data set, the similarity of the image prototype vector is the similarity between the target image feature and the vector of the image part in the reference feature vector, the similarity between the text prototype vector and the vector of the text part in the reference feature vector, and the model loss value is the difference between the paired image feature data and the text feature data extracted by the representation model.
Specifically, since the matching degree between the image characteristic information and the image characteristic data pair formed by the second text characteristic information, which are determined by preliminary matching, is low, the matching degree between a plurality of pieces of reference characteristic vectors obtained by adjustment of the characteristic data with high matching degree and the two pieces of characteristic data can be calculated, and the similarity between the two pieces of characteristic data can be understood as abstract the two pieces of characteristic data through specific knowledge of multiple aspects and corresponding to images and texts, the performance of the two pieces of data on the image and corresponding to texts in the multiple aspects is obtained, the accuracy of the relationship between the obtained image characteristic data and the second text characteristic information is improved, and because the characteristic data and the text characteristic data which are thinned are replaced by using the wide-range performance of the characteristic vectors in the multiple aspects can be improved, the training of the characteristic extraction model can be realized, and the model can be trained by using only two single-data sets which are not paired, so that the model training cost is reduced.
Further, calculating a model loss value according to the similarity of each image prototype vector and each text prototype vector, including:
determining an image similarity vector corresponding to the target second image feature data based on the similarity of each image prototype vector;
determining a text similarity vector corresponding to the target second text feature data based on the similarity of each text prototype vector;
and calculating a model loss value according to the image similarity vector and the text similarity vector.
In practical applications, the image similarity vector is a feature vector representing the degree of matching of the second image feature data with the knowledge in the multiple aspects of vision, and the text similarity vector is a feature vector representing the degree of matching of the second text feature data corresponding to the second image feature data with the knowledge in the multiple aspects of text.
The similarity of each image prototype vector can be sequentially arranged, and the similarity of the preset number is obtained as a parameter of the vector to form an image similarity vector, wherein the preset number can be the same as or smaller than the number of the image prototype vectors, and preferably, the image similarity vector is determined by the preset number smaller than the number of the image prototype vectors in consideration of resource consumption in the model training process. Similarly, the similarity of the text prototype vectors may be sequentially arranged, and the similarity of the preset number may be obtained as a parameter of the vector to form a text similarity vector, where the preset number may be the same as or smaller than the number of the text prototype vectors, and preferably, the text similarity vector is determined by the preset number smaller than the number of the text prototype vectors in consideration of resource consumption in the model training process. Preferably, the preset number in this step is the same as the preset number of the image similarity feature information and the text similarity feature information.
In one embodiment provided in the present disclosure, there are 10 reference feature vectors, including 5 image prototype vectors and 5 text prototype vectors, the image-to-image prototype vector similarity corresponding to each image prototype vector is 0.98, 0.22, 0.68, 0.78, and 0.77, the image-to-text prototype vector similarity corresponding to each text prototype vector is 0.13, 0.76, 0.81, 0.41, and 0.91, and then a preset number is set to 3, i.e. the similarity is ordered from small to large, and an image similarity vector and a text similarity vector are formed based on the similarity of three rows, and finally the obtained image similarity vector is [0.98,0.78,0.77], and the text similarity vector is [0.91,0.81,0.76].
Because the matching degree between the image characteristic information and the image-text characteristic data pair formed by the second text characteristic information, which are determined by preliminary matching, is low, the feature extraction model is trained by only using the characteristic information of the two modes, which is determined by preliminary matching, and the accuracy of the feature extraction model obtained by training is too low. Therefore, the reference feature vectors are used for extracting the duty ratio degree of the knowledge of the feature information of the two modes in each direction, so that the refined image feature data and text feature data can be replaced by using the extensive knowledge similarity degree, the matching degree between similar original feature vectors can be improved, and more accurate model loss values can be calculated through the improved matching degree, so that the training of the model is realized. Therefore, by carrying the reference feature vectors of the corresponding relation between the images and the texts in a plurality of directions, the model can be trained by only using two single-mode data sets which are not subjected to pairing processing, and the cost of model training is reduced.
By applying the scheme of the embodiment of the specification, the similarity between each image data and the corresponding reference feature vector and the similarity between the corresponding reference vectors of each text data are calculated through the image-text feature data pair, and then each initialized reference feature vector is obtained through adjustment of the two similarity, so that the reference feature vector carries the relationship between the image features and the text features in the image-text data pair.
When the model is trained, text feature data corresponding to each image feature data are determined through a plurality of pieces of image feature data and a plurality of pieces of text feature data extracted by the model, the image feature and the text feature are initially and automatically matched, the similarity of the image feature data and the adjusted corresponding reference vector and the similarity of the text feature data and the adjusted corresponding reference vector are calculated, and as the adjusted reference vector has the association relationship between the image feature and the text feature, the model can understand the extracted image feature and the text feature comprehensively carrying the reference feature vector of the graphic feature association relationship, the image feature and the text feature extracted by the model are more accurate, the feature extraction model is trained through an image set and a text set which are not matched, and therefore cost required by training the model is reduced.
Corresponding to the above method embodiment, the present disclosure further provides an embodiment of a machine learning model training method applied to cloud side equipment, referring to fig. 2, fig. 2 shows a flowchart of a feature extraction model training method applied to cloud side equipment according to one embodiment of the present disclosure, and specifically includes the following steps.
Step 202: and the image-text data pair set, the image data set and the text data set which are sent by the receiving terminal side equipment are processed through the feature extraction model to obtain the image-text data feature pair set.
Step 204: and acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set.
Step 206: and training the feature extraction model according to each reference feature vector, the image data set and the text data set until the training stopping condition of the feature extraction model is reached.
Step 208: and obtaining model parameters in the trained feature extraction model, and returning the model parameters to the end-side equipment.
The above is a schematic scheme of a machine learning model training method applied to cloud-side equipment in this embodiment. It should be noted that, the technical solution of the machine learning model training method applied to the cloud side device and the technical solution of the machine learning model training method described above belong to the same concept, and details of the technical solution of the machine learning model training method applied to the cloud side device, which are not described in detail, can be referred to the description of the technical solution of the machine learning model training method described above.
By applying the scheme of the embodiment of the specification, the obtained image-text feature data are subjected to adjustment initialization through a small amount of image-text feature data sent by the terminal equipment to obtain each reference feature vector, so that the reference feature vector carries the relationship between the image features and the text features in the image-text data, and then when the model is trained, the model can understand the reference feature vector with the relationship between the extracted image features and the text features comprehensively carried by the image-text feature, so that the image features and the text features extracted by the model are more accurate, the feature extraction model is trained through an image set and a text set which are not subjected to pairing processing, and the cost of constructing project corresponding data of the terminal equipment is reduced on the basis of ensuring the accuracy of the image-text model obtained by training of the cloud side equipment in a special scene.
Referring to fig. 3, fig. 3 shows a flowchart of a text-based image searching method applied to a cloud-side device according to an embodiment of the present disclosure, and specifically includes the following steps.
Step 302: and receiving an image searching instruction sent by the terminal side equipment, wherein the image searching instruction carries target searching text.
In practical application, the image searching instruction is an instruction for searching an image according to a text, and the target search text is a text provided by a user. Specifically, the target search text may be understood as a description of a picture by the user, for example, "horse eating grass", "a bag printed with a human face", and so on.
By receiving the target search text, at least one image corresponding to the text can be obtained according to the target search text.
Step 304: and inputting the target search text into a feature extraction model to obtain target search text feature information output by the feature extraction model, wherein the feature extraction model is obtained by training the machine learning model training method.
In practical application, the target search text feature information is text feature information obtained by extracting the target search text through the feature extraction model. Specifically, according to the text feature information obtained by extracting the feature extraction model, the text feature obtained by extracting the feature of the text in the specific scene still has the feature associated with the image, so that when the text in the specific scene is searched, related targets can be still identified and searched, wider search selection is provided for users, and the accuracy of image search is improved.
Step 306: and determining an image search result corresponding to the image search instruction according to the target search text characteristic information, and returning the image search result to the terminal side equipment.
In practical application, the image search result is at least one image acquired by the target data processing model according to the target search text. Specifically, at least one corresponding image is determined according to the text feature information, the constructed image library can be realized through the feature extraction model processing, so that the features of each image in the image library are extracted, and then the similarity between the text feature and each image feature of the target search is compared, so that images corresponding to a plurality of features with the similarity meeting the condition are obtained.
By applying the scheme of the embodiment of the specification, the image is searched by the text searching characteristic information obtained by the characteristic extraction model obtained through training by the extraction model training method, and the use cost of the text-based image searching method can be reduced on the premise of improving the accuracy of text-based image searching under the condition that a certain item corresponds to a specific scene.
Referring to fig. 4, fig. 4 shows a flowchart of an image analysis method applied to a cloud-side device according to an embodiment of the present disclosure, and specifically includes the following steps.
Step 402: and receiving an image analysis instruction sent by the terminal side equipment, wherein the image analysis instruction carries an image to be analyzed.
In practical applications, the image analysis instruction is an instruction for requesting analysis of one or more images, and the image to be analyzed is an image uploaded by a user. Specifically, the image to be analyzed may be any image that the user wishes to perform content recognition, style analysis, or emotion judgment, for example, a photo of a cat with rich expression, a picture depicting ancient castle, an image of a road, an image including a plurality of entities, and the like.
By receiving an image analysis instruction carrying an image to be analyzed, subsequent in-depth analysis and interpretation according to the image can be realized.
Step 404: inputting the image to be analyzed into a feature extraction model to obtain feature information of the image to be analyzed output by the feature extraction model, wherein the feature extraction model is obtained by training by the machine learning model training method.
In practical application, the image characteristic information to be analyzed is the image characteristic information obtained by extracting the image to be analyzed through the characteristic extraction model. Specifically, according to the image feature information obtained by extracting the feature extraction model, the image feature obtained by extracting the features of the text in the specific scene still has the feature associated with the text, so that the image can be accurately analyzed when the image in the specific scene is analyzed, wider search selection is provided for a user, and the accuracy of image analysis is improved.
Step 406: and determining an image analysis result corresponding to the image analysis instruction according to the image characteristic information to be analyzed, and returning the image analysis result to the end-side equipment.
In practical application, the image analysis result is a multidimensional analysis result of content recognition, style analysis, emotion judgment and the like performed by the target data processing model according to the image to be analyzed. Specifically, the image analysis result may be understood as a text corresponding to the image, for example, the image is a picture of a horse eating grass, the corresponding image analysis result is "horse eating grass", the image analysis result may also be understood as statistics and analysis of each entity in the image, for example, the image to be analyzed is a road image containing a complex scene, and the image analysis result corresponding to the image is "the picture has a billboard and two automobiles, wherein the front wheel of the sedan is broken".
It should be noted that, the analysis result of the image is determined according to the image features, and the constructed analysis result library (that is, a plurality of text libraries for describing the image) may be implemented through the feature extraction model processing, so as to extract the features of each text in the analysis result library, and then, the similarity between the image features to be analyzed and each text feature is compared, so as to obtain the text corresponding to the text features with the similarity meeting the condition.
By applying the scheme of the embodiment of the specification, the image to be analyzed is analyzed by the feature information of the image to be analyzed, which is obtained by the feature extraction model obtained through training by the extraction model training method, and the use cost of the image analysis method can be reduced on the premise of improving the accuracy of analyzing the image under the condition that a certain item corresponds to a specific scene.
Referring to fig. 5, fig. 5 shows a flowchart of an automatic question-answering method applied to cloud-side equipment according to one embodiment of the present disclosure, and specifically includes the following steps.
Step 502: and receiving the question data sent by the terminal side equipment, wherein the question data comprises at least one of question text data and question image data.
In practical application, the problem data is text data or image data uploaded by the user, specifically, the problem data can be understood as a problem to be solved by the user, and the problem data sent by the receiving end side device can be understood as an actual problem of the receiving user, so as to realize the processing of the problem. In one embodiment provided in this specification, the question data uploaded by the user is "analyze the picture" and picture 1. And the processing of the actual problem proposed by the user is realized by receiving the actual problem sent by the user.
Step 504: inputting the problem data into a language processing model to obtain problem data to be processed, and inputting the problem data to be processed into a feature extraction model to obtain problem feature data output by the feature extraction model, wherein the feature extraction model is obtained by training the machine learning model training method.
In practical application, the language processing model is a large model for processing text data in the problem data, the problem data to be processed is more detailed problem data acquired by the large model according to the text data given by a user, and the problem feature data is data for extracting features of the problem data to be processed.
Specifically, after the language processing model is used for extracting the questions raised by the user, the features extracted by the feature extraction model can be more accurate, and answer data generated later can be more accurate.
In one embodiment provided in the present disclosure, the problem data uploaded by the user includes text data "analyze the picture" and image data picture 1, and the user needs to analyze the picture through the language processing model, so that the problem data to be processed after processing is picture 1, so that the feature extraction model extracts the features of the picture 1, and the subsequent analysis of the picture 1 is implemented.
In another embodiment provided in the present specification, the problem data uploaded by the user includes text data "please help me find a picture of red automobile driving", and the language processing model analyzes the text data to find the picture according to the text, so that the text data uploaded by the user is further optimized to obtain "a picture of red automobile driving on a road", and the text is used as the problem data to be processed to perform feature extraction.
In another embodiment provided in the present specification, the problem data uploaded by the user includes the image data picture 2, and since the user does not input text data, the user can determine that the user needs to find a similar picture of the picture through analysis of the language processing model on the context, so that the picture 2 is used as the problem data to be processed to perform feature extraction.
Considering that the questions presented by the user are usually spoken language and the description of the questions is not very accurate, the questions presented by the user are processed through a large model first, so that the accuracy of solving the question data presented by the user can be improved.
Step 506: and generating answer data according to the question feature data, and returning the answer data to the terminal side equipment.
In practical application, the generated answer data comprises text data and image data, and is determined according to the practical requirements of the user. Specifically, the actual demands of the users are obtained according to the language processing model.
In one embodiment provided in the present specification, the question data uploaded by the user includes text data "analyze the picture" and image data picture 1, and the analysis of the text data by the language processing model indicates that the user needs to perform an image searching method, so that the answer data picture 1 corresponding to the question corresponds to the descriptive text.
In another embodiment provided in the present specification, the question data uploaded by the user includes text data "please help me find a picture of red automobile driving", and the text data is analyzed by the language processing model to determine that a text-based image searching method needs to be implemented, and then the obtained answer data is a picture corresponding to the text data.
In another embodiment provided in the present disclosure, the question data uploaded by the user includes an image data picture 2, and since the user does not input text data, the user can determine that the user needs to search for a picture by using a method of searching for a picture through analysis of the language processing model on the context, answer data corresponding to the question is a picture 3 similar to the picture 2.
Considering that the model can be further adjusted according to the feedback of the user to improve the accuracy of the model, after the answer data is returned to the end-side device, the method further comprises:
receiving feedback data aiming at the answer data and sent by the terminal side equipment;
and adjusting the feature extraction model according to the question data, the answer data and the feedback data.
In practical application, the feedback data feeds back the data of the satisfaction degree of the user on the answer data, the user can determine whether the answer data is satisfied after the answer data is acquired, the result is fed back to the cloud, and the satisfaction degree is used as a label fine tuning feature extraction model for the question data and the answer data, so that the feature extraction model extracts features more accurately.
By applying the scheme of the embodiment of the specification, the question feature data obtained by the feature extraction model obtained through training by the extraction model training method is answered, so that a large amount of paired data in corresponding projects are not required to be constructed, and the use cost of using the automatic question-answering method is reduced.
Referring to fig. 6, fig. 6 illustrates an architecture diagram of an automatic question-answering system provided in one embodiment of the present description, which may include a client 100 and a server 200;
the client 100 is configured to send problem data and feedback data to the server 200;
A server 200, configured to receive question data sent by the client 100, where the question data includes at least one of question text data and question image data; inputting the problem data into a language processing model to obtain problem data to be processed, and inputting the problem data to be processed into a feature extraction model to obtain problem feature data output by the feature extraction model, wherein the feature extraction model is obtained by training the feature extraction model training method; generating answer data according to the question feature data, and sending the answer data to the client 100;
the client 100 is further configured to receive answer data sent by the server 200, generate feedback data according to the answer data, and send the feedback data to the server 200;
The server 200 is configured to receive feedback data for the answer data sent by the client 100; and adjusting the feature extraction model according to the question data, the answer data and the feedback data.
By applying the scheme of the embodiment of the specification, the question feature data obtained by the feature extraction model obtained through training by the extraction model training method is answered, so that a large amount of paired data in corresponding projects are not required to be constructed, and the use cost of using the automatic question-answering method is reduced.
The automatic question-answering system may include a plurality of clients 100 and a server 200, wherein the clients 100 may be referred to as end-side devices and the server 200 may be referred to as cloud-side devices. Communication connection can be established between the plurality of clients 100 through the server 200, in the automatic question-answering scenario, the server 200 is used to provide an automatic question-answering service between the plurality of clients 100, and the plurality of clients 100 can respectively serve as a transmitting end or a receiving end, and communication is realized through the server 200.
The user may interact with the server 200 through the client 100 to receive data transmitted from other clients 100, or transmit data to other clients 100, etc. In the automatic question-answer scenario, the user may issue a data stream to the server 200 through the client 100, and the server 200 generates answer data according to the data stream and pushes the answer data to other clients establishing communication.
Wherein, the client 100 and the server 200 establish a connection through a network. The network provides a medium for a communication link between client 100 and server 200. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The data transmitted by the client 100 may need to be encoded, transcoded, compressed, etc. before being distributed to the server 200.
The client 100 may be a browser, APP (Application), or a web Application such as H5 (HyperText Markup Language, hypertext markup language (htv) 5th edition) Application, or a light Application (also called applet, a lightweight Application) or cloud Application, etc., and the client 100 may be based on a software development kit (SDK, software Development Kit) of a corresponding service provided by the server 200, such as a real-time communication (RTC, real Time Communication) based SDK development acquisition, etc. The client 100 may be deployed in an electronic device, need to run depending on the device or some APP in the device, etc. The electronic device may for example have a display screen and support information browsing etc. as may be a personal mobile terminal such as a mobile phone, tablet computer, personal computer etc. Various other types of applications are also commonly deployed in electronic devices, such as human-machine conversation type applications, model training type applications, text processing type applications, web browser applications, shopping type applications, search type applications, instant messaging tools, mailbox clients, social platform software, and the like.
The server 200 may include a server that provides various services, such as a server that provides communication services for multiple clients, a server for background training that provides support for a model used on a client, a server that processes data sent by a client, and so on. It should be noted that, the server 200 may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. The server may also be a server of a distributed system or a server that incorporates a blockchain. The server may also be a cloud server for cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), basic cloud computing services such as big data and artificial intelligence platforms, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
It should be noted that, the automatic question-answering method provided in the embodiments of the present disclosure is generally executed by the server, but in other embodiments of the present disclosure, the client may also have a similar function to the server, so as to execute the automatic question-answering method provided in the embodiments of the present disclosure. In other embodiments, the automatic question answering method provided in the embodiments of the present disclosure may be performed by the client and the server together.
The machine learning model training method provided in the present specification will be further described with reference to fig. 7 by taking an application of the machine learning model training method in image-text feature extraction model training as an example. Fig. 7 is a flowchart of a processing procedure of a method for training a graphic feature extraction model according to an embodiment of the present disclosure, which specifically includes the following steps.
Step 702: a set of teletext data pairs is obtained comprising 100 pairs of teletext data.
Step 704: inputting the image data in the image-text data pair set into an image encoder of the CLIP to obtain image characteristic data of the image-text pair, and inputting the text data corresponding to the image data into a text encoder to obtain text characteristic data of the image-text pair.
Step 706: 100 image prototype vectors and 100 text prototype vectors are initialized.
Step 708: and calculating the similarity between the image characteristic data of the image-text pairs and the 100 image prototype vectors, obtaining the similarity of the images 30 before the similarity is sequentially arranged, and generating a 30-dimensional image similarity vector.
Step 710: and calculating the similarity between the text feature data of the image-text pairs and the 100 text prototype vectors, obtaining the similarity of the first 30 of the similarity sequence arrangement, and generating a 30-dimensional text similarity vector.
Step 712: and calculating a prototype vector graph-text pair loss value between the image similarity vector and the text similarity vector.
Step 714: and adjusting parameters of the 100 image prototype vectors and the 100 text prototype vectors according to the prototype vector graph-text pair loss value.
Step 716: and calculating the similarity between 100 image prototype vectors after adjustment and the similarity between 100 text prototype vectors, calculating self loss values according to the two similarities, and adjusting parameters of the 100 image prototype vectors and the 100 text prototype vectors again.
Step 718: an image data set of 105470 pieces of data and a text data set of 157816 pieces of data are acquired.
Step 720: and inputting each image in the image data set into an image encoder, acquiring image characteristic data corresponding to each image, and inputting each text in the text data set into a text encoder, and acquiring text characteristic data corresponding to each text.
Step 722: and calculating the similarity between each piece of image characteristic data and each piece of text characteristic data in the image characteristic data.
Step 724: and determining the text characteristic data with the similarity meeting the condition as target text characteristic data corresponding to the current image characteristic data.
Step 726: and calculating the similarity between the image characteristic data and 100 image prototype vectors after the adjustment, obtaining the similarity of 30 before the similarity is sequentially arranged, and generating a 30-dimensional image similarity vector.
Step 728: and calculating the similarity between the target text feature data and the 100 text prototype vectors after adjustment, obtaining the similarity of 30 before similarity sequence arrangement, and generating a 30-dimensional text similarity vector.
Step 730: and calculating a model loss value between the image similarity vector and the target text similarity vector.
Step 732: and adjusting parameters of the image encoder and the text encoder according to the model loss value.
By applying the scheme of the embodiment of the specification, each reference feature vector is obtained through a small amount of image-text data pair to each reference feature vector which is initialized by adjustment, so that the reference feature vector carries the relation between the image features and the text features in the image-text data pair. When the model is trained, as the adjusted reference vector has the association relationship between the image features and the text features, the model can understand the reference feature vector with the image-text feature association relationship comprehensively carried by the extracted image features and the text features, so that the image features and the text features extracted by the model are more accurate, the feature extraction model is trained through an image set and a text set which are not matched, and the cost of constructing project corresponding data is reduced on the basis of ensuring the accuracy of the image-text model obtained by training under a special scene, and the cost of model training is further reduced.
Corresponding to the method embodiment, the present disclosure further provides an embodiment of a machine learning model training device, and fig. 8 shows a schematic structural diagram of the machine learning model training device provided in one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:
a data acquisition module 802 configured to acquire a set of image-text data pairs, a set of image data and a set of text data, and process the set of image-text data pairs through a feature extraction model to acquire a set of image-text data feature pairs;
a feature vector obtaining module 804, configured to obtain at least two reference feature vectors based on the set of image-text data feature pairs, where the reference feature vectors are feature vectors representing correspondence between paired image-text feature pairs and image-text feature pairs in the set of image-text data feature pairs;
A model training module 806 configured to train the feature extraction model based on each reference feature vector, the image data set, and the text data set until a training stop condition of the feature extraction model is reached.
Optionally, the feature vector obtaining module 804 is further configured to:
Initializing at least two reference feature vectors based on a preset initialization rule;
and adjusting each reference feature vector according to the image-text data feature pair set until reaching the reference feature vector adjustment stop condition.
Optionally, the feature vector obtaining module 804 is further configured to:
Acquiring a target image-text characteristic data pair, wherein the target image-text characteristic data pair is any one of the image-text characteristic data pairs;
Determining the similarity of the image-text pair prototype vector between the target image-text characteristic data pair and each reference characteristic vector according to the target image-text characteristic data pair and each reference characteristic vector;
and adjusting each reference feature vector according to the similarity of each image-text to the prototype vector.
Optionally, the image-text feature data pair includes image-text feature data and image-text feature data corresponding to the image-text feature data, and the reference feature vector includes an image prototype feature vector or a text prototype feature vector;
the feature vector acquisition module 804 is further configured to:
Acquiring target image-text pair image characteristic data and target image-text pair text characteristic data in the target image-text characteristic data pair;
calculating the similarity of the image prototype vector of the image-text pair between the image feature data of the target image-text pair and each image prototype feature vector;
and calculating the similarity of the text prototype vector of the image-text between the text feature data of the target image-text and each text prototype feature vector.
Optionally, the feature vector obtaining module 804 is further configured to:
determining image-to-image similarity characteristic information and image-to-text similarity characteristic information according to the image prototype vector similarity of each image-to-image and the text prototype vector similarity of each image-to-text;
and adjusting each reference feature vector according to the image-text image similarity feature information and the image-text similarity feature information.
Optionally, the feature vector obtaining module 804 is further configured to:
Determining a reference feature vector image-text pair loss value according to the image-text pair image similarity feature information and the image-text pair text similarity feature information;
And adjusting each reference feature vector according to the reference feature vector graph-text pair loss value.
Optionally, the machine learning model training device further includes a reference feature vector self-loss value calculation module configured to:
Calculating the similarity of the reference feature vectors among the reference feature vectors;
determining a loss value of the reference feature vector according to the similarity of each reference feature vector;
and adjusting each reference feature vector according to the loss value of the reference feature vector.
Optionally, the model training module 806 is further configured to:
inputting the image data set and the text data set into the feature extraction model to obtain at least one first image feature data and at least one first text feature data output by the feature extraction model;
Determining at least one piece of second image characteristic data and second text characteristic data corresponding to each piece of second image characteristic data according to each piece of first image characteristic data and each piece of first text characteristic data;
calculating a model loss value according to each reference feature vector, each second image feature data and second text feature data corresponding to each second image feature data;
And adjusting parameters of the feature extraction model according to the model loss value.
Optionally, the reference feature vector comprises an image prototype feature vector or a text prototype feature vector;
the model training module 806 is further configured to:
Determining target second image feature data and target second text feature data, wherein the target second image feature data is any one of the second image feature data;
Obtaining at least one image prototype vector similarity according to the target second image feature data and each image prototype feature vector, wherein the image prototype vector similarity is the similarity between the target second image feature data and each image prototype feature vector;
obtaining at least one text prototype vector similarity according to the target second text feature data and each text prototype feature vector, wherein the text prototype vector similarity is the similarity between the target second text feature data and each text prototype feature vector;
And calculating a model loss value according to the similarity of each image prototype vector and the similarity of each text prototype vector.
Optionally, the model training module 806 is further configured to:
determining an image similarity vector corresponding to the target second image feature data based on the similarity of each image prototype vector;
determining a text similarity vector corresponding to the target second text feature data based on the similarity of each text prototype vector;
and calculating a model loss value according to the image similarity vector and the text similarity vector.
Optionally, the model training module 806 is further configured to:
Determining the image-text similarity between each first image feature data and each first text feature data according to each first image feature data and each first text feature data;
Determining at least one second image characteristic data based on the image-text similarity, and determining second text characteristic data corresponding to each second image characteristic data; or determining at least one piece of second text feature data based on the image-text similarity, and determining second image feature data corresponding to each piece of second text feature data.
Optionally, the model training module 806 is further configured to:
determining first image similarity characteristic data corresponding to each first image characteristic data according to each first image characteristic data;
determining first text similarity feature data corresponding to each first text feature data according to each first text feature data;
And determining the image-text similarity between each first image feature data and each first text feature data based on each first image similarity feature data and each first text similarity feature data.
Optionally, the model training module 806 is further configured to:
determining target first image feature data and at least one reference first image feature data corresponding to the target first image feature data in each first image feature data;
Calculating the similarity of the image set corresponding to the target first image feature data and each reference first image feature data;
and determining first image similarity characteristic data corresponding to the target first image characteristic data based on the similarity of each image set.
Optionally, the model training module 806 is further configured to:
Determining target first text feature data and at least one reference first text feature data corresponding to the target first text feature data in each first text feature data;
calculating the similarity of the target first text feature data and the text set corresponding to each reference first text feature data;
And determining first text similarity feature data corresponding to the target first text feature data based on the similarity of each text set.
Optionally, the model training module 806 is further configured to:
determining target second image feature data, wherein the target second image feature data is any one of the first image feature data;
and determining target second text feature data corresponding to the target second image feature data according to the image-text similarity.
Optionally, the model training module 806 is further configured to:
determining at least one intermediate text feature data from the first text feature data according to the image-text similarity;
And determining target second text feature data according to each intermediate text feature data.
Optionally, the model training module 806 is further configured to:
based on each intermediate text feature data, acquiring text data corresponding to each intermediate text feature data;
and acquiring target text data according to each text data, inputting the target text data into the feature extraction model, and acquiring target second text feature data output by the feature extraction model.
The above is a schematic scheme of a feature extraction model training apparatus of the present embodiment. It should be noted that, the technical solution of the feature extraction model training device and the technical solution of the feature extraction model training method belong to the same concept, and details of the technical solution of the feature extraction model training device which are not described in detail can be referred to the description of the technical solution of the feature extraction model training method.
By applying the scheme of the embodiment of the specification, the similarity between each image data and the corresponding reference feature vector and the similarity between the corresponding reference vectors of each text data are calculated through the image-text feature data pair, and then each initialized reference feature vector is obtained through adjustment of the two similarities, so that the reference feature vector obtained by the feature vector obtaining module carries the relationship between the image features and the text features in the image-text data pair.
When the model is trained, text feature data corresponding to each image feature data are determined through a plurality of pieces of image feature data and a plurality of pieces of text feature data extracted by the model, the image feature and the text feature are initially and automatically matched, the similarity of the image feature data and the adjusted corresponding reference vector and the similarity of the text feature data and the adjusted corresponding reference vector are calculated, and as the adjusted reference vector has the association relationship between the image feature and the text feature, the model can understand the extracted image feature and the text feature comprehensively carrying the reference feature vector of the graphic feature association relationship, the image feature and the text feature extracted by the model are more accurate, so that the model training module is trained and obtained to obtain the model accuracy, and the cost required by the data acquisition module is reduced by training the model through only using a small amount of paired data.
Fig. 9 illustrates a block diagram of a computing device 900 provided in accordance with one embodiment of the present specification. The components of computing device 900 include, but are not limited to, memory 910 and processor 920. Processor 920 is coupled to memory 910 via bus 930 with database 950 configured to hold data.
Computing device 900 also includes an access device 940, access device 940 enabling computing device 900 to communicate via one or more networks 960. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 940 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near Field Communication (NFC).
In one embodiment of the present description, the above-described components of computing device 900 and other components not shown in FIG. 9 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 9 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.
Computing device 900 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 900 may also be a mobile or stationary server.
The processor 920 is configured to execute computer-executable instructions that when executed by the processor implement the above-described machine learning model training method, the machine learning model training method applied to the cloud-side device, the text-based image search method applied to the cloud-side device, the image analysis method applied to the cloud-side device, and the automatic question-answering method applied to the cloud-side device.
The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solutions of the computing device and the above machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device belong to the same concept, and details of the technical solutions of the computing device are not described in detail, which can be referred to the description of the technical solutions of the machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device.
An embodiment of the present disclosure further provides a computer-readable storage medium storing computer-executable instructions that when executed by a processor implement the above-described machine learning model training method, a machine learning model training method applied to a cloud-side device, a text-based image search method applied to a cloud-side device, an image analysis method applied to a cloud-side device, and an automatic question-answering method applied to a cloud-side device.
The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solutions of the storage medium and the above machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device belong to the same concept, and details of the technical solutions of the storage medium are not described in detail, which can be referred to the description of the technical solutions of the machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device.
An embodiment of the present disclosure further provides a computer program product, including computer programs/instructions, which when executed by a processor, implement the above-mentioned machine learning model training method, the machine learning model training method applied to a cloud-side device, the text-based image search method applied to the cloud-side device, the image analysis method applied to the cloud-side device, and the automatic question-answering method applied to the cloud-side device.
The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solutions of the computer program and the technical solutions of the machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device belong to the same concept, and details of the technical solutions of the computer program, which are not described in detail, can be described in the technical solutions of the machine learning model training method, the machine learning model training method applied to the cloud side device, the text-based image searching method applied to the cloud side device, the image analysis method applied to the cloud side device, and the automatic question-answering method applied to the cloud side device.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims (21)

1. A machine learning model training method, comprising:
Acquiring a graphic data pair set, an image data set and a text data set, and processing the graphic data pair set through a feature extraction model to acquire a graphic data feature pair set;
Acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set;
And training the feature extraction model according to each reference feature vector, the image data set and the text data set until the training stopping condition of the feature extraction model is reached.
2. The method of claim 1, obtaining at least two reference feature vectors based on the set of teletext feature pairs, comprising:
Initializing at least two reference feature vectors based on a preset initialization rule;
and adjusting each reference feature vector according to the image-text data feature pair set until reaching the reference feature vector adjustment stop condition.
3. The method of claim 2, adjusting each reference feature vector according to the set of pairs of teletext feature pairs, comprising:
Acquiring a target image-text characteristic data pair, wherein the target image-text characteristic data pair is any one of the image-text characteristic data pairs;
Determining the similarity of the image-text pair prototype vector between the target image-text characteristic data pair and each reference characteristic vector according to the target image-text characteristic data pair and each reference characteristic vector;
and adjusting each reference feature vector according to the similarity of each image-text to the prototype vector.
4. A method as claimed in claim 3, the pair of teletext feature data comprising teletext image feature data and teletext text feature data corresponding to the teletext image feature data, the reference feature vector comprising an image prototype feature vector or a text prototype feature vector;
according to the target image-text characteristic data pair and each reference characteristic vector, determining the image-text pair prototype vector similarity between the target image-text characteristic data pair and each reference characteristic vector comprises the following steps:
Acquiring target image-text pair image characteristic data and target image-text pair text characteristic data in the target image-text characteristic data pair;
calculating the similarity of the image prototype vector of the image-text pair between the image feature data of the target image-text pair and each image prototype feature vector;
and calculating the similarity of the text prototype vector of the image-text between the text feature data of the target image-text and each text prototype feature vector.
5. The method of claim 4, wherein adjusting each reference feature vector based on each teletext-to-prototype vector similarity comprises:
determining image-to-image similarity characteristic information and image-to-text similarity characteristic information according to the image prototype vector similarity of each image-to-image and the text prototype vector similarity of each image-to-text;
and adjusting each reference feature vector according to the image-text image similarity feature information and the image-text similarity feature information.
6. The method of claim 5, after adjusting each reference feature vector, the method further comprising:
Calculating the similarity of the reference feature vectors among the reference feature vectors;
determining a loss value of the reference feature vector according to the similarity of each reference feature vector;
and adjusting each reference feature vector according to the loss value of the reference feature vector.
7. The method of claim 1, training the feature extraction model from each reference feature vector, the image data set, and the text data set, comprising:
inputting the image data set and the text data set into the feature extraction model to obtain at least one first image feature data and at least one first text feature data output by the feature extraction model;
Determining at least one piece of second image characteristic data and second text characteristic data corresponding to each piece of second image characteristic data according to each piece of first image characteristic data and each piece of first text characteristic data;
calculating a model loss value according to each reference feature vector, each second image feature data and second text feature data corresponding to each second image feature data;
And adjusting parameters of the feature extraction model according to the model loss value.
8. The method of claim 7, the reference feature vector comprising an image prototype feature vector or a text prototype feature vector;
Calculating a model loss value according to each reference feature vector, each second image feature data and second text feature data corresponding to each second image feature data, including:
Determining target second image feature data and target second text feature data, wherein the target second image feature data is any one of the second image feature data;
Obtaining at least one image prototype vector similarity according to the target second image feature data and each image prototype feature vector, wherein the image prototype vector similarity is the similarity between the target second image feature data and each image prototype feature vector;
obtaining at least one text prototype vector similarity according to the target second text feature data and each text prototype feature vector, wherein the text prototype vector similarity is the similarity between the target second text feature data and each text prototype feature vector;
And calculating a model loss value according to the similarity of each image prototype vector and the similarity of each text prototype vector.
9. The method of claim 8, calculating a model loss value from each image prototype vector similarity and each text prototype vector similarity, comprising:
determining an image similarity vector corresponding to the target second image feature data based on the similarity of each image prototype vector;
determining a text similarity vector corresponding to the target second text feature data based on the similarity of each text prototype vector;
and calculating a model loss value according to the image similarity vector and the text similarity vector.
10. The method of claim 7, determining at least one second image feature data and second text feature data corresponding to each second image feature data from each first image feature data and each first text feature data, comprising:
Determining the image-text similarity between each first image feature data and each first text feature data according to each first image feature data and each first text feature data;
Determining at least one second image characteristic data based on the image-text similarity, and determining second text characteristic data corresponding to each second image characteristic data; or determining at least one piece of second text feature data based on the image-text similarity, and determining second image feature data corresponding to each piece of second text feature data.
11. The method of claim 10, determining a degree of teletext similarity between each first image feature data and each first text feature data from each first image feature data and each first text feature data, comprising:
determining first image similarity characteristic data corresponding to each first image characteristic data according to each first image characteristic data;
determining first text similarity feature data corresponding to each first text feature data according to each first text feature data;
And determining the image-text similarity between each first image feature data and each first text feature data based on each first image similarity feature data and each first text similarity feature data.
12. The method of claim 11, determining first image similarity feature data corresponding to each first image feature data from each first image feature data, comprising:
determining target first image feature data and at least one reference first image feature data corresponding to the target first image feature data in each first image feature data;
Calculating the similarity of the image set corresponding to the target first image feature data and each reference first image feature data;
and determining first image similarity characteristic data corresponding to the target first image characteristic data based on the similarity of each image set.
13. The method of claim 11, determining first text similarity feature data corresponding to each first text feature data from each first text feature data, comprising:
Determining target first text feature data and at least one reference first text feature data corresponding to the target first text feature data in each first text feature data;
calculating the similarity of the target first text feature data and the text set corresponding to each reference first text feature data;
And determining first text similarity feature data corresponding to the target first text feature data based on the similarity of each text set.
14. The method of claim 10, determining at least one second image feature data based on each of the teletext similarities, and determining second text feature data corresponding to each of the second image feature data, comprising:
determining target second image feature data, wherein the target second image feature data is any one of the first image feature data;
and determining target second text feature data corresponding to the target second image feature data according to the image-text similarity.
15. A machine learning model training method applied to cloud-side equipment, the method comprising:
The method comprises the steps that a picture-text data pair set, an image data set and a text data set sent by receiving terminal equipment are received, and the picture-text data pair set is processed through a feature extraction model to obtain a picture-text data feature pair set;
Acquiring at least two reference feature vectors based on the image-text data feature pair set, wherein the reference feature vectors are feature vectors representing the corresponding relation between paired image-text feature and text feature in the image-text data feature pair set;
Training the feature extraction model according to each reference feature vector, the image data set and the text data set until a training stopping condition of the feature extraction model is reached;
and obtaining model parameters in the trained feature extraction model, and returning the model parameters to the end-side equipment.
16. A text-based image search method applied to cloud-side equipment, the method comprising:
receiving an image searching instruction sent by a terminal side device, wherein the image searching instruction carries a target searching text;
Inputting the target search text into a feature extraction model to obtain target search text feature information output by the feature extraction model, wherein the feature extraction model is obtained by training the method according to any one of claims 1-14;
and determining an image search result corresponding to the image search instruction according to the target search text characteristic information, and returning the image search result to the terminal side equipment.
17. An automatic question-answering method applied to cloud side equipment, the method comprising the following steps:
receiving problem data sent by a terminal side device, wherein the problem data comprises at least one of problem text data and problem image data;
Inputting the problem data into a language processing model to obtain problem data to be processed, and inputting the problem data to be processed into a feature extraction model to obtain problem feature data output by the feature extraction model, wherein the feature extraction model is obtained by training the method of any one of claims 1-14;
and generating answer data according to the question feature data, and returning the answer data to the terminal side equipment.
18. The method of claim 17, after returning the answer data to the end-side device, the method further comprising:
receiving feedback data aiming at the answer data and sent by the terminal side equipment;
and adjusting the feature extraction model according to the question data, the answer data and the feedback data.
19. A computing device, comprising:
A memory and a processor;
The memory is adapted to store a computer program/instruction, the processor being adapted to execute the computer program/instruction, which when executed by the processor, implements the steps of the method of any of claims 1-18.
20. A computer readable storage medium storing a computer program/instruction which, when executed by a processor, implements the steps of the method of any of claims 1-18.
21. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1-18.
CN202410296148.8A 2024-03-14 2024-03-14 Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product Pending CN118132988A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410296148.8A CN118132988A (en) 2024-03-14 2024-03-14 Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410296148.8A CN118132988A (en) 2024-03-14 2024-03-14 Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product

Publications (1)

Publication Number Publication Date
CN118132988A true CN118132988A (en) 2024-06-04

Family

ID=91240194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410296148.8A Pending CN118132988A (en) 2024-03-14 2024-03-14 Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product

Country Status (1)

Country Link
CN (1) CN118132988A (en)

Similar Documents

Publication Publication Date Title
CN112182166A (en) Text matching method and device, electronic equipment and storage medium
CN117521675A (en) Information processing method, device, equipment and storage medium based on large language model
CN117876941B (en) Target multi-mode model system, construction method, video processing model training method and video processing method
CN114201516B (en) User portrait construction method, information recommendation method and related devices
CN116050405A (en) Text processing, question-answer text processing and text processing model training method
CN114282013A (en) Data processing method, device and storage medium
CN111666400A (en) Message acquisition method and device, computer equipment and storage medium
CN116975288A (en) Text processing method and text processing model training method
CN116595154A (en) Task processing method and automatic question-answering method
CN116977457A (en) Data processing method, device and computer readable storage medium
CN117573842B (en) Document retrieval method and automatic question-answering method
CN116303558A (en) Query statement generation method, data query method and generation model training method
CN117313837A (en) Large model prompt learning method and device based on federal learning
Zhang et al. Enhanced user interaction in operating systems through machine learning language models
CN116205700A (en) Recommendation method and device for target product, computer equipment and storage medium
CN117971420A (en) Task processing, traffic task processing and task processing model training method
CN117271745A (en) Information processing method and device, computing equipment and storage medium
CN117436480A (en) Large model under Mindspore frame and recommendation method
CN116663565A (en) Information extraction, conference view extraction and information extraction model training method
CN118132988A (en) Machine learning model training method, text-based image searching method, automatic question-answering method, computing device, computer-readable storage medium, and computer program product
CN117633540B (en) Sample data construction method and device
CN118212460A (en) Image classification method, automatic question-answering method, image class feature fusion model training method and information processing method based on deep learning model
CN117972047A (en) Document retrieval method and automatic question-answering method
CN116467500B (en) Data relation identification, automatic question-answer and query sentence generation method
CN118227770B (en) Task processing method, legal question answering method and task processing model training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination