CN117671678A - Image labeling method and device - Google Patents

Image labeling method and device Download PDF

Info

Publication number
CN117671678A
CN117671678A CN202211042115.8A CN202211042115A CN117671678A CN 117671678 A CN117671678 A CN 117671678A CN 202211042115 A CN202211042115 A CN 202211042115A CN 117671678 A CN117671678 A CN 117671678A
Authority
CN
China
Prior art keywords
image
training
feature
language
target image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211042115.8A
Other languages
Chinese (zh)
Inventor
窦昊
王艳
周云鹏
张安发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202211042115.8A priority Critical patent/CN117671678A/en
Priority to PCT/CN2023/089419 priority patent/WO2024045641A1/en
Publication of CN117671678A publication Critical patent/CN117671678A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an image labeling method and device, and belongs to the technical field of computer vision. After the computer equipment acquires a training image set corresponding to the target image task and a language description text set corresponding to the training image set, a target image annotation model is called to determine a prediction label of each training sample image, the prediction label of the training sample image is obtained by the target image annotation model based on a characteristic matching result of image characteristics corresponding to the training sample image and language characteristics corresponding to each language description text in the language description text set, and then the target image annotation model is trained according to errors between real labels and prediction labels of a plurality of training sample images in the training image set until the target image annotation model converges. According to the method and the device, the image task is converted into the matching task of the image features and the language features, so that the initial labeling performance of the image labeling model is greatly improved, and the image labeling efficiency is improved.

Description

Image labeling method and device
Technical Field
The present disclosure relates to the field of computer vision, and in particular, to a method and apparatus for labeling images.
Background
In the development of image-oriented artificial intelligence (artificial intelligence, AI) applications, human labor is required to label the training images one by one. Because successful AI models require thousands or even millions of accurately labeled training images, the image labeling task is often time consuming and costly.
Intelligent image annotation is one of the most practical techniques for developing image-oriented AI applications. The automatic labeling method is based on a small number of labeled images, and utilizes an AI algorithm to rapidly realize automatic labeling of the images to be labeled. By means of intelligent image labeling technology, a user can save a large amount of image labeling cost. When the intelligent image labeling technology is adopted, how to improve the image labeling efficiency is a problem which needs to be solved at present.
Disclosure of Invention
The application provides an image labeling method and device, which can improve image labeling efficiency.
In a first aspect, an image annotation method is provided. The method is used for a computer device. The method comprises the following steps: and acquiring a training image set corresponding to the target image task and a language description text set corresponding to the training image set. The training image set includes a plurality of training sample images. Each of the plurality of training sample images is labeled with a real label. The language description text set includes a plurality of language description texts. The plurality of language description texts are in one-to-one correspondence with the multi-class labels corresponding to the training image set. Each language description text is used to describe the semantics of one type of tag in the multiple types of tags. And calling a target image annotation model corresponding to the target image task, and determining a prediction label of each training sample image in the training image set, wherein the prediction label of the training sample image is obtained by the target image annotation model based on feature matching results of image features corresponding to the training sample image and language features corresponding to each language description text in the language description text set. And training the target image annotation model according to errors between the real labels and the predicted labels of a plurality of training sample images in the training image set until the target image annotation model converges. The target image annotation model is used for determining annotation labels of images to be annotated under the task of the target image.
The image features corresponding to the training sample images are image features obtained by extracting features of the training sample images. The language features corresponding to the language description text are the language features obtained by extracting the features of the language description text. The model convergence may be that a loss value of the model is smaller than a preset threshold, or that a weight change of two adjacent iterative training models is smaller than a preset threshold, or that the number of model iterations reaches a preset number.
In the application, the computer equipment introduces language priori associated knowledge of a labeling task for a training sample image by acquiring a language description text set corresponding to a training image set, and transforms the image task into a matching task of image features and language features in an image labeling model. Therefore, the initial labeling performance of the image labeling model can be greatly improved, the first round labeling accuracy of the image labeling model is improved under the condition that the number of initial training sample images is small, the training round number of the image labeling model is effectively reduced, and therefore the image labeling efficiency is improved.
Optionally, after convergence of the target image annotation model, the computer device may also invoke the target image annotation model to determine predictive labels for a plurality of verification sample images in the verification image set. The set of verification images includes a plurality of verification sample images, each of the plurality of verification sample images being labeled with a genuine label. And the computer equipment determines the labeling accuracy of the labeling model of the target image according to the real labels and the predicted labels of the plurality of verification sample images. And when the labeling accuracy of the target image labeling model does not reach the preset threshold, the computer equipment executes one or more model training processes until the labeling accuracy of the target image labeling model reaches the preset threshold. Wherein, the model training process includes: and calling a target image annotation model, and determining the prediction labels of the plurality of images to be annotated and the confidence of the prediction labels of the images to be annotated. And obtaining refractory images with confidence of the predictive labels lower than a confidence threshold from the plurality of images to be annotated. Outputting the difficultly-marked image and the prediction label of the difficultly-marked image for manual correction. And in response to receiving the manual annotation result for the difficultly-annotated image, adding the difficultly-annotated image as a new training sample image to the training image set to obtain an updated training image set. And calling a target image annotation model, determining a prediction label of each training sample image in the updated training image set, and obtaining the prediction label of the training sample image based on the feature matching result of the image features corresponding to the training sample image and the language features corresponding to each language description text in the language description text set corresponding to the updated training image set. And training the target image annotation model according to errors between the actual labels and the predicted labels of the plurality of training sample images in the updated training image set until the target image annotation model is converged again.
Optionally, when the labeling accuracy of the target image labeling model reaches a preset threshold, the computer device uses the prediction label of the image to be labeled, which is determined by calling the target image labeling model, as the labeling label of the image to be labeled.
Optionally, the target image annotation model includes a first feature extraction layer, a second feature extraction layer, and a feature matching layer. The first feature extraction layer includes an image feature output and a language feature output. The feature matching layer comprises an image feature input end and a language feature input end. The image feature output is connected to the input of the second feature extraction layer. The language characteristic output end is connected with the language characteristic input end. The output end of the second feature extraction layer is connected with the image feature input end. The computer equipment calls a target image annotation model corresponding to the target image task, and determines the implementation process of the prediction label of each training sample image in the training image set, and the implementation process comprises the following steps: the computer equipment respectively performs feature extraction on each language description text in the language description text set through the first feature extraction layer to obtain a language feature set, wherein the language feature set comprises a plurality of groups of language features, and each group of language features is a language feature corresponding to one language description text in the language description text set. For each training sample image in the training image set, the computer equipment performs feature extraction on the training sample image through a first feature extraction layer to obtain global image features corresponding to the training sample image, performs feature extraction on the global image features through a second feature extraction layer to obtain target image features, wherein the target image features are associated with target image tasks, each group of language features in the target image features and the language feature set are respectively subjected to feature matching through a feature matching layer to obtain feature matching results, the feature matching results comprise feature similarity between the target image features and each group of language features in the language feature set, and the labels described by the target language description text are used as prediction labels of the training sample image. The target language description text is the language description text corresponding to the group of language features with the highest feature similarity between the language feature set and the target image features.
Optionally, the target image annotation model further comprises a supervision module, and a loss function matched with the target image task is arranged in the supervision module. The input end of the supervision module is connected with the output end of the characteristic matching layer. The implementation mode of the training target image annotation model by the computer equipment according to errors between real labels and predicted labels of a plurality of training sample images in the training image set comprises the following steps: the computer equipment calculates a loss value of the loss function based on the real labels and the predicted labels of a plurality of training sample images in the training image set through the supervision module; and reversely transmitting gradient information of the loss function to the second feature extraction layer to adjust network parameters from the second feature extraction layer to the feature matching layer.
Optionally, the image annotation model framework is pre-stored in the computer device. The image annotation model framework comprises a first feature extraction layer, a downstream task model head and a feature matching layer. The image feature output end of the first feature extraction layer is connected with the input end of the downstream task model head. The output end of the downstream task model head is connected with the image feature input end of the feature matching layer. The downstream task model head includes a plurality of feature extraction layers in one-to-one correspondence with a plurality of image tasks. The second feature extraction layer is a feature extraction layer corresponding to the target image task in the downstream task model head. Wherein the image feature output of the first feature extraction layer is configured to connect one feature extraction layer at a time in the downstream task model head.
In the application, the downstream task model heads are designed in the image annotation model frame, so that the image annotation model frame can adapt to downstream requirements of diversity, and a corresponding image annotation model can be constructed by only selecting different feature extraction layers in the downstream task model heads. Because the image annotation model framework can be shared by a plurality of image tasks, a corresponding image annotation model is not required to be designed for each image task, unified standardized management of intelligent image annotation is realized, the complexity of development and maintenance can be reduced, and the technical cost is reduced.
Optionally, the first feature extraction layer is implemented by a visual language pre-training model obtained by pre-training. The visual language pre-training model can perform feature extraction on an input image to obtain image features, and can also perform feature extraction on an input language description text to obtain language features.
Optionally, the implementation manner of obtaining the language description text set corresponding to the training image set by the computer device includes: in response to receiving a launch instruction for a target image task, the computer device displays a template setting prompt for prompting a user to set a language description template corresponding to the target image task. For each type of label corresponding to the training image set, the computer equipment generates a language description text according to the set language description template and the label.
According to the method and the device, the man-machine interaction interface is provided, so that a user can manually set the language description template corresponding to the image task, the accuracy of semantic expression of the language description text on the label can be improved, the accuracy of the language features obtained by extracting the language description text is further improved, and the accuracy of the image annotation model is improved.
Alternatively, the computer device may display the language description text after generating the language description text.
In the application, the computer equipment enables a user to check whether the generated language description text can accurately express the meaning of the label or not by displaying the language description text so as to enable the user to adjust the language description template.
Optionally, the computer device may further display a plurality of image tasks, the target image task being one of the plurality of image tasks. In response to detecting a selection operation of the target image task, the computer device determines that a launch instruction for the target image task is received.
Optionally, the plurality of image tasks includes, but is not limited to, one or more of image classification, object detection, or motion recognition.
In a second aspect, an image annotation device is provided. The apparatus comprises a plurality of functional modules that interact to implement the method of the first aspect and embodiments thereof described above. The plurality of functional modules may be implemented based on software, hardware, or a combination of software and hardware, and the plurality of functional modules may be arbitrarily combined or divided based on the specific implementation.
In a third aspect, there is provided a computer device comprising: a processor and a memory;
the memory is used for storing a computer program, and the computer program comprises program instructions;
the processor is configured to invoke the computer program to implement the method in the first aspect and embodiments thereof.
In a fourth aspect, a computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the method of the first aspect and embodiments thereof described above.
In a fifth aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of the first aspect and embodiments thereof described above.
In a sixth aspect, a chip is provided, the chip comprising programmable logic circuits and/or program instructions, which when the chip is run, implement the method of the first aspect and embodiments thereof described above.
Drawings
FIG. 1 is a schematic diagram of an image annotation model framework provided in an embodiment of the present application;
fig. 2 is a schematic flow chart of an image labeling method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a display interface according to an embodiment of the present disclosure;
FIG. 4 is a schematic view of another display interface provided in an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a labeling model for a target image according to an embodiment of the present application;
fig. 6 is a schematic architecture diagram related to an image labeling method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an image labeling device according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of another image labeling apparatus according to an embodiment of the present disclosure;
fig. 9 is a schematic hardware structure of an image labeling device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Intelligent image annotation is a technology for automatically annotating unlabeled images by using an AI algorithm based on a small number of annotated images. By means of intelligent image annotation, the user can reduce the image data annotation cost by 50% -90%. Currently, the basic flow of intelligent image annotation comprises the following five steps.
And step 1, training to obtain an AI model corresponding to the image task based on a training image set, wherein the training image set comprises a plurality of training sample images marked with real labels.
And step 2, reasoning the image to be marked based on the AI model to obtain a prediction label (also called a pseudo label).
And step 3, screening valuable samples and predictive labels thereof from the inferred images according to a confidence level strategy so as to be used for manual correction labeling.
And 4, adding the image subjected to the manual correction annotation as a new training sample image into a training image set, and re-optimizing the AI model based on the updated training image set.
And 5, repeating iterative optimization in the steps 1-4 until a high-accuracy AI model is obtained, then using the AI model for reasoning of all images to be marked, outputting a predicted label of the images to be marked, and taking the finally output predicted label as a final marking label corresponding to the images to be marked, so as to obtain an intelligent image marking result.
However, in the existing intelligent image labeling technology, under the condition that the number of initial training sample images (images labeled with real labels) is small, the accuracy of first-round model training is low, the reliability of confidence coefficient of a predicted label obtained based on AI model reasoning is poor, so that valuable samples are difficult to find in an image screening process (step 3) based on a confidence coefficient strategy, and further, the number of model training rounds and the number of manual correction labeling rounds are large, and therefore the existing image labeling efficiency is low.
Based on the above, the embodiment of the application provides an image labeling method. First, a training image set corresponding to an image task and a language description text set corresponding to the training image set are obtained by computer equipment. The training image set includes a plurality of training sample images labeled with real labels. The language description text set includes a plurality of language description texts. The plurality of language description texts are in one-to-one correspondence with the multi-class labels corresponding to the training image set, that is, the number of the language description texts in the language description text set is the same as the number of the label categories of the training sample images in the training image set. Each language description text is used for describing the semantics of a class of labels corresponding to the training image set. And then, the computer equipment calls an image annotation model corresponding to the image task, and determines a prediction label of each training sample image in the training image set, wherein the prediction label is specifically obtained by the image annotation model based on a feature matching result of the image feature corresponding to the training sample image and the language feature corresponding to each language description text in the language description text set. Finally, the computer equipment trains an image annotation model corresponding to the image task according to errors between the real labels and the predicted labels of a plurality of training sample images in the training image set until the image annotation model converges. And finally, the trained image annotation model with the annotation accuracy reaching the preset threshold value is used for determining the annotation label of the image to be annotated under the image task. According to the embodiment of the application, language prior associated knowledge of the labeling task is introduced to the training sample image by acquiring the language description text set corresponding to the training image set, and the image task is converted into a matching task of the image feature and the language feature in the image labeling model. Therefore, the initial labeling performance of the image labeling model can be greatly improved, the first round labeling accuracy of the image labeling model is improved under the condition that the number of initial training sample images is small, the training round number of the image labeling model is effectively reduced, and therefore the image labeling efficiency is improved.
In addition, when image-oriented AI application development is performed, image tasks are diversified. For example, the categories of image tasks include, but are not limited to, image classification, object detection, and motion recognition. At present, for each image task, an AI model needs to be designed respectively to train to obtain an image annotation model capable of automatically annotating images under the image task. AI models are designed for each image task, respectively, and the development cost and the maintenance cost are high.
Optionally, an embodiment of the present application provides an image annotation model framework suitable for multiple image tasks. The image annotation model framework comprises a first feature extraction layer, a downstream task model head and a feature matching layer. For example, fig. 1 is a schematic diagram of an image labeling model framework provided in an embodiment of the present application. As shown in fig. 1, the first feature extraction layer includes an image feature output terminal m1 and a language feature output terminal m2. The feature matching layer includes an image feature input n1 and a language feature input n2. The image feature output terminal m1 of the first feature extraction layer is connected with the input terminal of the downstream task model head. The language feature output end m2 of the first feature extraction layer is connected with the language feature input end n2 of the feature matching layer. The output end of the downstream task model head is connected with the image feature input end n1 of the feature matching layer. The image tasks corresponding to the downstream task model head comprise image classification, target detection and action recognition.
The first feature extraction layer is used for extracting features of the input language description text to obtain language features corresponding to the language description text. The first feature extraction layer is also used for carrying out feature extraction on the input image to obtain the global image feature corresponding to the image. Because the global image features obtained by the first feature extraction layer for extracting the features of the image can reflect the global features of the whole image, different image tasks can share the first feature extraction layer. Optionally, the first feature extraction layer is implemented by a visual language pre-training model obtained by pre-training. The visual language pre-training model is a model obtained by training based on large-scale visual image data and corresponding language descriptions, and the training mode of the visual language pre-training model is not repeated here in the embodiment of the application.
The downstream task model head includes a plurality of feature extraction layers in one-to-one correspondence with a plurality of image tasks. Each feature extraction layer in the downstream task model head is used for extracting the image features under the corresponding image task respectively, namely, the image features extracted by the feature extraction layer in the downstream task model head are associated with the corresponding image task. For example, a feature extraction layer corresponding to an image classification task in the downstream task model head is used to extract features of an image region containing the classification object. The feature extraction layer corresponding to the target detection task in the downstream task model head is used for extracting the image region features containing the detection target and the position features of the image region containing the detection target. And the feature extraction layer corresponding to the action recognition task in the downstream task model head is used for extracting the image region features related to the object to be recognized. The image feature output of the first feature extraction layer is configured to connect one feature extraction layer at a time in the downstream task model head. When the image annotation model framework is used, a corresponding feature extraction layer can be selected in a downstream task model head according to an image task to be executed to construct an image annotation model corresponding to the image task, then when the constructed image annotation model is used, the selected feature extraction layer further performs feature extraction on the global image features output by the first feature extraction layer, and finally the extracted image features are output to the feature matching layer.
The feature matching layer is used for carrying out feature matching on the input image features and language features to obtain feature similarity between the image features and the language features. Under different image tasks, the functions of the feature matching layers are the same, so that the feature matching layers can be uniformly constructed for a plurality of image tasks, and then in the model training process, the network parameters of the feature matching layers can be automatically adjusted according to the actual image tasks.
In the embodiment of the application, the downstream task model heads are designed in the image annotation model frame, so that the image annotation model frame can adapt to downstream requirements of diversity, and the corresponding image annotation model can be constructed by only selecting different feature extraction layers in the downstream task model heads. Because the image annotation model framework can be shared by a plurality of image tasks, a corresponding image annotation model is not required to be designed for each image task, unified standardized management of intelligent image annotation is realized, the complexity of development and maintenance can be reduced, and the technical cost is reduced.
The technical scheme provided by the application is described in detail from the aspects of application scenes, method flows, software devices, hardware devices and the like.
The application scenario of the embodiment of the present application is illustrated below.
The image labeling method provided by the embodiment of the application is used for computer equipment. The computer device may be a server, or a server cluster comprising a plurality of servers, or a cloud computing center. For example, the image annotation method can be applied to a cloud computing development platform as a web page (web) service. Alternatively, the image labeling method can also be applied to a server, and the user performs functional interaction through a software User Interface (UI).
The following is an example of a method flow of an embodiment of the present application.
For example, fig. 2 is a schematic flow chart of an image labeling method according to an embodiment of the present application. As shown in fig. 2, the method includes:
step 201, a computer device obtains a training image set corresponding to a target image task and a language description text set corresponding to the training image set.
The training image set comprises a plurality of training sample images, and each training sample image in the plurality of training sample images is marked with a real label. The actual labels of the training sample images may be manually labeled. The language description text set comprises a plurality of language description texts, and the plurality of language description texts are in one-to-one correspondence with the multi-class labels corresponding to the training image set, namely, the number of the language description texts in the language description text set is the same as the number of the label categories of the training sample images in the training image set. Each language description text is used to describe the semantics of one type of tag in the multiple types of tags.
Alternatively, the target image task may be image classification, target detection, or motion recognition.
For example, the target image task is image classification, specifically classifying animal pictures. The training image set comprises two types of training sample images, wherein the real labels of one type of training sample image are dogs, and the real labels of the other type of training sample image are cats, namely the training image set corresponds to the two types of labels. The language description text set corresponding to the training image set includes two language description texts, one language description text is used for describing that the image contains dogs, and the other language description text is used for describing that the image contains cats. For example, the language description text may be in the format of "a photo of { }," a xx (xx is an adjective) { }, or "this is a { }", which is a real label for adding a training sample image.
For another example, the target image task is target detection, specifically, detecting the left-over garbage, and detecting garbage and pedestrians lifting the garbage. The training image set comprises two types of training sample images, one type of training sample image only comprises garbage, the real labels of the type of training sample image are garbage, the other type of training sample image comprises pedestrians for carrying garbage, the real labels of the type of training sample image comprise garbage and pedestrians, and namely the training image set corresponds to the two types of labels. The language description text set corresponding to the training image set comprises two language description texts, wherein one language description text is used for describing that the image contains garbage, and the other language description text is used for describing that the image contains pedestrians. For example, the two language descriptive text may be "there are bags of { garba }," a { person } on the road ", respectively. The language description text under the object detection task may be in the format of "detect: { }, "thenis { } on the xx (xx is a noun)" or "{ }, while is xx (xx is an adjective)", "{ }" are used to add a real label of a training sample image. It should be noted that, besides the real label, the training sample image under the target detection task may be further labeled with a real frame (GT) position, where the real frame position reflects the region of the detection target in the image. The real frame is typically a rectangular frame.
For another example, the target image task is action recognition, specifically recognizing a decontamination action of a person. The training image set comprises three types of training sample images, wherein the real labels of one type of training sample image are foot washing, the real labels of the other type of training sample image are hand washing, and the real labels of the other type of training sample image are disinfection, namely the training image set corresponds to the three types of labels. The language description text set corresponding to the training image set comprises three language description texts, one language description text is used for describing that the person in the image is washing feet, the other language description text is used for describing that the person in the image is washing hands, and the other language description text is used for describing that the person in the image is sterilizing. For example, the language description text may be in the format of "the man is { }," the human action of { }, or "the is is { }, a frame of action", "{ }" for adding the real label of the training sample image.
Optionally, one implementation of obtaining a language description text set corresponding to the training image set by the computer device includes the following steps 2011 to 2012.
In step 2011, in response to receiving the start instruction for the target image task, the computer device displays a template setting prompt for prompting a user to set a language description template corresponding to the target image task.
Alternatively, the template setup prompt may include one or more alternative semantic description templates under the target image task. And/or the template setting prompt may include a custom control for the user to enter the semantic description template by himself. For example, the target image task is image classification, and fig. 3 is a schematic diagram of a display interface provided in an embodiment of the present application. The display interface is a template setting interface. As shown in fig. 3, the display interface a includes a language description template option A1 and a custom control A2 corresponding to the target image task. The language description template option A1 includes two language description templates, one being "a photo of { }" and the other being "this is a { }. Custom control A2 includes an input box, an add option, and a confirm option. When the computer device detects a selection operation of the adding option, the computer device takes the input content in the input box as one language description template corresponding to the target image task, and restarts the input box for a user to add a new language description template. When the computer equipment detects the selection operation of the confirmation option, the computer equipment takes the input content in the input box as a language description template corresponding to the target image task, and the template custom setting flow is ended. When the language description template is set by the user, the language description template provided by the computer equipment can be selected, or the language description template can be defined by the user, or the language description template provided by the computer equipment and the self-defined language description template can be selected, namely, the language description template corresponding to one image task finally determined by the user can comprise the language description template corresponding to the image task provided by the computer equipment and/or the language description template defined by the user.
Optionally, after the computer device obtains the language description template set by the user, the language description template can be finely adjusted according to the specific image task currently executed, so that the language description template can be more matched with the current image task, and the semantic description accuracy of the language description text generated later on to the label is improved. For example, the computer device currently performs an image task of identifying the type of flower in the picture, the language description template set by the user is "a photo of { }, and the computer device may fine tune the language description template to obtain a language description template of" a flower photo of { }, so as to more accurately express that the task is the task of classifying the flower.
Optionally, before the computer device displays the template setting prompt, the computer device displays a plurality of image tasks, the target image task being one of the plurality of image tasks. In response to detecting a selection operation of the target image task, the computer device determines that a launch instruction for the target image task is received. For example, fig. 4 is a schematic diagram of another display interface provided in an embodiment of the present application. The display interface is an image task starting interface. As shown in fig. 4, the display interface B includes an image task option including three image tasks, which are image classification, object detection, and action recognition, respectively. For example, the target image task is image classification, and when the computer device detects a selection operation of image classification through the display interface B, the computer device may display the display interface a as shown in fig. 3.
In step 2012, for each type of tag corresponding to the training image set, the computer device generates a language description text according to the set language description template and the tag.
For example, the task of the target image is image classification, the training image set includes two types of training sample images, one type of training sample image has a dog as a real tag, the other type of training sample image has a cat as a real tag, the language description template set by the user is "a photo of { }, and the computer device generates two language description texts, namely" a photo of { dog } "and" a photo of { cat }, respectively.
According to the embodiment of the application, the man-machine interaction interface is provided, so that a user can manually set the language description template corresponding to the image task, the accuracy of semantic expression of the language description text on the label can be improved, the accuracy of the language features obtained by extracting the language description text is further improved, and the accuracy of the image annotation model is improved.
Optionally, the computer device may also display the language description text after generating the language description text. The computer device allows the user to check whether the generated language description text can accurately express the meaning of the label by displaying the language description text so as to adjust the language description template.
Or under the condition that the user knows all the labels corresponding to the training image set, the user can set a language description text according to each label to obtain a language description text set corresponding to the training image set.
Step 202, the computer equipment calls a target image annotation model corresponding to the target image task, and determines a prediction label of each training sample image in the training image set.
The predictive label of the training sample image is obtained by a target image annotation model based on the feature matching result of the image feature corresponding to the training sample image and the language feature corresponding to each language description text in the language description text set.
Optionally, fig. 5 is a schematic structural diagram of a target image labeling model provided in an embodiment of the present application. As shown in fig. 5, the target image annotation model includes a first feature extraction layer, a second feature extraction layer, and a feature matching layer. The first feature extraction layer includes an image feature output terminal m1 and a language feature output terminal m2. The feature matching layer includes an image feature input n1 and a language feature input n2. The image feature output m1 of the first feature extraction layer is connected to the input of the second feature extraction layer. The language feature output end m2 of the first feature extraction layer is connected with the language feature input end n2 of the feature matching layer. The output end of the second feature extraction layer is connected with the image feature input end n1 of the feature matching layer.
Optionally, an image annotation model framework is pre-stored in the computer device, which may be shown in fig. 1, for example. The second feature extraction layer is a feature extraction layer corresponding to the target image task in the downstream task model head. Optionally, in response to receiving a start instruction for the target image task, the computer device selects a feature extraction layer corresponding to the target image task in the downstream task model head, and builds a target image annotation model.
Optionally, in conjunction with the target image annotation model shown in fig. 5, the implementation of step 202 described above may include steps 2021 to 2022 below.
In step 2021, the computer device performs feature extraction on each language description text in the set of language description texts through the first feature extraction layer, to obtain a set of language features.
The language feature set comprises a plurality of groups of language features, and each group of language features is a language feature corresponding to one language description text in the language description text set. Optionally, the language description text set includes M language description texts, and language features corresponding to the mth language description text are expressed as L m The language feature set may be expressed as: l= { L 1 ,L 2 ,…,L m ,…,L M }. Wherein M is an integer greater than 1, and M is greater than or equal to 1 and less than or equal to M. The first feature extraction layer performs feature extraction on each language description text in the input language description text set respectively, and then outputs the obtained language feature set to the feature matching layer.
In step 2022, for each training sample image in the training image set, the computer device performs a label prediction process separately, resulting in a predicted label for each training sample image.
The label prediction process includes the following steps S1 to S4.
In step S1, the computer device performs feature extraction on the training sample image through the first feature extraction layer, so as to obtain a global image feature corresponding to the training sample image.
Optionally, the first feature extraction layer is implemented by a visual language pre-training model obtained by pre-training. The global image features are used to reflect global features of the whole image, which are independent of image tasks. And after the first feature extraction layer performs feature extraction on the input training sample image, outputting the obtained global image features to the second feature extraction layer.
In step S2, the computer device performs feature extraction on the global image feature through the second feature extraction layer, so as to obtain a target image feature, where the target image feature is associated with a target image task.
Optionally, the second feature extraction layer performs feature extraction on the global image features by using a feature extraction algorithm matched with the target image task. The second feature extraction layer is mainly used for extracting feature sets with strong category relevance from global image features. For example, the target image task is image classification or motion recognition, the second feature extraction layer is used to extract features of an image region containing the classified object or the object to be recognized. For example, if the target image task is target detection, the second feature extraction layer is configured to extract features of an image region including the detection target and features of a position of the image region including the detection target. The position features of the image area containing the detection target can be expressed by the prediction frame position. The prediction box is typically a rectangular box. And after the second feature extraction layer performs feature extraction on the input global image features, outputting the obtained target image features to the feature matching layer.
In step S3, the computer device performs feature matching on the target image feature and each group of language features in the language feature set through the feature matching layer, so as to obtain a feature matching result, where the feature matching result includes feature similarity between the target image feature and each group of language features in the language feature set.
Alternatively, the feature similarity between the image feature and the language feature may be cosine similarity or euclidean similarity, or the like.
Optionally, the target image task is image classification or motion recognition. For training sample images in the training image set, the image features output by the second feature extraction layer are F, and i is a positive integer. Language feature set l= { L 1 ,L 2 ,…,L m ,…,L M }。
The feature matching layer may represent the feature matching result obtained according to the input language feature set and the target image feature as follows: logits=func_similarity (F, L). Where func similarity is a function that calculates feature similarities between the image features and each set of language features in the set of language features.
Optionally, the targetThe image task is target detection. For training sample images in the training image set, the second feature extraction layer may extract N region features, where the N region features may be expressed as: o (O) F ={O 1 ,O 2 ,…,O n ,…,O N }. Wherein N is an integer greater than 1, N is greater than or equal to 1 and less than or equal to N. Language feature set l= { L 1 ,L 2 ,…,L m ,…,L M }. The feature matching layer may represent the feature matching result obtained according to the input language feature set and the target image feature as follows: logits=o F L T ,L T Represents the transposed matrix of L. The feature matching result includes feature similarities between each region feature and each set of language features.
In step S4, the computer device uses the label described by the target language description text as a prediction label of the training sample image, where the target language description text is a language description text corresponding to a group of language features with the highest feature similarity between the features of the target image in the language feature set.
For the target detection task, an image may contain a plurality of detection targets, and then the computer device uses, for each detection target in the image, a label described by a language description text corresponding to a group of language features with highest feature similarity between image features of the detection target in the language feature set as a prediction label of the detection target, and finally uses prediction labels of all detection targets in the image as prediction labels of the image. For example, it is necessary to detect garbage and pedestrians who carry garbage from an image, if the image contains only garbage, the predictive label of the image is garbage, and if the image contains pedestrians who carry garbage, the predictive label of the image includes both garbage and pedestrians.
Optionally, the step S4 may be completed by a feature matching layer, that is, after the feature matching layer performs feature matching on the image features and the language features to obtain feature matching results, the feature matching layer determines the prediction label of the training sample image based on the feature matching results, and the feature matching layer outputs the prediction label of the training sample image. Alternatively, the step S4 may be performed by other modules (e.g., a supervision module) in the image annotation model according to the output result of the feature matching layer. Or the step S4 may be separately determined by the computer device according to the output result of the feature matching layer. For the latter two cases, the feature matching layer performs feature matching on the image features and the language features to obtain feature matching results, and then directly outputs the feature matching results.
And 203, training the target image annotation model by the computer equipment according to errors between the real labels and the predicted labels of a plurality of training sample images in the training image set until the target image annotation model converges.
Optionally, referring to fig. 5, the target image labeling model further includes a supervision module, where a loss function matched with the target image task is set. The input end of the supervision module is connected with the output end of the characteristic matching layer. Referring to fig. 5, the feature matching layer may output a feature matching result to the supervision module, which determines a prediction label of the training sample image according to the input feature matching result.
Optionally, the implementation procedure of the above step 203 may include the following steps 2031 to 2032.
In step 2031, the computer device calculates, by the supervision module, a loss value of the loss function based on the real labels and the predictive labels of the plurality of training sample images in the training image set.
Optionally, the target image task is image classification or motion recognition. The loss function set in the supervision module can be expressed as: loss=func_classification (Logits, G). Wherein G is the real label of the training sample image. The meaning of Logits refers to the relevant definition in the case where the target image task is image classification or motion recognition in step S3 described above. func_classification is a function that computes classification losses from the labels, such as cross entropy loss functions, focal loss (a difficult sample mining loss) functions, or variants thereof.
Optionally, the target image task is target detection. The loss function provided in the supervision module consists of two parts, including classification loss and positioning loss. Classification loss functionCan be expressed as: loss_class=func_class (Logits, T c ). Wherein T is c For training the real label of the sample image, assuming that the sample image includes K detection targets, T c ={C 1 ,…,C k ,…,C K }. K is a positive integer, and K is more than or equal to 1 and less than or equal to K. The meaning of Logits refers to the relevant definition in the case where the target image task is target detection in step S3 described above. The positioning loss function can be expressed as: loss_loc=func_iou (O B ,T B ). Wherein O is B N prediction frame positions O corresponding to the N region features extracted by the second feature extraction layer B May be output by the second feature extraction layer to the supervision module. T (T) B The K real frame positions corresponding to the K detection targets in the training sample image can be expressed as follows: t (T) B ={Box 1 ,…,Box k ,…,Box K }. func_iou is a function of calculating the intersection ratio (intersection over union, IOU) of the prediction frame and the real frame, such as a generalized intersection ratio (GIOU), a complete intersection ratio (IOU), and other loss functions. The loss value of the loss function under the target detection task may be the sum of the loss value of the classification loss function and the loss value of the localization loss function.
In step 2032, the computer device transmits gradient information of the loss function back to the second feature extraction layer to adjust network parameters of the second feature extraction layer to the feature matching layer.
Alternatively, the above step 2032 may be replaced by: the computer device transmits gradient information of the loss function back to the first feature extraction layer to adjust network parameters of the first feature extraction layer to the feature matching layer. That is, the first feature extraction layer may not be adjusted during the model training process, or the first feature extraction layer may be fine-tuned during the model training process according to the actual image task.
In one round of model training, the computer equipment repeatedly trains the image annotation model (namely continuously adjusts network parameters) based on the same training image set until the loss function converges, and then the converged image annotation model under the round of training is obtained. The loss function convergence may be that a loss value of the loss function reaches a preset value.
It should be noted that, in the model training process, a supervision module may be set in the image labeling model, and after the model training is finished, the supervision module may be deleted or retained because the supervision module does not play a role in reverse tuning of the network parameters.
The steps 201 to 203 describe a training process of labeling the model for the target image. After one round of training is completed on the target image annotation model, the computer equipment can further verify the accuracy of the target image annotation model. If the accuracy of the target image labeling model meets the preset requirement, the computer equipment stops training the target image labeling model, and the target image labeling model obtained through final training is used for determining labeling labels of the images to be labeled under the target image task. If the accuracy of the target image annotation model does not reach the preset requirement, the computer equipment continues to train the target image annotation model for a new round. And repeatedly carrying out iterative updating on the target image annotation model until the accuracy of the target image annotation model reaches the preset requirement. See the following steps 204 to 207 for a specific implementation flow.
Step 204, after the target image labeling model converges, the computer device invokes the target image labeling model to determine the prediction labels of the plurality of verification sample images in the verification image set.
Wherein the set of verification images includes a plurality of verification sample images, each verification sample image of the plurality of verification sample images being labeled with a genuine label. The verification of the authentic signature of the sample image may be manually noted. And the prediction label of the verification sample image is obtained by a target image annotation model based on the feature matching result of the image feature corresponding to the verification sample image and the language feature corresponding to each language description text in the language description text set. The method for predicting and determining the prediction label of the verification sample image by the target image labeling model may refer to the method for determining the prediction label of the training sample image in step 202, which is not described herein.
Optionally, there is no intersection of the verification image set with the training image set. The set of verification images may be a fixed set of images, i.e. the verification sample image in the set of verification images is unchanged. For example, the intelligent labeling data set includes 1000 images, wherein 100 images are labeled with real labels, the remaining 900 images are images to be labeled, and then 50 images labeled with real labels can be used as verification sample images, so as to obtain a verification image set. And taking the other 50 images marked with the real labels as training sample images to obtain an initial training image set.
Step 205, the computer equipment determines the labeling accuracy of the labeling model of the target image according to the real labels and the predicted labels of the plurality of verification sample images.
For example, the verification image set includes 50 verification sample images, and if the true labels of 30 verification sample images are the same as the predicted labels, the labeling accuracy of the target image labeling model is 60%.
And 206, when the labeling accuracy of the target image labeling model does not reach the preset threshold, the computer equipment executes one or more model training processes until the labeling accuracy of the target image labeling model reaches the preset threshold.
In the embodiment of the application, the images to be annotated can be used for model training together under the condition that the initial training sample images are fewer. And screening out difficult cases in the image to be marked by manual correction by combining with an active learning strategy, adding the image subjected to the manual correction marking as a new training sample image into a training image set, expanding the scale of the training image set, and performing new training on the model to improve the model precision. The model training process includes the following steps 2061 to 2066.
In step 2061, the computer device invokes the target image annotation model to determine the predictive labels for the plurality of images to be annotated and the confidence level of the predictive labels for the images to be annotated.
In step 2062, the computer device obtains a refractory image from the plurality of images to be annotated that has a confidence level of the predictive tag below a confidence threshold.
In step 2063, the computer device outputs the refractory image and the predictive label of the refractory image for manual correction.
In step 2064, the computer device adds the refractory image as a new training sample image to the training image set in response to receiving the artificial annotation result for the refractory image, resulting in an updated training image set.
The manual labeling result comprises a real label which is manually labeled for the difficult-to-label image. Steps 2061 to 2064 are active learning in the model training process, that is, the part involved manually, and the computer equipment is used for screening out a suitable candidate set to the iterative learning process of manual labeling.
In step 2065, the computer device invokes the target image annotation model to determine a predictive label for each training sample image in the updated training image set, the predictive labels for the training sample images being derived based on feature matching results of image features corresponding to the training sample images with corresponding language features of each language description text in the set of language description texts corresponding to the updated training image set, respectively.
Optionally, if the label corresponding to the updated training image set changes, the computer device re-acquires the language description text set corresponding to the updated training image set, and the specific implementation manner may refer to the above steps 2011 to 2012, and the embodiments of the present application are not repeated herein.
In step 2066, the computer device trains the target image annotation model based on the errors between the actual labels and the predicted labels of the plurality of training sample images in the updated training image set until the target image annotation model is again converged.
The implementation process of this step 2066 may refer to the implementation process of step 203, and this embodiment is not described herein.
In step 207, when the labeling accuracy of the labeling model of the target image reaches a preset threshold, the computer device uses the predicted label of the image to be labeled determined by calling the labeling model of the target image as the labeling label of the image to be labeled.
Optionally, the computer device may further output labeling tags for all images to be labeled.
For example, fig. 6 is a schematic architecture diagram related to an image labeling method according to an embodiment of the present application. As shown in fig. 6, the architecture includes a data storage medium, a processor, and an interactive UI. The data storage medium is used for storing an intelligent annotation data set, and the intelligent annotation data set comprises an annotated image and an image to be annotated. The processor may be a central processing unit (central processing unit, CPU) or a graphics processor (graphics processing unit, GPU). The processor is used for running and training an image annotation model, the image annotation model comprises a visual language pre-training model, a downstream task model head and a language image feature matching layer which are sequentially connected in series, and the visual language pre-training model is further connected with the language image feature matching layer. The interactive UI is used for a user to add language descriptions, including setting a language description template and adding language description texts in batches. The interactive UI is also used for outputting a labeling result, for example, a prediction label of an image to be labeled by the image labeling model can be displayed, and an active learning/difficult case mining result is presented for a user to manually correct and label, and the like. The downstream task model header includes, but is not limited to, image tasks such as image classification, object detection, motion recognition, and the like.
In summary, in the image labeling method provided in the embodiment of the present application, language prior associated knowledge of a labeling task is introduced to a training sample image by acquiring a language description text set corresponding to a training image set, and the image task is transformed into a matching task of image features and language features in an image labeling model. Therefore, the initial labeling performance of the image labeling model can be greatly improved, the first round labeling accuracy of the image labeling model is improved under the condition that the number of initial training sample images is small, the training round number of the image labeling model is effectively reduced, and therefore the image labeling efficiency is improved. In addition, the embodiment of the application can also provide an image annotation model frame suitable for a plurality of image tasks, and the downstream task model heads are designed in the image annotation model frame, so that the image annotation model frame can adapt to downstream requirements of diversity, and a corresponding image annotation model can be constructed only by selecting different feature extraction layers in the downstream task model heads. Because the image annotation model framework can be shared by a plurality of image tasks, a corresponding image annotation model is not required to be designed for each image task, unified standardized management of intelligent image annotation is realized, the complexity of development and maintenance can be reduced, and the technical cost is reduced.
The sequence of the steps of the image labeling method provided by the embodiment of the application can be properly adjusted, and the steps can be correspondingly increased or decreased according to the situation. Any method of modification, which is within the scope of the present disclosure, will be readily apparent to those skilled in the art, and is intended to be encompassed within the scope of the present disclosure. For example, the computer device may display information on its own display interface, or the computer device may send information to another display device and display information on the other display device.
The virtual device according to the embodiment of the present application is illustrated below.
For example, fig. 7 is a schematic structural diagram of an image labeling device according to an embodiment of the present application. As shown in fig. 7, the image labeling apparatus 700 includes: an acquisition module 701, a determination module 702 and a training module 703.
The obtaining module 701 is configured to obtain a training image set corresponding to a target image task and a language description text set corresponding to the training image set, where the training image set includes a plurality of training sample images, each training sample image in the plurality of training sample images is labeled with a real label, the language description text set includes a plurality of language description texts, the plurality of language description texts are in one-to-one correspondence with a plurality of types of labels corresponding to the training image set, and each language description text is used for describing semantics of one type of labels in the plurality of types of labels.
The determining module 702 is configured to invoke a target image annotation model corresponding to the target image task, determine a prediction label of each training sample image in the training image set, where the prediction label of the training sample image is obtained by the target image annotation model based on a feature matching result of an image feature corresponding to the training sample image and a language feature corresponding to each language description text in the language description text set.
The training module 703 is configured to train the target image labeling model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set until the target image labeling model converges, where the target image labeling model is used to determine labeling labels of the images to be labeled under the target image task.
Optionally, the determining module 702 is further configured to, after the target image labeling model converges, invoke the target image labeling model to determine prediction labels of a plurality of verification sample images in a verification image set, where the verification image set includes the plurality of verification sample images, and each verification sample image in the plurality of verification sample images is labeled with a real label. The determining module 702 is further configured to determine labeling accuracy of the labeling model of the target image according to the real labels and the predicted labels of the plurality of verification sample images. The training module 703 is further configured to perform one or more model training processes when the labeling accuracy of the target image labeling model does not reach the preset threshold, until the labeling accuracy of the target image labeling model reaches the preset threshold.
Wherein, the model training process includes: and calling a target image annotation model, and determining the prediction labels of the plurality of images to be annotated and the confidence of the prediction labels of the images to be annotated. And obtaining refractory images with confidence of the predictive labels lower than a confidence threshold from the plurality of images to be annotated. Outputting the difficultly-marked image and the prediction label of the difficultly-marked image for manual correction. And in response to receiving the manual annotation result for the difficultly-annotated image, adding the difficultly-annotated image as a new training sample image to the training image set to obtain an updated training image set. And calling a target image annotation model, determining a prediction label of each training sample image in the updated training image set, and obtaining the prediction label of the training sample image based on the feature matching result of the image features corresponding to the training sample image and the language features corresponding to each language description text in the language description text set corresponding to the updated training image set. And training the target image annotation model according to errors between the actual labels and the predicted labels of the plurality of training sample images in the updated training image set until the target image annotation model is converged again.
Optionally, the determining module 702 is further configured to, when the labeling accuracy of the target image labeling model reaches a preset threshold, use the predicted label of the image to be labeled determined by calling the target image labeling model as the labeling label of the image to be labeled.
Optionally, the target image labeling model includes a first feature extraction layer, a second feature extraction layer and a feature matching layer, the first feature extraction layer includes an image feature output end and a language feature output end, the feature matching layer includes an image feature input end and a language feature input end, the image feature output end is connected with the input end of the second feature extraction layer, the language feature output end is connected with the language feature input end, and the output end of the second feature extraction layer is connected with the image feature input end. A determining module 702, configured to: and respectively extracting the characteristics of each language description text in the language description text set through the first characteristic extraction layer to obtain a language characteristic set, wherein the language characteristic set comprises a plurality of groups of language characteristics, and each group of language characteristics is a language characteristic corresponding to one language description text in the language description text set. For each training sample image in the training image set, carrying out feature extraction on the training sample image through a first feature extraction layer to obtain global image features corresponding to the training sample image, carrying out feature extraction on the global image features through a second feature extraction layer to obtain target image features, associating the target image features with target image tasks, respectively carrying out feature matching on each group of language features in the target image features and the language feature sets through a feature matching layer to obtain feature matching results, wherein the feature matching results comprise feature similarity between each group of language features in the target image features and the language feature sets, taking a label described by a target language description text as a prediction label of the training sample image, and the target language description text is a language description text corresponding to a group of language features with the highest feature similarity between the language feature sets and the target image features.
Optionally, the target image annotation model further comprises a supervision module, a loss function matched with the target image task is arranged in the supervision module, and the input end of the supervision module is connected with the output end of the feature matching layer. Training module 703 for: the loss value of the loss function is calculated by a supervision module based on the real labels and the predicted labels of a plurality of training sample images in the training image set. And reversely transmitting gradient information of the loss function to the second feature extraction layer to adjust network parameters from the second feature extraction layer to the feature matching layer.
Optionally, an image annotation model frame is pre-stored in the computer device, the image annotation model frame includes a first feature extraction layer, a downstream task model head and a feature matching layer, an image feature output end is connected with an input end of the downstream task model head, an output end of the downstream task model head is connected with an image feature input end of the feature matching layer, the downstream task model head includes a plurality of feature extraction layers corresponding to a plurality of image tasks one to one, and the second feature extraction layer is a feature extraction layer corresponding to a target image task in the downstream task model head, wherein the image feature output end is configured to be connected with one feature extraction layer in the downstream task model head each time.
Optionally, the first feature extraction layer is implemented by a visual language pre-training model obtained by pre-training.
Optionally, as shown in fig. 8, the image labeling apparatus 700 further includes: and a display module 704.
Optionally, the display module 704 is configured to display, in response to receiving an initiation instruction for the target image task, a template setting prompt, where the template setting prompt is configured to prompt a user to set a language description template corresponding to the target image task. The obtaining module 701 is configured to generate a language description text according to the set language description template and the label for each type of label corresponding to the training image set.
Optionally, the display module 704 is further configured to display the language description text after generating the language description text.
Optionally, the display module 704 is further configured to display a plurality of image tasks, where the target image task is one of the plurality of image tasks. The acquiring module 701 is configured to determine that a start instruction for a target image task is received in response to detecting a selection operation for the target image task.
Optionally, the plurality of image tasks includes one or more of image classification, object detection, or motion recognition.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
The following illustrates the basic hardware structure involved in the embodiments of the present application.
For example, fig. 9 is a schematic hardware structure of an image labeling device according to an embodiment of the present application. As shown in fig. 9, the image labeling apparatus 900 includes a processor 901 and a memory 902, and the memory 901 and the memory 902 are connected via a bus 903. Fig. 9 illustrates the processor 901 and the memory 902 independently of each other. Optionally, the processor 901 and the memory 902 are integrated. Alternatively, image annotation device 900 in FIG. 9 is any computer device that has computing capabilities.
The memory 902 is used to store a computer program, including an operating system and program code. The memory 902 is a variety of types of storage media, such as read-only memory (ROM), random access memory (random access memory, RAM), electrically erasable programmable read-only memory (EEPROM), compact disk read-only memory (compact disc read-only memory), flash memory, optical memory, registers, optical disk storage, magnetic disk, or other magnetic storage device.
The processor 901 is a general-purpose processor or a special-purpose processor. Processor 901 may be a single core processor or a multi-core processor. The processor 901 includes at least one circuit to perform the image labeling method provided in the embodiments of the present application.
Optionally, the image labeling apparatus 900 further comprises a network interface 904, the network interface 904 being connected to the processor 901 and the memory 902 via a bus 903. The network interface 904 enables the image annotation apparatus 900 to communicate with other devices.
Optionally, the image labeling apparatus 900 further comprises an input/output (I/O) interface 905, and the I/O interface 905 is connected to the processor 901 and the memory 902 through the bus 903. The processor 901 can receive input commands or data, etc., through the I/O interface 905. The I/O interface 905 is used for the image labeling apparatus 900 to connect input devices such as a keyboard, a mouse, and the like. Optionally, in some possible scenarios, the above-described network interface 904 and I/O interface 905 are collectively referred to as a communication interface.
Optionally, the image labeling apparatus 900 further comprises a display 906, the display 906 being connected to the processor 901 and the memory 902 via the bus 903. The display 906 can be used to display intermediate and/or final results, etc., resulting from the processor 901 performing the above-described methods, such as displaying image tasks, template setup prompts, language description text, etc. In one possible implementation, the display 906 is a touch screen to provide a human-machine interaction interface.
The bus 903 is any type of communication bus for interconnecting the internal devices of the image annotation device 900. Such as a system bus. The embodiment of the present application describes that the above-mentioned devices inside the image labeling apparatus 900 are interconnected by the bus 903, alternatively, the above-mentioned devices inside the image labeling apparatus 900 are communicatively connected to each other by a connection means other than the bus 903, for example, the above-mentioned devices inside the image labeling apparatus 900 are interconnected by a logic interface inside the image labeling apparatus 900.
The above devices may be provided on separate chips, or may be provided at least partially or entirely on the same chip. Whether the individual devices are independently disposed on different chips or integrally disposed on one or more chips is often dependent on the needs of the product design. The embodiment of the application does not limit the specific implementation form of the device.
The image annotation device 900 shown in fig. 9 is merely exemplary, and in implementation, the image annotation device 900 includes other components, which are not listed here. The image labeling apparatus 900 shown in fig. 9 may implement intelligent labeling of an image by performing all or part of the steps of the method provided in the above embodiments.
Embodiments of the present application also provide a computer readable storage medium having instructions stored thereon that, when executed by a processor, implement an image labeling method as shown in fig. 2.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements an image labeling method as shown in fig. 2.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
In the present embodiments, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The term "and/or" in this application is merely an association relation describing an associated object, and indicates that three relations may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, the image data referred to in this application are acquired with sufficient authorization.
The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, its application, to the form and details of construction and the arrangement of the preferred embodiments, and thus, any and all modifications, equivalents, and alternatives falling within the spirit and principles of the present application.

Claims (25)

1. An image annotation method for a computer device, the method comprising:
acquiring a training image set corresponding to a target image task and a language description text set corresponding to the training image set, wherein the training image set comprises a plurality of training sample images, each training sample image in the plurality of training sample images is marked with a real label, the language description text set comprises a plurality of language description texts, the plurality of language description texts are in one-to-one correspondence with a plurality of types of labels corresponding to the training image set, and each language description text is used for describing the semantics of one type of labels in the plurality of types of labels;
Invoking a target image annotation model corresponding to the target image task, and determining a prediction label of each training sample image in the training image set, wherein the prediction label of the training sample image is obtained by the target image annotation model based on a feature matching result of image features corresponding to the training sample image and language features corresponding to each language description text in the language description text set;
and training the target image annotation model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set until the target image annotation model converges, wherein the target image annotation model is used for determining the annotation labels of the images to be annotated under the target image task.
2. The method of claim 1, wherein after the target image annotation model converges, the method further comprises:
invoking the target image annotation model, and determining prediction labels of a plurality of verification sample images in a verification image set, wherein the verification image set comprises the plurality of verification sample images, and each verification sample image in the plurality of verification sample images is annotated with a real label;
Determining the labeling accuracy of the target image labeling model according to the real labels and the prediction labels of the verification sample images;
when the labeling accuracy of the target image labeling model does not reach a preset threshold, executing one or more model training processes until the labeling accuracy of the target image labeling model reaches the preset threshold;
wherein the model training process comprises:
invoking the target image annotation model, and determining the prediction labels of a plurality of images to be annotated and the confidence of the prediction labels of the images to be annotated;
acquiring difficultly-marked images with confidence coefficient of the predictive label lower than a confidence coefficient threshold value from the plurality of images to be marked;
outputting the difficultly-marked image and a prediction label of the difficultly-marked image for manual correction;
in response to receiving a manual annotation result for the difficultly-annotated image, adding the difficultly-annotated image as a new training sample image to the training image set to obtain an updated training image set;
invoking the target image annotation model, and determining a prediction label of each training sample image in the updated training image set, wherein the prediction label of the training sample image is obtained based on feature matching results of image features corresponding to the training sample image and language features corresponding to each language description text in a language description text set corresponding to the updated training image set;
And training the target image annotation model according to errors between the real labels and the predicted labels of the plurality of training sample images in the updated training image set until the target image annotation model is converged again.
3. The method according to claim 2, wherein the method further comprises:
when the labeling accuracy of the target image labeling model reaches the preset threshold, the prediction label of the image to be labeled, which is determined by calling the target image labeling model, is used as the labeling label of the image to be labeled.
4. A method according to any one of claims 1 to 3, wherein the target image annotation model comprises a first feature extraction layer, a second feature extraction layer and a feature matching layer, the first feature extraction layer comprising an image feature output and a language feature output, the feature matching layer comprising an image feature input and a language feature input, the image feature output being connected to the input of the second feature extraction layer, the language feature output being connected to the language feature input, the output of the second feature extraction layer being connected to the image feature input; the step of calling the target image annotation model corresponding to the target image task and determining the prediction label of each training sample image in the training image set comprises the following steps:
Extracting the characteristics of each language description text in the language description text set through the first characteristic extraction layer to obtain a language characteristic set, wherein the language characteristic set comprises a plurality of groups of language characteristics, and each group of language characteristics is a language characteristic corresponding to one language description text in the language description text set;
for each training sample image in the training image set,
extracting the characteristics of the training sample image through the first characteristic extraction layer to obtain the global image characteristics corresponding to the training sample image,
performing feature extraction on the global image features through the second feature extraction layer to obtain target image features, wherein the target image features are associated with the target image task,
the feature matching layer is used for respectively carrying out feature matching on the target image features and the language features of each group in the language feature set to obtain feature matching results, the feature matching results comprise feature similarity between the target image features and each group of language features in the language feature set,
and taking the label described by the target language description text as a prediction label of the training sample image, wherein the target language description text is the language description text corresponding to a group of language features with highest feature similarity between the language feature set and the target image feature.
5. The method of claim 4, wherein the target image annotation model further comprises a supervision module, wherein a loss function matched with the target image task is set in the supervision module, an input end of the supervision module is connected with an output end of the feature matching layer, and training the target image annotation model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set comprises:
calculating, by the supervision module, a loss value of the loss function based on real labels and predictive labels of a plurality of training sample images in the training image set;
and reversely transmitting gradient information of the loss function to the second feature extraction layer so as to adjust network parameters from the second feature extraction layer to the feature matching layer.
6. The method according to claim 4 or 5, wherein an image annotation model framework is pre-stored in the computer device, the image annotation model framework comprises the first feature extraction layer, a downstream task model head and the feature matching layer, the image feature output end is connected with the input end of the downstream task model head, the output end of the downstream task model head is connected with the image feature input end of the feature matching layer, the downstream task model head comprises a plurality of feature extraction layers corresponding to a plurality of image tasks one by one, the second feature extraction layer is a feature extraction layer corresponding to the target image task in the downstream task model head, and the image feature output end is configured to connect one feature extraction layer in the downstream task model head at a time.
7. The method according to any of claims 4 to 6, wherein the first feature extraction layer is implemented by a pre-trained visual language pre-training model.
8. The method according to any one of claims 1 to 7, wherein the obtaining the language description text set corresponding to the training image set includes:
in response to receiving a start instruction for the target image task, displaying a template setting prompt, wherein the template setting prompt is used for prompting a user to set a language description template corresponding to the target image task;
and generating a language description text according to the set language description template and the labels aiming at each type of labels corresponding to the training image set.
9. The method of claim 8, wherein the method further comprises:
after the language description text is generated, the language description text is displayed.
10. The method according to claim 8 or 9, characterized in that the method further comprises:
displaying a plurality of image tasks, wherein the target image task is one of the plurality of image tasks;
in response to detecting a selection operation of the target image task, it is determined that a start instruction for the target image task is received.
11. The method of claim 6 or 10, wherein the plurality of image tasks includes one or more of image classification, object detection, or motion recognition.
12. An image annotation device, the device comprising:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training image set corresponding to a target image task and a language description text set corresponding to the training image set, the training image set comprises a plurality of training sample images, each training sample image in the plurality of training sample images is marked with a real label, the language description text set comprises a plurality of language description texts, the plurality of language description texts are in one-to-one correspondence with a plurality of types of labels corresponding to the training image set, and each language description text is used for describing the semantics of one type of labels in the plurality of types of labels;
the determining module is used for calling a target image annotation model corresponding to the target image task and determining a prediction label of each training sample image in the training image set, wherein the prediction label of the training sample image is obtained by the target image annotation model based on the feature matching result of the image feature corresponding to the training sample image and the language feature corresponding to each language description text in the language description text set;
The training module is used for training the target image annotation model according to errors between real labels and predicted labels of a plurality of training sample images in the training image set until the target image annotation model converges, and the target image annotation model is used for determining the annotation labels of the images to be annotated under the target image task.
13. The apparatus of claim 12, wherein the device comprises a plurality of sensors,
the determining module is further configured to invoke the target image annotation model after the target image annotation model converges, determine prediction labels of a plurality of verification sample images in a verification image set, where the verification image set includes the plurality of verification sample images, and each verification sample image in the plurality of verification sample images is annotated with a real label;
the determining module is further used for determining the labeling accuracy of the target image labeling model according to the real labels and the prediction labels of the verification sample images;
the training module is further used for executing one or more model training processes when the labeling accuracy of the target image labeling model does not reach a preset threshold value until the labeling accuracy of the target image labeling model reaches the preset threshold value;
Wherein the model training process comprises:
invoking the target image annotation model, and determining the prediction labels of a plurality of images to be annotated and the confidence of the prediction labels of the images to be annotated;
acquiring difficultly-marked images with confidence coefficient of the predictive label lower than a confidence coefficient threshold value from the plurality of images to be marked;
outputting the difficultly-marked image and a prediction label of the difficultly-marked image for manual correction;
in response to receiving a manual annotation result for the difficultly-annotated image, adding the difficultly-annotated image as a new training sample image to the training image set to obtain an updated training image set;
invoking the target image annotation model, and determining a prediction label of each training sample image in the updated training image set, wherein the prediction label of the training sample image is obtained based on feature matching results of image features corresponding to the training sample image and language features corresponding to each language description text in a language description text set corresponding to the updated training image set;
and training the target image annotation model according to errors between the real labels and the predicted labels of the plurality of training sample images in the updated training image set until the target image annotation model is converged again.
14. The apparatus of claim 13, wherein the device comprises a plurality of sensors,
and the determining module is further used for taking the prediction label of the image to be marked, which is determined by calling the target image marking model, as the marking label of the image to be marked when the marking accuracy of the target image marking model reaches the preset threshold.
15. The apparatus according to any one of claims 12 to 14, wherein the target image annotation model comprises a first feature extraction layer, a second feature extraction layer and a feature matching layer, the first feature extraction layer comprising an image feature output and a language feature output, the feature matching layer comprising an image feature input and a language feature input, the image feature output being connected to the input of the second feature extraction layer, the language feature output being connected to the language feature input, the output of the second feature extraction layer being connected to the image feature input; the determining module is used for:
extracting the characteristics of each language description text in the language description text set through the first characteristic extraction layer to obtain a language characteristic set, wherein the language characteristic set comprises a plurality of groups of language characteristics, and each group of language characteristics is a language characteristic corresponding to one language description text in the language description text set;
For each training sample image in the training image set,
extracting the characteristics of the training sample image through the first characteristic extraction layer to obtain the global image characteristics corresponding to the training sample image,
performing feature extraction on the global image features through the second feature extraction layer to obtain target image features, wherein the target image features are associated with the target image task,
the feature matching layer is used for respectively carrying out feature matching on the target image features and the language features of each group in the language feature set to obtain feature matching results, the feature matching results comprise feature similarity between the target image features and each group of language features in the language feature set,
and taking the label described by the target language description text as a prediction label of the training sample image, wherein the target language description text is the language description text corresponding to a group of language features with highest feature similarity between the language feature set and the target image feature.
16. The apparatus of claim 15, wherein the target image annotation model further comprises a supervision module, the supervision module having a loss function matched to the target image task disposed therein, an input of the supervision module connected to an output of the feature matching layer, and the training module configured to:
Calculating, by the supervision module, a loss value of the loss function based on real labels and predictive labels of a plurality of training sample images in the training image set;
and reversely transmitting gradient information of the loss function to the second feature extraction layer so as to adjust network parameters from the second feature extraction layer to the feature matching layer.
17. The apparatus according to claim 15 or 16, wherein an image annotation model framework is pre-stored in the computer device, the image annotation model framework comprising the first feature extraction layer, a downstream task model head and the feature matching layer, the image feature output being connected to an input of the downstream task model head, an output of the downstream task model head being connected to an image feature input of the feature matching layer, the downstream task model head comprising a plurality of feature extraction layers in one-to-one correspondence with a plurality of image tasks, the second feature extraction layer being a feature extraction layer in the downstream task model head corresponding to the target image task, wherein the image feature output is configured to connect one feature extraction layer in the downstream task model head at a time.
18. The apparatus according to any of the claims 15 to 17, wherein the first feature extraction layer is implemented by a pre-trained visual language pre-training model.
19. The apparatus according to any one of claims 12 to 18, further comprising: a display module;
the display module is used for responding to the receiving of the starting instruction aiming at the target image task, displaying a template setting prompt, wherein the template setting prompt is used for prompting a user to set a language description template corresponding to the target image task;
the acquisition module is used for generating a language description text according to the set language description template and the labels aiming at each type of labels corresponding to the training image set.
20. The apparatus of claim 19, wherein the device comprises a plurality of sensors,
the display module is further used for displaying the language description text after the language description text is generated.
21. The device according to claim 19 or 20, wherein,
the display module is further used for displaying a plurality of image tasks, and the target image task is one of the plurality of image tasks;
the acquisition module is used for responding to detection of the selection operation of the target image task and determining that a starting instruction aiming at the target image task is received.
22. The apparatus of claim 17 or 21, wherein the plurality of image tasks comprises one or more of image classification, object detection, or motion recognition.
23. A computer device, comprising: a processor and a memory;
the memory is used for storing a computer program, and the computer program comprises program instructions;
the processor is configured to invoke the computer program to implement the image labeling method according to any of claims 1 to 11.
24. A computer readable storage medium having instructions stored thereon which, when executed by a processor, implement the image annotation method according to any of claims 1 to 11.
25. A computer program product comprising a computer program which, when executed by a processor, implements the image annotation method according to any one of claims 1 to 11.
CN202211042115.8A 2022-08-29 2022-08-29 Image labeling method and device Pending CN117671678A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211042115.8A CN117671678A (en) 2022-08-29 2022-08-29 Image labeling method and device
PCT/CN2023/089419 WO2024045641A1 (en) 2022-08-29 2023-04-20 Image annotation method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211042115.8A CN117671678A (en) 2022-08-29 2022-08-29 Image labeling method and device

Publications (1)

Publication Number Publication Date
CN117671678A true CN117671678A (en) 2024-03-08

Family

ID=90064841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211042115.8A Pending CN117671678A (en) 2022-08-29 2022-08-29 Image labeling method and device

Country Status (2)

Country Link
CN (1) CN117671678A (en)
WO (1) WO2024045641A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118171208A (en) * 2024-05-16 2024-06-11 江西广播电视网络传媒有限公司 Multi-mode multi-label association classification method, system, storage medium and computer
CN118192976A (en) * 2024-05-08 2024-06-14 工业富联(杭州)数据科技有限公司 Operation guide generation method and device, electronic equipment and storage medium
CN118366011A (en) * 2024-06-19 2024-07-19 温州电力建设有限公司 Model training, underground cable pipeline defect identification method, product and equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118296387B (en) * 2024-06-05 2024-08-06 烟台海颐软件股份有限公司 Model bi-directional iteration-based training sample optimization method and model bi-directional iteration-based training sample optimization system
CN118332127B (en) * 2024-06-14 2024-08-06 安徽农业大学 Zero sample text classification method based on cross-language integration

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416384B (en) * 2018-03-05 2021-11-05 苏州大学 Image label labeling method, system, equipment and readable storage medium
US10878296B2 (en) * 2018-04-12 2020-12-29 Discovery Communications, Llc Feature extraction and machine learning for automated metadata analysis
CN111626362B (en) * 2020-05-28 2024-02-02 腾讯科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN112926654B (en) * 2021-02-25 2023-08-01 平安银行股份有限公司 Pre-labeling model training and certificate pre-labeling method, device, equipment and medium
CN113065013B (en) * 2021-03-25 2024-05-03 携程计算机技术(上海)有限公司 Image annotation model training and image annotation method, system, equipment and medium
CN114186056B (en) * 2021-12-14 2024-10-15 广州华多网络科技有限公司 Commodity label marking method and device, equipment, medium and product thereof
CN114429566A (en) * 2022-01-20 2022-05-03 北京沃东天骏信息技术有限公司 Image semantic understanding method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118192976A (en) * 2024-05-08 2024-06-14 工业富联(杭州)数据科技有限公司 Operation guide generation method and device, electronic equipment and storage medium
CN118171208A (en) * 2024-05-16 2024-06-11 江西广播电视网络传媒有限公司 Multi-mode multi-label association classification method, system, storage medium and computer
CN118366011A (en) * 2024-06-19 2024-07-19 温州电力建设有限公司 Model training, underground cable pipeline defect identification method, product and equipment

Also Published As

Publication number Publication date
WO2024045641A1 (en) 2024-03-07

Similar Documents

Publication Publication Date Title
CN117671678A (en) Image labeling method and device
CN110851641B (en) Cross-modal retrieval method and device and readable storage medium
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
CN115658955B (en) Cross-media retrieval and model training method, device, equipment and menu retrieval system
US11605232B2 (en) System and method for road sign ground truth construction with a knowledge graph and machine learning
CN117390497B (en) Category prediction method, device and equipment based on large language model
CN114998220A (en) Tongue image detection and positioning method based on improved Tiny-YOLO v4 natural environment
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN117523275A (en) Attribute recognition method and attribute recognition model training method based on artificial intelligence
CN115438215A (en) Image-text bidirectional search and matching model training method, device, equipment and medium
EP3882817A2 (en) Method, apparatus and device for recognizing bill and storage medium
CN113095072B (en) Text processing method and device
CN117036778A (en) Potential safety hazard identification labeling method based on image-text conversion model
CN112668608A (en) Image identification method and device, electronic equipment and storage medium
Qi et al. Cogcom: Train large vision-language models diving into details through chain of manipulations
CN114676705B (en) Dialogue relation processing method, computer and readable storage medium
US20240249503A1 (en) Image processing method and related apparatus
CN112926700B (en) Class identification method and device for target image
CN117667979B (en) Data mining method, device, equipment and medium based on large language model
CN117251553B (en) Intelligent learning interaction method based on custom plug-in and large language model
CN116974626B (en) Analysis sequence chart generation method, device, equipment and computer readable storage medium
CN112380861A (en) Model training method and device and intention identification method and device
CN115617975B (en) Intention recognition method and device for few-sample multi-turn conversation
CN112016493A (en) Image description method and device, electronic equipment and storage medium
CN114970666B (en) Spoken language processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination