CN115099358A - Open world target detection training method based on dictionary creation and field self-adaptation - Google Patents
Open world target detection training method based on dictionary creation and field self-adaptation Download PDFInfo
- Publication number
- CN115099358A CN115099358A CN202210811954.5A CN202210811954A CN115099358A CN 115099358 A CN115099358 A CN 115099358A CN 202210811954 A CN202210811954 A CN 202210811954A CN 115099358 A CN115099358 A CN 115099358A
- Authority
- CN
- China
- Prior art keywords
- training
- text
- target detection
- field
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an open world target detection training method based on dictionary creation and field self-adaptation, relates to the technical field of communication software, and solves the problem that the existing open world target detection is limited to a single scene, firstly, a picture description data set and a multi-mode feature extraction network are introduced, text mode and visual mode features output by the picture description data set and the multi-mode feature extraction network are aligned, and meanwhile, a multi-mode transform network is introduced to perform text matching learning and text mask learning of an image text; and transferring the parameters of the regional visual feature extraction model and the visual mapping text layer learned in the pre-training stage into a target detection model, inputting pictures from two field data sets, wherein the source field picture participates in target detection training, the target field picture only participates in global field self-adaptive training, and in the training process, replacing the weight of a classifier of a detection head with a fixed word vector of a known class to perform target detection training.
Description
Technical Field
The invention relates to the technical field of communication, in particular to the technical field of an open world target detection training method based on dictionary creation and field self-adaptation.
Background
The main purpose of target detection is to detect and locate specific multiple targets from a picture, and the core problem of target detection is to locate and classify the content to be detected, so that the shape, size and position of the target appearing in the picture need to be determined according to the influence of the detected object under different conditions such as illumination, shade and the like, and higher accuracy and shorter detection time are ensured. The open-world target detection method is a method of identifying new classes in a real complex scene.
The traditional target detection method is mainly limited to a fixed class data set in a fixed scene, a trained classifier only has the capability of identifying the class of a label, but does not have the capability of efficiently identifying a known class and an unknown class in a non-fixed scene, all information of all scene labels is unrealistic, the traditional new class detection method only learns the implicit relation between classes, the internal relation between text features and visual features is seriously ignored, the identification precision is low, meanwhile, due to the lack of a cross-scene data set, a model is difficult to generalize to severe weather after being trained in the fixed data set, and the capability of detecting the new class in the severe weather is further reduced.
The existing open world target detection is limited to a single scene, such as indoor or normal weather, but in the real open world, the scene is complex and faces various severe weather, the existing methods are trained in normal weather and are difficult to generalize to severe weather, resulting in low recognition accuracy, although there are related documents disclosing that a target detection model based on domain adaptation can solve the problem, such as Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool.2018.Domain adaptive fast r-cnn for object detection in the world in Proceedings of the IEEE conference on Computer Vision and Pattern recognition.3339-3348, but they do not have the capability of recognizing new classes in the world, and the reason is that: the pure domain adaptive method only generalizes global scene features and does not increase category features.
Disclosure of Invention
The invention aims to solve the problem that the open world target detection in the prior art is only limited to a single scene, and provides an open world target detection training method based on dictionary creation and field self-adaptation in order to solve the technical problem.
The invention specifically adopts the following technical scheme for realizing the purpose:
the open world target detection method based on dictionary creation and field self-adaptation comprises the following steps:
acquiring image sample data and picture description data corresponding to the image sample data;
step two, constructing a regional visual feature extraction model, a visual mapping text layer, a BERT word vector extraction model and a multi-modal Transformer network model;
inputting the image sample data in the step one into a regional visual characteristic extraction model in a pre-training stage, wherein the output of the regional visual characteristic extraction model is used as the input of a visual mapping text layer; inputting the picture description data in the first step into a BERT word vector extraction model; after the output of the BERT word vector extraction model and the output of the visual mapping text model are aligned in features, the output of the BERT word vector extraction model and the output of the visual mapping text model are input into a multi-mode Transformer network model to perform text matching learning and text mask learning of image texts;
and in the training stage, the parameters of the regional visual feature extraction model and the visual mapping text layer learned in the pre-training stage are transferred to a visual feature extraction module, pictures from two field data sets are input, wherein the source field picture participates in target detection training, the target field picture only participates in global field adaptive training, the classifier weight of the detection head is replaced by a fixed word vector of a known class in the training process, and target detection training is carried out.
During testing, different fixed character features are adopted to replace the classification heads of the detection models.
The technical principle is as follows: in the pre-training stage, a picture description data set is adopted, picture information and word description are respectively input into a visual feature extraction model and a text feature extraction model, regional visual features and single word features are further obtained, a visual mapping text layer is designed behind the visual features, the output visual features and the text feature dimensions are guaranteed to be consistent, finally, the distance between two modal features is calculated, the similarity of multi-modal features is further improved by reducing specific loss function values, in the pre-training process, the models mainly learn the visual mapping text layer and the visual feature extraction model, and the construction of a dictionary is realized by combining the two models.
In the training stage, FasterR-CNN is introduced as a target detection model and is influenced by environmental factors, most of the existing target detection data sets are under normal weather, so a field self-adaption method is introduced, the domain invariant features of objects under normal weather and severe weather are learned in the training process, meanwhile, a zero sample target identification method is introduced in the detection stage, fixed character features are substituted for a classification head of the detection model, a visual mapping text layer trained in the pre-training stage is introduced, and finally, the target detection model is trained.
In the training stage, parameters of a regional visual feature extraction model and a visual mapping text layer which are learned in the pre-training stage are transferred to a target detection model, pictures from two field data sets are input in the training stage, wherein a source field data set is provided with target labeling information, a target field data set is only provided with field label information, only the source field picture participates in target detection training, a target field picture only participates in global field adaptive training, and the main purpose of the global field adaptive training is to extract the field invariant features of a source field and a target field.
In the testing stage, different fixed character features are used for replacing the detection model classification heads so as to achieve the purpose of identifying different types of targets in the open world.
Further, in the pre-training process, the relationship between the BERT word vector and the visual features output by the visual mapping text layer is measured by dot product distance, and the formula is defined as follows:
whereinFor visual features passing through the regional visual feature extraction model and the visual mapping text layer,word vector features for a single word extracted via the BERT word vector, n I Is the number of image features, n L In order to be the number of the present features,is composed ofCharacteristic of anda distance measure between the features.
Further, in the pre-training process, the feature alignment mainly includes two parts, namely text image alignment and image text alignment, and the specific loss function is as follows:
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G As a degree of matching of the image sequence and the corresponding text sequence, exp<I,L′> G The matching degree of the text sequence and the corresponding image sequence is obtained.
Further, in the pre-training process, the formula for the model to run in an auto-supervised manner by image-text matching and text mask learning is defined as follows:
wherein w m Is a text mask block, E (w,l)~D Is a mean value of P θ (w m |w m And I) is a conditional probability function under model parameters. S θ A function is generated for the image classification score.
Further, in the global domain adaptive training in the training stage, the distance between the two domains is shortened to extract the domain invariant features under different domains, and the formula is defined as follows:
wherein D is t 0 stands for feature from source domain, D t The representative feature is from the target domain as 1,the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
Further, in the training phase, the class to which the word vector with the smallest distance belongs is selected as the classification of the feature, and the formula is defined as follows:
whereinIs characterized in that the method is characterized in that,as BERT word vectors, e B Representing all 0 background embeddings.As a feature of an imageAnd text featuresIs measured by the distance of (a) to (b),as a feature of an imageWith background features e B Is measured by the distance of (a) to (b),as a feature of an imageWith different text characteristicsIs measured.
Further, the detection tasks of different classes are completed by replacing the classification heads with BERT word vectors of different classes.
An open world target detection training device based on dictionary creation and field self-adaptation comprises a pre-training module, a training module and a detection module;
the pre-training module is used for introducing the picture description data set and the multi-modal feature extraction network, aligning the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network, introducing the multi-modal Transformer network to perform text matching learning and text mask learning of the image text, and enabling the whole pre-training model to run in a self-supervision mode;
the training module is used for transferring the parameters of the regional visual feature extraction model and the visual mapping text layer which are learned in the pre-training stage to a target detection model in the training stage, inputting pictures from two field data sets, wherein the source field pictures participate in target detection training, the target field features only participate in global field adaptive training, and the classifier weight of the detection head is replaced by fixed word vectors of known classes in the training process to carry out target detection training;
and the test module is used for replacing the detection model classification head with different fixed character characteristics.
A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.
A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
The invention has the following beneficial effects:
(1) the open world target detection method based on dictionary creation and field self-adaptation can identify a new model in severe weather, effectively combines field self-adaptation with zero sample identification, greatly improves the accuracy of zero sample identification on a severe weather data set, and exceeds most methods based on field self-adaptation in the aspect of known categories;
(2) the open world target detection method based on dictionary creation and field self-adaptation constructs an open world target detection model based on field self-adaptation, and has the capability of detecting known classes and unknown classes in severe weather, so that the problem that single zero sample detection is poor in weather difference generalization capability is solved, and the problem that a field self-adaptation method cannot detect new classes in the face of knowledge difference is solved.
Drawings
FIG. 1 is a diagram of an embodiment of an open world target detection pre-training model;
FIG. 2 is a diagram of an embodiment open world target detection training model.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
Example 1
Referring to fig. 1 to 2, the embodiment provides an open world target detection training method based on dictionary creation and domain adaptation, and solves the problem that the existing open world target detection is only limited to a single scene. Meanwhile, the method has the capability of detecting known classes and unknown classes in severe weather, so that the problem that single zero sample detection is poor in weather difference generalization capability is solved, and the problem that a field self-adaptive method cannot detect new classes in knowledge difference is solved.
(1) Pre-training phase
In order to establish a display relation between category texts and visual features, a picture description data set and a multi-modal feature extraction network are introduced, the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network are aligned, and a multi-modal Transformer network is introduced to perform text matching learning and text mask learning of image texts, wherein the multi-modal Transformer network is used for enabling the whole pre-training model to run in a self-supervision mode, and the visual feature extraction network and a visual mapping text layer are mainly learned in the whole pre-training stage, as shown in fig. 1.
During the whole pre-training process, we measure the relationship between the BERT word vector and the visual features output by the visual mapping text layer through the dot product distance, and the formula is defined as follows:
whereinFor visual features passing through the regional visual feature extraction model and the visual mapping text layer,word vector features for a single word extracted via a BERT word vector, n I Is the number of image features, n L In order to be the number of such features,is composed ofCharacteristic of andthe distance between features is measured, and the specific loss function is as follows:
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G For matching of image sequences with corresponding text sequences, exp<I,L′> G And the matching degree of the text sequence and the corresponding image sequence.
The image text matching and text mask learning mainly comprises the purpose of enabling a model to run in an automatic supervision mode, and the formula is defined as follows:
wherein w m Is a text mask block, E (w,I)~D Is a mean value, P θ (w m |w m And I) is a conditional probability function under model parameters. S. the θ A function is generated for the image classification score.
(2) Training phase
In the training stage, parameters of a regional visual feature extraction model and a visual mapping text layer which are learned in the pre-training stage are transferred to a target detection model, pictures from two field data sets are input in the training stage, wherein a source field data set is provided with target labeling information, a target field data set is only provided with field label information, only the source field picture participates in target detection training, target field features only participate in global field adaptive training, the weight of a classifier of a detection head is replaced by fixed word vectors of known classes in the training process, target detection training is carried out, and the whole network model is as shown in fig. 2.
The global domain adaptive expression mainly extracts domain invariant features under different domains by shortening the distance between the two domains, and the formula is defined as follows:
wherein D is i 0 stands for feature from source domain, D i The representative feature is from the target domain as 1,the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
In the final classification process, the distance between the output feature and the word vectors of different words BERT is compared, and the category to which the word vector with the minimum distance belongs is selected as the classification of the feature, and the formula is defined as follows:
whereinIs characterized in that the method is characterized in that,as BERT word vectors, e B Representing all 0 background embeddings.As a feature of an imageWith text featuresIs measured by the distance of (a) to (b),as a feature of an imageWith background features e B Is measured by the distance of (a) to (b),as a feature of an imageWith different text characteristicsIs measured.
(3) Testing phase
The detection tasks of different categories can be alternatively completed by replacing the classification heads by BERT word vectors of different categories, and the test process is the same.
Experimental tests and results
The mAP @50 index is adopted to evaluate the model effect, and the mAP @50 is the proportion that the coincidence degree of the coordinate position of the target which is correctly predicted and the coordinate position of the labeling information exceeds 50 percent, and is also the mainstream target detection and evaluation method at present. In the pre-training stage, we use a "picture description" dataset (COCOCaption dataset) to establish an explicit mapping relationship between features and text features. In the training stage, an automatic driving Cityscapes data set and a FoggyCityscapes data set are introduced, and the specific categories are as follows: car, Person, Rider, Motor, Train, Truck, Bus, and Bike. Three groups of new classes are divided, and the new classes are as follows: rider and Motor, Train and Truck, Bus and Bike, the corresponding remaining categories are old categories. A plurality of sets of experimental settings are tested in a Cityscapes data set and a FoggyCityscapes data set respectively, and the method is superior to the current mainstream method in detecting new types and old types in severe weather by analyzing experimental results. In table 1, on the foggy cityscaps dataset, in the case that the new category is Rider and Motor, the mAP @50 values of the new category and the old category are measured separately by the method, which are 21.97 and 0.51 higher than those of the latest method, and the settings of other unknown categories are improved to different degrees than those of the latest method. In table 2, we tested that the value of mapp @50 of the old class is higher than that of the latest methods 3.05, 1.62 and 0.66 respectively under different new class settings, and the effectiveness of our method is demonstrated.
Meanwhile, the inventor compares the technical scheme of the application with the technical scheme disclosed by the related literature at present, and the specific experimental results are shown in tables 1 and 2.
The relevant documents are as follows:
[1]Ankan Bansal,Karan Sikka,Gaurav Sharma,Rama Chellappa,and Ajay Divakaran.2018.Zero-shot object detection.In Proceedings of the European Conference on Computer Vision(ECCV).384–400.
[2]Yuhua Chen,Wen Li,Christos Sakaridis,Dengxin Dai,and Luc Van Gool.2018.Domain adaptive faster r-cnn for object detection in the wild.In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.3339–3348.
[3]Jinhong Deng,Wen Li,Yuhua Chen,and Lixin Duan.2021.Unbiased mean teacher for cross-domain object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.4091–4101.
[4]Zhenwei He and Lei Zhang.2019.Multi-adversarial faster-rcnn for unrestricted object detection.In Proceedings of the IEEE/CVF International Conference on Computer Vision.6668–6677.
[5]Congcong Li,Dawei Du,Libo Zhang,Longyin Wen,Tiejian Luo,Yanjun Wu,and Pengfei Zhu.2020.Spatial attention pyramid network for unsupervised domain adaptation.In European Conference on Computer Vision.Springer,481–497.
[6]Shafin Rahman,Salman Khan,and Nick Barnes.2020.Improved visual-semantic alignment for zero-shot object detection.In Proceedings of the AAAI Conference on Artificial Intelligence,Vol.34.11932–11939.
[7]Kuniaki Saito,Yoshitaka Ushiku,Tatsuya Harada,and Kate Saenko.2019.Strong-weak distribution alignment for adaptive object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.6956–6965.
[8]Zhiqiang Shen,Harsh Maheshwari,Weichen Yao,and MariosSavvides.2019.Scl:Towards accurate domain adaptive object detection via gradient detach based stacked complementary losses.arXiv preprint arXiv:1911.02559(2019).
[9]Vibashan VS,Vikram Gupta,PoojanOza,Vishwanath ASindagi,and Vishal M Patel.2021.Mega-cda:Memory guided attention for category-aware unsupervised domain adaptive object detection.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.4516–4526.
[10]Alireza Zareian,Kevin Dela Rosa,Derek Hao Hu,and Shih-Fu Chang.2021.Open-vocabulary object detection using captions.In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition.14393–14402.
table 1 New and old class detection results on FoggyCityscapes data set under different experimental settings
Table 2 old class detection results on foggy cityscaps dataset under different experimental settings
Example 2
Referring to fig. 1 to 2, the present embodiment provides an open world target detection device based on dictionary creation and domain adaptation, which solves the problem that the current open world target detection is only limited to a single scene. Meanwhile, the method has the capability of detecting the known class and the unknown class in severe weather, not only solves the problem that single zero sample detection is poor in the generalization capability of weather difference, but also solves the problem that a field adaptive method cannot detect a new class in the face of knowledge difference. The detection device specifically comprises: the device comprises a pre-training module, a training module and a detection module;
pre-training module
In order to establish a display relation between category texts and visual features, a picture description data set and a multi-modal feature extraction network are introduced, the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network are aligned, and a multi-modal Transformer network is introduced to perform text matching learning and text mask learning of image texts, wherein the multi-modal Transformer network is used for enabling the whole pre-training model to run in a self-supervision mode, and the visual feature extraction network and a visual mapping text layer are mainly learned in the whole pre-training stage, as shown in fig. 1.
During the whole pre-training process, we measure the relationship between the BERT word vector and the visual features output by the visual mapping text layer by the dot product distance <, > and the formula is defined as follows:
whereinFor visual features passing through the regional visual feature extraction model and the visual mapping text layer,word vector features for a single word extracted via the BERT word vector, n I Is the number of image features, n L In order to be the number of the present features,is composed ofCharacteristic of andthe distance between features is measured, and the specific loss function is as follows:
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G For matching of image sequences with corresponding text sequences, exp<I,L′> G And the matching degree of the text sequence and the corresponding image sequence.
The image text matching and text mask learning mainly comprises the purpose of enabling a model to run in an automatic supervision mode, and the formula is defined as follows:
wherein w m Is a text mask block, E (w,I)~D Is a mean value of P θ (w m |w m And I) is a conditional probability function under model parameters. S θ A function is generated for the image classification score.
Training module
In the training stage, parameters of a regional visual feature extraction model and a visual mapping text layer which are learned in the pre-training stage are transferred to a target detection model, pictures from two field data sets are input in the training stage, wherein a source field data set is provided with target labeling information, a target field data set is only provided with field label information, only the source field picture participates in target detection training, target field features only participate in global field adaptive training, the weight of a classifier of a detection head is replaced by fixed word vectors of known classes in the training process, target detection training is carried out, and the whole network model is as shown in fig. 2.
The global domain self-adaptive formula mainly extracts domain invariant features under different domains by shortening the distance between the two domains, and the formula is defined as follows:
wherein D is i 0 stands for feature from source domain, D i The representative feature is from the target domain as 1,the method is a field characteristic prediction result, and the whole loss function is a two-class cross entropy loss function.
In the final classification process, the distance between the output feature and the word vectors of different words BERT is compared, and the category to which the word vector with the minimum distance belongs is selected as the classification of the feature, and the formula is defined as follows:
whereinIs characterized in that the method is characterized in that,as BERT word vectors, e B Representing all 0 background embeddings.As a feature of an imageWith text featuresIs measured by the distance of (a) to (b),as a feature of an imageWith background features e B Is measured by the distance of (a) to (b),as a feature of an imageWith different text characteristicsIs measured.
Test module
The detection tasks of different classes can be alternatively completed by replacing the classification heads by using BERT word vectors of different classes, and the test process is the same.
Example 3
The embodiment also provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the open world object detection method based on dictionary creation and domain adaptation.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory includes at least one type of readable storage medium including flash memory, hard disks, multimedia cards, card-type memory (e.g., SD or D interface display memory, etc.), Random Access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disks, optical disks, and the like. In some embodiments, the storage may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device. Of course, the memory may also include both internal and external storage devices of the computer device. In this embodiment, the memory is used to store an operating system and various types of application software installed in the computer device, such as program codes for running the method for detecting an abdominal lymph node based on semi-supervised learning. Further, the memory may be used to temporarily store various types of data that have been output or are to be output.
The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process data, for example, execute the program code of the semi-supervised learning based abdominal lymph node detection method.
Example 4
The present embodiment also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the processor executes the steps of the above-mentioned dictionary creation and domain adaptive open world object detection method.
Wherein the computer readable storage medium stores an interface display program executable by at least one processor to cause the at least one processor to perform the steps of a semi-supervised learning based abdominal lymph node detection method.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
Claims (10)
1. The open world target detection training method based on dictionary creation and field self-adaptation is characterized by comprising the following steps of:
acquiring image sample data and picture description data corresponding to the image sample data;
step two, constructing a regional visual feature extraction model, a visual mapping text layer, a BERT word vector extraction model and a multi-modal Transformer network model;
inputting the image sample data in the step one into a regional visual characteristic extraction model in a pre-training stage, wherein the output of the regional visual characteristic extraction model is used as the input of a visual mapping text layer; inputting the picture description data in the first step into a BERT word vector extraction model; after the output of the BERT word vector extraction model and the output of the visual mapping text model are aligned in features, the output of the BERT word vector extraction model and the output of the visual mapping text model are input into a multi-mode Transformer network model for text matching learning and text mask learning of image texts;
and in the training stage, the parameters of the regional visual feature extraction model and the visual mapping text layer learned in the pre-training stage are transferred to a visual feature extraction module, pictures from two field data sets are input, wherein the source field picture participates in target detection training, the target field picture only participates in global field adaptive training, the classifier weight of the detection head is replaced by a fixed word vector of a known class in the training process, and target detection training is carried out.
2. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the pre-training process, the relationship between the BERT word vector and the visual features output by the visual mapping text layer is measured by dot product distance, and the formula is defined as follows:
whereinFor visual features passing through the regional visual feature extraction model and the visual mapping text layer,word vector features for a single word extracted via the BERT word vector, n I Is the number of image features, n L In order to be the number of such features,is composed ofCharacteristic of anda distance measure between the features.
3. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the pre-training process, the feature alignment mainly comprises two parts of text image alignment and image text alignment, and the specific loss function is as follows:
where I is image input, L is text input, exp<I,L> G For global image text matching metrics, B L For text sequences, B I Is a sequence of images. exp<I,L′> G For matching of image sequences with corresponding text sequences, exp<I,L′> G And the matching degree of the text sequence and the corresponding image sequence.
4. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the pre-training process, the formula for the model to run in an unsupervised manner by image-text matching and text mask learning is defined as follows:
wherein w m Is a text maskBlock, E (w,l)~D Is a mean value of P θ (w m |w m And I) is a conditional probability function under model parameters. S θ A function is generated for the image classification score.
5. The open world target detection training method based on dictionary creation and domain adaptation as claimed in claim 1, wherein in the global domain adaptation training of the training stage, the distance between two domains is reduced to extract domain invariant features under different domains, and the formula is defined as follows:
6. The open world target detection training method based on dictionary creation and field adaptation as claimed in claim 1, wherein in the training phase, the class to which the word vector with the smallest distance belongs is selected as the classification of the feature, and the formula is defined as follows:
whereinIs characterized in that the method is characterized in that,as BERT word vectors, e B Representing all 0 background embedding。As a feature of an imageWith text featuresIs measured by the distance of (a) to (b),as a feature of an imageWith background features e B Is measured by the distance of (a) to (b),as a feature of an imageWith different text characteristicsIs measured.
7. The open-world target detection training method based on dictionary creation and domain adaptation as claimed in claim 1, wherein different classes of detection tasks are performed by replacing classification heads with different classes of BERT word vectors.
8. An open world target detection training device based on dictionary creation and field self-adaptation is characterized by comprising a pre-training module and a training module;
the pre-training module is used for introducing the picture description data set and the multi-modal feature extraction network, aligning the text mode and the visual mode features output by the picture description data set and the multi-modal feature extraction network, introducing the multi-modal Transformer network to perform text matching learning and text mask learning of the image text, and enabling the whole pre-training model to run in a self-supervision mode;
and the training module is used for transferring the parameters of the regional visual feature extraction model and the visual mapping text layer which are learned in the pre-training stage to the target detection model in the training stage, inputting pictures from two field data sets, wherein the source field pictures participate in the target detection training, the target field features only participate in the global field self-adaptive training, and the classifier weight of the detection head is replaced by a fixed word vector of a known class in the training process to perform the target detection training.
9. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210811954.5A CN115099358A (en) | 2022-07-11 | 2022-07-11 | Open world target detection training method based on dictionary creation and field self-adaptation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210811954.5A CN115099358A (en) | 2022-07-11 | 2022-07-11 | Open world target detection training method based on dictionary creation and field self-adaptation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115099358A true CN115099358A (en) | 2022-09-23 |
Family
ID=83297737
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210811954.5A Pending CN115099358A (en) | 2022-07-11 | 2022-07-11 | Open world target detection training method based on dictionary creation and field self-adaptation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115099358A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117576982A (en) * | 2024-01-16 | 2024-02-20 | 青岛培诺教育科技股份有限公司 | Spoken language training method and device based on ChatGPT, electronic equipment and medium |
CN117852624A (en) * | 2024-03-08 | 2024-04-09 | 腾讯科技(深圳)有限公司 | Training method, prediction method, device and equipment of time sequence signal prediction model |
CN117893876A (en) * | 2024-01-08 | 2024-04-16 | 中国科学院自动化研究所 | Zero sample training method and device, storage medium and electronic equipment |
-
2022
- 2022-07-11 CN CN202210811954.5A patent/CN115099358A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117893876A (en) * | 2024-01-08 | 2024-04-16 | 中国科学院自动化研究所 | Zero sample training method and device, storage medium and electronic equipment |
CN117576982A (en) * | 2024-01-16 | 2024-02-20 | 青岛培诺教育科技股份有限公司 | Spoken language training method and device based on ChatGPT, electronic equipment and medium |
CN117576982B (en) * | 2024-01-16 | 2024-04-02 | 青岛培诺教育科技股份有限公司 | Spoken language training method and device based on ChatGPT, electronic equipment and medium |
CN117852624A (en) * | 2024-03-08 | 2024-04-09 | 腾讯科技(深圳)有限公司 | Training method, prediction method, device and equipment of time sequence signal prediction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115099358A (en) | Open world target detection training method based on dictionary creation and field self-adaptation | |
Zuo et al. | Natural scene text recognition based on encoder-decoder framework | |
CN110704633A (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN111079785A (en) | Image identification method and device and terminal equipment | |
CN108959474B (en) | Entity relation extraction method | |
CN112613293B (en) | Digest generation method, digest generation device, electronic equipment and storage medium | |
CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
CN117197904B (en) | Training method of human face living body detection model, human face living body detection method and human face living body detection device | |
WO2022126917A1 (en) | Deep learning-based face image evaluation method and apparatus, device, and medium | |
WO2023038722A1 (en) | Entry detection and recognition for custom forms | |
CN114724145A (en) | Character image recognition method, device, equipment and medium | |
EP3913533A2 (en) | Method and apparatus of processing image device and medium | |
CN114495113A (en) | Text classification method and training method and device of text classification model | |
CN114511857A (en) | OCR recognition result processing method, device, equipment and storage medium | |
CN106709490B (en) | Character recognition method and device | |
CN117173154A (en) | Online image detection system and method for glass bottle | |
CN115618019A (en) | Knowledge graph construction method and device and terminal equipment | |
CN114565759A (en) | Image semantic segmentation model optimization method and device, electronic equipment and storage medium | |
CN113111833B (en) | Safety detection method and device of artificial intelligence system and terminal equipment | |
Li et al. | A Survey of Text Detection Algorithms in Images Based on Deep Learning | |
CN117421244B (en) | Multi-source cross-project software defect prediction method, device and storage medium | |
CN113139187B (en) | Method and device for generating and detecting pre-training language model | |
CN116052220B (en) | Pedestrian re-identification method, device, equipment and medium | |
CN116012656B (en) | Sample image generation method and image processing model training method and device | |
CN114005005B (en) | Double-batch standardized zero-instance image classification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |