CN113344121A - Method for training signboard classification model and signboard classification - Google Patents
Method for training signboard classification model and signboard classification Download PDFInfo
- Publication number
- CN113344121A CN113344121A CN202110723347.9A CN202110723347A CN113344121A CN 113344121 A CN113344121 A CN 113344121A CN 202110723347 A CN202110723347 A CN 202110723347A CN 113344121 A CN113344121 A CN 113344121A
- Authority
- CN
- China
- Prior art keywords
- semantic
- image
- prediction result
- signboard
- loss value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013145 classification model Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000012549 training Methods 0.000 title claims abstract description 57
- 230000004927 fusion Effects 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims description 22
- 238000003860 storage Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 239000000126 substance Substances 0.000 claims 1
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 238000013135 deep learning Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000001514 detection method Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 238000009826 distribution Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 235000021395 porridge Nutrition 0.000 description 1
- 239000000843 powder Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
The disclosure provides a signboard classification model training method and a signboard classification training device, relates to the field of artificial intelligence, particularly relates to computer vision and deep learning technology, and can be used in intelligent traffic scenes. The specific implementation scheme is as follows: obtaining a sample set, wherein samples in the sample set comprise: images, semantic information, sample labels; selecting samples from the sample set; inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image characteristics, a second prediction result based on semantic characteristics and a third prediction result based on fusion characteristics of the image characteristics and the semantic characteristics; calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result and the sample label; and if the total loss value is smaller than the preset threshold value, determining that the signboard classification model is trained completely. The embodiment generates the signboard classification model capable of detecting the invalid signboard, and improves the accuracy of signboard classification.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular to computer vision and deep learning techniques, which can be used in intelligent traffic scenarios.
Background
POI (point of interest) has important significance for directions such as map position retrieval, map navigation positioning and the like, and is a basic support of local life business. Traditional POI collection mode relies on manual work, and is not only inefficiency, and is with high costs moreover. In order to realize the goals of cost reduction, efficiency improvement and real-time updating, the vehicle-mounted image becomes a main data source for automatic updating of POI.
In the POI data generation process, signboard detection has become a key link, and how to efficiently extract effective signboards is always a bottleneck of the current generation process. The related signboard extraction scheme based on signboard detection improves the efficiency of signboard extraction, but the data distribution is different due to different sources of the data, so that a large number of non-signboards are generated in the signboard detection process. These non-signboards can be mainly classified into: occlusion, blurring, billboards, etc.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, and storage medium for training a sign classification model and a sign classification.
According to a first aspect of the present disclosure, there is provided a method of training a sign classification model, comprising: obtaining a sample set, wherein samples in the sample set comprise: image, semantic information, sample label. The following training steps are performed: samples are taken from the sample set. And inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image characteristics, a second prediction result based on semantic characteristics and a third prediction result based on fusion characteristics of the image characteristics and the semantic characteristics. And calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result and the sample label. And if the total loss value is smaller than a preset threshold value, determining that the signboard classification model is trained completely.
According to a second aspect of the present disclosure, there is provided a method of sign sorting, comprising: and carrying out character recognition on the detected signboard picture to obtain character information. And inputting the signboard pictures and the character information into the signboard classification model trained according to the method of the first aspect to obtain the image score and the semantic score. A probability that the signboard picture is valid is output based on the image score and the semantic score.
According to a third aspect of the present disclosure, there is provided an apparatus for training a signboard classification model, comprising: an obtaining unit configured to obtain a sample set, wherein a sample in the sample set comprises: image, semantic information, sample label. A training unit configured to perform the following training steps: samples are taken from the sample set. And inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image characteristics, a second prediction result based on semantic characteristics and a third prediction result based on fusion characteristics of the image characteristics and the semantic characteristics. And calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result and the sample label. And if the total loss value is smaller than a preset threshold value, determining that the signboard classification model is trained completely.
According to a fourth aspect of the present disclosure, there is provided an apparatus for sign sorting, comprising: and the identification unit is configured to perform character identification on the detected signboard picture to obtain character information. And the classification unit is configured to input the signboard pictures and the character information into the signboard classification model trained by the device of the third aspect, and obtain the image score and the semantic score. An output unit configured to output a probability that the signboard picture is valid based on the image score and the semantic score.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor. And a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first or second aspect.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first or second aspect.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.
The method and the device for training the signboard classification model and the signboard classification provided by the embodiment of the disclosure classify according to the fusion characteristics of the image characteristics and the semantic characteristics, besides the two modal characteristics of the image characteristics and the semantic characteristics, the image characteristics and the semantic characteristics can be effectively fused, the accuracy of signboard classification can be improved, and the method and the device have better robustness for solving the non-signboard situations such as shielding, blurring and advertising boards in the signboard classification scene.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary system architecture diagram in which the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of a method of training a sign classification model according to the present disclosure;
FIG. 3 is a schematic illustration of one application scenario of a method of training a sign classification model according to the present disclosure;
FIG. 4 is a flow diagram of one embodiment of a method of sign sorting according to the present disclosure;
FIG. 5 is a schematic block diagram of one embodiment of an apparatus for training a sign classification model according to the present disclosure;
FIG. 6 is a schematic structural diagram of one embodiment of an apparatus for sign sorting according to the present disclosure;
FIG. 7 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 illustrates an exemplary system architecture 100 to which a method of training a sign classification model, an apparatus to train a sign classification model, a method of sign classification, or an apparatus of sign classification of embodiments of the present application may be applied.
As shown in fig. 1, system architecture 100 may include unmanned vehicles (also known as autonomous vehicles) 101, 102, a network 103, a database server 104, and a server 105. Network 103 is the medium used to provide communication links between the unmanned vehicles 101, 102, database server 104, and server 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The unmanned vehicles 101 and 102 are provided therein with driving control equipment and equipment for collecting point cloud data, such as a laser radar and a millimeter wave radar. The driving control equipment (also called vehicle-mounted brain) is responsible for intelligent control of the unmanned vehicle. The driving control device may be a Controller separately arranged, such as a Programmable Logic Controller (PLC), a single chip microcomputer, an industrial Controller, and the like; or the equipment consists of other electronic devices which have input/output ports and have the operation control function; but also a computer device installed with a vehicle driving control type application.
It should be noted that, in practice, the unmanned vehicle may also be equipped with at least one sensor, such as a camera, a gravity sensor, a wheel speed sensor, and the like. In some cases, the unmanned vehicle may further include GNSS (Global Navigation Satellite System) equipment, SINS (Strap-down Inertial Navigation System), and the like.
The server 105 may also be a server that provides various services, such as a background server that provides support for various applications displayed on the unmanned vehicles 101, 102. The background server may train the initial model using samples in the sample set collected by the unmanned vehicles 101, 102, and may send the training results (e.g., the generated signboard classification model) to the unmanned vehicles 101, 102. In this way, the unmanned vehicle can classify the signboards by applying the generated signboard classification model, so that the unmanned vehicle can detect whether the signboards are effective signboards or not and filter out the blocked signboards, unclear signboards, advertising boards and other ineffective signboards.
Here, the database server 104 and the server 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein. Database server 104 and server 105 may also be servers of a distributed system or servers that incorporate a blockchain. Database server 104 and server 105 may also be cloud servers, or smart cloud computing servers or smart cloud hosts with artificial intelligence technology.
It should be noted that the method for training the signboard classification model or the method for signboard classification provided in the embodiment of the present application is generally performed by the server 105. Accordingly, means for training a sign classification model or means for sign classification are also typically provided in the server 105. The method of sign sorting may also be performed by an unmanned vehicle.
It is noted that database server 104 may not be provided in system architecture 100, as server 105 may perform the relevant functions of database server 104.
It should be understood that the number of unmanned vehicles, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of unmanned vehicles, networks, database servers, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of a method of training a sign classification model according to the present application is shown. The method of training a sign classification model may comprise the steps of:
In this embodiment, the performing agent (e.g., server 105 shown in FIG. 1) of the method of training the sign classification model may obtain the sample set in a variety of ways. For example, the executing entity may obtain the existing sample set stored therein from a database server (e.g., database server 104 shown in fig. 1) via a wired connection or a wireless connection. As another example, a user may collect a sample via an unmanned vehicle (e.g., unmanned vehicles 101, 102 shown in fig. 1). In this way, the executive may receive samples collected by the unmanned vehicle and store the samples locally, thereby generating a sample set.
Wherein the samples in the sample set include: image, semantic information, sample label. Since the scene is from the perspective of multiple modalities, data of two modalities are mainly used for the signboard classification scene, one is an image of the signboard, the other is semantic information of the signboard, and the semantic information is mainly character (possibly, symbol and the like) information on the image of the signboard. The sample label is used to identify whether the sample is a positive sample or a negative sample:
1. positive sample definition:
for the image mode, clear, non-shielding, non-billboard and other samples are mainly defined as positive samples,
for semantic modalities, common signboard names in the signboard library are mainly taken as positive samples, such as: xx casserole porridge, hot and sour powder, Chinese yy bank and the like.
2. Negative examples define:
for the image modality: mainly fuzzy, occlusion and non-signboard, etc. as negative examples
For semantic modalities, the name of a sign that is not in the sign library is primarily taken as a negative example.
The annotation of the data is also relatively simple based on existing POI production flows and definitions of positive and negative samples.
At step 202, a sample is selected from a sample set.
In this embodiment, the executing subject may select a sample from the sample set obtained in step 201, and perform the training steps from step 203 to step 206. The selection manner and the number of samples are not limited in the present application. For example, the samples may be selected randomly, or the samples with higher definition of the picture or the samples with more semantic information may be selected.
In this embodiment, the sign classification model may include three classification submodels: an image classification sub-model, a semantic classification sub-model and a fusion classification sub-model. The input of the image classification submodel is only an image, and the probability that the image belongs to an effective signboard (clear, non-shielding and non-advertising board) can be obtained as a first prediction result by extracting and identifying the image characteristics. The input of the semantic classification submodel is only semantic information, and the probability that the semantic information belongs to an effective signboard (a common signboard name in a signboard library) can be obtained as a second prediction result by extracting semantic features for recognition. The input of the fusion classification submodel is the image characteristics obtained by the image classification submodel in the middle process of identifying the image and the semantic characteristics obtained by the semantic classification submodel in the middle process of identifying the semantics. And fusing the image characteristics and the semantic characteristics by the fusion classification submodel, obtaining the fusion characteristics, then identifying, and obtaining the probability of matching the image and the semantic information as a third prediction result.
And step 204, calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result and the sample label.
In this embodiment, three classification submodels correspond to three loss values, as follows:
1. the first loss value (image loss) corresponding to the image classification submodel mainly learns whether the current image is a valid signboard or not from the aspect of image characteristics, and is calculated according to the difference between the first prediction result and the sample label. For example, if the sample label is 1 (positive sample) and the first prediction result is 0.9, the first loss value is 0.1. The image classification submodel may be trained by Cross entry Loss.
2. And a second loss value (semantic loss) corresponding to the semantic classification submodel is mainly used for learning whether the content in the current image is effective signboard content from the semantic perspective. Such as billboards, are negative examples that can be removed from the classification results of the semantic classification submodel. For example, if the sample label is 0 (negative sample) and the second prediction result is 0.2, the second loss value is 0.2. The semantic classification submodel may be trained by Cross entry Loss.
3. The third loss value (fusion loss) corresponding to the fusion classification submodel is to enhance the feature learning of the two. The label of the fused classification submodel is generated in the training process, if the current sample image and the text in the image are corresponding, the label is 1 (as positive sample learning), otherwise, the label is 0 (as negative sample learning), and the purpose is mainly to distinguish whether the image and the text are from the same image or are matched. The fusion classification submodel can be trained by Binary Cross Entropy loss.
And finally, calculating the weighted sum of the first loss value, the second loss value and the third loss value as a total loss value. The weight of the loss value of each classification submodel can be set according to actual requirements. For example, if the accuracy of the image classification submodel is highest, its weight may be set to maximum. And since the fusion classification submodel is not easily converged, the weight of the third loss value may be set to be minimum.
And step 205, if the total loss value is smaller than the preset threshold value, determining that the signboard classification model is trained completely.
In this embodiment, when the total loss value is less than the predetermined threshold, the predicted value may be considered to be close to or approximate the true value. The predetermined threshold may be set according to actual requirements. And if the total loss value is less than the preset threshold value, the signboard classification model training is finished.
In this embodiment, if the total loss value is not less than the predetermined threshold, which indicates that the training of the signboard classification model is not completed, the relevant parameters of the signboard classification model are adjusted, for example, the weights in the image classification sub-model, the semantic classification sub-model and the fusion classification sub-model in the signboard classification model are modified by using a back propagation technique. And may return to step 202 to re-select samples from the sample set. So that the training step can be continued based on the adjusted signboard classification model.
The method for training the signboard classification model provided by the embodiment of the application starts from a human visual system, and because the human understanding of the signboard distinguishes whether the signboard is an effective signboard or not from image characteristics and semantic characteristics, characteristics of multiple modes are fused, and the method has better robustness for solving the shielding, blurring, signboard and other non-signboard situations in the signboard classification scene.
In some optional implementation manners of this embodiment, inputting the image and the semantic information in the selected sample into the signboard classification model to obtain a first prediction result based on the image feature, a second prediction result based on the semantic feature, and a third prediction result based on the fusion feature of the image feature and the semantic feature, including: and extracting image features from the images in the selected samples through an image feature extraction network. And (4) passing the image features through the image full-connection layer to obtain image representation. And extracting semantic features from the semantic information in the selected sample through a semantic feature extraction network. And (4) the semantic features are subjected to a semantic full connection layer to obtain semantic representation. And fusing the image characteristics and the semantic characteristics after sharing the full connection layer to obtain a shared representation. And cascading the image representation and the sharing representation, and inputting the image representation and the sharing representation into a first classifier to obtain a first prediction result. And cascading the semantic representation and the shared representation and inputting the semantic representation and the shared representation into a second classifier to obtain a second prediction result. And inputting the sharing expression into a third classifier to obtain a third prediction result.
An image and semantic information in the image are given and respectively pass through two different backbone networks (an image feature extraction network and a semantic feature extraction network, wherein the reason that the weight of the backbone networks is not shared is that the information distribution difference of two modes is large, and if the information distribution difference is shared, the model training process is difficult to converge), so that the feature vectors of the two modes are respectively obtained. For Image features, after independent Image unique FC (Image fully-connected layer) and share FC (shared fully-connected layer) feature control (cascade) are carried out, Cross control Loss is used for training, and similarly, for Semantic features, after Semantic unique FC (Semantic fully-connected layer) and share FC feature control are carried out, Cross control Loss is used for training. While the middle share representation is the feature of two modalities trained together with Binary Cross control over share FC coordinate.
The image feature extraction network may be a common network structure such as Res50, VGG, and the like. The semantic feature extraction network may be a text encoder commonly used in natural language processing, such as a transformer.
Correspondingly, relevant parameters of the signboard classification model are adjusted to be relevant parameters of an image feature extraction network, a semantic feature extraction network, an image full-link layer, a semantic full-link layer, a sharing full-link layer, a first classifier, a second classifier and a third classifier. Therefore, invalid signs such as advertising boards, shielded signs and fuzzy signs can be accurately identified by training the signboard classification model. And the effective signboard is reserved, effective data is provided for signboard character recognition, and invalid data is filtered out, so that the POI recognition speed is increased.
In some optional implementations of this embodiment, calculating the total loss value based on the first prediction result, the second prediction result, the third prediction result, and the sample label includes: a first loss value is calculated based on a difference between the first prediction result and the sample label. A second loss value is calculated based on a difference between the second prediction result and the sample label. A third loss value is calculated based on a difference between the third prediction result and the sample label. And calculating the weighted sum of the first loss value, the second loss value and the third loss value as the total loss value. The difference between the first prediction result and the exemplar label is a difference between the result of the signboard classification from the image and the true image label (the image label of the positive exemplar is 1, and the image label of the negative exemplar is 0). The difference between the second prediction result and the sample label is the difference between the result of signboard classification according to semantic information and the true semantic label (the semantic label of the positive sample is 1, and the semantic label of the negative sample is 0). The difference between the third prediction result and the sample label is the difference between the result of matching according to the image and the semantic information and the true pair label (the pair label of the positive sample is 1, and the pair label of the negative sample is 0). The weighted sum of the three loss values is calculated, so that the trained model can simultaneously refer to the image and the semantic information, the accuracy rate of signboard classification is higher than that of signboard classification only through a single item, and the false detection of images such as billboards and the like as the signboard is avoided.
In some optional implementations of this embodiment, the first penalty value is weighted the same as the second penalty value and is greater than the third penalty value. The influence of the image and the semantic information on the model is set to be the same and larger than the influence of the fusion of the image and the semantic information, so that the convergence speed of the model can be increased, and the accuracy of model classification is improved.
With further reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method of training a signboard classification model according to the present embodiment. In the application scenario of fig. 3, the user randomly selects a sample from a sample set, the sample including a picture and semantic information ("the farm department") detected by the sign detection network, and noting whether the picture is a valid sign (the sign in this example has a fuzzy word and is therefore an invalid sign). And inputting the pictures into an image feature extraction network of the signboard classification model to extract image features. And inputting the semantic information into a semantic feature extraction network of the signboard classification model, and extracting the semantic features. The image characteristics pass through an image full-connection layer to obtain image representation, the image characteristics and the semantic characteristics pass through a shared full-connection layer to obtain shared representation, the image representation and the shared representation are cascaded and then are predicted through a first classifier to obtain a first prediction result. And then comparing the first prediction result with the sample label to calculate a first loss value. And the semantic features pass through a semantic full-connection layer to obtain semantic representation, and the semantic representation and the shared representation are cascaded and then are predicted through a second classifier to obtain a second prediction result. And then comparing the second prediction result with the sample label to calculate a second loss value. And predicting the shared representation through a third classifier to obtain a third prediction result. And comparing the third prediction result with the sample label to calculate a third loss value. A total loss value is calculated based on the first loss value, the second loss value, and the third loss value. And if the total loss value is smaller than the preset threshold value, finishing the training of the signboard classification model. Otherwise, relevant parameters of the signboard classification model are adjusted, samples are reselected, and training is continued, so that the total loss value is reduced until a preset threshold value is converged.
Referring to fig. 4, a flow 400 of one embodiment of a method of sign sorting provided herein is shown. The method of sign sorting may comprise the steps of:
In the present embodiment, the execution subject of the method of signboard classification (e.g., the server 105 or the unmanned vehicles 101, 102 shown in fig. 1) may acquire the street view of the area to be detected in various ways. For example, if the execution subject is a server, a street view of the area to be detected collected by an unmanned vehicle may be received. A lot of signs may be included in the street view. And detecting a signboard area from the street view through a pre-trained signboard detection model and cutting the signboard area to be used as a signboard picture. Character recognition is carried out on the signboard picture through character recognition tools such as OCR and the like, and character information is obtained.
In this embodiment, the sign classification model may be generated using the method described above in the embodiment of fig. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again. The signboard picture and the character information can be predicted through the signboard classification model. As shown in fig. 3, the branch of calculating the first loss value predicts the probability that the picture belongs to the sign, i.e., the image score. And calculating the branch of the second loss value, and predicting the probability that the text information belongs to the signboard, namely the semantic score.
And step 403, outputting the effective probability of the signboard picture based on the image score and the semantic score.
In this embodiment, the average of the image score and the semantic score may be used as the probability that the signboard picture is valid. Optionally, the weighted sum of the image score and the semantic score may also be taken as the probability that the signboard picture is valid. The weights may be determined according to the accuracy of the image feature extraction network and the accuracy of the semantic feature extraction network, for example, if the accuracy of the image feature extraction network is higher than the accuracy of the semantic feature extraction network, the weight of the image score may be set to be higher than the weight of the semantic score.
It should be noted that the method for classifying signs in this embodiment can be used to test the sign classification models generated in the above embodiments. And then the signboard classification model can be continuously optimized according to the test result. The method may also be a practical application method of the signboard classification model generated by each of the above embodiments. The signboard classification model generated by each embodiment is adopted to classify the signboard, which is beneficial to improving the performance of the signboard classification model. Such as quickly filtering out invalid signs, etc.
With continuing reference to FIG. 5, as an implementation of the methods illustrated in the above figures, the present application provides one embodiment of an apparatus for training a sign classification model. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.
As shown in fig. 5, the apparatus 500 for training a signboard classification model according to the present embodiment may include: an acquisition unit 501 and a training unit 502. Wherein the obtaining unit 501 is configured to obtain a sample set, wherein samples in the sample set include: image, semantic information, sample label. A training unit 502 configured to perform the following training steps: samples are taken from the sample set. And inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image characteristics, a second prediction result based on semantic characteristics and a third prediction result based on fusion characteristics of the image characteristics and the semantic characteristics. And calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result and the sample label. And if the total loss value is smaller than the preset threshold value, determining that the signboard classification model is trained completely.
In some optional implementations of this embodiment, the training unit 502 is further configured to: and if the total loss value is larger than or equal to the preset threshold value, adjusting relevant parameters of the signboard classification model, and continuously executing the training step based on the adjusted signboard classification model.
In some optional implementations of this embodiment, the training unit 502 is further configured to: and extracting image features from the images in the selected samples through an image feature extraction network. And (4) passing the image features through the image full-connection layer to obtain image representation. And extracting semantic features from the semantic information in the selected sample through a semantic feature extraction network. And (4) the semantic features are subjected to a semantic full connection layer to obtain semantic representation. And fusing the image characteristics and the semantic characteristics after sharing the full connection layer to obtain a shared representation. And cascading the image representation and the sharing representation, and inputting the image representation and the sharing representation into a first classifier to obtain a first prediction result. And cascading the semantic representation and the shared representation and inputting the semantic representation and the shared representation into a second classifier to obtain a second prediction result. And inputting the sharing expression into a third classifier to obtain a third prediction result.
In some optional implementations of this embodiment, the training unit 502 is further configured to: a first loss value is calculated based on a difference between the first prediction result and the sample label. A second loss value is calculated based on a difference between the second prediction result and the sample label. A third loss value is calculated based on a difference between the third prediction result and the sample label. And calculating the weighted sum of the first loss value, the second loss value and the third loss value as the total loss value.
In some optional implementations of this embodiment, the first penalty value is weighted the same as the second penalty value and is greater than the third penalty value.
With continued reference to fig. 6, the present application provides one embodiment of an apparatus for sign sorting as an implementation of the methods illustrated in the above figures. The embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device can be applied to various electronic devices.
As shown in fig. 6, the apparatus 600 for detecting a target of the present embodiment may include: a recognition unit 601, a classification unit 602, and an output unit 603. The identifying unit 601 is configured to perform character identification on the detected signboard picture to obtain character information. The classification unit 602 is configured to classify the signboard picture and text information input apparatus 500 into a trained signboard classification model, and obtain an image score and a semantic score. An output unit 603 configured to output a probability that the signboard picture is valid based on the image score and the semantic score.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of flows 200 or 400.
A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of flow 200 or 400.
A computer program product comprising a computer program which, when executed by a processor, implements the method of flow 200 or 400.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
Claims (15)
1. A method of training a sign classification model, comprising:
obtaining a sample set, wherein samples in the sample set comprise: images, semantic information, sample labels;
the following training steps are performed: selecting a sample from the sample set; inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image characteristics, a second prediction result based on semantic characteristics and a third prediction result based on fusion characteristics of the image characteristics and the semantic characteristics; calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result, and a sample label; and if the total loss value is smaller than a preset threshold value, determining that the signboard classification model is trained completely.
2. The method of claim 1, wherein the method further comprises:
and if the total loss value is larger than or equal to the preset threshold value, adjusting relevant parameters of the signboard classification model, and continuously executing the training step based on the adjusted signboard classification model.
3. The method of claim 1, wherein the inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image features, a second prediction result based on semantic features, and a third prediction result based on a fusion feature of the image features and the semantic features comprises:
extracting image features from the images in the selected samples through an image feature extraction network; after the image features pass through an image full-connection layer, image representation is obtained;
extracting semantic features from semantic information in the selected samples through a semantic feature extraction network; the semantic features are subjected to a semantic full connection layer to obtain semantic representation;
fusing the image features and the semantic features after sharing a full connection layer to obtain a shared representation;
after the image representation and the sharing representation are cascaded, inputting the image representation and the sharing representation into a first classifier to obtain a first prediction result;
after the semantic representation and the shared representation are cascaded, inputting the semantic representation and the shared representation into a second classifier to obtain a second prediction result;
and inputting the sharing expression into a third classifier to obtain a third prediction result.
4. The method of claim 1, wherein said calculating a total loss value based on the first predictor, the second predictor, the third predictor, and a sample label comprises:
calculating a first loss value based on a difference between the first prediction result and a sample label;
calculating a second loss value based on a difference between the second prediction result and a sample label;
calculating a third loss value based on a difference between the third prediction result and a sample label;
calculating a weighted sum of the first loss value, the second loss value, and the third loss value as a total loss value.
5. The method of claim 4, wherein the first penalty value is weighted the same as the second penalty value and is greater than the third penalty value.
6. A method of sign sorting, comprising:
carrying out character recognition on the detected signboard picture to obtain character information;
inputting the signboard picture and the text information into a signboard classification model trained according to the method of any one of claims 1-5 to obtain an image score and a semantic score;
outputting a probability that the signboard picture is valid based on the image score and the semantic score.
7. An apparatus for training a sign classification model, comprising:
an acquisition unit configured to acquire a set of samples, wherein a sample in the set of samples comprises: images, semantic information, sample labels;
a training unit configured to perform the following training steps: selecting a sample from the sample set; inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image characteristics, a second prediction result based on semantic characteristics and a third prediction result based on fusion characteristics of the image characteristics and the semantic characteristics; calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result, and a sample label; and if the total loss value is smaller than a preset threshold value, determining that the signboard classification model is trained completely.
8. The apparatus of claim 7, wherein the training unit is further configured to:
and if the total loss value is larger than or equal to the preset threshold value, adjusting relevant parameters of the signboard classification model, and continuously executing the training step based on the adjusted signboard classification model.
9. The apparatus of claim 7, wherein the training unit is further configured to:
extracting image features from the images in the selected samples through an image feature extraction network; after the image features pass through an image full-connection layer, image representation is obtained;
extracting semantic features from semantic information in the selected samples through a semantic feature extraction network; the semantic features are subjected to a semantic full connection layer to obtain semantic representation;
fusing the image features and the semantic features after sharing a full connection layer to obtain a shared representation;
after the image representation and the sharing representation are cascaded, inputting the image representation and the sharing representation into a first classifier to obtain a first prediction result;
after the semantic representation and the shared representation are cascaded, inputting the semantic representation and the shared representation into a second classifier to obtain a second prediction result;
and inputting the sharing expression into a third classifier to obtain a third prediction result.
10. The apparatus of claim 7, wherein the training unit is further configured to:
calculating a first loss value based on a difference between the first prediction result and a sample label;
calculating a second loss value based on a difference between the second prediction result and a sample label;
calculating a third loss value based on a difference between the third prediction result and a sample label;
calculating a weighted sum of the first loss value, the second loss value, and the third loss value as a total loss value.
11. The apparatus of claim 10, wherein the first penalty value is weighted the same as the second penalty value and is greater than the third penalty value.
12. An apparatus for sign sorting, comprising:
the identification unit is configured to perform character identification on the detected signboard picture to obtain character information;
a classification unit configured to input the signboard picture and the text information into a signboard classification model trained according to the apparatus according to any one of claims 7-11, resulting in an image score and a semantic score;
an output unit configured to output a probability that the signboard picture is valid based on the image score and the semantic score.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110723347.9A CN113344121B (en) | 2021-06-29 | 2021-06-29 | Method for training a sign classification model and sign classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110723347.9A CN113344121B (en) | 2021-06-29 | 2021-06-29 | Method for training a sign classification model and sign classification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113344121A true CN113344121A (en) | 2021-09-03 |
CN113344121B CN113344121B (en) | 2023-10-27 |
Family
ID=77481148
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110723347.9A Active CN113344121B (en) | 2021-06-29 | 2021-06-29 | Method for training a sign classification model and sign classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113344121B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114201607A (en) * | 2021-12-13 | 2022-03-18 | 北京百度网讯科技有限公司 | Information processing method and device |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590134A (en) * | 2017-10-26 | 2018-01-16 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
CN107683469A (en) * | 2015-12-30 | 2018-02-09 | 中国科学院深圳先进技术研究院 | A kind of product classification method and device based on deep learning |
CN110414432A (en) * | 2019-07-29 | 2019-11-05 | 腾讯科技(深圳)有限公司 | Training method, object identifying method and the corresponding device of Object identifying model |
CN111340064A (en) * | 2020-02-10 | 2020-06-26 | 中国石油大学(华东) | Hyperspectral image classification method based on high-low order information fusion |
CN111523574A (en) * | 2020-04-13 | 2020-08-11 | 云南大学 | Image emotion recognition method and system based on multi-mode data |
US20200357143A1 (en) * | 2019-05-09 | 2020-11-12 | Sri International | Semantically-aware image-based visual localization |
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
CN112633380A (en) * | 2020-12-24 | 2021-04-09 | 北京百度网讯科技有限公司 | Interest point feature extraction method and device, electronic equipment and storage medium |
CN112733549A (en) * | 2020-12-31 | 2021-04-30 | 厦门智融合科技有限公司 | Patent value information analysis method and device based on multiple semantic fusion |
-
2021
- 2021-06-29 CN CN202110723347.9A patent/CN113344121B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107683469A (en) * | 2015-12-30 | 2018-02-09 | 中国科学院深圳先进技术研究院 | A kind of product classification method and device based on deep learning |
CN107590134A (en) * | 2017-10-26 | 2018-01-16 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
US20200357143A1 (en) * | 2019-05-09 | 2020-11-12 | Sri International | Semantically-aware image-based visual localization |
CN110414432A (en) * | 2019-07-29 | 2019-11-05 | 腾讯科技(深圳)有限公司 | Training method, object identifying method and the corresponding device of Object identifying model |
CN111340064A (en) * | 2020-02-10 | 2020-06-26 | 中国石油大学(华东) | Hyperspectral image classification method based on high-low order information fusion |
CN111523574A (en) * | 2020-04-13 | 2020-08-11 | 云南大学 | Image emotion recognition method and system based on multi-mode data |
CN112101165A (en) * | 2020-09-07 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Interest point identification method and device, computer equipment and storage medium |
CN112633380A (en) * | 2020-12-24 | 2021-04-09 | 北京百度网讯科技有限公司 | Interest point feature extraction method and device, electronic equipment and storage medium |
CN112733549A (en) * | 2020-12-31 | 2021-04-30 | 厦门智融合科技有限公司 | Patent value information analysis method and device based on multiple semantic fusion |
Non-Patent Citations (1)
Title |
---|
徐戈;肖永强;汪涛;陈开志;廖祥文;吴运兵: "基于视觉误差与语义属性的零样本图像分类", 计算机应用, no. 004 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114201607A (en) * | 2021-12-13 | 2022-03-18 | 北京百度网讯科技有限公司 | Information processing method and device |
CN114201607B (en) * | 2021-12-13 | 2023-01-03 | 北京百度网讯科技有限公司 | Information processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113344121B (en) | 2023-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113326764B (en) | Method and device for training image recognition model and image recognition | |
CN111598164A (en) | Method and device for identifying attribute of target object, electronic equipment and storage medium | |
CN112633276A (en) | Training method, recognition method, device, equipment and medium | |
CN113239807B (en) | Method and device for training bill identification model and bill identification | |
CN114648676A (en) | Point cloud processing model training and point cloud instance segmentation method and device | |
CN113313053A (en) | Image processing method, apparatus, device, medium, and program product | |
CN113947188A (en) | Training method of target detection network and vehicle detection method | |
CN115860102B (en) | Pre-training method, device, equipment and medium for automatic driving perception model | |
CN113378712A (en) | Training method of object detection model, image detection method and device thereof | |
CN112862005A (en) | Video classification method and device, electronic equipment and storage medium | |
CN114581732A (en) | Image processing and model training method, device, equipment and storage medium | |
CN114202026A (en) | Multitask model training method and device and multitask processing method and device | |
CN113569912A (en) | Vehicle identification method and device, electronic equipment and storage medium | |
CN113569911A (en) | Vehicle identification method and device, electronic equipment and storage medium | |
CN113344121B (en) | Method for training a sign classification model and sign classification | |
CN113326766A (en) | Training method and device of text detection model and text detection method and device | |
CN113255501A (en) | Method, apparatus, medium, and program product for generating form recognition model | |
CN115761698A (en) | Target detection method, device, equipment and storage medium | |
CN114429631B (en) | Three-dimensional object detection method, device, equipment and storage medium | |
CN116152702A (en) | Point cloud label acquisition method and device, electronic equipment and automatic driving vehicle | |
CN115527069A (en) | Article identification and article identification system construction method and apparatus | |
CN115331048A (en) | Image classification method, device, equipment and storage medium | |
CN114973333A (en) | Human interaction detection method, human interaction detection device, human interaction detection equipment and storage medium | |
CN111768007B (en) | Method and device for mining data | |
CN113902898A (en) | Training of target detection model, target detection method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |