CN113344121B

CN113344121B - Method for training a sign classification model and sign classification

Info

Publication number: CN113344121B
Application number: CN202110723347.9A
Authority: CN
Inventors: 李辉; 王洪志; 王昆
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2023-10-27
Anticipated expiration: 2041-06-29
Also published as: CN113344121A

Abstract

The disclosure provides a method and a device for training a signboard classification model and signboard classification, relates to the field of artificial intelligence, in particular to a computer vision and deep learning technology, and can be particularly used in an intelligent traffic scene. The specific implementation scheme is as follows: obtaining a sample set, wherein the samples in the sample set comprise: image, semantic information, sample tags; selecting a sample from the sample set; inputting the images and semantic information in the selected samples into a signboard classification model to obtain a first prediction result based on image features, a second prediction result based on semantic features and a third prediction result based on fusion features of the image features and the semantic features; calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result and the sample tag; and if the total loss value is smaller than the preset threshold value, determining that the training of the signboard classification model is completed. The embodiment generates a signboard classification model capable of detecting invalid signboards, and improves accuracy of signboard classification.

Description

Method for training a sign classification model and sign classification

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to computer vision and deep learning techniques, which are particularly useful in intelligent traffic scenarios.

Background

The POI (point of interest, interest points) has important significance for map position retrieval, map navigation and positioning and other directions, and is a basic support for local life business. The traditional POI acquisition mode depends on manual operation, and is low in efficiency and high in cost. In order to achieve the aim of 'cost reduction and synergy and real-time update', the vehicle-mounted image becomes a main data source for POI automatic update.

In the POI data generation flow, the detection of the signboard is already a key link, and how to efficiently extract the effective signboard is always a bottleneck existing in the current generation flow. Related sign extraction schemes based on sign detection have increased the efficiency of sign extraction, but due to the different sources of data, the distribution of data is different, resulting in a large number of non-signs being generated during sign detection. These non-signs can be largely divided into: shielding, blurring, billboards, etc.

Disclosure of Invention

The present disclosure provides a method, apparatus, device and storage medium for training a sign classification model and sign classification.

According to a first aspect of the present disclosure, there is provided a method of training a sign classification model, comprising: obtaining a sample set, wherein the samples in the sample set comprise: image, semantic information, sample tags. The following training steps are performed: samples are selected from the sample set. Inputting the images and semantic information in the selected samples into a signboard classification model to obtain a first prediction result based on the image features, a second prediction result based on the semantic features and a third prediction result based on fusion features of the image features and the semantic features. The total loss value is calculated based on the first prediction result, the second prediction result, the third prediction result, and the sample tag. And if the total loss value is smaller than a preset threshold value, determining that the training of the signboard classification model is completed.

According to a second aspect of the present disclosure there is provided a method of sign classification comprising: and carrying out character recognition on the detected signboard pictures to obtain character information. Inputting the signboard pictures and the text information into a signboard classification model trained according to the method of the first aspect to obtain an image score and a semantic score. And outputting the probability that the signboard picture is valid based on the image score and the semantic score.

According to a third aspect of the present disclosure, there is provided an apparatus for training a sign classification model, comprising: an acquisition unit configured to acquire a sample set, wherein samples in the sample set include: image, semantic information, sample tags. A training unit configured to perform the following training steps: samples are selected from the sample set. Inputting the images and semantic information in the selected samples into a signboard classification model to obtain a first prediction result based on the image features, a second prediction result based on the semantic features and a third prediction result based on fusion features of the image features and the semantic features. The total loss value is calculated based on the first prediction result, the second prediction result, the third prediction result, and the sample tag. And if the total loss value is smaller than a preset threshold value, determining that the training of the signboard classification model is completed.

According to a fourth aspect of the present disclosure there is provided an apparatus for sign classification, comprising: and the identification unit is configured to perform character identification on the detected signboard pictures to obtain character information. A classification unit configured to input the signboard pictures and the text information into the apparatus-trained signboard classification model of the third aspect, resulting in an image score and a semantic score. And an output unit configured to output a probability that the signboard picture is valid based on the image score and the semantic score.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor. And a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or the second aspect.

According to a sixth aspect of the present disclosure there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first or second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.

According to the method and the device for training the signboard classification model and the signboard classification, the image characteristics and the semantic characteristics are classified according to the fusion characteristics of the image characteristics and the semantic characteristics, the image and the semantic information can be effectively fused, the accuracy of the signboard classification is improved, and better robustness is achieved for solving non-signboard situations such as shielding, blurring and advertising in a signboard classification scene.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram to which the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a method of training a sign classification model according to the present disclosure;

FIG. 3 is a schematic illustration of one application scenario of a method of training a sign classification model according to the present disclosure;

FIG. 4 is a flow chart of one embodiment of a method of categorizing signs according to the present disclosure;

FIG. 5 is a schematic structural view of one embodiment of an apparatus for training a sign classification model according to the present disclosure;

FIG. 6 is a schematic structural view of one embodiment of an apparatus for categorizing signs according to the present disclosure;

fig. 7 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

FIG. 1 illustrates an exemplary system architecture 100 of a method of training a sign classification model, an apparatus of training a sign classification model, a method of sign classification, or an apparatus of sign classification to which embodiments of the application may be applied.

As shown in fig. 1, the system architecture 100 may include unmanned vehicles (also known as autopilots) 101, 102, a network 103, a database server 104, and a server 105. The network 103 is used to provide a medium for communication links between the drones 101, 102, the database server 104 and the server 105. The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The unmanned vehicles 101 and 102 are equipped with driving control devices and devices for acquiring point cloud data such as a laser radar and a millimeter wave radar. The driving control device (also called a vehicle-mounted brain) is responsible for intelligent control of the unmanned vehicle. The driving control device may be a separately provided controller, such as a programmable logic controller (Programmable Logic Controller, PLC), a single chip microcomputer, an industrial controller, or the like; the device can also be equipment consisting of other electronic devices with input/output ports and operation control functions; but also a computer device installed with a vehicle driving control type application.

In practice, at least one sensor such as a camera, a gravity sensor, a wheel speed sensor, or the like may be mounted in the unmanned vehicle. In some cases, a GNSS (Global Navigation Satellite System ) device and an SINS (Strap-down Inertial Navigation System, strapdown inertial navigation System) device and the like can also be installed in the unmanned vehicle.

Database server 104 may be a database server that provides various services. For example, a database server may have stored therein a sample set. The sample set contains a large number of samples. The samples may include images, semantic information, sample tags, among others. The image here is a signboard image detected from the street view by a signboard detection model (a signboard classification model). The semantic information is the content of the sign, which may be manually noted, or may be recognized from the sign image by OCR (Optical Character Recognition ) technology. The user may also select samples from the set of samples stored by the database server 104 via the drones 101, 102.

The server 105 may also be a server that provides various services, such as a background server that provides support for various applications displayed on the drones 101, 102. The background server may train the initial model using samples in the sample set collected by the drones 101, 102, and may send training results (e.g., generated sign classification models) to the drones 101, 102. Thus, the unmanned vehicle can apply the generated signboard classification model to classify the signboard, so that the unmanned vehicle can detect whether the signboard is an effective signboard, and can filter out the blocked signboard, the unclear signboard, the advertisement board and other invalid signboards.

The database server 104 and the server 105 may be hardware or software. When they are hardware, they may be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein. Database server 104 and server 105 may also be servers of a distributed system or servers that incorporate blockchains. Database server 104 and server 105 may also be cloud servers, or intelligent cloud computing servers or intelligent cloud hosts with artificial intelligence technology.

It should be noted that the method for training the sign classification model or the method for classifying signs according to the embodiment of the present application is generally performed by the server 105. Accordingly, means for training a sign classification model or means for sign classification are typically also provided in the server 105. The method of sign classification may also be performed by an unmanned vehicle.

It should be noted that the database server 104 may not be provided in the system architecture 100 in cases where the server 105 may implement the relevant functions of the database server 104.

It should be understood that the number of drones, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of drones, networks, database servers, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of training a sign classification model in accordance with the present application is shown. The method of training a sign classification model may include the steps of:

in step 201, a sample set is acquired.

In this embodiment, the execution subject of the method of training the sign classification model (e.g., server 105 shown in fig. 1) may obtain the sample set in a variety of ways. For example, the executing entity may obtain the existing sample set stored therein from a database server (e.g., database server 104 shown in fig. 1) through a wired connection or a wireless connection. As another example, a user may collect a sample via an unmanned vehicle (e.g., unmanned vehicles 101, 102 shown in fig. 1). In this way, the executing body may receive samples collected by the drone and store the samples locally, thereby generating a sample set.

Wherein the samples in the sample set comprise: image, semantic information, sample tags. Since the scene is classified from a plurality of modes, the data aspect mainly uses data of two modes, namely an image of the signboard and semantic information of the signboard, wherein the semantic information is mainly character (symbol and the like can be also existed) information on the image of the signboard. The sample tag is used to identify whether the sample is a positive sample or a negative sample:

1. positive sample definition:

for image mode, the samples such as clear, non-shielding, non-advertising board and the like are mainly defined as positive samples,

for semantic modalities, the common sign names in the sign library are mainly taken as positive samples, such as: xx marmite porridge, sour and spicy powder, china yy bank and the like.

2. Negative sample definition:

for image modalities: mainly blurring, masking and non-signage, etc. are defined as negative examples

For semantic modalities, sign names that are not in the sign library are mainly taken as negative samples.

The labeling of data is relatively simple based on the existing POI production flow and definition of positive and negative samples.

Step 202, selecting a sample from a sample set.

In this embodiment, the execution subject may select a sample from the sample set acquired in step 201, and execute the training steps of steps 203 to 206. The selection manner and the selection number of the samples are not limited in the present application. For example, samples can be selected randomly, or samples with higher definition of pictures or samples with more semantic information can be selected.

Step 203, inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on the image features, a second prediction result based on the semantic features and a third prediction result based on the fusion features of the image features and the semantic features.

In this embodiment, the sign classification model may include three classification sub-models: an image classification sub-model, a semantic classification sub-model, and a fusion classification sub-model. The input of the image classification sub-model is only an image, and the probability that the image belongs to an effective signboard (clear, non-shielding and non-advertising board) can be obtained as a first prediction result by extracting the image characteristics for recognition. The input of the semantic classification sub-model is only semantic information, and the probability that the semantic information belongs to an effective signboard (a common signboard name in a signboard library) can be obtained as a second prediction result by extracting semantic features for recognition. The input of the fusion classification sub-model is the image characteristics obtained by the image classification sub-model in the middle of recognizing the image and the semantic characteristics obtained by the semantic classification sub-model in the middle of recognizing the semantic. And the fusion classification sub-model fuses the image features and the semantic features, identifies the fused features after obtaining the fused features, and obtains the probability of matching the image and the semantic information as a third prediction result.

Step 204, calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result, and the sample tag.

In this embodiment, the three classification sub-models correspond to three loss values, as follows:

1. the first loss value (image loss) corresponding to the image classification sub-model mainly learns whether the current image is an effective sign from the image characteristic angle, and calculates according to the difference between the first prediction result and the sample label. For example, if the sample label is 1 (positive sample), the first prediction result is 0.9, and the first loss value is 0.1. The image classification sub-model may be trained by Cross Entropy Loss (cross entropy loss).

2. And a second loss value (semantic loss) corresponding to the semantic classification sub-model mainly learns from the semantic point of view whether the content in the current image is valid sign content. Such as billboards, are negative examples, and can be removed from the classification results of the semantic classification sub-model. For example, the sample label is 0 (negative sample), the second prediction result is 0.2, and the second loss value is 0.2. The semantic classification sub-model may be trained by Cross Entropy Loss (cross entropy loss).

3. And fusing a third loss value (fusion loss) corresponding to the classification sub-model to strengthen the feature learning of the two. The labels of the fusion classification sub-model are generated in the training process, and are 1 (as positive sample learning) if the current sample image and the characters in the image are corresponding, and are 0 (as negative sample learning) if not, so that the purpose is mainly to distinguish whether the images and the characters come from the same image or are matched. The fused classification sub-model may be trained by Binary Cross Entropy loss (binary cross entropy loss).

Finally, a weighted sum of the first, second and third loss values is calculated as a total loss value. The weight of the loss value of each classification sub-model can be set according to actual requirements. For example, if the accuracy of the image classification sub-model is highest, its weight may be set to be the largest. And because the fusion classification sub-model is not easy to converge, the weight of the third loss value can be set to be minimum.

If the total loss value is less than the predetermined threshold, it is determined that the training of the signboard classification model is complete 205.

In this embodiment, the predicted value may be considered to be close to or approximate to the true value when the total loss value is less than the predetermined threshold. The predetermined threshold may be set according to actual requirements. If the total loss value is less than the predetermined threshold, the completion of training of the signboard classification model is indicated.

Step 206, if the total loss value is greater than or equal to the predetermined threshold, adjusting the relevant parameters of the signboard classification model to execute steps 202-206.

In this embodiment, if the total loss value is not less than the predetermined threshold, it is indicated that training of the signboard classification model is not complete, and relevant parameters of the signboard classification model are adjusted, for example, the weights in the image classification sub-model, the semantic classification sub-model and the fusion classification sub-model in the signboard classification model are respectively modified by using a back propagation technique. And may return to step 202 to re-select samples from the sample set. So that the training steps described above can be continued based on the adjusted sign classification model.

The method for training the signboard classification model provided by the embodiment of the application starts from a human visual system, and because people understand whether the signboard is an effective signboard or not from the image characteristics and the semantic characteristics, the method combines the characteristics of a plurality of modes, and has better robustness for solving the non-signboard situations such as shielding, blurring, advertising boards and the like in a signboard classification scene.

In some optional implementations of this embodiment, inputting the image and the semantic information in the selected sample into the signboard classification model to obtain a first prediction result based on the image feature, a second prediction result based on the semantic feature, and a third prediction result based on the fusion feature of the image feature and the semantic feature, including: and extracting image features from the images in the selected samples through an image feature extraction network. And after the image features pass through the image full-connection layer, obtaining image representation. And extracting semantic features from semantic information in the selected samples through a semantic feature extraction network. And after the semantic features pass through the semantic full-connection layer, semantic representation is obtained. And fusing the image features and the semantic features after the image features and the semantic features pass through the shared full-connection layer to obtain a shared representation. And inputting the image representation and the shared representation into a first classifier after cascading to obtain a first prediction result. And inputting the semantic representation and the shared representation into a second classifier after cascading to obtain a second prediction result. And inputting the shared representation into a third classifier to obtain a third prediction result.

The image and the semantic information in the image are given to pass through two different backbone networks (an image feature extraction network and a semantic feature extraction network, wherein the weight of the backbone network is not shared because the information distribution difference of two modes is large, and if the two modes are shared, the model training process is difficult to converge), so that the feature vectors of the two modes are obtained respectively. For Image features, training is performed by Cross Entropy Loss after passing through an independent Image unique FC layer (Image full connection layer) and passing through share FC (shared full connection layer) feature concatate, and similarly, for semantic features, training is performed by Cross Entropy Loss after passing through Semantic unique FC (semantic full connection layer) and passing through share FC feature concatate. While the middle share representation (shared representation) is the training of features of both modalities with Binary Cross Entropy loss over share FC concatas.

The image feature extraction network may be a common network structure such as Res50, VGG, etc. The semantic feature extraction network may be a text encoder commonly used in natural language processing, such as a transformer.

Correspondingly, the relevant parameters of the signboard classification model are adjusted to be relevant parameters of an image feature extraction network, a semantic feature extraction network, an image full-connection layer, a semantic full-connection layer, a shared full-connection layer, a first classifier, a second classifier and a third classifier. Therefore, the training signboard classification model can accurately identify invalid signboards such as billboards, blocked signboards and fuzzy signboards. And effective signboards are reserved, effective data are provided for identifying the signboards, and invalid data are filtered, so that the POI identification speed is improved.

In some optional implementations of the present embodiment, calculating the total loss value based on the first prediction result, the second prediction result, the third prediction result, and the sample tag includes: a first loss value is calculated based on a difference between the first prediction result and the sample tag. A second loss value is calculated based on a difference between the second prediction result and the sample tag. A third loss value is calculated based on a difference between the third prediction result and the sample tag. A weighted sum of the first, second, and third loss values is calculated as a total loss value. The difference between the first prediction result and the sample label is the difference between the result of the signboard classification based on the image and the real image label (the image label of the positive sample is 1 and the image label of the negative sample is 0). The difference between the second predicted result and the sample label is the difference between the result of the sign classification according to the semantic information and the true semantic label (semantic label of positive sample is 1, semantic label of negative sample is 0). The difference between the third prediction result and the sample label is the difference between the result of matching according to the image and the semantic information and the true pairing label (the pairing label of the positive sample is 1 and the pairing label of the negative sample is 0). The weighted sum of the three loss values is calculated, so that the trained model can simultaneously refer to the image and the semantic information, the accuracy of classifying the signboard by only one item is higher, and the false detection of the image such as the billboard and the like as the signboard is avoided.

In some alternative implementations of the present embodiment, the weight of the first penalty value is the same as the weight of the second penalty value and is greater than the weight of the third penalty value. The influence of the image and the semantic information on the model is set to be the same and larger than the fusion influence of the image and the semantic information, so that the convergence speed of the model can be increased, and the accuracy of model classification is improved.

With further reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method of training a signboard classification model according to the present embodiment. In the application scenario of fig. 3, the user randomly selects a sample from the sample set, the sample including a picture and semantic information ("farm") detected by the sign detection network, and marks whether the picture is a valid sign (in this case the sign has a blurred word and is therefore an invalid sign). Inputting the picture into an image feature extraction network of the signboard classification model to extract image features. Inputting the semantic information into a semantic feature extraction network of the signboard classification model to extract semantic features. The image features pass through the image full-connection layer to obtain image representation, the image features and the semantic features pass through the shared full-connection layer to obtain shared representation, and the image representation and the shared representation are cascaded and then predicted through a first classifier to obtain a first prediction result. And then comparing the first predicted result with the sample label to calculate a first loss value. The semantic features pass through a semantic full-connection layer to obtain semantic representation, and the semantic representation and the shared representation are cascaded and then predicted through a second classifier to obtain a second prediction result. And then comparing the second predicted result with the sample label to calculate a second loss value. And predicting the shared representation through a third classifier to obtain a third prediction result. And then comparing the third predicted result with the sample label to calculate a third loss value. The total loss value is calculated from the first loss value, the second loss value, and the third loss value. If the total loss value is less than the predetermined threshold, the sign classification model training is complete. Otherwise, the relevant parameters of the sign classification model are adjusted, the samples are reselected, and training is continued so that the total loss value decreases until convergence to a predetermined threshold.

Referring to fig. 4, a flow 400 of one embodiment of a method of sign categorization provided by the present application is shown. The method of sign classification may comprise the steps of:

and step 401, performing character recognition on the detected signboard pictures to obtain character information.

In the present embodiment, the execution subject of the method of signboard classification (e.g., the server 105 or the unmanned vehicles 101, 102 shown in fig. 1) may acquire the street view of the region to be detected in various ways. For example, if the execution subject is a server, the execution subject may receive a street view of the area to be detected collected by the unmanned vehicle. Many signs may be included in the street view. The signboard areas are detected from the street view map by a pre-trained signboard detection model and cut out as a signboard image. Character recognition is carried out on the signboard pictures through character recognition tools such as OCR and the like, so that character information is obtained.

Step 402, inputting the signboard pictures and the text information into a signboard classification model to obtain an image score and a semantic score.

In this embodiment, the sign classification model may be generated using the method described above in connection with the embodiment of FIG. 2. The specific generation process may be referred to in the description of the embodiment of fig. 2, and will not be described herein. The sign picture and the text information can be predicted through the sign classification model. As shown in fig. 3, the branch for calculating the first loss value predicts the probability that the picture belongs to the sign, i.e., the picture score. And calculating branches of the second loss value, and predicting the probability that the text information belongs to the signboard, namely, the semantic score.

Step 403, outputting the probability of the signboard picture being valid based on the image score and the semantic score.

In this embodiment, the average of the image score and the semantic score may be used as the probability that the signboard picture is valid. Alternatively, a weighted sum of the image score and the semantic score may also be used as a probability that the sign picture is valid. The weights may be determined based on the accuracy of the image feature extraction network and the accuracy of the semantic feature extraction network, e.g., if the accuracy of the image feature extraction network is higher than the accuracy of the semantic feature extraction network, the weights of the image scores may be set to a higher weight than the semantic score.

It should be noted that the method for classifying a signboard according to this embodiment may be used to test the signboard classification model generated in each of the above embodiments. And further, the signboard classification model can be continuously optimized according to the test result. The method may be a practical application method of the signboard classification model generated in each of the embodiments described above. The signboard classification model generated by the embodiments is used for classifying the signboard, and the performance of the signboard classification model is improved. Such as quick filtering off invalid signs.

With continued reference to FIG. 5, as an implementation of the method illustrated in the above figures, the present application provides one embodiment of an apparatus for training a sign classification model. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for training a signboard classification model according to the present embodiment may include: an acquisition unit 501 and a training unit 502. Wherein the obtaining unit 501 is configured to obtain a sample set, wherein samples in the sample set comprise: image, semantic information, sample tags. A training unit 502 configured to perform the following training steps: samples are selected from the sample set. Inputting the images and semantic information in the selected samples into a signboard classification model to obtain a first prediction result based on the image features, a second prediction result based on the semantic features and a third prediction result based on fusion features of the image features and the semantic features. The total loss value is calculated based on the first prediction result, the second prediction result, the third prediction result, and the sample tag. And if the total loss value is smaller than the preset threshold value, determining that the training of the signboard classification model is completed.

In some optional implementations of the present embodiment, training unit 502 is further configured to: and if the total loss value is greater than or equal to a preset threshold value, adjusting relevant parameters of the signboard classification model, and continuously executing the training step based on the adjusted signboard classification model.

In some optional implementations of the present embodiment, training unit 502 is further configured to: and extracting image features from the images in the selected samples through an image feature extraction network. And after the image features pass through the image full-connection layer, obtaining image representation. And extracting semantic features from semantic information in the selected samples through a semantic feature extraction network. And after the semantic features pass through the semantic full-connection layer, semantic representation is obtained. And fusing the image features and the semantic features after the image features and the semantic features pass through the shared full-connection layer to obtain a shared representation. And inputting the image representation and the shared representation into a first classifier after cascading to obtain a first prediction result. And inputting the semantic representation and the shared representation into a second classifier after cascading to obtain a second prediction result. And inputting the shared representation into a third classifier to obtain a third prediction result.

In some optional implementations of the present embodiment, training unit 502 is further configured to: a first loss value is calculated based on a difference between the first prediction result and the sample tag. A second loss value is calculated based on a difference between the second prediction result and the sample tag. A third loss value is calculated based on a difference between the third prediction result and the sample tag. A weighted sum of the first, second, and third loss values is calculated as a total loss value.

In some alternative implementations of the present embodiment, the weight of the first penalty value is the same as the weight of the second penalty value and is greater than the weight of the third penalty value.

With continued reference to fig. 6, as an implementation of the method illustrated in the above figures, the present application provides one embodiment of an apparatus for sign sorting. The embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device can be applied to various electronic devices.

As shown in fig. 6, the apparatus 600 for detecting a target of the present embodiment may include: an identification unit 601, a classification unit 602, and an output unit 603. The recognition unit 601 is configured to perform text recognition on the detected signboard images to obtain text information. The classification unit 602 is configured to obtain an image score and a semantic score from a signboard classification model trained by the signboard image and text information input apparatus 500. An output unit 603 configured to output a probability that the signboard picture is valid based on the image score and the semantic score.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of flow 200 or 400.

A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of flow 200 or 400.

A computer program product comprising a computer program that when executed by a processor implements the method of flow 200 or 400.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the method of training a sign classification model. For example, in some embodiments, the method of training a sign classification model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the method of training a sign classification model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method of training the sign classification model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training a sign classification model, comprising:

obtaining a sample set, wherein samples in the sample set comprise: image, semantic information, sample tags;

the following training steps are performed: selecting a sample from the sample set; inputting the images and semantic information in the selected samples into a signboard classification model to obtain a first prediction result based on image features, a second prediction result based on semantic features and a third prediction result based on fusion features of the image features and the semantic features; calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result, and a sample tag; if the total loss value is smaller than a preset threshold value, determining that the training of the signboard classification model is completed;

Inputting the image and semantic information in the selected sample into a signboard classification model to obtain a first prediction result based on image features, a second prediction result based on semantic features and a third prediction result based on fusion features of the image features and the semantic features, wherein the method comprises the following steps:

extracting image features from images in the selected samples through an image feature extraction network; the image characteristics are subjected to an image full-connection layer to obtain image representation;

extracting semantic features from semantic information in the selected samples through a semantic feature extraction network; the semantic features are subjected to a semantic full-connection layer to obtain semantic representation;

the image features and the semantic features are fused after passing through a shared full-connection layer to obtain a shared representation;

inputting the image representation and the sharing representation into a first classifier after cascading to obtain a first prediction result;

inputting the semantic representation and the sharing representation into a second classifier after cascading to obtain a second prediction result;

and inputting the shared representation into a third classifier to obtain a third prediction result.

2. The method of claim 1, wherein the method further comprises:

And if the total loss value is greater than or equal to a preset threshold value, adjusting relevant parameters of the signboard classification model, and continuously executing the training step based on the adjusted signboard classification model.

3. The method of claim 1, wherein the calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result, and a sample tag comprises:

calculating a first loss value based on a difference between the first prediction result and a sample label;

calculating a second loss value based on a difference between the second prediction result and a sample label;

calculating a third loss value based on a difference between the third prediction result and a sample label;

a weighted sum of the first, second, and third loss values is calculated as a total loss value.

4. A method according to claim 3, wherein the weight of the first loss value is the same as the weight of the second loss value and is greater than the weight of the third loss value.

5. A method of sign classification, comprising:

performing character recognition on the detected signboard pictures to obtain character information;

inputting the sign picture and the text information into a sign classification model trained according to the method of any one of claims 1-4 to obtain an image score and a semantic score;

And outputting the probability that the signboard picture is valid based on the image score and the semantic score.

6. An apparatus for training a sign classification model, comprising:

an acquisition unit configured to acquire a sample set, wherein samples in the sample set include: image, semantic information, sample tags;

a training unit configured to perform the following training steps: selecting a sample from the sample set; inputting the images and semantic information in the selected samples into a signboard classification model to obtain a first prediction result based on image features, a second prediction result based on semantic features and a third prediction result based on fusion features of the image features and the semantic features; calculating a total loss value based on the first prediction result, the second prediction result, the third prediction result, and a sample tag; if the total loss value is smaller than a preset threshold value, determining that the training of the signboard classification model is completed;

wherein the training unit is further configured to:

7. The apparatus of claim 6, wherein the training unit is further configured to:

8. The apparatus of claim 6, wherein the training unit is further configured to:

9. The apparatus of claim 8, wherein the weight of the first loss value is the same as the weight of the second loss value and greater than the weight of the third loss value.

10. An apparatus for sign classification, comprising:

the identification unit is configured to carry out character identification on the detected signboard pictures to obtain character information;

a classification unit configured to input the sign picture and the text information into a sign classification model trained in accordance with the apparatus of any one of claims 6-9, resulting in an image score and a semantic score;

an output unit configured to output a probability that the signboard picture is valid based on the image score and the semantic score.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.