CN111931683A

CN111931683A - Image recognition method, image recognition device and computer-readable storage medium

Info

Publication number: CN111931683A
Application number: CN202010863243.3A
Authority: CN
Inventors: 宫鲁津
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2020-11-13
Anticipated expiration: 2040-08-25
Also published as: CN111931683B

Abstract

The application discloses an image identification method, an image identification device and a computer readable storage medium, and belongs to the technical field of computers and artificial intelligence. The method comprises the following steps: acquiring an image of a traffic sign board to be identified, wherein the image of the traffic sign board comprises multiple types of elements; respectively identifying each type of element in the multiple types of elements to obtain an identification result of each type of element in the multiple types of elements; and combining the recognition results of each type of elements in the plurality of types of elements to obtain the recognition result of the image of the traffic sign. The method and the device solve the problem of low image identification accuracy. The application is used for identifying images.

Description

Image recognition method, image recognition device and computer-readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image recognition method and apparatus, and a computer-readable storage medium.

Background

With the development of computer technology and artificial intelligence technology, image recognition is more and more widely applied.

In the related art, a server of map software can automatically plan a travel route of a user according to various transportation means available for the user to travel, and image recognition is important in the process. For example, the staff needs to collect the images of the traffic signs on the roads in advance, and then input the collected images into the server, and the server can directly perform overall recognition on the input images of the traffic signs to obtain the recognition result.

However, the server in the related art obtains a recognition result with low accuracy.

Disclosure of Invention

The application provides an image identification method, an image identification device and a computer readable storage medium, which can solve the problem of low image identification accuracy. The technical scheme is as follows:

in one aspect, an image recognition method is provided, and the method includes:

acquiring an image of a traffic sign board to be identified, wherein the image of the traffic sign board comprises multiple types of elements;

respectively identifying each type of element in the multiple types of elements to obtain an identification result of each type of element in the multiple types of elements;

and combining the recognition results of each type of elements in the multiple types of elements to obtain the recognition result of the image of the traffic sign.

In another aspect, there is provided an image recognition apparatus including:

the system comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an image of a traffic sign board to be recognized, and the image of the traffic sign board comprises multiple types of elements;

the identification module is used for respectively identifying each type of element in the multiple types of elements to obtain the identification result of each type of element in the multiple types of elements;

and the combination module is used for combining the recognition result of each type of element in the multiple types of elements to obtain the recognition result of the image of the traffic sign.

In still another aspect, an image recognition apparatus is provided, which includes: a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the image recognition method described above.

In yet another aspect, a computer readable storage medium is provided, having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement the image recognition method described above.

The beneficial effect that technical scheme that this application provided brought includes at least:

in the image identification method provided by the application, multiple types of elements in the image of the traffic sign board to be identified can be identified respectively, and the identification results of each type of element in the multiple types of elements are combined to obtain the identification result of the image of the traffic sign board. Therefore, compared with the overall recognition of the images of the traffic sign boards in the related technology, the recognition of the images of the traffic sign boards can be ensured to be more precise, and the accuracy of the image recognition is improved.

Drawings

Fig. 1 is a flowchart of an image recognition method provided in an embodiment of the present application;

FIG. 2 is a flow chart of another image recognition method provided by the embodiment of the application;

FIG. 3 is a schematic view of a traffic sign to be identified provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a feature extraction network provided in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an image recognition model provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of another image recognition apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

At present, computer technology and Artificial Intelligence (AI) technology are rapidly developing. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large image recognition technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

With the development of computer technology and artificial intelligence, the application of image recognition in computer vision technology is more and more extensive. For example, the face image recognition can be applied to person tracking and searching, and the image recognition of the traffic sign can be applied to the aspects of map construction, automatic vehicle driving and the like. At present, for the image recognition of the traffic sign, after a recognition model is trained by directly using a large number of images of the traffic sign as training samples, the images of the traffic sign to be recognized are input into the recognition model, and the recognition model directly recognizes the whole input image of the traffic sign and then outputs a recognition result. However, rare traffic signs (such as traffic signs used for indicating the driving of a truck) exist at present, and due to the fact that the traffic signs are arranged in a few positions, it is difficult to acquire enough training samples for the traffic signs, and therefore the image recognition accuracy of the rare traffic signs is low. The embodiment of the application provides an image identification method and device, which can improve the accuracy of an identification result of an image of a traffic sign.

Fig. 1 is a flowchart of an image recognition method according to an embodiment of the present application. The method may be used for a server, as shown in fig. 1, and may include:

step 101, obtaining an image of a traffic sign board to be identified, wherein the image of the traffic sign board comprises multiple types of elements.

Alternatively, the image of the traffic sign to be identified may comprise an image of a rare traffic sign. The traffic sign may include a plurality of types of elements, and the image of the traffic sign may also include a plurality of types of elements, and the plurality of types of elements included in the rare traffic sign may be ubiquitous elements. For example, the sample size of the image of the rare traffic sign may be less than a sample size threshold, and the sample size of each of the plurality of types of elements may be greater than the sample size threshold. The image recognition model is trained through certain image samples with the sample size more than the sample size threshold, so that the accuracy of the image recognition model on the recognition result of the image can be ensured to be higher than the accuracy threshold.

Alternatively, the plurality of types of elements in the traffic sign may include an object element and a behavior element. The object element is used for indicating the object to which the information indicated by the traffic sign is directed, and the behavior element is used for indicating the behavior to be executed by the object in the information indicated by the traffic sign. Illustratively, the image of the traffic sign to be recognized is a traffic restriction sign, such as a traffic sign that prohibits left turn of a truck, and the traffic sign indicates "prohibit left turn of a truck". The traffic sign may include a truck pattern and a left-turn arrow pattern, and the traffic sign may have a red slash thereon. The pattern of the truck is an object element in the traffic sign, and the object to which the information indicated by the traffic sign is directed is the truck; the pattern of the left-turn arrow and the red slash are behavior elements in the traffic sign, and are used for indicating that the behavior to be executed by the truck in the information indicated by the traffic sign is to prohibit left-turn.

And 102, respectively identifying each type of element in the multiple types of elements to obtain an identification result of each type of element in the multiple types of elements.

The server can determine multiple types of elements in the image of the traffic sign and respectively identify each type of element to obtain the identification result of each type of element. If the input image of the traffic sign is the image of the traffic sign which prohibits the left turn of the truck, the server can determine the pattern of the truck and the patterns of the left turn arrow and the red slash according to the input image of the traffic sign, and respectively recognize the two patterns. Further, it is possible to obtain a flag that the recognition result of the pattern of the truck is the truck and the recognition results of the patterns of the left-turn arrow and the red slash are the inhibition of the left-turn.

And 103, combining the recognition results of each type of elements in the plurality of types of elements to obtain the recognition result of the image of the traffic sign.

After obtaining the recognition results of the multiple classes of elements, the server can combine the recognition results of each class of elements in the multiple classes of elements to obtain the recognition result of the image of the traffic sign. For example, the server may combine the recognition results of various types of elements in the input image of the traffic signboard, that is, "truck" and "left turn prohibition sign", to obtain the recognition result of the image of the traffic signboard as truck left turn prohibition.

In the embodiments of the present application, only the traffic sign is taken as a traffic sign for prohibiting the left turn of the truck, the object indicated by the object element in the traffic sign is taken as the truck, and the behavior indicated by the behavior element is taken as an example for prohibiting the left turn. Alternatively, the object indicated by the object element in the traffic sign may be an object such as a pedestrian, a motor vehicle, a passenger car, a bus, a motorcycle, or a tricycle, and the behavior indicated by the behavior element may be an object such as a prohibition of entry of the object into the road, a prohibition of left turn, a prohibition of right turn, a prohibition of straight running and left turn, a prohibition of straight running and right turn, a prohibition of left turn and right turn, a prohibition of turning around, a prohibition of left turn and turning around, a restriction of the height of the object, a restriction of the width of the object, a restriction of the weight of the object, or a restriction of the. The traffic sign board can also be a traffic sign board for prohibiting pedestrians from turning left, prohibiting pedestrians from going straight, prohibiting vehicles from turning around, prohibiting vehicles from turning right, prohibiting vehicles from driving in, limiting the height of a truck, limiting the width of the truck, limiting the weight of the truck or limiting the axle weight of the truck, and the like. The images of the traffic signs can be identified by the image identification method, which is not described in detail in the embodiments of the present application.

In summary, in the image recognition method provided in the embodiment of the present application, multiple types of elements in the image of the traffic sign to be recognized may be recognized respectively, and the recognition results of each type of element in the multiple types of elements may be combined to obtain the recognition result of the image of the traffic sign. Therefore, compared with the overall recognition of the images of the traffic sign boards in the related technology, the recognition of the images of the traffic sign boards can be ensured to be more precise, and the accuracy of the image recognition is improved.

Moreover, because the traffic sign boards are all composed of multiple ubiquitous elements, even if the traffic sign boards in the images of the traffic sign boards to be identified are rare, the images of the traffic sign boards can be accurately identified by identifying the ubiquitous elements in the images. The situation that the recognition effect on the image of the rare traffic sign board is poor due to the fact that the samples of the image of the rare traffic sign board are few is avoided.

Fig. 2 is a flowchart of an image recognition method according to an embodiment of the present application. The method may be used for a server, as shown in fig. 2, and may include:

step 201, obtaining an image of a traffic sign board to be identified, wherein the image of the traffic sign board comprises multiple types of elements.

In the embodiment of the application, the image of the traffic sign board can comprise two types of elements, namely an object element and a behavior element. Illustratively, fig. 3 is a schematic diagram of an image of a traffic sign to be recognized provided by an embodiment of the present application. As shown in fig. 3, the object elements in the image of the traffic sign may include a pattern of a van X1, and the behavior elements may include a pattern of an arrow X2 and a slash (not shown in fig. 3). For other related descriptions of the traffic sign board, reference may be made to step 101, and details are not described in this embodiment of the present application.

In an alternative example, the server may obtain a plurality of initial images through the crowdsourcing system, and determine the image of the traffic sign to be recognized according to the plurality of initial images. Alternatively, the image of the traffic sign to be recognized may be the complete initial image or may also be an image of a partial region in the initial image. It should be noted that crowdsourcing refers to the practice of a company or organization outsourcing work tasks originally performed by employees to an unspecified (usually large) mass network in a free-voluntary manner. In this embodiment of the application, the crowdsourcing system may be a road crowdsourcing system, the road crowdsourcing system may be configured to acquire a road image, and an initial image acquired by the crowdsourcing system may include a road image acquired by an ordinary user through a portable terminal (such as a mobile phone or a camera), or a road image acquired by a vehicle event data recorder. The server can detect the initial image obtained by the crowdsourcing system, determine the initial image comprising the traffic sign, and further intercept the area where the traffic sign is located in the initial image to obtain the image of the traffic sign. Optionally, when the server detects the initial image, the server may also detect a traffic light, a zebra crossing, a violation camera, and the like in the initial image. Optionally, the initial image may carry related acquisition information, such as acquisition position information and acquisition time information of the initial image.

Alternatively, the image of the traffic sign to be recognized in the embodiment of the present application may include a traffic sign of a target shape. The traffic sign may have different shapes and the shapes are related to the information indicated by the traffic sign. For example, a traffic sign of a circle indicates information of prohibition and guidance, a traffic sign of a triangle indicates information of warning, a traffic sign of a square indicates information of directions, partial warning, prohibition and guidance, tourist areas, and notice. When the server detects the initial image and determines that the initial image comprises the traffic sign board, the server can also determine the shape of the traffic sign board, if the shape of the traffic sign board is the target shape, the server intercepts the image of the traffic sign board from the initial image as the image to be identified, and then executes the following steps 202 to 207. Optionally, for the image of the traffic sign with the shape that is not the target shape, the server may also perform recognition by using the image recognition method provided in the embodiment of the present application, or may also perform recognition by using a method of performing overall recognition on the image in the related art, which is not limited in the embodiment of the present application.

The image of the traffic sign to be recognized in the embodiment of the present application may be an image of a rare traffic sign, and the traffic sign may be a circular traffic sign, that is, the target shape may be a circle. Since the round traffic signs have more information for indicating prohibition or limitation and the prohibition and limitation information usually has a target, such traffic signs are rare, that is, the number of the traffic signs arranged on the road is small; therefore, in the embodiment of the present application, the image of the traffic sign having a circular shape may be determined as the image of the traffic sign to be recognized.

Step 202, inputting the image of the traffic sign board to be recognized into an image recognition model, wherein the image recognition model comprises: the system comprises a feature extraction network and a plurality of detection networks, wherein the plurality of detection networks comprise target detection networks corresponding to the multi-class elements one by one.

The feature extraction network in the image recognition model is also a backbone (backbone) network of the image recognition model, the detection networks are also branch (branch) networks of the image recognition model, and each detection network may be a head (head).

In this embodiment, the image of the traffic sign may include two types of elements, namely, an object element and a behavior element, so that the image recognition model may include two target detection networks corresponding to the two types of elements one to one, and the two target detection networks may include an object detection network corresponding to the object element and a behavior detection network corresponding to the behavior element. Since the behavior elements in traffic signs are usually represented by a pattern of arrows, the behavior detection network can also be used directly to detect the arrow pattern in the image. Optionally, the traffic sign may also include other elements, and the image recognition model may also include other detection networks corresponding to the other elements one to one, which is not limited in this embodiment of the application. If the existing traffic sign board comprises identification words, the existing traffic sign board also comprises word elements, and the image recognition model also comprises a target detection network corresponding to the word elements.

In this embodiment of the application, before step 202, a plurality of training samples need to be used to train the image recognition model, for example, a plurality of samples of a class of elements corresponding to each of a plurality of detection networks may be used to train each of the detection networks, so as to ensure that the recognition accuracy of each of the detection networks on the class of elements corresponding thereto is higher than an accuracy threshold, and then the image recognition model is used for image recognition. For example, the sample used for training each detection network at least includes the element corresponding to the detection network. In this way, even if the number of samples that can be acquired by the rare traffic sign as a whole is small, since each element in the rare traffic sign is used in other traffic signs in a common manner, it is possible to train the detection network corresponding to the element using the image of the element in the other traffic signs as a training sample, and thus it is possible to acquire samples having a large number of elements for each type relatively easily. Because the more training samples are used for training the image recognition model, the higher the image recognition accuracy of the image recognition model is, in the embodiment of the application, because each detection network has enough training samples for training, the recognition accuracy of each detection network on the corresponding class of elements is higher.

In the related art, the image recognition model directly recognizes the whole input image, and then only the whole image of the traffic sign can be used as a training sample to train the image recognition model. Therefore, for rare traffic signs, the samples are difficult to obtain, so the sample amount of the rare traffic signs which can be used for training the model is small, and the recognition accuracy of the image recognition model for the images of the rare traffic signs is low. In the embodiment of the application, the image recognition model can respectively recognize various elements in the input image of the traffic sign board to be recognized through the plurality of detection networks, each detection network has enough training samples to train, the recognition accuracy of each detection network on the corresponding element is higher, and then the accurate recognition result of the image can be obtained according to the accurate recognition result of each detection network on various elements in the input image.

Optionally, the image recognition model in the embodiment of the present application may recognize an image of a target size. After the server inputs the image of the traffic sign board to be recognized into the image recognition model, the size of the image can be adjusted to the target size through the image recognition model. For example, the server may up-sample (upsampling) or down-sample (subsampled) an image of the traffic sign into which the image recognition model is input to adjust the size of the image to a target size. If the original size of the image of the traffic sign board is larger than the target size, downsampling the image; and if the original size of the image of the traffic sign board is smaller than the target size, performing up-sampling on the image. The down-sampling is also called down-sampling (down-sampled) and is used for reducing an image to obtain a thumbnail of the image, and specifically, the down-sampling may be performed at intervals of several pixels in the image, and then the image formed by the pixels obtained by the sampling is the image after the down-sampling, that is, the down-sampling is performed on the pixels in the image. Upsampling is also called image interpolation (interpolating) and is used for enlarging an image, for example, nearest neighbor interpolation, bilinear interpolation, bi-square interpolation or bi-cubic interpolation can be used to perform interpolation processing on pixels in an input image to obtain an image containing more pixels.

For example, the target size in the embodiment of the present application may range from 60 pixels by 60 pixels to 90 pixels by 90 pixels. If the target size may be 2ⁿPixel 2ⁿPixel, e.g., n 6, target size is 64 x 64 pixels. Because each pixel needs to be convolved in the image feature extraction process, and each convolution can integrate the features of half of the pixels in the image, the length and the width of the target size can be 2ⁿAnd the integrity of the pixel characteristics of the convolved image is ensured.

And step 203, extracting the characteristics of the image of the traffic sign board through the characteristic extraction network.

The server extracts features of the images of the traffic signs through a feature extraction network to form a feature map (feature map). Optionally, in the embodiment of the present application, a Receptive Field (Receptive Field) of each feature in the features of the image of the traffic sign extracted by the feature extraction network may cover all pixels of the image, where the Receptive Field is an area size of a pixel point (i.e., a feature point) on a feature map output by each layer of the convolutional neural network, which is mapped on an input image. Each feature in the feature map extracted by the feature extraction network in the embodiment of the present application is associated with all pixels in the input image of the traffic sign.

It should be noted that, since the receptive field of each feature in the feature map obtained by the feature extraction network covers all pixels of the input image, it can be ensured that when the detection network identifies a part of features corresponding to elements in the feature map, all pixels in the input image can be considered accordingly, and the accuracy of identifying the elements can be ensured. For example, the image of the traffic sign includes a red slash and an arrow pattern, and when the behavior detection network identifies the feature corresponding to the arrow pattern in the image of the traffic sign, the behavior detection network may also consider the red slash in other areas in the image, thereby ensuring that the result of the behavior element detection performed by the behavior detection network is a behavior for prohibiting the object from moving in the direction indicated by the arrow, rather than a behavior for indicating that the object moves in the direction indicated by the arrow.

Fig. 4 is a schematic structural diagram of a feature extraction network according to an embodiment of the present application. As shown in fig. 4, the feature extraction network includes a plurality of convolutional layers, the number of the convolutional layers is greater than or equal to 13, and fig. 4 exemplifies that the number of the convolutional layers is equal to 13. Because each time the input image is convolved, one pixel and the surrounding pixels are integrated to obtain one pixel point in the convolved feature map, the more the convolution times are, the larger the receptive field of each feature in the feature map is. In the embodiment of the present application, the feature extraction network includes 13 convolution layers, so that it can be ensured that the receptive field of each feature in the obtained feature map covers all pixels in the input image. Optionally, at least one of the plurality of convolutional layers is a bottleneck (bottle) layer, and none of the convolutional layers on both sides of the bottleneck layer is a bottleneck layer. The feature extraction network may further include a plurality of max pooling layers, the number of which may be less than 4. The feature extraction network as in the present embodiment comprises two max-pooling layers.

As shown in fig. 4, the feature extraction network may include two first convolution portions B1, three second convolution portions B2, and one third convolution portion B3. Each of the first convolution portions B1 includes a convolution layer of 3 × 3 (referring to a pixel size of 3 × 3 in size), a Batch Normalization (BN) layer, a Rectified Linear Unit (ReLU) layer, and a maximum pooling layer, which are connected in sequence. Each second convolution portion B2 includes one 3 × 3 convolution layer, one 1 × 1 convolution layer, and another 3 × 3 convolution layer connected in sequence. The third convolution portion B3 includes a 1 x1 convolution layer and a 3 x 3 convolution layer connected. Note that C identified at each convolutional layer in fig. 4 represents the dimension of the characteristic diagram output by the convolutional layer. Optionally, each convolutional layer is provided with a flag indicating the ratio of the size of the output feature map to the size of the input image. Such as identified as 1/2 or 1/4 equi-score values.

The maximum pooling layer is used for down-sampling the feature map of the previous layer to reduce the feature size of the image. Because the size of the image of the traffic sign plate of the input image recognition model is larger, the calculation amount required for calculating the image is larger, the maximum pooling layer can be arranged after the first two convolution layers for down-sampling, so that the size of the characteristic diagram can be reduced rapidly. After passing through the two first convolution sections B1, the length and width of the input image can be reduced, and the calculation speed of the feature extraction network can be increased. And the down-sampling of the image by the maximum pooling layer increases the receptive field of the outputted features, so that the receptive field of the inputted features after passing through the two first convolution parts B1 is correspondingly increased. It should be noted that, in the related art, the image recognition model usually includes more than four maximum pooling layers, but the size of the image of the traffic sign is small, for example, the size is only 64 pixels by 64 pixels, and if the operation is performed through the four maximum pooling layers, the size of the finally obtained feature map is too small, and is only 4 pixels by 4 pixels. The accuracy of determining the position of each element through the feature map with such size is poor, which in turn may result in poor accuracy of the element identification result. In the embodiment of the present application, the feature extraction network only includes two largest pooling layers, so that it can be ensured that the size of the finally obtained feature map is large, for example, the size is 16 pixels by 16 pixels. Compared with the related technology, the method has the advantages that the resolution of the feature map is improved, the position of each element in the image can be accurately determined, and the accuracy of the identification result of the element by a subsequent detection network is improved.

Compared with the related art, the embodiment of the application reduces the maximum pooling layer, so that the receptive field of the features in the obtained feature map is possibly smaller; however, the receptive field of the features can be enlarged by increasing the convolutional layer, so that the feature extraction network can include more convolutional layers, that is, the depth of the feature extraction network is increased. Therefore, in the embodiment of the present application, the second convolution portion and the third convolution portion after the first convolution portion in the feature extraction network collectively include 7 convolution layers of 3 × 3, so as to ensure that the features in the feature map output by the feature extraction network can cover all pixels in the input image.

In the embodiment of the present application, the number of channels (channels) of the convolutional layer combined image of 1 × 1 in the feature extraction network may be used as the bottleneck layer. Illustratively, the input image in the embodiment of the present application is an RGB image, channels of the image include a red (red) channel, a green (green) channel, and a blue (blue) channel, and the number of channels of the image is 3. It should be noted that, if the convolution layers in the feature extraction network are both set to 3 × 3, which may result in a large calculation amount of the server, in the embodiment of the present application, 1 × 1 convolution layer is inserted into two convolution layers of 3 × 3, and the convolution layer of 1 × 1 may reduce the feature amount of the feature map input to the convolution layer by half and input the feature amount to the next convolution layer, so that the next convolution layer may perform calculation only for half of the feature amount, which may reduce the calculation amount of the server and increase the feature extraction speed of the feature extraction network.

And step 204, identifying the characteristics of the corresponding class of elements through each target detection network in the characteristics of the images of the traffic sign boards to obtain the identification result of the class of elements.

Fig. 5 is a schematic structural diagram of an image recognition model according to an embodiment of the present application. As shown in fig. 5, after the third convolution part B3 of the feature extraction network, a plurality of detection networks a are connected to each other. The features extracted by the feature extraction network may be shared by the plurality of detection networks, and the feature extraction network may input the extracted features to each detection network after extracting the features of the image of the traffic sign. Furthermore, each detection network can determine whether the feature of the corresponding element exists in the feature according to the input feature, and identify the feature after determining that the feature of the corresponding element exists, so as to obtain the identification result of the element. For example, the target detection network in the plurality of detection networks may determine the features of a corresponding class of elements from the input features, and further identify the features of the class of elements. Each target detection network only takes the features of one type of corresponding elements in the input features as positive samples, takes other features as negative samples, and equivalently takes the features as backgrounds of the type of elements, so that the positive samples are identified. It should be noted that, in fig. 5, the dimension C of the result output by each detection network is class +5, where class is the number of categories of the element to be detected, and if the object detection network can only identify 10 types of objects, the number of categories is 10. In the dimensions, the output of the front (classes +1) dimension also comprises a one-dimensional output corresponding to the background category, and each output of the front (classes +1) dimension is the confidence coefficient of one category corresponding to the dimension as the recognition result; the remaining 4-dimensional output is the result of the detection frame of the class of elements corresponding to the detection network, that is, the detection frame is the region of the class of elements in the input image determined by the detection network.

In the embodiment of the application, each detection network can independently and simultaneously detect the corresponding elements in the feature map output by the feature extraction network, the confusion of different types of element identification is avoided, each detection network only needs to identify partial features in the feature map, and the time consumption of identification is reduced.

In the embodiment of the present application, the recognition result of each type of element output by each detection network is the recognition result output by the image recognition model. Since different images include different types of elements, the target detection networks in the image recognition models corresponding to the different images may be different. The target detection network corresponding to a certain image may include a part of the detection networks in the image recognition model, or may also include all the detection networks, which is not limited in the embodiment of the present application. For each input image, each detection network in the image recognition model recognizes the image; if the image comprises an element corresponding to a certain detection network, the detection network is a target detection network, and the detection network can output the identification result of the element; and if the image does not comprise elements corresponding to a certain detection network, the identification result output by the detection network is a background.

And step 205, combining the recognition results of each type of elements in the plurality of types of elements to obtain the recognition result of the image of the traffic sign.

The server identifies the input image of the traffic sign through the image identification model to obtain the identification results of various elements in the image, and then can carry out semantic understanding combination on the identification results of various elements to obtain the identification result of the image of the traffic sign. The server may directly perform semantic understanding combination on the recognition results output by each detection network, or the server may only determine the recognition results output by the target detection networks in each detection network, and then perform semantic understanding combination on the output results of the target detection networks.

Illustratively, the plurality of detection networks include an object detection network, a behavior detection network and a character detection network, wherein the confidence of the one-dimensional output corresponding to the category of trucks in the recognition result output by the object detection network is highest, the confidence of the one-dimensional output corresponding to the category of prohibiting left turning in the recognition result output by the behavior detection network is highest, and the confidence of the one-dimensional output corresponding to the background category in the recognition result output by the character detection network is highest, so that the server can combine the recognition results of the three detection networks, and perform semantic understanding to obtain the recognition result of the input image of the traffic signboard as the signboard for prohibiting left turning of trucks. Optionally, after the three detection networks output the recognition results, the server may perform screening to exclude the recognition result with the highest confidence of the background category in the recognition results, and further combine the remaining recognition results to obtain the recognition result of the input image of the traffic sign.

It should be noted that the above embodiments of the present application only take the case that the behavior detection network directly detects whether the traffic sign indicates the restricted prohibited behavior or the guided behavior of the object, for example, the behavior detection network directly detects the red slash in the traffic sign. Alternatively, since the traffic signboard is a traffic restriction or prohibition signboard or a traffic guidance signboard, the pattern based on may be independent of the behavior pattern, for example, the red slash and the left-turn arrow in the traffic prohibition signboard are actually two independent patterns, so that the two patterns may also be regarded as two independent elements to be recognized by two detection networks respectively.

Alternatively, since the pattern for prohibiting restriction is usually more conspicuous in the traffic sign and occupies a larger area, when the server detects the initial image to determine the traffic sign in the initial image, it can be directly determined whether the information of the traffic sign is the restriction prohibition information of the behavior. Whereas the image recognition model may only recognize patterns in the image of the traffic sign. If the identification result is the restriction prohibition information, when the server combines the identification results of the image identification models, the restriction prohibition information can be combined at the same time, and the final identification result of the image of the traffic sign can be obtained. For example, for the image of the traffic sign plate for which the left turn of the truck is prohibited, the image recognition model may output only two recognition results of the truck and the left turn, and the server combines the restriction prohibition information obtained by recognizing the initial image with the two recognition results to obtain a final recognition result that the left turn of the truck is prohibited.

And step 206, determining a target class to which the image of the traffic sign belongs according to the recognition result of the image of the traffic sign in a plurality of classes, wherein the plurality of classes comprise object classes to which the information indicated by the traffic sign aims.

For example, the category of the object to which the information indicated by the traffic sign is directed may include a pedestrian, a motor vehicle, a passenger car, a bus, a truck, a motorcycle, or a tricycle, and the like, and the plurality of categories to which the image of the traffic sign belongs may include a pedestrian, a motor vehicle, a passenger car, a bus, a truck, a motorcycle, or a tricycle, and the like. For example, if the server determines that the recognition result of the input image of the traffic sign is that the left turn of a van is prohibited in step 205, the server may determine that the target category to which the image belongs is the category of the van.

Optionally, the server may divide corresponding storage regions for each category, and after determining the target category to which the image belongs, the server may further store the recognition result of the image in the storage region corresponding to the target category, so as to facilitate subsequent invocation.

And step 207, when receiving a route planning instruction aiming at the target category, performing route planning according to the recognition result of the image of the traffic sign board and the acquisition position information carried by the image of the traffic sign board, wherein the image of the traffic sign board carries the acquisition position information of the image of the traffic sign board.

For example, the terminal may have map software installed thereon, the server according to the embodiment of the present application may be a server of the map software, and the terminal may be in communication connection with the server. When a user needs to go out, the user can start the map software on the terminal and input the travel information, such as a departure place, a destination and a travel mode adopted by selection, and then the terminal can send a route planning instruction and the travel information input by the user to a server of the map software. If the travel mode is to drive a truck for travel, the route planning instruction may be an instruction for the truck, which is the target category determined in the step 206. After receiving the route planning instruction and the travel information, the server can acquire the identification result of the images of the traffic sign boards of the truck types in the target region, and determine a road through which the vehicle can pass, wherein the target region comprises a region between the departure place and the destination and a region in a certain fixed range around the departure place and the destination. And the server plans a route for driving the truck from the departure place to the destination for the user according to the route, and sends the route to the terminal, so that the user can travel according to the guidance of the route displayed by the terminal.

It should be noted that the server needs to determine whether a certain road can be used for vehicle passing according to the collected position information carried by the image of the traffic sign of the truck category and the identification result of the image, so the server can plan a route according to the identification result of the image of the traffic sign and the collected position information carried by the image of the traffic sign.

In the embodiment of the present application, the server performs route planning for the user according to the recognition result of the traffic sign. Optionally, the recognition result of the image of the traffic sign can also be used for automatic driving of the vehicle, and the automatic driving technology generally comprises high-precision maps, environment perception, behavior decision, path planning, motion control and the like. In an example, the vehicle can acquire an image of a traffic sign on a road in real time in the driving process, the image is uploaded to a server, the image is identified by the server to obtain an identification result of the image, and the driving mode of the vehicle is determined according to the identification result, so that flexible automatic driving of the vehicle is realized. Steps 201 to 206 in the image recognition method provided in the embodiment of the present application may also be used in other scenes, such as statistics of traffic signs, and the like, which is not limited in the embodiment of the present application.

Fig. 6 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present application. As shown in fig. 6, the image recognition apparatus 60 may include:

the obtaining module 601 is configured to obtain an image of a traffic sign board to be identified, where the image of the traffic sign board includes multiple types of elements.

The identifying module 602 is configured to identify each class of element in the multiple classes of elements, respectively, to obtain an identification result of each class of element in the multiple classes of elements.

And the combining module 603 is configured to combine the recognition results of each of the multiple types of elements to obtain the recognition result of the image of the traffic sign.

In summary, the image recognition apparatus provided in the embodiment of the present application can respectively recognize multiple types of elements in the image of the traffic sign to be recognized, and further combine the recognition results of each type of element in the multiple types of elements to obtain the recognition result of the image of the traffic sign. Therefore, compared with the overall recognition of the images of the traffic sign boards in the related technology, the recognition of the images of the traffic sign boards can be ensured to be more precise, and the accuracy of the image recognition is improved.

Optionally, the recognition result of the image of the traffic sign is recognized by an image recognition model, and the image recognition model includes: the recognition module 602 may further be configured to:

extracting the characteristics of the images of the traffic sign board through a characteristic extraction network;

in the characteristics of the images of the traffic sign boards, the characteristics of the corresponding class of elements are identified through each target detection network, and the identification result of the class of elements is obtained.

Optionally, the receptive field of each feature of the extracted image to be recognized covers all pixels of the image to be recognized.

Optionally, the feature extraction network comprises a plurality of convolutional layers, the number of which is greater than or equal to 13.

Optionally, at least one of the plurality of convolutional layers is a bottleneck layer, and the convolutional layers on both sides of the bottleneck layer are not bottleneck layers.

Optionally, the feature extraction network comprises a plurality of maximum pooling layers, the number of the plurality of maximum pooling layers being less than 4.

Optionally, the image of the traffic sign carries the acquisition location information of the image of the traffic sign. Fig. 7 is a schematic structural diagram of another image recognition apparatus provided in an embodiment of the present application, and on the basis of fig. 6, the image recognition apparatus 60 may further include:

the determining module 604 is configured to, after the recognition result of each of the multiple types of elements is combined to obtain the recognition result of the image of the traffic sign, determine, according to the recognition result of the image of the traffic sign, a target class to which the image of the traffic sign belongs among multiple classes, where the multiple classes include a class to which the information indicated by the traffic sign is directed.

And the planning module 605 is configured to perform route planning according to the recognition result of the image of the traffic sign and the collection position information carried by the image of the traffic sign when receiving a route planning instruction for the target category.

In an exemplary embodiment, an image recognition apparatus is also provided and may include a processor and a memory having at least one instruction stored therein. The at least one instruction is configured to be executed by one or more processors to implement any of the image recognition methods described above.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application. The server may be the image recognition apparatus described in the above embodiments. The server may be a physical server, or may be a cloud server providing cloud computing services, and the server may be implemented as one server, or may be a server cluster or distributed system formed by a plurality of servers. When the terminal and the server cooperatively implement the scheme provided by the embodiment of the present application, the terminal and the server may be directly or indirectly connected in a wired or wireless communication manner, which is not limited in the embodiment of the present application. As shown in fig. 8, the server 80 includes a Central Processing Unit (CPU)801, a system memory 804 including a Random Access Memory (RAM)802 and a Read Only Memory (ROM)803, and a system bus 805 connecting the system memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806, which facilitates transfer of information between devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for user input of information. Wherein a display 808 and an input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory device, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory. The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

According to various embodiments of the present application, server 80 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the server 80 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

In an exemplary embodiment, there is also provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded by a processor and that performs any of the above image recognition methods.

Embodiments of the present application also provide a computer program product, which when run on a computer causes the computer to execute the image recognition method as described in any of the embodiments of the present application.

In the embodiments of the present application, the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means at least two, i.e., two or more, unless expressly defined otherwise. The term "and/or" in the embodiment of the present application is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. "/" is used in formulas or mathematical operations to denote the operator "divide by". The term "at least one of a and B" in the present application is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, at least one of a and B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

It should be noted that: in the image recognition apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing image recognition, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above.

It should be noted that: the method embodiments provided in the embodiments of the present application can be mutually referred to corresponding apparatus embodiments, and the embodiments of the present application do not limit this. The sequence of the steps of the method embodiments provided in the embodiments of the present application can be appropriately adjusted, and the steps can be correspondingly increased or decreased according to the situation, and any method that can be easily conceived by those skilled in the art within the technical scope disclosed in the present application shall be covered by the protection scope of the present application, and therefore, the details are not repeated.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image recognition method, characterized in that the method comprises:

2. The method of claim 1, wherein the recognition result of the image of the traffic sign is recognized by an image recognition model, the image recognition model comprising: a feature extraction network and a plurality of detection networks, where the plurality of detection networks include target detection networks corresponding to the multiple types of elements one to one, and the identifying each type of element in the multiple types of elements respectively includes:

extracting features of the image of the traffic sign through the feature extraction network;

and in the characteristics of the images of the traffic sign boards, identifying the characteristics of the corresponding class of elements through each target detection network to obtain the identification result of the class of elements.

3. The method according to claim 2, wherein the extracted receptive field of each feature of the image to be recognized covers all pixels of the image to be recognized.

4. The method of claim 3, wherein the feature extraction network comprises a plurality of convolutional layers, and wherein the number of convolutional layers is greater than or equal to 13.

5. The method of claim 4, wherein at least one of the plurality of convolutional layers is a bottleneck layer, neither convolutional layer on either side of the bottleneck layer being a bottleneck layer.

6. The method of any of claims 2 to 5, wherein the feature extraction network comprises a plurality of maximum pooling layers, the number of the plurality of maximum pooling layers being less than 4.

7. The method according to any one of claims 1 to 5, wherein the image of the traffic sign carries information about a collecting position of the image of the traffic sign, and after the combining the recognition result of each of the plurality of types of elements to obtain the recognition result of the image of the traffic sign, the method further comprises:

determining a target category to which the image of the traffic sign belongs according to a recognition result of the image of the traffic sign among a plurality of categories including a category to which information indicated by the traffic sign is directed;

and when a route planning instruction aiming at the target category is received, carrying out route planning according to the recognition result of the image of the traffic sign board and the acquisition position information carried by the image of the traffic sign board.

8. An image recognition apparatus, characterized in that the image recognition apparatus comprises:

9. An image recognition apparatus, characterized in that the image recognition apparatus comprises: a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the image recognition method according to any one of claims 1 to 7.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the image recognition method according to any one of claims 1 to 7.