CN114926766A

CN114926766A - Identification method and device, equipment and computer readable storage medium

Info

Publication number: CN114926766A
Application number: CN202210569816.0A
Authority: CN
Inventors: 杜松显; 卢江涛; 唐伟; 王家奇; 林锦河; 王静
Original assignee: Hangzhou Yele Technology Co ltd
Current assignee: Hangzhou Yele Technology Co ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-19

Abstract

The embodiment of the application discloses an identification method, an identification device, equipment and a computer readable storage medium. The method comprises the following steps: extracting an image frame to be identified in a video stream; inputting the image frame to be recognized into a trained recognition model to obtain information of a target object and target scene information output by the recognition model; the identification model comprises an image characteristic extraction network used for identifying and extracting image characteristics of an image frame to be identified, a detection task branch used for outputting detection information of an object in the image frame, a semantic segmentation task branch used for outputting semantic information of pixel points in the image frame, and a classification task branch used for outputting scene type information in the image frame, determining information of a target object according to the detection information and the semantic information, and determining the target scene information according to the scene type information and the semantic information. Meanwhile, multi-task processing is carried out aiming at the same image characteristic, so that the data processing time is saved, and the results output by task branches are mutually referred, so that the output information is more accurate.

Description

Identification method and device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an identification method, an identification apparatus, a device, and a computer-readable storage medium.

Background

The video stream is composed of a plurality of continuous image frames, and in the prior art, image features are obtained by performing feature extraction on the image frames, and the image frames are subjected to identification analysis according to the extracted image features, for example, a target object in the image frames is identified.

With the increase of the demand types of the image frame information, the total amount of the image frame information is increased, so that the time of the identification and acquisition process of the image features in the image frame is increased. In particular, in the prior art, the target object and the target scene in the image frame cannot be recognized at the same time, the time of the whole recognition process cannot be ensured not to be increased, and the accuracy of the recognition result cannot be ensured by performing the recognition step in a multi-task manner at the same time.

Therefore, there is a need for an identification method for simultaneously identifying target object information and target scene information in an image frame, so as to solve the above problems and ensure the accuracy of the identification information without increasing the time delay.

Disclosure of Invention

In order to solve the above technical problem, embodiments of the present application respectively provide an identification method, an identification device, an identification apparatus, and a computer-readable storage medium, so as to simultaneously identify target object information and target scene information in an image frame without increasing a time delay.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided an identification method, including: extracting an image frame to be identified in a video stream; inputting an image frame to be recognized into a trained recognition model to obtain information of a target object and target scene information output by the recognition model; the identification model comprises an image feature extraction network for identifying and extracting image features of the image frame to be identified, a detection task branch for outputting detection information of an object in the image frame, a semantic segmentation task branch for outputting semantic information of pixel points in the image frame, and a classification task branch for outputting scene type information in the image frame, information of the target object is determined according to the detection information and the semantic information, and the target scene information is determined according to the scene type information and the semantic information.

Further, the identification method further comprises: constructing an initial identification model, wherein the initial identification model comprises an image feature extraction network, the detection task branch, the semantic segmentation task branch and the classification task branch; inputting an image frame to be recognized to be trained into the initial recognition model, and recognizing the image frame to be recognized to be trained and performing feature extraction by the image feature extraction network to obtain the features of the image to be trained; the detection task branch outputs first detection information of an object in an image frame corresponding to the image features to be trained, the semantic segmentation task branch outputs first semantic information of pixel points in the image frame corresponding to the image features to be trained, and the classification task branch outputs first scene type information in the image frame corresponding to the image features; and correcting the initial recognition model according to the first detection information, the first semantic information and the first scene information to obtain the trained recognition model.

Further, the modifying the initial recognition model according to the first detection information, the first semantic information, and the first scene information to obtain the trained recognition model includes: determining a detection information loss function value according to the first detection information and the first standard detection information; determining a semantic information loss function value according to the first semantic information and the first semantic standard information; determining a scene type information loss function value according to the first scene type information and the first standard scene type information; and correcting the initial recognition model based on the detection information loss function value, the semantic information loss function value and the scene type information loss function value to obtain the trained recognition model.

Further, the modifying the initial recognition model based on the detection information loss function value, the semantic information loss function value, and the scene type information loss function value to obtain the trained recognition model includes: calculating to obtain a first back propagation value based on the detection information loss function value and a first dynamic modulation factor corresponding to the detection task branch; calculating to obtain a second back propagation value based on the semantic information loss function value and a second dynamic modulation factor corresponding to the semantic segmentation task branch; calculating to obtain a third back propagation value based on the scene type information loss function value and a third dynamic modulation factor corresponding to the classification task branch; and updating configuration parameters in the initial recognition model according to the first back propagation value, the second back propagation value and the third back propagation value to obtain the trained recognition model.

Further, the detecting information includes a position of a detection regression frame in the image frame and a category of a prediction object, the semantic information includes a number and a position of pixel points corresponding to the prediction object in the image frame and a semantic type corresponding to the pixel points in the image frame, the information of the target object includes the category and the position of the target object, and the determining the information of the target object according to the detecting information and the semantic information includes: determining the number of pixel points corresponding to the prediction object in the detection regression frame according to the position of the detection regression frame; and if the number of the pixel points corresponding to the prediction object is greater than the number of the pixel points of a preset category object corresponding to the category of the prediction object, determining that the position of the detection regression frame is the position of the target object, and the category of the prediction object is the category of the target object.

Further, the semantic types include: a scene type and an object type; the determining the target scene information according to the scene type information and the semantic information includes: if the scene type in the semantic type is the same as the scene type in the scene type information, determining the scene type in the scene type information as the scene type of the target scene; if the scene type in the semantic type is different from the scene type in the scene type information, and the scene probability value corresponding to the scene type is greater than a preset scene probability threshold value, determining the scene type in the scene type information as the scene type of the target scene; and if the scene type in the semantic types is different from the scene type in the scene type information, and the scene probability value corresponding to the scene type is smaller than or equal to the preset scene probability threshold, determining the scene type in the semantic types corresponding to the pixel points in the image characteristics as the scene type of the target scene.

Further, the extracting the image frame to be identified in the video stream includes: acquiring a video stream, wherein the video stream comprises a plurality of image frames; and respectively detecting whether the target object exists in each image frame, and determining the image frame in which the target object is detected as the image frame to be identified. Further, before the obtaining the trained recognition model, the method further includes: correcting the initial recognition model to obtain a corrected recognition model; and carrying out quantitative perception training of an INT8 edge calculation module aiming at the modified recognition model to obtain a trained recognition model.

Further, the updating the configuration parameters in the initial recognition model according to the first back propagation value, the second back propagation value, and the third back propagation value to obtain the trained recognition model includes: training parameters in the detection task branches according to the first back propagation value to obtain trained detection task branches; the second back propagation value trains parameters in the semantic segmentation task branches to obtain trained semantic segmentation task branches; the third back propagation value trains parameters in the classification task branches to obtain the trained classification task branches; and obtaining the trained recognition model.

According to an aspect of an embodiment of the present application, there is provided an identification apparatus, including: the extraction module is configured to extract image features of the image frame to be identified; the output module is configured to input the image features to a trained recognition model, and obtain information of a target object and target scene information output by the recognition model; the recognition model comprises a detection task branch used for outputting detection information of an object in the image characteristics, a semantic segmentation task branch used for outputting semantic information of a pixel point in the image characteristics, and a classification task branch used for outputting scene type information in the image characteristics, information of the target object is determined according to the detection information and the semantic information, and the target scene information is determined according to the scene type information and the semantic information.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: a controller; a memory for storing one or more programs that, when executed by the controller, cause the controller to implement the identification method described above.

According to an aspect of embodiments of the present application, there is also provided a computer-readable storage medium having stored thereon computer-readable instructions, which, when executed by a processor of a computer, cause the computer to execute the above-mentioned identification method.

According to an aspect of an embodiment of the present application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the above-mentioned identification method.

In the technical scheme provided by the embodiment of the application, the image characteristics of the image frame to be recognized are extracted, the image characteristics are input into the trained recognition model, and the information of the target object and the target scene information output by the recognition model are obtained. The recognition model comprises three task branches, a detection task branch for outputting detection information of an object in image characteristics, a semantic segmentation task branch for outputting semantic information of pixel points in the image characteristics and a classification task branch for outputting scene type information in the image characteristics.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 is a schematic illustration of an implementation environment to which the present application relates;

FIG. 2 is a flow chart illustrating a method for intelligent road identification according to an exemplary embodiment of the present application;

FIG. 3 is a flow chart illustrating an identification method according to an exemplary embodiment of the present application;

FIG. 4 is a flowchart illustrating steps for building a recognition model in accordance with another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a recognition model according to another exemplary embodiment of the present application;

FIG. 6 is a flow chart illustrating a process for modifying a recognition model based on a loss function according to another exemplary embodiment of the present application;

FIG. 7 is a flow chart illustrating a process for modifying a recognition model based on backpropagation values in accordance with another exemplary embodiment of the present application;

FIG. 8 is a diagram illustrating edge-end quantized perceptual training for a recognition model according to another exemplary embodiment of the present application;

FIG. 9 is a schematic diagram illustrating the structure of an identification appliance in accordance with an exemplary embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer system of an electronic device according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flowcharts shown in the figures are illustrative only and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Reference to "a plurality" in this application means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring first to fig. 1, fig. 1 is a schematic diagram of an implementation environment related to the present application. The implementation environment comprises an acquisition terminal 100 and a server 200, wherein the terminal 100 and the server 200 are communicated through a wired or wireless network.

The capture terminal 100 has a function of capturing a video stream, and can transmit the captured video stream to the server 200. The capture terminal 100 includes, but is not limited to, a video camera, a still camera, a video camera, a mobile phone, a vehicle-mounted video device, and any other electronic device capable of implementing image visualization, and is not limited herein.

The server 200 may extract an image frame to be recognized from the video stream, and then input the image frame to be recognized to the trained recognition model, so that the information of the target object and the target scene information output by the recognition model are obtained. The server 200 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, where the plurality of servers may form a block chain, and the server is a node on the block chain, and the server 200 may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), and a big data and artificial intelligence platform, which is not limited herein.

In some scenarios, the capturing terminal 100 and the server 200 may be disposed in the same physical device or apparatus, for example, the recognition method of the present application is used in a scenario of intelligent road recognition, the capturing terminal 100 is an onboard video device in a road running vehicle, which can capture a video stream of an associated road, and the server 200 is located inside the vehicle, as shown in fig. 2, where fig. 2 is a flowchart of an intelligent road recognition method according to an exemplary embodiment of the present application. In the figure, an image frame to be recognized in a vehicle-mounted real-time video stream is extracted, the image frame to be recognized is input into a trained recognition model, the recognition model comprises a detection task branch, a semantic segmentation task branch and a classification task branch, the model outputs target obstacle information and road information in the image frame, and a vehicle is controlled according to related information. The information of the target obstacle includes the position, size, type and the like of the target obstacle, and the road information includes the information of the road type and the like. The road condition changes rapidly in the running process of the vehicle, the identification method in the embodiment can accurately identify the target barrier and the road type in real time, and subsequently control the running parameters of the vehicle according to the identified relevant information, for example, when the target barrier is identified on the road, the vehicle is controlled to stop running within the preset reaction time or immediately. When the road type is different from the gradient road, the gear of the vehicle can be automatically controlled and switched. The detailed steps of the identification method executed by the server 200 refer to the description in the following embodiments.

Referring to fig. 3, fig. 3 is a flowchart illustrating an identification method according to an exemplary embodiment of the present application, which may be specifically executed by the server 200 on the base station side in the implementation environment shown in fig. 1. Of course, the method may also be applied to other implementation environments and executed by a server device in other implementation environments, which is not limited in this embodiment. As shown in fig. 3, the method at least includes steps S310 to S320, which are described in detail as follows:

s310: and extracting the image frames to be identified in the video stream.

The image frames to be recognized are from a video stream, the video stream in this embodiment is acquired by the acquisition terminal 100, and the video stream is composed of a plurality of image frames.

S320: inputting image features of an image frame to be recognized into a trained recognition model to obtain information of a target object and target scene information output by the recognition model; the identification model comprises an image feature extraction network for identifying and extracting image features of an image frame to be identified, a detection task branch for outputting detection information of an object in the image features, a semantic segmentation task branch for outputting semantic information of pixel points in the image features, and a classification task branch for outputting scene type information in the image features, wherein the information of a target object is determined according to the detection information and the semantic information, and the target scene information is determined according to the scene type information and the semantic information.

The image features are feature pictures or feature data obtained by identifying and extracting features of the image frame to be identified, and the specific representation of the image features is not limited in this embodiment.

The target object of the present embodiment is an object in an image frame to be recognized, for example, in a vehicle road scene, the target object may be a target obstacle in a road. The target scene information is scene information in the image frame to be recognized, and comprises road types, road weather and the like.

For an exemplary illustration of S320, in a vehicle road scene, inputting the extracted image frame to be recognized into a trained recognition model, where the recognition model outputs information of a target obstacle and information of a target road scene in the image frame to be recognized, where the information of the target obstacle includes: the type of the obstacle, the size of the obstacle, the position of the obstacle and the like, and the road scene information comprises: road type, road weather, etc.

Specifically, the model in the present embodiment includes three task branches: the method comprises the steps of detecting a task branch, semantically segmenting the task branch and classifying the task branch, wherein the detecting task branch outputs information of obstacles in image characteristics, the semantically segmenting the task branch outputs semantic information of each pixel point in the image characteristics, the classifying task branch outputs type information of roads in the image characteristics, information of target obstacles is determined according to the information of the obstacles and the semantic information, and target road scene information is determined according to the type information of the roads and the semantic information.

In the embodiment, the image frames to be recognized in the video stream are extracted and input into the trained recognition model, so that the information of the target object and the target scene information output by the recognition model are obtained. The recognition model is used for recognizing an image feature extraction network for extracting image features of an image frame to be recognized, and further comprises three task branches, a detection task branch for outputting detection information of an object in the image frame, a semantic segmentation task branch for outputting semantic information of pixel points in the image frame, and a classification task branch for outputting scene type information in the image frame.

How to optimize the recognition model is a study continuously performed by those skilled in the art, the recognition model of the present application is trained before being used, please refer to fig. 4, fig. 4 is a flowchart of steps of constructing the recognition model according to another exemplary embodiment of the present application, and based on the recognition method shown in fig. 3, at least S410 to S430 are further included, and the following details are introduced:

s410: and constructing an initial recognition model, wherein the initial recognition model comprises an image feature extraction network, a detection task branch, a semantic segmentation task branch and a classification task branch.

The S410 is exemplarily illustrated: and constructing a detection task branch, a semantic segmentation task branch and a classified task branch in the recognition model. Firstly, constructing an image feature extraction network, namely a feature backbone network, which comprises M blocks (network blocks), wherein the downsampling multiplying power of each block of the feature backbone network can be 1 or 2, and finally the downsampling multiplying power of the feature backbone network is 8; the detection task branch comprises N detection heads, and except that the output resolution of the first detection head is the resolution of the characteristic backbone network, the respective rate of each of the rest detection heads is 1/2 of the connected detection head; the semantic segmentation task branch comprises an S-layer simple neural network; the classification task branch comprises an L-layer simple neural network; m, N, S and L are both integers. Meanwhile, a training mode of a polling mechanism is established, only the training data of a single task branch in the recognition model is fed in during single-step training, three task branches are alternately trained, and the parameters required by each task branch are updated.

Specifically, as shown in fig. 5, fig. 5 is a schematic structural diagram of a recognition model according to another exemplary embodiment of the present application. The block construction of a single feature network backbone comprises: a first convolutional layer, 6 feature layers and a blend layer. The stride of the first winding layer is 1 or 2, the number of channels is c, and the size of the inner core is 3 x 3; the characteristic layers are 3 common convolution layers and 3 expansion convolution layers, the number of channels is 1/2c, the expansion scale is d, and the size of the inner core is 3 x 3; the size of the inner core of the fusion layer is 1 x 1, the number of channels is c, and c and d are positive integers.

Furthermore, a semantic segmentation task is branched into an S-layer simple convolutional neural network, and the last layer of the network adopts bilinear interpolation upsampling to be the same as the input resolution; the classification task branches into L layers of simple convolutional neural networks, the last layer is k channels and is divided into k neurons through a global pool, and k is the total number of the classified categories; the detection branch is composed of a plurality of detection heads with different scales, and the regression of the anchor point frame is completed through the post-processing similar to yolo. Wherein S, L, K is an integer.

S420: inputting an image frame to be recognized to be trained into an initial recognition model, recognizing the image frame to be recognized to be trained by an image feature extraction network, and extracting features to obtain the features of the image to be trained; the detection task branch outputs first detection information of an object in an image frame corresponding to the image feature to be trained, the semantic segmentation task branch outputs first semantic information of a pixel point in the image frame corresponding to the image feature to be trained, and the classification task branch outputs first scene type information in the image frame corresponding to the image feature.

Illustratively, after the image frame to be recognized to be trained is input into the initial recognition model, the image feature extraction network will extract the image features thereof, the corresponding task branch will output the relevant information in the image frame corresponding to the image feature to be trained, and subsequently, the parameters of the task branch can be corrected according to the respective corresponding information.

S430: and correcting the initial recognition model according to the first detection information, the first semantic information and the first scene information to obtain a trained recognition model.

S430 is exemplarily illustrated, the detection task branch is modified according to the first detection information, the semantic segmentation task branch is modified according to the first semantic information, and the classification task branch is modified according to the first scene information, so as to obtain the trained recognition model.

The embodiment further defines the step of constructing the initial recognition model, and obtains the trained recognition model by training each task branch in the initial model. Meanwhile, the image frame to be recognized and trained is input into the special thought feature extraction network, the extracted image features are input into the three task branches to obtain corresponding information, and the respective task branches are trained according to the corresponding information, so that the training process is more precise, and the accuracy of the recognition model is finally improved.

Further, another exemplary embodiment of the present application provides a specific modification manner for the recognition model, and specifically refer to fig. 6, where fig. 6 is a flowchart illustrating a process of modifying the recognition model based on the loss function according to another exemplary embodiment of the present application. Based on the above S430, at least S610 to S620 are further included, and the following details are introduced:

s610: determining a detection information loss function value according to the first detection information and the first standard detection information; determining a semantic information loss function value according to the first semantic information and the first semantic standard information; and determining a scene type information loss function value according to the first scene type information and the first standard scene type information.

Each task branch has a corresponding standard information value, and the first detection information, the first semantic information and the first scene type information are respectively compared with the corresponding standard information values to obtain respective loss function values.

S620: and correcting the initial recognition model based on the detection information loss function value, the semantic information loss function value and the scene type information loss function value to obtain a trained recognition model.

And correcting the parameters of the corresponding task branches according to the corresponding loss function values to achieve the aim of correcting the initial recognition model, thereby obtaining the trained recognition model.

According to the embodiment, the initial recognition model is limited to be corrected according to the loss function value of each task branch, so that the trained recognition model is obtained, and the accuracy of the trained recognition model is higher.

In order to make the recognition model modification more scientific, a back propagation value is introduced in another exemplary embodiment of the present application to assist in modifying the recognition model, and in particular, referring to fig. 7, fig. 7 is a flowchart illustrating a process of modifying the recognition model based on the back propagation value according to another exemplary embodiment of the present application. Based on the above S620, at least S710 to S740 are further included, and the following is introduced in detail:

s710: and calculating to obtain a first back propagation value based on the detection information loss function value and a first dynamic modulation factor corresponding to the detection task branch.

S720: and calculating to obtain a second back propagation value based on the semantic information loss function value and a second dynamic modulation factor corresponding to the semantic segmentation task branch.

S730: and calculating to obtain a third back propagation value based on the scene type information loss function value and a third dynamic modulation factor corresponding to the classification task branch.

In this embodiment, a value propagated in the direction is calculated according to the dynamic modulation factor and the loss function value corresponding to each task branch, all the neural network parameters in the involved recognition model are updated through a back propagation mechanism, and the neural network parameters obtained through training can be used for simultaneously detecting three tasks during reasoning.

Specifically, the relationship between the dynamic modulation factor and the loss function value is Lbi-1/Lsi × Lci, where Lbi is the loss function value of the ith task propagating in the multitask network in the reverse direction, Lsi is the loss function value after the single task network is completely converged, and Lci is the loss function value calculated by the current step number of the task.

S740: and updating configuration parameters in the initial recognition model according to the first back propagation value, the second back propagation value and the third back propagation value to obtain a trained recognition model.

Illustratively, the configuration parameters in the initial model are updated by using the back propagation values corresponding to the task branches, so as to obtain a trained recognition model. In particular, as shown in fig. 8, fig. 8 is a schematic diagram of performing edge-end quantization perception training on a recognition model according to another exemplary embodiment of the present application. The method and the device creatively send the board-end INT8 result to the PC-end FP32 model to calculate the loss function, so that the neural network parameters obtained by training are quantized through post-training quantization and training perception quantization, and an INT8 tensor operation model which can be deployed in an edge calculation processor is obtained; the method comprises the steps of utilizing an INT8 tensor operation model, obtaining detection information, semantic information and scene type information while processing image frame information in video streams through edge calculation, determining information of a target object according to the detection information and semantic information, and determining target scene information according to classification information and semantic information. Specifically, the image feature samples in the image feature sample set are input into the INT8 tensor operation model to obtain a prediction result, the prediction result is compared with a preset standard value in the image feature sample set to obtain a loss function value, a corresponding back propagation value is obtained according to the loss function value to update parameters, and an FP32 model is obtained and serves as the trained recognition model of the embodiment.

The embodiment further introduces a backward propagation value to perform parameter correction on the initial recognition model, further determines a corresponding backward propagation value according to the loss function value, and finally performs parameter correction on the initial recognition model according to the backward propagation value to obtain a trained recognition model, so that the recognition model correction process has more theoretical scientificity, and the recognition model has higher accuracy.

In another exemplary embodiment, updating configuration parameters in the initial recognition model according to the first back propagation value, the second back propagation value, and the third back propagation value to obtain a trained recognition model, including: training parameters in the detection task branches according to the first back propagation value to obtain trained detection task branches; training parameters in the semantic segmentation task branches by the second back propagation value to obtain trained semantic segmentation task branches; training parameters in the classification task branches by the third back propagation value to obtain trained classification task branches; and obtaining the trained recognition model.

According to the method and the device, training is performed on the parameters in the corresponding task branches according to the back propagation values, model optimization time is saved by training the multi-task parameters at the same time, and the trained classification task branches are updated in the recognition model, so that the trained recognition model is obtained, the training is more refined, and the recognition accuracy of the trained recognition model is higher.

In another exemplary embodiment, how to determine the category of the target object is further defined, the detection information includes a position of a detection regression frame in the image frame and the category of the prediction object, the semantic information includes a number and a position of pixel points corresponding to the prediction object in the image frame, and a semantic type corresponding to the pixel points in the image frame, and based on S320, the steps specifically include S810 to S820, which are described in detail below:

s810: and determining the number of pixel points corresponding to the prediction object in the detection regression frame according to the position of the detection regression frame.

The detection regression frame of the present embodiment is a target frame in the image features included in the detection information, and is used to frame the prediction object, that is, to determine the position of the prediction object in the image features. Knowing the position of the detection regression frame in the image characteristics, counting the pixel points corresponding to the prediction object in the detection regression frame to obtain the number of the pixel points corresponding to the prediction object in the regression frame.

Illustratively, the detection regression frame B in the image feature included in the detection information of the detection task branch output _i (X ₁ ，Y ₁ ，X ₂ ，Y ₂ ) And predicting the type C of the object, based on the detected regression frame B _i The coordinates of the C-type prediction object can determine the position of the C-type prediction object in the image characteristics and the area of the framing area, and the number of the pixel points of the C-type prediction object is obtained by counting the pixel points of the C-type prediction object in the area of the framing area.

S820: and if the number of the pixel points corresponding to the prediction object is larger than that of the pixel points of the preset category object corresponding to the category of the prediction object, determining that the position of the detection regression frame is the position of the target object, and determining that the category of the prediction object is the category of the target object.

Illustratively, the number of pixels of the preset category object corresponding to the C-type prediction object is 10, and if the number of pixels corresponding to the C-type prediction object is 15, that is, the number of pixels corresponding to the prediction object is greater than the number of pixels of the preset category object corresponding to the category of the prediction object, it can be determined that the C-type prediction object is the target object, and the position of the regression frame is detected as the position of the target object.

The embodiment defines the information contained in the detection information, and further clarifies how to determine the category of the target object by using the position of the detection regression frame and the category of the prediction object in the image features included in the detection information, and meanwhile, the exact position of the target object in the image frame to be recognized can be determined, so that the recognition process of the target object is more accurate.

In another exemplary embodiment, how to determine the scene type of the target scene is further defined, based on S320, the semantic types include: a scene type and an object type; the scene type information includes a scene type in the image frame and a scene probability value corresponding to the scene type, and S320 includes S910 to S930, which are described in detail below:

s910: and if the scene type in the semantic type is the same as the scene type in the scene type information, determining that the scene type in the scene type information is the scene type of the target scene.

Exemplarily, if the scene type in the semantic types is D and the scene type in the scene type information is also D, the scene type of the target scene output by the recognition model last is D.

S920: and if the scene type in the semantic type is different from the scene type in the scene type information, and the scene probability value corresponding to the scene type is greater than a preset scene probability threshold value, determining the scene type in the scene type information as the scene type of the target scene.

The preset scene probability threshold in this embodiment is a threshold preset in the recognition model, and is used to determine the scene type of the target scene, and the setting process of the preset scene probability threshold and the size of the threshold are not specifically limited in this embodiment.

Exemplarily, if the scene type in the scene type information is B, the corresponding scene probability of B is 0.5, the scene type in the semantic type is a, and the preset scene probability threshold is 0.4, it is determined that the scene type of the target scene is B.

S930: and if the scene type in the semantic type is different from the scene type in the scene type information, and the scene probability value corresponding to the scene type is smaller than or equal to a preset scene probability threshold value, determining the scene type in the semantic type corresponding to the pixel point in the image feature as the scene type of the target scene.

Exemplarily, if a scene type in the scene type information is B, a corresponding probability of the B scene is 0.4, a scene type in the semantic type is a, and a preset scene probability threshold is 0.4, it is determined that the scene type of the target scene is a.

In the embodiment, the scene type in the semantic type is compared with the scene type in the scene type information in similarity, and the preset scene probability threshold is introduced, so that the scene type of the target scene in the image frame to be identified is accurately determined.

In another exemplary embodiment, the acquisition mode of the pre-image frame to be recognized is defined, and based on the above S310, the method further includes S1010 to S1020, which are described in detail below:

s1010: a video stream is acquired, the video stream including a plurality of image frames.

S1020: whether a target object exists in each image frame is detected respectively, and the image frame with the detected target object is determined as an image frame to be identified.

For example, if an image frame is a blank image frame, that is, if there is no object in the blank image frame, there is certainly no target object, and if the image frame is used for identifying the target object and the scene type, the data processing amount of the identification model is obviously increased.

The embodiment performs the pre-detection on each image frame in the video stream, avoids useless/meaningless identification of the identification model caused by inputting useless image frames into the identification model as image frames to be identified, avoids the waste of data processing resources of the identification model, and improves the accuracy of the identification result.

Another aspect of the present application also provides an identification device, as shown in fig. 9, where fig. 9 is a schematic diagram illustrating a structure of an identification device according to an exemplary embodiment of the present application. Wherein, the recognition device includes:

an extracting module 910 configured to extract an image frame to be identified in a video stream.

An output module 930 configured to input the image frame to be recognized into the trained recognition model, so as to obtain information of the target object and the target scene information output by the recognition model; the identification model comprises an image feature extraction network for identifying and extracting image features of an image frame to be identified, a detection task branch for outputting detection information of an object in the image frame, a semantic segmentation task branch for outputting semantic information of pixel points in the image frame, and a classification task branch for outputting scene type information in the image frame, information of a target object is determined according to the detection information and the semantic information, and target scene information is determined according to the scene type information and the semantic information.

In another embodiment, the identification means further comprises:

the system comprises a construction module and a classification module, wherein the construction module is configured to construct an initial recognition model, and the initial recognition model comprises an image feature extraction network, a detection task branch, a semantic segmentation task branch and a classification task branch.

The training module is configured to input the image frames to be recognized to be trained into the initial recognition model, and the image feature extraction network recognizes and extracts features of the image frames to be recognized to be trained to obtain the features of the image to be trained; the detection task branch outputs first detection information of an object in an image frame corresponding to the image feature to be trained, the semantic segmentation task branch outputs first semantic information of a pixel point in the image frame corresponding to the image feature to be trained, and the classification task branch outputs first scene type information in the image frame corresponding to the image feature.

And the correction module is configured to correct the initial recognition model according to the first detection information, the first semantic information and the first scene information to obtain a trained recognition model.

In another embodiment, the correction module includes:

a loss function value determination unit configured to determine a detection information loss function value from the first detection information and the first standard detection information; determining a semantic information loss function value according to the first semantic information and the first semantic standard information; and determining a scene type information loss function value according to the first scene type information and the first standard scene type information.

And the loss function correction unit is configured to correct the initial recognition model based on the detection information loss function value, the semantic information loss function value and the scene type information loss function value to obtain a trained recognition model.

In another embodiment, the loss function modification unit includes:

and the first plate is configured to calculate a first back propagation value based on the detection information loss function value and a first dynamic modulation factor corresponding to the detection task branch.

And the second plate is configured to calculate a second back propagation value based on the semantic information loss function value and a second dynamic modulation factor corresponding to the semantic segmentation task branch.

And the third plate is configured to calculate a third back propagation value based on the scene type information loss function value and a third dynamic modulation factor corresponding to the classification task branch.

And the updating plate is configured to update the configuration parameters in the initial recognition model according to the first back propagation value, the second back propagation value and the third back propagation value to obtain the trained recognition model.

In another embodiment, the output module 930 includes:

and the pixel determining unit is configured to determine the number of pixel points corresponding to the prediction object in the detection regression frame according to the position of the detection regression frame.

And the category determination unit of the target object is configured to determine that the position of the detection regression frame is the position of the target object and the category of the prediction object is the category of the target object if the number of the pixel points corresponding to the prediction object is greater than the number of the pixel points of the preset category object corresponding to the category of the prediction object.

In another embodiment, semantic types include: a scene type and an object type; the output module 930 includes:

a first type unit configured to determine that the scene type in the scene type information is the scene type of the target scene if the scene type in the semantic type is the same as the scene type in the scene type information.

And the second type unit is configured to determine that the scene type in the scene type information is the scene type of the target scene if the scene type in the semantic type is different from the scene type in the scene type information and the scene probability value corresponding to the scene type is greater than a preset scene probability threshold value.

And the third type unit is configured to determine that the scene type in the semantic types corresponding to the pixel points in the image features is the scene type of the target scene if the scene type in the semantic types is different from the scene type in the scene type information and the scene probability value corresponding to the scene type is less than or equal to a preset scene probability threshold value.

In another embodiment, the extraction module 910 includes:

an acquisition unit configured to acquire a video stream, the video stream including a plurality of image frames.

And the detection unit is configured to respectively detect whether the target object exists in each image frame and determine the image frame in which the target object is detected as the image frame to be identified.

In another embodiment, the identification means further comprises:

and the correcting module is configured to correct the initial recognition model to obtain a corrected recognition model.

And the quantization module is configured to perform quantitative perception training of the INT8 edge calculation module on the modified recognition model to obtain a trained recognition model.

It should be noted that the identification apparatus provided in the foregoing embodiment and the identification method provided in the foregoing embodiment belong to the same concept, and specific ways of performing operations by each module and unit have been described in detail in the method embodiment, and are not described again here.

Another aspect of the present application also provides an electronic device, including: a controller; a memory for storing one or more programs, which when executed by the controller, perform the methods identified in the various embodiments described above.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer system of an electronic device according to an exemplary embodiment of the present application, which shows a schematic structural diagram of a computer system suitable for implementing an electronic device according to an exemplary embodiment of the present application.

It should be noted that the computer system 1000 of the electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU)1001 that can perform various appropriate actions and processes, such as performing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage portion 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for system operation are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other via a bus 1004. An Input/Output (I/O) interface 1005 is also connected to the bus 1004.

The following components are connected to the I/O interface 1005: an input portion 1006 including a keyboard, a mouse, and the like; an output section 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

In particular, according to embodiments of the present application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising a computer program for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. When the computer program is executed by a Central Processing Unit (CPU)1001, various functions defined in the system of the present application are executed.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with a computer program embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The computer program embodied on the computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

Another aspect of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the identification method as before. The computer-readable storage medium may be included in the electronic device described in the above embodiment, or may exist separately without being incorporated in the electronic device.

Another aspect of the application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the identification method provided in the above embodiments.

According to an aspect of an embodiment of the present application, there is also provided a computer system including a Central Processing Unit (CPU) that can perform various appropriate actions and processes, such as performing the method in the above-described embodiment, according to a program stored in a Read-Only Memory (ROM) or a program loaded from a storage portion into a Random Access Memory (RAM). In the RAM, various programs and data necessary for system operation are also stored. The CPU, ROM, and RAM are connected to each other via a bus. An Input/Output (I/O) interface is also connected to the bus.

The following components are connected to the I/O interface: an input section including a keyboard, a mouse, and the like; an output section including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section including a hard disk and the like; and a communication section including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as needed, so that the computer program read out therefrom is mounted into the storage section as needed.

The above description is only a preferred exemplary embodiment of the present application, and is not intended to limit the embodiments of the present application, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present application, so that the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An identification method, characterized in that the identification method comprises:

extracting an image frame to be identified in a video stream;

inputting an image frame to be recognized into a trained recognition model to obtain information of a target object and target scene information output by the recognition model; the identification model comprises an image feature extraction network for identifying and extracting image features of the image frame to be identified, a detection task branch for outputting detection information of an object in the image frame, a semantic segmentation task branch for outputting semantic information of pixel points in the image frame, and a classification task branch for outputting scene type information in the image frame, information of the target object is determined according to the detection information and the semantic information, and the target scene information is determined according to the scene type information and the semantic information.

2. The identification method according to claim 1, characterized in that the identification method further comprises:

constructing an initial identification model, wherein the initial identification model comprises the image feature extraction network, the detection task branch, the semantic segmentation task branch and the classification task branch;

inputting the image frames to be recognized to be trained into the initial recognition model, and recognizing the image frames to be recognized to be trained and performing feature extraction by the image feature extraction network to obtain the features of the images to be trained; the detection task branch outputs first detection information of an object in an image frame corresponding to the image feature to be trained, the semantic segmentation task branch outputs first semantic information of a pixel point in the image frame corresponding to the image feature to be trained, and the classification task branch outputs first scene type information in the image frame corresponding to the image feature;

and correcting the initial recognition model according to the first detection information, the first semantic information and the first scene information to obtain the trained recognition model.

3. The recognition method according to claim 2, wherein the modifying the initial recognition model according to the first detection information, the first semantic information, and the first scene information to obtain the trained recognition model comprises:

determining a detection information loss function value according to the first detection information and the first standard detection information; determining a semantic information loss function value according to the first semantic information and the first semantic standard information; determining a scene type information loss function value according to the first scene type information and the first standard scene type information;

and correcting the initial recognition model based on the detection information loss function value, the semantic information loss function value and the scene type information loss function value to obtain the trained recognition model.

4. The recognition method according to claim 3, wherein the modifying the initial recognition model based on the detection information loss function value, the semantic information loss function value, and the scene-type information loss function value to obtain the trained recognition model comprises:

calculating to obtain a first back propagation value based on the detection information loss function value and a first dynamic modulation factor corresponding to the detection task branch;

calculating to obtain a second back propagation value based on the semantic information loss function value and a second dynamic modulation factor corresponding to the semantic segmentation task branch;

calculating to obtain a third back propagation value based on the scene type information loss function value and a third dynamic modulation factor corresponding to the classification task branch;

and updating configuration parameters in the initial recognition model according to the first back propagation value, the second back propagation value and the third back propagation value to obtain the trained recognition model.

5. The identification method according to claim 1, wherein the detection information includes a position of a detection regression box in the image frame and a category of a prediction object, the semantic information includes a number and a position of pixels corresponding to the prediction object in the image frame and a semantic type corresponding to the pixels in the image frame, the information of the target object includes the category and the position of the target object, and the determining the information of the target object according to the detection information and the semantic information includes:

determining the number of pixel points corresponding to the prediction object in the detection regression frame according to the position of the detection regression frame;

and if the number of the pixel points corresponding to the prediction object is greater than the number of the pixel points of a preset category object corresponding to the category of the prediction object, determining that the position of the detection regression frame is the position of the target object, and the category of the prediction object is the category of the target object.

6. The recognition method of claim 5, wherein the semantic types include: a scene type and an object type; the determining the target scene information according to the scene type information and the semantic information includes:

if the scene type in the semantic type is the same as the scene type in the scene type information, determining the scene type in the scene type information as the scene type of the target scene;

if the scene type in the semantic type is different from the scene type in the scene type information, and the scene probability value corresponding to the scene type is greater than a preset scene probability threshold value, determining the scene type in the scene type information as the scene type of the target scene;

and if the scene type in the semantic types is different from the scene type in the scene type information, and the scene probability value corresponding to the scene type is smaller than or equal to the preset scene probability threshold, determining the scene type in the semantic types corresponding to the pixel points in the image characteristics as the scene type of the target scene.

7. The identification method according to any one of claims 1 to 6, wherein the extracting the image frame to be identified in the video stream comprises:

acquiring a video stream, wherein the video stream comprises a plurality of image frames;

and respectively detecting whether the target object exists in each image frame, and determining the image frame in which the target object is detected as the image frame to be identified.

8. The recognition method of claim 3, wherein prior to said obtaining the trained recognition model, the method further comprises:

correcting the initial recognition model to obtain a corrected recognition model;

and carrying out quantitative perception training of an INT8 edge calculation module aiming at the modified recognition model to obtain a trained recognition model.

9. An identification device, comprising:

the extraction module is configured to extract an image frame to be identified in the video stream;

the output module is configured to input the image characteristics of the image frame to be recognized into a trained recognition model, and obtain the information of a target object and the target scene information output by the recognition model; the identification model comprises a detection task branch used for outputting detection information of an object in the image frame, a semantic segmentation task branch used for outputting semantic information of a pixel point in the image frame and a classification task branch used for outputting scene type information in the image frame, information of the target object is determined according to the detection information and the semantic information, and the target scene information is determined according to the scene type information and the semantic information.

10. A computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor of a computer, cause the computer to perform the identification method of any one of claims 1 to 7.