CN113743258A

CN113743258A - Target identification method, target identification device, electronic equipment and computer-readable storage medium

Info

Publication number: CN113743258A
Application number: CN202110963118.4A
Authority: CN
Inventors: 龙思源; 张圆; 殷保才
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-12-03

Abstract

The application provides a target identification method, a target identification device, an electronic device and a computer-readable storage medium, wherein the target identification method comprises the following steps: performing feature extraction on the target to be recognized in the current frame based on reference feature information of the target to be recognized in the reference frame to obtain feature information of the target to be recognized in the current frame; the reference frame is a previous frame of the current frame; and identifying the characteristic information of the target to be identified. The method can reduce background interference and improve the accuracy of target identification.

Description

Target identification method, target identification device, electronic equipment and computer-readable storage medium

Technical Field

The present invention relates to the field of digital image processing technologies, and in particular, to a target recognition method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of sensor technology, more natural human-computer interaction methods and systems are receiving more and more attention. Compared with the traditional key interaction mode, the gesture interaction has remarkable advantages in many scenes. For example, in an intelligent home scene, a remote controller is not needed to perform operations of volume increasing and channel changing on a television, and a driver in an intelligent cockpit can adjust the volume without paying attention.

However, in the existing end-to-end gesture recognition scheme, there is a delay in recognizing the target gesture and the target gesture is susceptible to background interference.

Disclosure of Invention

The invention provides a target identification method, a target identification device, electronic equipment and a computer readable storage medium.

In order to solve the above technical problems, a first technical solution provided by the present invention is: provided is a target recognition method including: performing feature extraction on the target to be recognized in the current frame based on reference feature information of the target to be recognized in the reference frame to obtain feature information of the target to be recognized in the current frame; the reference frame is a previous frame of the current frame; and identifying the characteristic information of the target to be identified.

The step of extracting the features of the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame to obtain the feature information of the target to be recognized in the current frame includes: extracting the features of the target to be identified in the current frame to obtain current feature information; and splicing at least part of the reference characteristic information of the target to be identified in the reference frame with at least part of the current characteristic information of the target to be identified in the current frame to obtain the characteristic information of the target to be identified in the current frame.

The step of splicing at least part of the reference feature information of the target to be identified in the reference frame with at least part of the current feature information of the target to be identified in the current frame to obtain the feature information of the target to be identified in the current frame includes: splicing the (N-1)/N characteristics of the characteristic channel corresponding to the current characteristic information with the 1/N characteristics of the position of the characteristic channel corresponding to the reference characteristic information to obtain the characteristic information of the target to be identified in the current frame; and N is the number of the two-dimensional convolution modules, and the N two-dimensional convolution modules are sequentially cascaded.

The step of splicing the (N-1)/N features of the feature channel corresponding to the current feature information and the 1/N features of the feature channel position corresponding to the reference feature information to obtain the feature information of the target to be identified in the current frame further includes: responding to T < N, and then the current characteristic information is the characteristic information of the target to be identified in the current frame; wherein T is the frame number of the current frame.

The step of extracting the features of the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame to obtain the feature information of the target to be recognized in the current frame includes: and in response to the current frame comprising a plurality of targets, performing similarity comparison on current feature information of the plurality of targets and the reference feature information to determine the target to be identified in the current frame.

Before the step of extracting the features of the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame to obtain the feature information of the target to be recognized in the current frame, the method further includes: detecting a target to be recognized on an original image of a current frame to obtain the characteristic of the target to be recognized, the background characteristic corresponding to the target to be recognized and the position characteristic of the target to be recognized; and splicing the characteristics of the target to be recognized, the background characteristics corresponding to the target to be recognized and the position characteristics of the target to be recognized to obtain the current frame.

Wherein the target to be recognized is a gesture.

In order to solve the above technical problems, a second technical solution provided by the present invention is: there is provided an object recognition apparatus comprising: the characteristic extraction module is used for extracting the characteristics of the target to be identified in the current frame based on the reference characteristic information of the target to be identified in the reference frame so as to obtain the characteristic information of the target to be identified in the current frame; the reference frame is a previous frame of the current frame; and the identification module is used for identifying the target to be identified according to the characteristic information.

In order to solve the above technical problems, a third technical solution provided by the present invention is: the method for training the target recognition model is provided, the target recognition model comprises a target detection model and a target classification model which are sequentially cascaded, the target recognition model is used for realizing the target recognition method, and the training method comprises the following steps: acquiring a training sample set, wherein a position frame of a detection target is marked on the training sample set; training a target detection model by using the training sample set; processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target; and training a target classification model by using the characteristic information of the detection target and the preset class information of the detection target.

Wherein, the step of processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target comprises: processing the training sample set by using the trained target detection model to obtain feature information corresponding to a position frame area of the detection target and feature information corresponding to a background area of the detection target; the step of training the target classification model by using the feature information of the detection target and the preset category information of the detection target comprises the following steps: splicing the characteristic information corresponding to the position frame area of the detection target with the characteristic information corresponding to the background area of the detection target to obtain spliced characteristic information; and training a target classification model by using the spliced characteristic information and the preset class information of the detection target.

In order to solve the above technical problems, a fourth technical solution provided by the present invention is: providing a training device of a target recognition model, wherein the training device of the target recognition model is used for training the target recognition model, the target recognition model comprises a target detection model and a target classification model which are sequentially cascaded, and the target recognition model is used for realizing the target recognition method; the training apparatus includes: the acquisition module is used for acquiring a training sample set, and the training sample set is marked with a position frame of a detection target; the first training module is used for training a target detection model by utilizing the training sample set; the processing module is used for processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target; and the second training module is used for training a target classification model by utilizing the characteristic information of the detection target and the preset class information of the detection target.

In order to solve the above technical problems, a fifth technical solution provided by the present invention is: provided is an electronic device including: a memory storing program instructions and a processor retrieving the program instructions from the memory to perform any of the above object recognition methods; and/or the method of training the object recognition model of any one of the above.

In order to solve the above technical problems, a sixth technical solution provided by the present invention is: providing a computer readable storage medium storing a program file executable to implement the object recognition method of any one of the above; and/or the method of training the object recognition model of any one of the above.

The method has the advantages that the method is different from the prior art, the target to be recognized in the current frame is subjected to feature extraction based on the reference feature information of the target to be recognized in the reference frame, so that the feature information of the target to be recognized in the current frame is obtained, and the feature information of the target to be recognized is recognized. The method combines the reference characteristic information with the characteristic information of the current frame, reserves the time domain information of the characteristics and can improve the accuracy of target identification.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a schematic flow chart diagram illustrating a first embodiment of a target identification method according to the present application;

FIG. 2 is a flowchart illustrating an embodiment of step S11 in FIG. 1;

FIG. 3 is a schematic diagram of a cascaded architecture of a translation network according to a feature of the present application;

FIG. 4 is a schematic structural diagram of an embodiment of an object recognition device of the present application;

FIG. 5 is a flowchart illustrating an embodiment of a method for training a target recognition model according to the present application;

FIG. 6 is a schematic structural diagram of an embodiment of a training apparatus for a target recognition model according to the present application;

FIG. 7 is a schematic structural diagram of an embodiment of an electronic device of the present application;

fig. 8 is a schematic structural diagram of a computer-readable storage medium according to the present application.

Detailed Description

With the development of sensor technology, more natural human-computer interaction methods and systems are receiving more and more attention. Compared with the contact type interaction modes of traditional keys, a mouse, a keyboard and a touch screen, the gesture interaction has remarkable advantages in many scenes. For example, in an intelligent home scene, a remote controller is not needed to perform operations of volume increasing and channel changing on a television, and a driver in an intelligent cockpit can adjust the volume without paying attention. The gesture recognition is an important component of human-computer interaction, and the research and development of the gesture recognition influence the naturalness and flexibility of the human-computer interaction.

Most of the existing gesture recognition schemes are two types: firstly, directly extracting features of a single image and outputting gesture categories, wherein the specific process is that a user inputs an image at the current time T into a convolutional neural network, extracts image features and classifies and identifies gestures; secondly, an end-to-end gesture recognition network based on a three-dimensional convolution network is directly adopted for inputting continuous multi-frame images, the method is different from a common two-dimensional convolution network, and the three-dimensional convolution network increases time domain dimensions on convolution of three dimensions of a common wide high-pass channel. The specific process is that a user inputs an image at the current moment T and N frames of images before T into a three-dimensional convolution network at the same time, the network performs time-domain convolution on continuous images, and finally outputs the preset gesture category at the moment T.

However, the above two prior arts have the following disadvantages:

in the first scheme, feature extraction is directly performed on a single image and gesture categories are output, and the extracted features do not have information in a time domain, so that only static gestures such as a fist making gesture, an OK gesture and the like can be recognized, and dynamic gestures such as leftward sliding, rightward sliding, clockwise circling and the like made by a user cannot be recognized.

In the second scheme, for a general end-to-end gesture recognition scheme based on a three-dimensional convolutional network, the extracted features include information in a time domain, and a dynamic gesture can be recognized, but the method has the following disadvantages: (1) when calculating the gesture category of the current frame, all previous image inputs are required, and time domain convolution is performed inside a network, so that the calculation amount of the actual network is large, and delay sometimes exists; (2) the input is continuous original images, so that if a complex background exists or a human body occupies a small area in an input actual image, the input image is easily interfered by the background of the image, and the accuracy of actual gesture recognition can be seriously influenced.

Therefore, for the reasons mentioned above, the present application provides a target identification method for the first solution in the prior art, which can identify a dynamic target and solve the problems of large computation amount and low accuracy in the second solution in the prior art. That is, in the present application, a three-dimensional convolution network is removed, a multi-channel number is used to replace convolution in a time domain, and a stream-type scheme is used to ensure that a current frame result has no frame delay; aiming at the condition that the human body accounts for a small amount in a complex background or an actual image, the features of the human hand part are directly extracted, the hand features extracted from the current frame and the hand features extracted from the previous N frames are spliced into a feature map of an N +1 channel, the feature map is sent into a target recognition model for recognition and classification, the final recognition result is output, and the problem of false recognition caused by the fact that the complex background or the main body image accounts for a small amount is solved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of a first embodiment of the object identification method of the present application. The method specifically comprises the following steps:

step S11: extracting the features of the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame to obtain the feature information of the target to be recognized in the current frame; the reference frame is a frame previous to the current frame.

In this embodiment, feature extraction is performed on the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame, so as to obtain the feature information of the target to be recognized in the current frame. The reference frame is the previous frame of the current frame in the video sequence, the reference frame and the current frame have time domain connection, the time domain is different time, the time relationship of the same content is short, namely the characteristic information of the dynamic object on different time can be represented. That is, in practice, in the feature extraction of the target to be recognized of the current frame, time domain information (also referred to as feature information) is combined, so that the accuracy of target recognition can be improved.

In one embodiment, referring to fig. 2, step S11 includes:

step S111: and extracting the characteristics of the target to be identified in the current frame to obtain current characteristic information.

And performing feature extraction on the target to be identified in the current frame by using a feature extraction algorithm to obtain current feature information of the target to be identified in the current frame.

Step S112: and splicing at least part of reference characteristic information of the target to be identified in the reference frame with at least part of current characteristic information of the target to be identified in the current frame to obtain the characteristic information of the target to be identified in the current frame.

And splicing at least part of reference characteristic information of the target to be identified in the reference frame with at least part of current characteristic information of the target to be identified in the current frame to obtain the characteristic information of the target to be identified in the current frame. Specifically, part of the reference characteristic information and part of the current characteristic information are spliced to obtain the characteristic information of the target to be identified of the current frame, and the time domain information of the target to be identified can be combined, so that the accuracy of the characteristic information of the current frame is improved, and the accuracy of target identification is further improved.

In an embodiment, one third of the reference feature information and two thirds of the current feature information may be spliced to obtain the feature information of the target to be identified of the current frame. In another embodiment, one half of the reference feature information and one half of the current feature information may be further spliced to obtain the feature information of the target to be identified of the current frame. Specifically, the ratio of the spliced reference feature information to the current feature information is not limited, and is not repeated herein based on the actual application.

In a specific embodiment, the splicing of the partial reference feature information and the partial current feature information can be realized by using a feature-shifted network structure. Specifically, splicing the (N-1)/N characteristics of the characteristic channel corresponding to the current characteristic information with the 1/N characteristics of the position of the characteristic channel corresponding to the reference characteristic information to obtain the characteristic information of the target to be identified in the current frame; and N is the number of the two-dimensional convolution modules, and the N two-dimensional convolution modules are cascaded in sequence. Specifically referring to fig. 3, the feature translation network shown in fig. 3 includes 4 two-dimensional convolution modules, which are respectively: 2DConv Block1, 2DConv Block2, 2DConv Block3, 2DConv Block4, and 4 two-dimensional convolution modules are cascaded in sequence, specifically, 2DConv Block1 and 2DConv Block2 are cascaded, 2DConv Block2 and 2DConv Block3 are cascaded, and 2DConv Block3 and 2DConv Block4 are cascaded.

In one embodiment, in response to T < N, the current feature information is feature information of an object to be identified in the current frame; wherein T is the frame number of the current frame.

For example, the characteristics of the current frame t output at 2DConv Block1 include 1/N of the characteristics information of the t-1 frame and the characteristics information of the (N-1)/N of the t frame. The characteristics of the current frame t output at 2DConv Block2 include 1/N of characteristic information of the t-2 frame and (N-1)/N of characteristic information of the t frame and the t-1 frame. The characteristics of the current frame t output at 2DConv Block3 include the characteristic information of 1/N of t-3 frame and the characteristic information of (N-1)/N of t frame, t-1 frame and t-2 frame. The characteristics of the current frame t output at 2DConv Block4 include the characteristic information of 1/N of t-4 frame and the characteristic information of (N-1)/N of t frame, t-1 frame, t-2 frame and t-3 frame.

By the characteristic translation method, the finally obtained characteristic information of the current frame comprises the characteristic information of all frames before the current frame, and the accuracy of target identification can be improved when the target to be identified is dynamic.

Further, since there may be a plurality of targets in the detection process, it is necessary to match a plurality of target features with the target features extracted from the previous frame (the matching method includes, but is not limited to, cosine distance, etc.), and the closest distance is used as the same image feature stream and is sent to the recognition network for recognition. Specifically, in response to the current frame including multiple targets, the current feature information of the multiple targets is compared with the reference feature information in similarity to determine the target to be identified in the current frame. Specifically, assuming that the current frame includes a target a, a target B, and a target C, extracting feature information of the target a, the target B, and the target C, and comparing similarity between the target a, the target B, and the target C and the reference feature information, respectively, where a target with a close similarity is a target to be identified. Therefore, the target to be identified can be accurately positioned, and the situation that the target to be identified in the current frame and the target to be identified in the reference frame are not the same target is avoided.

Further, specifically, target detection to be recognized is performed on the original image of the current frame to obtain features of the target to be recognized, background features corresponding to the target to be recognized, and position features of the target to be recognized; and splicing the characteristics of the target to be recognized, the background characteristics corresponding to the target to be recognized and the position characteristics of the target to be recognized to obtain the current frame. In an embodiment, for a far-field scene with a complex background or a small subject, a detection network (including but not limited to maskrnn, fastrcnn, etc.) is added on the basis of a feature translation algorithm to extract a target feature, a background feature and a target position feature, the target feature, the background feature and the target position feature are interpolated to the same size and spliced on a channel dimension, an original image is replaced as an input of a recognition network, and a final gesture result is output. The target features are used, so that the feature calculation from an original image to the recognition network is greatly reduced, the effect of feature multiplexing is achieved, meanwhile, the hand position features and the background features are added on the channel dimension, the condition of incomplete local features can be avoided, the background noise is effectively inhibited, and the effect of improving the recognition rate is achieved; meanwhile, under the scene of multiple targets, the function of multiple target identification can be realized due to the addition of detection and matching.

In one embodiment, the target to be recognized is a gesture. In other embodiments, the target to be recognized may also be a dynamic object such as a vehicle, a person, and the like, which is not limited specifically.

Step S12: and identifying the characteristic information of the target to be identified.

Specifically, the classifier can be used to identify the feature information of the target to be identified, so as to obtain an identification result.

According to the target identification method, the streaming target identification network is used, the feature translation and the two-bit convolution module are combined aiming at the time domain information of dynamic target identification to replace a three-dimensional convolution module, the effect of reducing the calculated amount of the identification network is achieved, the whole feature is more obvious, and a better identification effect can be achieved under the scene of a complex background or a far field.

According to the target recognition method, before target recognition, target feature extraction is carried out, and feature multiplexing is carried out by combining background features and target position information features, so that the end-to-end target recognition effect of multiple targets is achieved, meanwhile, the influence of complex backgrounds is reduced, and the accuracy and recall rate of gesture recognition are improved.

Referring to fig. 4, a schematic structural diagram of an embodiment of an object recognition device according to the present application includes: a feature extraction module 101 and a recognition module 102.

The feature extraction module 101 is configured to perform feature extraction on a target to be identified in a current frame based on reference feature information of the target to be identified in a reference frame, so as to obtain feature information of the target to be identified in the current frame; the reference frame is a frame previous to the current frame.

In an embodiment, the feature extraction module 101 is configured to perform feature extraction on an object to be identified in a current frame to obtain current feature information; and splicing at least part of reference characteristic information of the target to be identified in the reference frame with at least part of current characteristic information of the target to be identified in the current frame to obtain the characteristic information of the target to be identified in the current frame.

In an embodiment, the feature extraction module 101 is configured to splice (N-1)/N features of a feature channel corresponding to current feature information with 1/N features at a feature channel position corresponding to reference feature information to obtain feature information of an object to be identified in a current frame; and N is the number of the two-dimensional convolution modules, and the N two-dimensional convolution modules are cascaded in sequence.

In one embodiment, when T is less than N, the current characteristic information is the characteristic information of the target to be identified in the current frame; wherein T is the frame number of the current frame.

In an embodiment, in response to the current frame including multiple targets, the feature extraction module 101 is configured to perform similarity comparison between current feature information of the multiple targets and reference feature information to determine a target to be identified in the current frame.

In an embodiment, the feature extraction module 101 is configured to perform target detection to be identified on an original image of a current frame to obtain features of a target to be identified, background features corresponding to the target to be identified, and position features of the target to be identified; and splicing the characteristics of the target to be recognized, the background characteristics corresponding to the target to be recognized and the position characteristics of the target to be recognized to obtain the current frame.

The identification module 102 is configured to identify the target to be identified according to the feature information.

The target recognition device uses a streaming target recognition network, and combines a feature translation module and a two-bit convolution module aiming at time domain information of dynamic target recognition to replace a three-dimensional convolution module, so that the effect of reducing the calculated amount of the recognition network is achieved, the whole feature is more robust, and a better recognition effect can be achieved under a complex background or far-field scene.

The target recognition device extracts the target features before target recognition, combines the background features and the target position information features, and performs feature multiplexing, so that the end-to-end target recognition effect of multiple targets is achieved, meanwhile, the influence of complex backgrounds is reduced, and the accuracy and the recall rate of gesture recognition are improved.

Referring to fig. 5, a flowchart of an embodiment of a training method of a target recognition model according to the present application is shown. The target recognition model comprises a target detection model and a target classification model which are sequentially cascaded. The training method of the target recognition model comprises the following steps:

s41: and acquiring a training sample set, wherein the training sample set is marked with a position frame of a detection target.

Specifically, a training sample set is obtained from the database, and a position frame of a detection target is marked in the training sample set. For example, if the detection target is a person, the position frame is a circumscribed rectangular frame of the person.

S42: and training the target detection model by utilizing the training sample set.

Specifically, a target detection model is trained by using a training sample set, and the target detection model includes, but is not limited to, maskrcnn, fastrcnn, and the like.

S43: and processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target.

And inputting the training sample set into a trained target detection model, and processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target. Therefore, background information can be filtered out, and the accuracy of target detection is improved.

Further, the trained target detection model is used for processing the training sample set to obtain feature information corresponding to the position frame area of the detection target and feature information corresponding to the background area of the detection target.

S44: and training the target classification model by using the characteristic information of the detection target and the preset class information of the detection target.

Specifically, the preset type information of the detection target is an attribute type of the detection target, for example, when the detection target is a hand, the action information of the hand corresponding to the preset type information is preset. And training the target classification model by using the characteristic information of the detection target and the preset category information of the detection target, so that the obtained target recognition model can recognize the action information of the hand.

Further, splicing the characteristic information corresponding to the position frame area of the detection target with the characteristic information corresponding to the background area of the detection target to obtain spliced characteristic information; and training the target classification model by using the spliced characteristic information and the preset class information of the detected target.

The feature information corresponding to the target frame region and the feature information corresponding to the background region of the target are added in the channel dimension, so that the condition of incomplete local features can be avoided, background noise is effectively inhibited, and the effect of improving the identification rate is achieved.

Referring to fig. 6, which is a schematic structural diagram of an embodiment of a training apparatus for a target recognition model of the present application, the training apparatus for a target recognition model is used for training the target recognition model. The training apparatus comprises an acquisition module 201, a first training module 202, a processing module 203, and a second training module 204.

The obtaining module 201 is configured to obtain a training sample set, where the training sample set is labeled with a position frame of a detection target.

The first training module 202 is configured to train a target detection model using a training sample set.

The processing module 203 is configured to process the training sample set by using the trained target detection model to obtain feature information of the detection target.

In an embodiment, the processing module 203 is further configured to process the training sample set by using the trained target detection model to obtain feature information corresponding to a position frame area of the detection target and feature information corresponding to a background area of the detection target.

The second training module 204 is configured to train the target classification model by using the feature information of the detection target and the preset category information of the detection target.

In an embodiment, the second training module 204 is further configured to splice feature information corresponding to the position frame area of the detection target with feature information corresponding to the background area of the detection target to obtain spliced feature information; and training the target classification model by using the spliced characteristic information and the preset class information of the detected target.

Referring to fig. 7, which is a schematic structural diagram of an embodiment of an electronic device according to the present application, the electronic device includes a memory 301 and a processor 302 connected to each other.

The memory 301 is used to store program instructions implementing the methods of the apparatus of any of the above.

Processor 302 is operative to execute program instructions stored in memory 301.

The processor 302 may also be referred to as a Central Processing Unit (CPU). The processor 302 may be an integrated circuit chip having signal processing capabilities. The processor 302 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 301 may be a memory bank, a TF card, etc., and may store all information in the electronic device of the device, including the input raw data, the computer program, the intermediate operation result, and the final operation result. It stores and retrieves information based on the location specified by the controller. With the memory, the electronic device can only have the memory function to ensure the normal operation. The memories of electronic devices are classified into a main memory (internal memory) and an auxiliary memory (external memory) according to their purposes, and also into an external memory and an internal memory. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the main board, which is used for storing data and programs currently being executed, but is only used for temporarily storing the programs and the data, and the data is lost when the power is turned off or the power is cut off.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a system server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application.

Please refer to fig. 8, which is a schematic structural diagram of a computer-readable storage medium according to the present application. The storage medium of the present application stores a program file 401 capable of implementing all the methods described above, wherein the program file 401 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of object recognition, comprising:

performing feature extraction on the target to be recognized in the current frame based on reference feature information of the target to be recognized in the reference frame to obtain feature information of the target to be recognized in the current frame; the reference frame is a previous frame of the current frame;

and identifying the characteristic information of the target to be identified.

2. The method according to claim 1, wherein the step of extracting the feature of the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame to obtain the feature information of the target to be recognized in the current frame comprises:

extracting the features of the target to be identified in the current frame to obtain current feature information;

and splicing at least part of the reference characteristic information of the target to be identified in the reference frame with at least part of the current characteristic information of the target to be identified in the current frame to obtain the characteristic information of the target to be identified in the current frame.

3. The object recognition method according to claim 2, wherein the step of splicing at least a part of the reference feature information of the object to be recognized in the reference frame with at least a part of the current feature information of the object to be recognized in the current frame to obtain the feature information of the object to be recognized in the current frame comprises:

splicing the (N-1)/N characteristics of the characteristic channel corresponding to the current characteristic information with the 1/N characteristics of the position of the characteristic channel corresponding to the reference characteristic information to obtain the characteristic information of the target to be identified in the current frame;

and N is the number of the two-dimensional convolution modules, and the N two-dimensional convolution modules are sequentially cascaded.

4. The object recognition method according to claim 3, wherein the step of splicing (N-1)/N features of a feature channel corresponding to the current feature information with 1/N features of the feature channel position corresponding to the reference feature information to obtain the feature information of the object to be recognized in the current frame further comprises:

responding to T < N, and then the current characteristic information is the characteristic information of the target to be identified in the current frame; wherein T is the frame number of the current frame.

5. The target identification method according to claim 2, wherein before the step of extracting the features of the target to be identified in the current frame based on the reference feature information of the target to be identified in the reference frame to obtain the feature information of the target to be identified in the current frame, the method comprises:

and in response to the current frame comprising a plurality of targets, performing similarity comparison on current feature information of the plurality of targets and the reference feature information to determine the target to be identified in the current frame.

6. The method according to claim 1, wherein before the step of extracting the feature of the target to be recognized in the current frame based on the reference feature information of the target to be recognized in the reference frame to obtain the feature information of the target to be recognized in the current frame, the method further comprises:

detecting a target to be recognized on an original image of a current frame to obtain the characteristic of the target to be recognized, the background characteristic corresponding to the target to be recognized and the position characteristic of the target to be recognized;

and splicing the characteristics of the target to be recognized, the background characteristics corresponding to the target to be recognized and the position characteristics of the target to be recognized to obtain the current frame.

7. The object recognition method according to claim 1, wherein the object to be recognized is a gesture.

8. An object recognition apparatus, comprising:

the characteristic extraction module is used for extracting the characteristics of the target to be identified in the current frame based on the reference characteristic information of the target to be identified in the reference frame so as to obtain the characteristic information of the target to be identified in the current frame; the reference frame is a previous frame of the current frame;

and the identification module is used for identifying the characteristic information of the target to be identified.

9. A training method of a target recognition model, wherein the target recognition model comprises a target detection model and a target classification model which are sequentially cascaded, and the target recognition model is used for realizing the target recognition method of any one of claims 1 to 7, and the training method comprises the following steps:

acquiring a training sample set, wherein a position frame of a detection target is marked on the training sample set;

training a target detection model by using the training sample set;

processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target;

and training a target classification model by using the characteristic information of the detection target and the preset class information of the detection target.

10. The method for training the target recognition model according to claim 9, wherein the step of processing the training sample set by using the trained target detection model to obtain the feature information of the detection target includes:

processing the training sample set by using the trained target detection model to obtain feature information corresponding to a position frame area of the detection target and feature information corresponding to a background area of the detection target;

the step of training the target classification model by using the feature information of the detection target and the preset category information of the detection target comprises the following steps:

splicing the characteristic information corresponding to the position frame area of the detection target with the characteristic information corresponding to the background area of the detection target to obtain spliced characteristic information;

and training a target classification model by using the spliced characteristic information and the preset class information of the detection target.

11. A training device of a target recognition model is characterized in that the training device of the target recognition model is used for training the target recognition model, the target recognition model comprises a target detection model and a target classification model which are sequentially cascaded, and the target recognition model is used for realizing the target recognition method of any one of claims 1 to 7; the training apparatus includes:

the acquisition module is used for acquiring a training sample set, and the training sample set is marked with a position frame of a detection target;

the first training module is used for training a target detection model by utilizing the training sample set;

the processing module is used for processing the training sample set by using the trained target detection model to obtain the characteristic information of the detection target;

and the second training module is used for training a target classification model by utilizing the characteristic information of the detection target and the preset class information of the detection target.

12. An electronic device, comprising: a memory storing program instructions and a processor retrieving the program instructions from the memory to perform the object recognition method of any one of claims 1-7; and/or the method for training the object recognition model according to any one of claims 9 to 10.

13. A computer-readable storage medium, characterized in that a program file is stored, the program file being executable to implement the object recognition method according to any one of claims 1 to 7; and/or the method for training the object recognition model according to any one of claims 9 to 10.