CN114677611B - Data identification method, storage medium and device - Google Patents

Data identification method, storage medium and device Download PDF

Info

Publication number
CN114677611B
CN114677611B CN202110304366.8A CN202110304366A CN114677611B CN 114677611 B CN114677611 B CN 114677611B CN 202110304366 A CN202110304366 A CN 202110304366A CN 114677611 B CN114677611 B CN 114677611B
Authority
CN
China
Prior art keywords
target object
track
video frame
video
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110304366.8A
Other languages
Chinese (zh)
Other versions
CN114677611A (en
Inventor
刘兵
曹浩宇
郑岩
吴磊
刘银松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Cloud Computing Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Cloud Computing Beijing Co Ltd filed Critical Tencent Cloud Computing Beijing Co Ltd
Priority to CN202110304366.8A priority Critical patent/CN114677611B/en
Publication of CN114677611A publication Critical patent/CN114677611A/en
Application granted granted Critical
Publication of CN114677611B publication Critical patent/CN114677611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a data identification method, a storage medium and equipment. The method comprises the following steps: acquiring video data containing a target object, and performing semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame; carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame; acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain predicted track features of a target object in video data; and classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object. By the method and the device, the accuracy of track identification of the target object can be improved, and the applicability of track identification is improved.

Description

Data identification method, storage medium and device
Technical Field
The present application relates to the field of deep learning and image processing technologies, and in particular, to a data identification method, a storage medium, and a device.
Background
Along with the rapid development of image processing technology, the track recognition can be performed on video data shot by the camera equipment, the motion track of the monitored object in the video data is recognized, and the reference information for related services is provided based on the motion track. For example, the running state or running track of the monitored item may be monitored in real time based on the running track of the monitored item, or whether the running track of the monitored item meets the specification may be detected, for example, whether the express box to be shipped reaches a specified window may be monitored.
However, at present, a quadratic polynomial y=ax2+bx+c is adopted to fit the motion coordinates of the monitored object, and the motion track of the monitored object is obtained according to the fitting result, and as the difference of factors such as the position and the shape of the monitored object in each stage in video data is large, the track identification by adopting the method can lead to lower classification accuracy, and especially when the motion track of the monitored object is complex in change, the quadratic polynomial cannot be used for fitting, so that the application range is smaller.
Disclosure of Invention
The technical problem to be solved by the embodiment of the application is to provide a data identification method, a storage medium and equipment, which can improve the accuracy of identifying the motion trail of a target object in video data and improve the applicability of trail identification.
An aspect of an embodiment of the present application provides a data identification method, including:
Acquiring video data containing a target object, and performing semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
Carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame;
Acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain predicted track features of a target object in video data;
And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object.
Wherein acquiring video data containing the target item comprises:
responding to uploading operation in the track verification page, and starting the camera shooting assembly;
and shooting the target object based on the shooting component, and determining the shot movement track of the target object as video data containing the target object.
Wherein acquiring video data containing the target item comprises:
Receiving a track verification request aiming at a target object and sent by the target object, and verifying authority of the target object according to the track verification request;
if the target object has the authority for carrying out track verification on the target object, acquiring a video stream corresponding to the target object from a track verification request;
and decoding the video stream, and determining the decoded video stream as video data containing the target object.
Wherein, the at least two video frames comprise video frames T i, i is smaller than or equal to the number S of the at least two video frames, and both i and S are positive integers;
Semantic segmentation is carried out on at least two video frames in the video data to obtain a coordinate sequence of a target object in each video frame, and the method comprises the following steps:
Randomly dividing an image corresponding to the video frame T i to obtain N candidate areas; n is a positive integer;
respectively extracting pixels of the N candidate areas to obtain area pixel points corresponding to the N candidate areas respectively;
According to a pixel value interval associated with the target object, carrying out pixel classification on the region pixel points contained in the N candidate regions respectively to obtain target pixel points in each candidate region, and carrying out segmentation processing on the video frame T i based on the target pixel points to obtain a mask image corresponding to the target object; the pixel value corresponding to the target pixel point belongs to a pixel value interval;
Based on the mask image, a sequence of coordinates of the target object in the video frame T i is determined.
Wherein, based on the mask image, determining the coordinate sequence of the target object in the video frame T i includes:
Determining image edge points of the target object in the video frame T i according to the mask image;
performing straight line fitting on the image edge points to determine M fitting straight lines; m is a positive integer;
And acquiring the intersection point between any two adjacent fitting straight lines in the M fitting straight lines, acquiring position coordinate information corresponding to the intersection point, and generating a coordinate sequence of the target object in the video frame T i according to the position coordinate information corresponding to the intersection point.
The method for processing the coordinate sequence of the target object in each video frame by convolution to obtain the position sub-feature map corresponding to each video frame comprises the following steps:
Integrating coordinate sequences of the target object in at least two video frames to obtain a coordinate matrix corresponding to the target object, inputting the coordinate matrix into a convolution network layer in a track recognition model, and performing convolution operation on the coordinate matrix to obtain position convolution information corresponding to the coordinate matrix;
Normalizing the position convolution information to obtain normalized position convolution information;
based on an activation function in a convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and a position feature map corresponding to a coordinate sequence of a target object in at least two video frames is generated;
and determining a position sub-feature map corresponding to each video frame respectively based on the position feature maps.
Based on an activation function in a convolution network layer, nonlinear combination is performed on the position convolution information after normalization processing, and a position feature map corresponding to a coordinate sequence of a target object in at least two video frames is generated, and the method comprises the following steps:
Based on an activation function in a convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and candidate position feature diagrams corresponding to coordinate sequences in at least two video frames are obtained;
acquiring the number of output channels and output size information in a convolution network layer, and adjusting the candidate position feature map based on the number of output channels and the output size information to obtain a position feature map corresponding to a coordinate sequence of a target object in at least two video frames; the size information corresponding to the position feature map is the product of the output size information and the number of output channels.
The normalization processing is performed on the position convolution information to obtain normalized position convolution information, which comprises the following steps:
Based on a characteristic standardized network in the convolutional network layer, carrying out grouping processing on the position convolutional information to obtain Q position convolutional groups; q is a positive integer;
And (3) acquiring the mean value and the variance corresponding to the Q position convolution groups respectively, and carrying out normalization processing on the position convolution information in each position convolution group based on the mean value and the variance corresponding to the Q position convolution groups respectively to obtain the position convolution information after normalization processing.
Wherein, the time sequence relation between at least two video frames is obtained, the position sub-feature images corresponding to each video frame are combined into a feature image sequence according to the time sequence relation, extracting time sequence features of the feature map sequence to obtain predicted track features of the target object in the video data, wherein the method comprises the following steps:
acquiring a time sequence relation between at least two video frames, and combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation;
performing cyclic coding on the feature map sequence based on a cyclic network layer of the track recognition model to obtain a coding matrix corresponding to the feature map sequence;
and based on an activation function in the cyclic network layer, carrying out nonlinear combination on the coding matrix to obtain the predicted track characteristics of the target object in the video data.
The video frames comprise a video frame X i and a video frame X i+1, the video frame X i and the video frame X i+1 have an adjacent relation, i+1 is smaller than or equal to the number S of the at least two video frames, and both i and S are positive integers;
The method for circularly coding the feature map sequence based on the circular network layer of the track recognition model to obtain a coding matrix corresponding to the feature map sequence comprises the following steps:
Acquiring a first hidden state vector corresponding to a video frame X i in a feature map sequence, inputting a position sub-feature map corresponding to a video frame X i and the first hidden state vector into a cyclic network layer of a track identification model, and encoding the position sub-feature map corresponding to a video frame X i to obtain a feature vector corresponding to a video frame X i;
Determining a second hidden state vector corresponding to a video frame X i+1 in the feature map sequence according to the feature vector corresponding to the video frame X i, inputting a position sub-feature map corresponding to the video frame X i+1 and the second hidden state vector into a circulating network layer of a track recognition model, and encoding the position sub-feature map corresponding to the video frame X i+1 to obtain a feature vector corresponding to the video frame X i+1;
When the video frame X i+1 is the last frame video frame of the at least two video frames, generating a coding matrix corresponding to the feature map sequence according to the feature vector corresponding to each video frame of the at least two video frames.
Wherein the method further comprises:
If the track recognition result corresponding to the target object is matched with the verification track direction, determining that the video data corresponding to the target object is effective video data, and verifying the validity of the target object according to the video data; the verification track direction refers to the direction of verification required by the business associated with the target object;
If the track recognition result corresponding to the target object is not matched with the verification track direction, determining that the video data corresponding to the target object is invalid video data, and outputting prompt information for prompting the target object to be shot again.
An aspect of an embodiment of the present application provides a data identification method, including:
Acquiring a first initial track recognition model, first sample video data corresponding to a sample article, and a first label classification result corresponding to the first sample article in the first sample video data;
carrying out convolution processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data by adopting a first initial track recognition model to obtain a position sub-feature map corresponding to each first sample video frame respectively;
Acquiring a sample time sequence relation between at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a sample feature image sequence according to the sample time sequence relation, and extracting time sequence features of the sample feature image sequence to obtain first predicted sample track features of sample objects in first sample video data;
classifying and identifying the first predicted sample track features to obtain a first predicted classification result corresponding to the sample article;
Determining a first loss function corresponding to a first initial track recognition model according to a first label classification result and a first prediction classification result corresponding to the sample article;
Performing iterative training on the first initial track recognition model based on the first loss function, and determining the trained first initial track recognition model as a track recognition model; the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data.
Wherein the method further comprises:
outputting second predicted sample track characteristics of the sample object in second sample video data through a second initial track recognition model; the model parameters in the second initial trajectory identification model are the same as the model parameters in the first initial trajectory identification model;
classifying and identifying the second predicted sample track features to obtain a second predicted classification result corresponding to the sample object in the second sample video data;
generating a second loss function according to a second prediction classification result and a second label classification result corresponding to the second sample video data;
generating a third loss function according to the first prediction classification result and the second prediction classification result;
Iteratively training the first initial trajectory recognition model based on the first loss function, determining the converged first initial trajectory recognition model as a trajectory recognition model, comprising:
Generating a total loss function based on the first loss function, the second loss function and the third loss function, performing iterative training on the first initial trajectory recognition model according to the total loss function, and determining the trained first initial trajectory recognition model as a trajectory recognition model.
An aspect of an embodiment of the present application provides a data identifying apparatus, including:
The semantic segmentation module is used for acquiring video data containing the target object, and carrying out semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
the first convolution processing module is used for carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame respectively;
The first feature extraction module is used for acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and carrying out time sequence feature extraction on the feature image sequence to obtain the predicted track feature of the target object in the video data;
The first classification and identification module is used for classifying and identifying the predicted track characteristics to obtain a track identification result corresponding to the target object, and the track identification result is used for providing verification basis for the service associated with the target object.
The semantic segmentation module comprises:
the starting unit is used for responding to the uploading operation in the track verification page and starting the camera shooting assembly;
And the first determining unit is used for shooting the target object based on the image shooting assembly and determining the shot movement track of the target object as video data containing the target object.
The semantic segmentation module comprises:
the verification unit is used for receiving a track verification request aiming at the target object and sent by the target object, and verifying the authority of the target object according to the track verification request;
The first acquisition unit is used for acquiring a video stream corresponding to the target object from the track verification request if the target object has the authority for carrying out track verification on the target object;
and the second determining unit is used for decoding the video stream and determining the decoded video stream as video data containing the target object.
Wherein, the at least two video frames comprise video frames T i, i is smaller than or equal to the number S of the at least two video frames, and both i and S are positive integers;
The semantic segmentation module further comprises:
The random dividing unit is used for randomly dividing the image corresponding to the video frame T i to obtain N candidate areas; n is a positive integer;
the pixel extraction unit is used for respectively carrying out pixel extraction on the N candidate areas to obtain area pixel points corresponding to the N candidate areas respectively;
The segmentation processing unit is used for classifying pixels of the region pixels respectively contained in the N candidate regions according to the pixel value interval associated with the target object to obtain target pixel points in each candidate region, and carrying out segmentation processing on the video frame T i based on the target pixel points to obtain a mask image corresponding to the target object; the pixel value corresponding to the target pixel point belongs to a pixel value interval;
And a third determining unit, configured to determine a coordinate sequence of the target object in the video frame T i based on the mask image.
The third determining unit is specifically configured to:
Determining image edge points of the target object in the video frame T i according to the mask image;
performing straight line fitting on the image edge points to determine M fitting straight lines; m is a positive integer;
And acquiring the intersection point between any two adjacent fitting straight lines in the M fitting straight lines, acquiring position coordinate information corresponding to the intersection point, and generating a coordinate sequence of the target object in the video frame T i according to the position coordinate information corresponding to the intersection point.
Wherein the first convolution processing includes:
The convolution operation unit is used for integrating the coordinate sequences of the target object in at least two video frames to obtain a coordinate matrix corresponding to the target object, inputting the coordinate matrix into a convolution network layer in the track identification model, and carrying out convolution operation on the coordinate matrix to obtain position convolution information corresponding to the coordinate matrix;
the normalization processing unit is used for carrying out normalization processing on the position convolution information to obtain normalized position convolution information;
The first nonlinear combination unit is used for carrying out nonlinear combination on the normalized position convolution information based on an activation function in the convolution network layer to generate a position feature map corresponding to a coordinate sequence of the target object in at least two video frames;
And the fourth determining unit is used for determining a position sub-feature map corresponding to each video frame respectively based on the position feature maps.
The first nonlinear combination unit is specifically configured to:
Based on an activation function in a convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and candidate position feature diagrams corresponding to coordinate sequences in at least two video frames are obtained;
acquiring the number of output channels and output size information in a convolution network layer, and adjusting the candidate position feature map based on the number of output channels and the output size information to obtain a position feature map corresponding to a coordinate sequence of a target object in at least two video frames; the size information corresponding to the position feature map is the product of the output size information and the number of output channels.
The normalization processing unit is specifically configured to:
Based on a characteristic standardized network in the convolutional network layer, carrying out grouping processing on the position convolutional information to obtain Q position convolutional groups; q is a positive integer;
And (3) acquiring the mean value and the variance corresponding to the Q position convolution groups respectively, and carrying out normalization processing on the position convolution information in each position convolution group based on the mean value and the variance corresponding to the Q position convolution groups respectively to obtain the position convolution information after normalization processing.
Wherein, the first feature extraction module includes:
The second acquisition unit is used for acquiring a time sequence relation between at least two video frames, and combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation;
The cyclic coding unit is used for carrying out cyclic coding on the characteristic graph sequence based on a cyclic network layer of the track recognition model to obtain a coding matrix corresponding to the characteristic graph sequence;
And the second nonlinear combination unit is used for nonlinear combination of the coding matrix based on the activation function in the cyclic network layer to obtain the predicted track characteristic of the target object in the video data.
The video frames comprise a video frame X i and a video frame X i+1, the video frame X i and the video frame X i+1 have an adjacent relation, i+1 is smaller than or equal to the number S of the at least two video frames, and both i and S are positive integers;
The cyclic coding unit is specifically configured to:
Acquiring a first hidden state vector corresponding to a video frame X i in a feature map sequence, inputting a position sub-feature map corresponding to a video frame X i and the first hidden state vector into a cyclic network layer of a track identification model, and encoding the position sub-feature map corresponding to a video frame X i to obtain a feature vector corresponding to a video frame X i;
Determining a second hidden state vector corresponding to a video frame X i+1 in the feature map sequence according to the feature vector corresponding to the video frame X i, inputting a position sub-feature map corresponding to the video frame X i+1 and the second hidden state vector into a circulating network layer of a track recognition model, and encoding the position sub-feature map corresponding to the video frame X i+1 to obtain a feature vector corresponding to the video frame X i+1;
When the video frame X i+1 is the last frame video frame of the at least two video frames, generating a coding matrix corresponding to the feature map sequence according to the feature vector corresponding to each video frame of the at least two video frames.
Wherein, the data identification device further comprises:
The verification module is used for determining that the video data corresponding to the target object is effective video data if the track recognition result corresponding to the target object is matched with the verification track direction, and verifying the validity of the target object according to the video data; the verification track direction refers to the direction of verification required by the business associated with the target object;
the first output module is used for determining that the video data corresponding to the target object is invalid video data if the track recognition result corresponding to the target object is not matched with the verification track direction, and outputting prompt information for prompting the target object to be shot again.
An aspect of an embodiment of the present application provides a data identifying apparatus, including:
The acquisition module is used for acquiring a first initial track identification model, first sample video data corresponding to the sample article and a first label classification result corresponding to the first sample article in the first sample video data;
the second convolution processing module is used for carrying out convolution processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data by adopting the first initial track recognition model to obtain a position sub-feature map corresponding to each first sample video frame respectively;
The second feature extraction module is used for acquiring a sample time sequence relation between at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a sample feature image sequence according to the sample time sequence relation, and carrying out time sequence feature extraction on the sample feature image sequence to obtain first predicted sample track features of sample articles in first sample video data;
the second classification recognition module is used for classifying and recognizing the track features of the first predicted sample to obtain a first predicted classification result corresponding to the sample article;
the first determining module is used for determining a first loss function corresponding to the first initial track recognition model according to a first label classification result and a first prediction classification result corresponding to the sample article;
The second determining module is used for carrying out iterative training on the first initial track recognition model based on the first loss function, and determining the trained first initial track recognition model as a track recognition model; the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data.
Wherein, the data identification device still includes:
The second output module is used for outputting second predicted sample track characteristics of the sample object in second sample video data through a second initial track recognition model; the model parameters in the second initial trajectory identification model are the same as the model parameters in the first initial trajectory identification model;
The third classification and identification module is used for classifying and identifying the track characteristics of the second predicted sample to obtain a second predicted classification result of the sample object corresponding to the second sample video data;
the first generation module is used for generating a second loss function according to a second prediction classification result and a second label classification result corresponding to the second sample video data;
The second generation module is used for generating a third loss function according to the first prediction classification result and the second prediction classification result;
The second determination module includes:
and a fifth determining unit, configured to generate a total loss function based on the first loss function, the second loss function, and the third loss function, perform iterative training on the first initial trajectory recognition model according to the total loss function, and determine the trained first initial trajectory recognition model as a trajectory recognition model.
In one aspect, the application provides a computer device comprising: a processor and a memory;
wherein the memory is used for storing a computer program, and the processor is used for calling the computer program to execute the following steps:
Acquiring video data containing a target object, and performing semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
Carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame;
Acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain predicted track features of a target object in video data;
And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object.
An aspect of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of:
Acquiring video data containing a target object, and performing semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
Carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame;
Acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain predicted track features of a target object in video data;
And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object.
In one aspect, the application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method in the above aspect.
In the embodiment of the application, the coordinate sequence of the target object in each video frame is obtained by acquiring the video data containing the target object and carrying out semantic segmentation on at least two video frames in the video data, and the coordinate sequence of the target object in each video frame is adopted to carry out track identification on the target object, so that the calculation amount can be reduced. And carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame. The coordinate sequence of the target object in each video frame is subjected to convolution processing, so that the abnormal coordinate sequence can be eliminated, the coordinate sequence of the target object corresponding to each video frame is converted into a unified format, namely into a position sub-feature map, and therefore, the track identification can be carried out on any type of video data, the application range is expanded, and the applicability of the track identification is improved. And acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data. Therefore, the feature extraction is performed according to the time sequence relation between at least two video frames, so that the accuracy of the feature extraction can be improved. And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object. By the method and the device, the accuracy of track identification of the target object can be improved, and the applicability of track identification is improved.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1a is a schematic diagram of a data identification system according to an embodiment of the present application;
FIG. 1b is a schematic view of a scenario of data identification according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a data identification method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of video data including a target object according to an embodiment of the present application;
fig. 4 is a schematic flow chart of a data identification method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of obtaining a mask image corresponding to a target object according to an embodiment of the present application;
FIG. 6 is a schematic diagram of acquiring a coordinate sequence corresponding to a target object according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a convolutional block in a convolutional network layer according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a cyclic feature extraction block in a cyclic network layer according to an embodiment of the present application;
Fig. 9 is a schematic diagram of performing cyclic encoding on a feature map sequence in a cyclic network layer to obtain an encoding matrix corresponding to the feature map sequence according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a trajectory identification provided by an embodiment of the present application;
FIG. 11 is a schematic diagram of a track recognition model according to an embodiment of the present application;
fig. 12 is a flow chart of a data identification method according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a track recognition method according to an embodiment of the present application;
Fig. 14 is a schematic structural diagram of a data identification device according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a data identification device according to an embodiment of the present application;
fig. 16 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The present application relates to artificial intelligence technology, artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The computer vision technology is a science for researching how to make a machine "see", and further means that a camera and a computer are used for replacing human eyes to perform machine vision such as recognition, tracking and measurement on a target, and further performing graphic processing, so that the computer is processed into an image which is more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others. In the application, at least two video frames in the video data can be subjected to semantic segmentation by utilizing a computer vision technology, so as to obtain a coordinate sequence of the target object in each video frame. Therefore, the calculation amount can be reduced and the track recognition efficiency can be improved by carrying out track recognition on the coordinate sequence of the target object in each video frame.
The machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. In the scheme, the coordinate sequence of the target object in each video frame can be subjected to convolution processing through machine learning and deep learning, so that the position sub-feature map corresponding to each video frame is obtained. And acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, extracting time sequence features of the feature image sequence to obtain predicted track features of the target object in video data, and classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object. Thus, the accuracy of track identification of the target object can be improved, and the applicability of track identification can be improved.
Referring to fig. 1a, fig. 1a is a schematic structural diagram of a track recognition system according to an embodiment of the present application. As shown in fig. 1a, the track recognition system may comprise a server 10 and a cluster of user terminals. The cluster of user terminals may comprise one or more user terminals, the number of which will not be limited here. As shown in fig. 1a, the user terminals 100a, 100b, 100c, …, 100n may be specifically included. As shown in fig. 1a, the user terminals 100a, 100b, 100c, …, 100n may respectively be connected to the above-mentioned server 10 through a network, so that each user terminal may interact with the server 10 through the network connection.
Wherein each user terminal in the user terminal cluster may include: smart phones, tablet computers, notebook computers, desktop computers, wearable devices, smart home, head-mounted devices and other intelligent terminals with track recognition. It should be appreciated that each user terminal in the cluster of user terminals shown in fig. 1a may be provided with a target application (i.e. application client) that, when running in the respective user terminal, may interact with the server 10 shown in fig. 1a, respectively, as described above.
As shown in fig. 1a, the server 10 may receive video data including a target object sent by a user terminal, perform semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame, and perform convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame. The server 10 may acquire a time sequence relationship between at least two video frames, combine the position sub-feature maps corresponding to each video frame into a feature map sequence according to the time sequence relationship, perform time sequence feature extraction on the feature map sequence to obtain a predicted track feature of the target object in the video data, and perform classification and identification on the predicted track feature to obtain a track identification result corresponding to the target object. The server 10 obtains the track recognition result, and may send the track recognition result to the user terminal, so that the user terminal may be used to provide a verification basis for the service associated with the target article according to the track recognition result. The server 10 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms.
For easy understanding, the embodiment of the present application may select one user terminal from the plurality of user terminals shown in fig. 1a as a target user terminal, where the target user terminal may include: and an intelligent terminal with a track recognition function, such as an intelligent mobile phone, a tablet personal computer, a notebook computer, a desktop computer, an intelligent television and the like. For example, the embodiment of the present application may use the user terminal 100a shown in fig. 1a as a target user terminal, where an application client having the track recognition function may be integrated. For example, the user terminal 100a may respond to an upload operation of the user in the application client, obtain data uploaded by the user, perform initial detection on the data, determine whether the data uploaded by the user includes a target object, and if the data uploaded by the user includes the target object, determine the data uploaded by the user as video data including the target object; if the target object does not exist in the data uploaded by the user, a prompt message can be output to prompt the user to upload the data again. After the user terminal 100a obtains the video data including the target object, the video data including the target object may be sent to the server 10, after the server 10 receives the video data including the target object sent by the user terminal 100a, semantic segmentation is performed on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame, and convolution processing is performed on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame respectively. The server 10 may acquire a time sequence relationship between at least two video frames, combine the position sub-feature maps corresponding to each video frame into a feature map sequence according to the time sequence relationship, perform time sequence feature extraction on the feature map sequence to obtain a predicted track feature of the target object in the video data, and perform classification and identification on the predicted track feature to obtain a track identification result corresponding to the target object. The server 10 obtains the track recognition result, and may send the track recognition result to the user terminal, so that the user terminal provides a verification basis for the service associated with the target article according to the track recognition result.
For example, the target object may be a user certificate, the user terminal 100a may send video data including the user certificate to the server 10, after the server 10 receives the video data sent by the user terminal 100a, a track recognition result (e.g. turning right) corresponding to the user certificate may be obtained according to the above operation, and a track recognition result (e.g. turning right) corresponding to the user certificate may be sent to the user terminal 100a, and the user terminal may provide a verification basis for a service associated with the target object according to the track recognition result. If the track recognition result corresponding to the user certificate is right-turning, the video data containing the user certificate can be identified as effective video data, and related services such as certificate information acquisition or certificate authenticity verification can be performed according to the video data containing the user certificate; if the interface is turned over (i.e. the user certificate is turned over during the process of turning over), the video data containing the user certificate can be identified as invalid video data, and prompt information is output to request the user to upload the video data of the user certificate again.
As shown in fig. 1b, fig. 1b is a schematic view of a data identification scenario provided by an embodiment of the present application, and as shown in fig. 1b, a target object may be a certificate, and when a target user (i.e. a user who needs to perform a bank account opening service) performs a triggering operation for an upload certificate video button in a user interface b11, a target user terminal may respond to the triggering operation of the target user, and start a camera assembly to perform shooting, so as to obtain video data of the certificate including the target user. After the target user terminal obtains the video data containing the certificate, the data identification in the user interface b12 can be displayed, the video data containing the certificate is sent to the server b13, and the server b13 can perform track identification on the video data to obtain a track identification result corresponding to the certificate. As shown in fig. 1b, the server b13 may obtain coordinate sequences of the certificate in at least two video frames in the video data through the semantic segmentation model b14, and perform convolution processing on the coordinate sequences of the certificate in the at least two video frames in the video data through the convolutional neural network b15 to obtain a position sub-feature map corresponding to the at least two video frames respectively. And (3) carrying out cyclic feature extraction on the position sub-feature graphs corresponding to at least two video frames respectively through a cyclic neural network b16 to obtain predicted track features of the certificates in the video data, and carrying out classification recognition on the predicted track features of the certificates in the video data through linear classification b17 to obtain track recognition results corresponding to the certificates. After the server b13 obtains the track recognition result corresponding to the certificate, the track recognition result may be sent to the target user terminal, and after the target user terminal receives the track recognition result sent by the server b13, the target user terminal may determine whether the track recognition result matches with the verification track direction, at this time, the target user terminal continues to display the user interface b18 in "data recognition". If the target user terminal determines that the track recognition result corresponding to the certificate is matched with the verification track direction, displaying a 'pass' user interface b19, and after verification, the target user terminal can acquire the certificate information, such as an identity card number, birth year, month, day and the like, according to the video data corresponding to the certificate; if the target user terminal determines that the track recognition result corresponding to the certificate is not matched with the verification track direction, displaying a user interface b20 of 'please re-shoot', and prompting the target user to re-shoot.
Referring to fig. 2, fig. 2 is a flow chart of a data identification method according to an embodiment of the application. The data identification method may be performed by a computer device, which may be a server (e.g. the server 10 of fig. 1a described above), or a user terminal (e.g. any user terminal of the user terminal cluster of fig. 1a described above), or a system of servers and user terminals, which is not limited in this respect. As shown in fig. 2, the data identification method may include steps S101 to S104.
S101, acquiring video data containing a target object, and carrying out semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame.
Specifically, the computer device can perform track recognition on the motion track of the target object in the video data to obtain a track recognition result corresponding to the target object, and provide reference information for related services based on the track recognition result. For example, the running state or running track of the target object can be monitored in real time based on the running track of the target object, or whether the running track of the target object meets the specification can be detected, for example, the validity of the video data corresponding to the target object can be verified according to the track identification result, so that the occurrence of the wrong service processing result caused by carrying out service processing associated with the target object according to invalid video data is avoided. If the target object is an identity document, when the user needs to upload video data corresponding to the identity document to acquire identity information or verify authenticity of the identity document, the validity of the video data corresponding to the identity document uploaded by the user needs to be verified, and when the video data corresponding to the identity document uploaded by the user is invalid video data (such as replacement occurs during uploading of the video data by the user), if service processing associated with the target object is performed on the invalid video data, an incorrect service processing result is generated, such as acquisition of identity information, and incorrect information is acquired. For example, when the target object is an express box, and the express box to be shipped is required to be transported to a designated window through a conveyor belt, shooting can be performed through monitoring equipment (such as a camera) to obtain video data containing the express box, and track recognition is performed on the video data containing the express box to obtain a track recognition result corresponding to the express box. When the track recognition result corresponding to the express box is matched with the verification track direction appointed by the manager, the express box can be determined to reach the appointed position; when the track recognition result corresponding to the express box is not matched with the verification track direction appointed by the manager, the fact that the express box does not reach the appointed position can be determined, alarm information is sent out, the manager is prompted to put the express box deviating from the verification track direction to the appointed position, and meanwhile the manager can find the express box deviating from the verification track direction according to the track recognition result.
Specifically, after the computer device obtains the video data including the target object (for example, the user uploads the video data including the target object or the video data including the target object obtained by shooting through the camera component), semantic segmentation may be performed on at least two video frames in the video data, so as to obtain a coordinate sequence of the target object in each video frame. The target object may be an object that needs to be monitored, such as an identity document, an express box, a vehicle, a living object produced in a production shop, or the like, and the embodiment of the application is not limited herein. The semantic segmentation method may include R-CNN (Region-Convolutional Neural Network, region-based semantic segmentation, such as Mask R-CNN (Mask Region-Convolutional Neural Network, example segmentation algorithm), etc.), FCN (Fully Convolutional Networks, full convolution network semantic segmentation), etc., and may also be other segmentation methods, such as line detection, face detection, etc.
The computer equipment can shoot the target object through an imaging component in the computer equipment to obtain video data containing the target object, or can acquire the video data corresponding to the target object through a target object (such as a user terminal) and then send the video data corresponding to the target object to the computer equipment.
Optionally, the specific manner in which the computer device obtains the video data containing the target item may include: and responding to uploading operation in the track verification page, starting the image pickup assembly, shooting the target object based on the image pickup assembly, and determining the movement track of the shot target object as video data containing the target object.
Specifically, the computer device in the present solution may refer to a user client, may be an Application (APP), or may be a virtual Application, where the Application is an APP entity, and may be an APP that is self-owned by a user side for track verification. The user can implement uploading operation (such as touching a video data uploading button) for uploading video data containing the target object on the self track verification page of the user side, and the computer equipment can respond to the uploading operation of the user for the track verification page to start the camera shooting assembly. Based on the image pickup assembly, the target object is photographed, and the movement track of the photographed target object is determined to be video data containing the target object.
Optionally, the specific manner in which the computer device obtains the video data containing the target item may include: and receiving a track verification request aiming at the target object and sent by the target object, and performing authority verification on the target object according to the track verification request. If the target object has the authority to perform track verification on the target object, acquiring a video stream corresponding to the target object from the track verification request, decoding the video stream, and determining the decoded video stream as video data containing the target object.
Specifically, when other target objects (such as a user terminal) send a track verification request for the target object to the computer device, the computer device may receive the track verification request for the target object sent by the target object, and perform authority verification on the target object. If the target object has the authority to perform track verification on the target object, acquiring a video stream corresponding to the target object from a track verification request sent by the target object, decoding the video stream, and determining the decoded video stream as video data containing the target object. If the target object does not have the authority to perform track verification on the target object, sending rejection verification information to the target object, informing the target object that the track verification on the target object cannot be performed, if only the target object which can perform track verification (such as the target object with a member) can perform track verification by default, and when the target object which can perform track verification by default requests to perform track verification on the target object, determining a track recognition result corresponding to the target object according to video data corresponding to the target object by the computer equipment, and sending the track recognition result to the target object; when an object other than the default target object requests track verification of the target object, the computer device may reject the track verification request of the object other than the default target object, and may output prompt information to prompt which operations the object other than the default target object needs to perform to track verify.
S102, carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame.
Specifically, after the computer device obtains the coordinate sequence of the target object in each video frame, convolution processing can be performed on the coordinate sequence of the target object in each video frame to obtain the position sub-feature map corresponding to each video frame respectively. The method can adopt a CNN convolutional neural network (Convolutional Neural Network, a feedforward neural network, which is used for extracting the characteristics of the image), and convolve the coordinate sequence of the target object in each video frame to obtain the position sub-characteristic diagram corresponding to each video frame. The feature extraction can be performed by adopting one-dimensional convolution, so that the feature extraction can be better performed according to the time sequence information corresponding to the vertex coordinates corresponding to the target object in at least two video frames. In addition, when the pixels of some video frames are poor, and the target object is unclear and the extracted coordinate sequence corresponding to the target object is inaccurate, the one-dimensional convolution can have good smoothing effect on the abnormal coordinate sequence, and the influence of the abnormal coordinate sequence is weakened, so that the robustness of the model is provided, and the accuracy of feature extraction is further provided. Of course, the coordinate matrix of i×k corresponding to the target object may be regarded as a single-channel image vector, I is the number of at least two video frames, K is the number of coordinates of the target object in each video frame, I and K are positive integers, for example, I may take values of 1,2,3, …, and K may take values of 1,2,3, …, and the single-channel image vector is subjected to convolution processing by adopting two-dimensional convolution, so as to obtain the position sub-feature map corresponding to each video frame respectively.
S103, acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data.
Specifically, since the video data is obtained by shooting the target object, the computer device may obtain a time sequence relationship between at least two video frames in the video data according to the shooting sequence, sort the position sub-feature maps corresponding to each video frame respectively according to the time sequence relationship, and combine the feature map sequences. The feature map sequence formed by combining the position sub-feature maps corresponding to the at least two video frames can reflect the change features of the target object in the at least two videos, so that feature extraction can be performed on the feature map sequence to obtain the predicted track features of the target object in the video data. The computer equipment can adopt a cyclic neural network to extract the characteristics of the characteristic graph sequence, so as to obtain the predicted track characteristics of the target object in the video data. The recurrent neural network may refer to RNN (Recurrent Neural Network, recurrent neural network, an artificial neural network with nodes connected in a loop in a directed manner, and may better combine time sequence information to perform feature extraction), such as a conventional RNN, LSTM (Long Short-Term Memory network, a Long-Term Memory network, which is a time recurrent neural network suitable for processing and predicting important events with relatively Long intervals and delays in a time sequence), and GRU (Gate Recurrent Unit, which is a variant of the recurrent neural network with a very good effect, which is simpler than the structure of the LSTM network and has a very good feature extraction effect), and so on.
S104, classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object.
Specifically, the computer device obtains the predicted track characteristics of the target object in the video data, and can obtain the track recognition result corresponding to the target object by classifying and recognizing the predicted track characteristics, wherein the track recognition result is used for providing a verification basis for the service associated with the target object.
Alternatively, the computer device may obtain a verification track direction corresponding to the business associated with the target item, where the verification track direction refers to a direction in which the business associated with the target item needs to be verified, i.e., a valid direction determined by the business associated with the target item. If the track recognition result corresponding to the target object is matched with the verification track direction, determining that the video data corresponding to the target object is effective video data, and verifying the validity of the target object according to the video data. If the target object can be an identity document, when the video data corresponding to the identity document is valid video data, the identity document can be verified according to the valid video data, or the identity information can be acquired according to the valid video data, such as acquiring the user name, the user identity card number, the user home address and the like in the identity document. If the track recognition result corresponding to the target object is not matched with the verification track direction, determining that the video data corresponding to the target object is invalid video data, and outputting prompt information for prompting the target object to be shot again. If the target object is an identity document, when a user needs to transact banking business or other business and needs to verify the identity document, if the video data corresponding to the identity document is subjected to track recognition, the obtained track recognition result is not matched with the verification track direction, and when the video data corresponding to the identity document is determined to be invalid video data, prompt information can be output to require the user to shoot the identity document again.
Fig. 3 is a schematic diagram of video data including a target object according to an embodiment of the present application, where, as shown in fig. 3, in a video frame X i, the target object is in a first state (i.e., an initial state when a user performs shooting, and has no deflection angle). In video frame X i+n, n is a positive integer, that is, video frame X i+n is the last n video frames of video frame X i, as shown in fig. 3, the target object in video frame X i+n is deflected to the right by a certain angle, and the size of the target object in video frame X i+n is changed. In video frame X i+m, m is a positive integer greater than n, i.e., video frame X i+m is the next m-n video frames of video frame X i+n, as shown in fig. 3, the target object in video frame X i+m is deflected to the right by a greater angle than the target object in video frame X i+n, and the size of the target object in video frame X i+m is changed more. When the computer device performs track recognition on the video data shown in fig. 3, it can be recognized that the track recognition result of the target object in the video data shown in fig. 3 is right-turning.
In the embodiment of the application, the coordinate sequence of the target object in each video frame is obtained by acquiring the video data containing the target object and carrying out semantic segmentation on at least two video frames in the video data, and the coordinate sequence of the target object in each video frame is adopted to carry out track identification on the target object, so that the calculation amount can be reduced. And carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame. The coordinate sequence of the target object in each video frame is subjected to convolution processing, so that the abnormal coordinate sequence can be eliminated, the coordinate sequence of the target object corresponding to each video frame is converted into a unified format, namely into a position sub-feature map, and therefore, the track identification can be carried out on any type of video data, and the application range is enlarged. And acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data. Therefore, the feature extraction is performed according to the time sequence relation between at least two video frames, so that the accuracy of the feature extraction can be improved. And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object. By the method and the device, the accuracy of track identification of the target object can be improved, and the applicability of track identification is improved.
Referring to fig. 4, fig. 4 is a flowchart of a data identification method according to an embodiment of the present application. The data identification method may be performed by a computer device, which may be a server (e.g. the server 10 of fig. 1a described above), or a user terminal (e.g. any user terminal of the user terminal cluster of fig. 1a described above), or a system of servers and user terminals, which is not limited in this respect. As shown in fig. 4, the data recognition method may include steps S201 to S207.
S201, obtaining video data containing a target object, and carrying out semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame.
Specifically, the computer device may capture the target object through the image capturing component in the computer device to obtain video data including the target object, or may obtain the video data including the target object (e.g. at the user side) and then send the video data including the target object to the computer device. After the computer equipment obtains the video data containing the target object, semantic segmentation can be carried out on at least two video frames in the video data to obtain the coordinate sequence of the target object in each video frame.
Optionally, the at least two video frames include a video frame T i, i is smaller than or equal to the number S of the at least two video frames, i and S are both positive integers, and the specific manner of performing semantic segmentation on the at least two video frames in the video data by the computer device to obtain the coordinate sequence of the target object in each video frame may include: randomly dividing an image corresponding to the video frame T i to obtain N candidate areas; n is a positive integer, for example, N can be 1,2,3, …. And respectively extracting pixels from the N candidate areas to obtain area pixel points corresponding to the N candidate areas, and classifying the pixels of the areas contained in the N candidate areas according to the pixel value interval associated with the target object to obtain the target pixel point in each candidate area. Based on the target pixel point, the video frame T i is subjected to segmentation processing to obtain a mask image corresponding to the target object, the pixel value corresponding to the target pixel point belongs to a pixel value interval associated with the target object, and based on the mask image, the coordinate sequence of the target object in the video frame T i is determined.
Specifically, the computer device may determine the location information of the target item in the video frame T i through a semantic segmentation algorithm in deep learning, thereby determining the coordinate sequence of the target item in the video frame T i. For example, the computer device may randomly segment an image corresponding to the video frame T i to obtain N candidate regions, where N is a positive integer. The computer device may also traverse the image corresponding to the video frame T i by using a sliding window, so as to obtain N candidate areas, where the size of the sliding window may be set according to a specific situation. Or a selective search algorithm (SELECTIVE SEARCH) may be used to calculate the similarity between each area after random division, and find the possible position of the target object, that is, the candidate area, according to the similarity, and extract the candidate area that may contain the target object from the image corresponding to the video frame T i.
Specifically, after obtaining N candidate regions, the computer device may perform pixel extraction on the N candidate regions, to obtain region pixel points corresponding to the N candidate regions, respectively. And carrying out pixel classification on the region pixel points respectively contained in the N candidate regions according to the pixel value interval associated with the target object to obtain target pixel points in each candidate region. Based on the target pixel point, the video frame T i is subjected to segmentation processing, so that a mask image corresponding to the target object, namely the outline information of the target object in the video frame T i, is obtained, and the pixel value corresponding to the target pixel point belongs to a pixel value interval associated with the target object. After the computer equipment obtains the mask image corresponding to the target object, the coordinate sequence of the target object in the video frame T i can be determined based on the mask image, so that the coordinate sequences corresponding to at least two video frames in the video data can be obtained by the method. Because the position information of the target object in the video frame T i is excessively redundant by using the mask image corresponding to the target object, the coordinate sequence of the target object in the video frame T i can be determined based on the mask image of the target object in the video frame T i, and the position information of the target object is represented by using the coordinate sequence of the target object in the video frame T i, so that the calculation amount in track recognition can be reduced.
As shown in fig. 5, fig. 5 is a schematic diagram of acquiring a mask image corresponding to a target object, where, as shown in fig. 5, a computer device may input N candidate regions corresponding to a video frame T i into a semantic segmentation model trained according to a semantic segmentation algorithm (such as FCN), and perform feature extraction on the N candidate regions through a plurality of convolution layers in the semantic segmentation model to obtain region pixel points corresponding to each candidate region, and perform pixel classification on the region pixel points contained in each N candidate region according to a pixel value interval associated with the target object, so as to obtain the target pixel point in each candidate region. Based on the target pixel point, the video frame T i is subjected to segmentation processing, so that a mask image corresponding to the target object, namely the outline information of the target object in the video frame T i, is obtained, and the pixel value corresponding to the target pixel point belongs to a pixel value interval associated with the target object.
Optionally, the specific manner in which the computer device determines the coordinate sequence of the target object in the video frame T i based on the mask image may include: according to the mask image, determining image edge points of the target object in the video frame T i, performing straight line fitting on the image edge points, and determining M fitting straight lines, wherein M fitting straight lines are obtained according to the mask image corresponding to the target object, so that M is a positive integer and is greater than or equal to 3, and if N can take values of 3,4,5 and …. And acquiring the intersection point between any two adjacent fitting straight lines in the M fitting straight lines, acquiring position coordinate information corresponding to the intersection point, and generating a coordinate sequence of the target object in the video frame T i according to the position coordinate information corresponding to the intersection point.
Specifically, in the video frame T i, the computer device may determine, according to the mask image corresponding to the target object, an image edge point of the target object in the video frame T i, that is, detect a boundary point between the mask image and an image area other than the mask image in the video frame T i, and perform straight line fitting on the image edge point, to determine M fitted straight lines. Because the M fitting straight lines are boundary lines corresponding to the target object, an intersection point exists between any two adjacent straight lines to form a closed area, so that the computer equipment can acquire the intersection point between any two fitting straight lines in the M fitting straight lines, acquire position coordinate information corresponding to the intersection point, and transpose the position coordinate corresponding to the intersection point (namely, convert the position coordinate into a required matrix form according to requirements) to obtain a coordinate sequence of the target object in the video frame T i.
For example, as shown in fig. 6, fig. 6 is a schematic diagram of acquiring a coordinate sequence corresponding to a target object according to an embodiment of the present application, and as shown in fig. 6, when the target object is an identity document, a document mask image corresponding to the identity document may be determined by performing semantic segmentation on a video frame corresponding to the identity document, which may be described with reference to fig. 5, and will not be described herein. According to the mask image corresponding to the target object, the image edge point corresponding to the certificate mask image is obtained, and 4 fitting straight lines, namely a fitting straight line 1, a fitting straight line 2, a fitting straight line 3 and a fitting straight line 4, can be obtained through straight line fitting. And then sequentially acquiring the intersection points (namely vertexes) between any two adjacent fitting straight lines in the four fitting straight lines. The lower left corner of the image corresponding to the video frame may be used as the origin of coordinates, or the lower left corner corresponding to the mask image may be used as the origin of coordinates, so as to obtain position coordinate information between any two adjacent fitting straight lines, the position coordinate information corresponding to the intersection point may be determined as position coordinate information corresponding to the identity document, for example, (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) may be used to represent four vertices of the identity document. The method is used for obtaining the position coordinate information of the identity document in the video data, which corresponds to at least two video frames respectively, and integrating (such as transposition processing) the position coordinate information of the at least two video frames, which corresponds to the at least two video frames respectively, so as to obtain the coordinate sequence of the identity document in each video frame.
The formula of obtaining the coordinate sequence of the identity document in each video frame can be represented by the following formula (1).
A=[[x1,y1,x2,y2,x3,y3,x4,y4]T,[x1,y1,x2,y2,x3,y3,x4,y4]T,…](1)
Wherein x1, y1, x2, y2, x3, y3, x4, y4 in formula (1) refer to the position coordinate information corresponding to the identity document, a refers to the coordinate matrix corresponding to the identity document, and T refers to the transpose.
S202, integrating coordinate sequences of the target object in at least two video frames to obtain a coordinate matrix corresponding to the target object, inputting the coordinate matrix into a convolution network layer in a track recognition model, and performing convolution operation on the coordinate matrix to obtain position convolution information corresponding to the coordinate matrix.
Specifically, the convolutional network layer in the trajectory recognition model may refer to a CNN convolutional neural network (Convolutional Neural Network, a feed-forward neural network, used to perform feature extraction on the image). The computer device may integrate the coordinate sequences of the target object in at least two video frames to obtain a coordinate matrix corresponding to the target object, for example, the coordinate sequences of the target object in at least two video frames may be combined and transferred into a coordinate matrix of i×k, where I is the number of at least two video frames, K is the number of coordinates of the target object in each video frame, and I and K are positive integers, for example, I may take values of 1,2,3, …, and K may take values of 1,2,3, …. Vector conversion is carried out on the coordinate matrix corresponding to the target object, the coordinate matrix after vector conversion is input into a convolution network layer in the track recognition model, convolution operation is carried out on the coordinate matrix, and position convolution information corresponding to the coordinate matrix is obtained. The convolution network layer in the track recognition model comprises a convolution block and adaptive average pooling, wherein the convolution block comprises a convolution network, a characteristic standardization network and an activation function. The computer equipment can carry out convolution operation on the coordinate matrix by adopting a convolution network in the convolution block to obtain position convolution information corresponding to the coordinate matrix.
Fig. 7 is a schematic structural diagram of a convolution block in a convolution network layer according to an embodiment of the present application, where, as shown in fig. 7, a convolution network in the convolution block may refer to one-dimensional convolution, and is used to perform feature extraction on an output vector to obtain position convolution information corresponding to the input vector. The convolution block can also comprise group normalization for normalizing the position convolution information corresponding to the input vector, and an activation function for nonlinear combination of the normalized position convolution information, namely increasing and decreasing nonlinear factors, so as to improve the effect of feature extraction.
The computer equipment integrates coordinate sequences corresponding to the target object in at least two video frames to obtain a coordinate matrix, the coordinate matrix obtained by integration can be regarded as a one-dimensional coordinate image vector of a K channel, the vector on each dimension (i.e. channel) represents the change of a certain vertex coordinate of the target object in at least two video frames, and if the first dimension records the change of a vertex coordinate X1 corresponding to the target object in at least two video frames. Because the coordinate matrix of I x K corresponding to the target object can be regarded as a one-dimensional image vector of K channels, when the coordinate matrix is convolved in the convolution network, one-dimensional convolution can be adopted to convolve the coordinate vectors on the K channels, so that feature extraction is performed according to the time sequence information corresponding to at least two video frames of the vertex coordinates corresponding to the target object better, and the position convolution information corresponding to the coordinate matrix is obtained. In addition, when the pixels of some video frames are poor, and the target object is unclear and the extracted coordinate sequence corresponding to the target object is inaccurate, the one-dimensional convolution can have good smoothing effect on the abnormal coordinate sequence, and the influence of the abnormal coordinate sequence is weakened, so that the robustness of the model is provided, and the accuracy of feature extraction is further provided. Of course, the coordinate matrix of i×k corresponding to the target object may be regarded as a single-channel image vector, and the single-channel image vector may be convolved by two-dimensional convolution to obtain position convolution information corresponding to the coordinate matrix.
Meanwhile, when the computer equipment carries out convolution operation on the coordinate matrix by adopting one-dimensional convolution, a bigger convolution kernel can be adopted to expand the receptive field, such as 5*5 or 7*7, so that the information of the front-back change of the vertex coordinates corresponding to the target object in at least two video frames can be better obtained, and the effect of feature extraction on the coordinate evidence corresponding to the target object is improved. The one-dimensional convolution includes 1 or more convolution kernels (which may also be called filters or receptive fields), the convolution operation means that the convolution kernels and the submatrices located at different positions of the input vector perform matrix multiplication operation, and the number of rows H out and the number of columns W out of the output matrix after the convolution operation are determined by the size of the input vector, the size of the convolution kernels, the step size (stride) and the boundary filling (padding), that is, Hout=(Hin-Hkernel+2*padding)/stride+1,Wout=(Win-Wkernel+2*padding)/stride+1.Hin,Hkernel represents the number of input vectors and the number of rows of the convolution kernels respectively; w in,Wkernel represents the dimension of each input vector and the number of columns of the convolution kernel, respectively.
S203, carrying out normalization processing on the position convolution information to obtain normalized position convolution information.
Specifically, after the computer device obtains the position convolution information corresponding to the coordinate matrix, the computer device may use the feature normalization network in the convolution block to normalize the position convolution information corresponding to the coordinate matrix, so as to obtain the normalized position convolution information. The feature normalization is also called normalization processing, and is used for solving the comparability between feature indexes, and after the normalization processing, all indexes are in the same order of magnitude, so that the comprehensive comparison is convenient.
Optionally, the specific way for the computer device to normalize the position convolution information to obtain the normalized position convolution information may include: based on the characteristic standardized network in the convolution network layer, the position convolution information is subjected to grouping processing to obtain Q position convolution groups, wherein Q is a positive integer, and if Q can take the values of 1,2,3 and …. And (3) acquiring the mean value and the variance corresponding to the Q position convolution groups respectively, and carrying out normalization processing on the position convolution information in each position convolution group based on the mean value and the variance corresponding to the Q position convolution groups respectively to obtain the position convolution information after normalization processing.
Specifically, as the positions and the duty ratios of the target objects corresponding to each video frame are different, the corresponding mean value and variance difference of the coordinate sequences corresponding to the target objects in different stages of at least two video frames are larger, and if batch normalization is adopted to normalize all the position convolution information, the final normalization result is inaccurate. Therefore, a group normalization processing mode can be adopted in the characteristic normalization network, and the position convolution information is subjected to grouping processing to obtain Q position convolution groups. And acquiring the mean value and the variance corresponding to the Q position convolution groups respectively, namely acquiring the mean value and the variance corresponding to each position convolution group, and carrying out normalization processing on the position convolution information in each position convolution group based on the mean value and the variance corresponding to the Q position convolution groups respectively to obtain the normalized position convolution information. Therefore, when the change of the target object in at least two video frames is large, the group normalization mode is adopted to perform the staged normalization processing, so that the change information of the target object in at least two video frames can be better obtained, and the influence of the size of the video data and the change of the target object in the video data is not large.
S204, based on an activation function in the convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and a position feature map corresponding to the coordinate sequence of the target object in at least two video frames is generated.
Specifically, the computer device may perform nonlinear combination on the normalized position convolution information based on the activation function in the convolution block, to generate a position feature map corresponding to the coordinate sequence of at least two video frames. The activation function is to perform nonlinear combination on the output after convolution operation and the input before convolution operation, so as to increase nonlinear factors and solve the defect of insufficient expression capacity of the linear model.
Optionally, the specific manner of generating the position feature map corresponding to the coordinate sequences in the at least two video frames by the computer device may include: and based on an activation function in the convolution network layer, carrying out nonlinear combination on the position convolution information after normalization processing to obtain candidate position feature diagrams corresponding to the coordinate sequences in at least two video frames. Acquiring the number of output channels and output size information in a convolution network layer, and adjusting the candidate position feature map based on the number of output channels and the output size information to obtain a position feature map corresponding to a coordinate sequence of a target object in at least two video frames; the size information corresponding to the position feature map is the product of the output size information and the number of output channels.
Specifically, the computer equipment can perform nonlinear combination on the normalized position convolution information through an activation function in a convolution block (belonging to a convolution network layer) to obtain candidate position feature diagrams corresponding to coordinate sequences of the target object in at least two video frames. The number of convolution blocks in the convolution network layer can be multiple, and multi-layer feature extraction can be performed on the coordinate matrix through the multiple convolution blocks. After the candidate position feature map is obtained through convolution operation of the convolution blocks in the convolution network layer, the number of output channels in the convolution network layer and the output size information in self-adaptive average pooling in the convolution network layer can be obtained, the size information of the candidate position feature map is adjusted, the position feature map corresponding to the coordinate sequences of the target object in at least two video frames is obtained, and the size information corresponding to the position feature map is the product between the output size information and the number of output channels. For example, if the output channel data in the convolutional network layer is C, and the output size in the adaptive average pooling in the convolutional network layer is M, the coordinate matrix generated by the video data corresponding to any time length can output a position feature map with a uniform size of c×m through the adaptive average pooling, so that the application of the position feature map on the video data with different frame rates and different time lengths is effectively widened, that is, the feature extraction can be performed on the video data with different frame rates and different time lengths in the convolutional network, so as to obtain the position feature map with a uniform size of c×m, so that the subsequent track recognition can be performed.
S205, determining a position sub-feature map corresponding to each video frame based on the position feature map.
Specifically, after the computer device obtains the position feature diagrams corresponding to the coordinate sequences of at least two video frames, the position feature diagrams can be split, so as to obtain the position sub-feature diagrams corresponding to each video frame respectively.
S206, acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data.
Specifically, the computer device may obtain a time sequence relationship between at least two video frames, and combine the position sub-feature maps corresponding to each video frame into a feature map sequence according to the time sequence relationship, and perform time sequence feature extraction on the feature map sequence to obtain a predicted track feature of the target object in the video data.
Optionally, the specific manner in which the computer device obtains the predicted trajectory characteristics of the target item in the video data may include: and acquiring a time sequence relation between at least two video frames, and combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation. And (3) performing cyclic coding on the feature map sequence based on a cyclic network layer of the track recognition model to obtain a coding matrix corresponding to the feature map sequence, and performing nonlinear combination on the coding matrix based on an activation function in the cyclic network layer to obtain the predicted track characteristics of the target object in the video data. The cyclic network layer in the track recognition model may refer to RNN (Recurrent Neural Network, cyclic neural network, an artificial neural network with nodes connected in a ring in a directional manner, and feature extraction can be performed by better combining time sequence information), and the cyclic network layer includes a plurality of cyclic feature extraction blocks (i.e., RNN blocks), such as a traditional RNN and LSTM (Long Short-Term Memory) network, which is a time recursive neural network suitable for processing and predicting important events with relatively Long intervals and delays in a time sequence, and GRU (Gate Recurrent Unit, which is a variant of the cyclic neural network with good effect, which is simpler than the LSTM network in structure and has good feature extraction effect.
As shown in fig. 8, fig. 8 is a schematic structural diagram of a cyclic network layer provided by the embodiment of the present application, where, as shown in fig. 8, the cyclic network layer may be composed of M cyclic feature extraction blocks (i.e. cyclic units in fig. 8), and each cyclic feature extraction block may combine feature information corresponding to a previous video frame and/or a next video frame of a current video frame to be encoded, i.e. combine context, when encoding a position sub-feature map corresponding to each video frame, so as to improve the feature extraction effect.
Specifically, each cyclic feature extraction block may receive an input vector with a length of C, so that a position feature map of c×m obtained in a convolutional network layer in a track recognition model may be input into the cyclic network layer in the track recognition model, and track recognition is performed on the position feature map of c×m, to obtain a predicted track feature of the target object in video data. The computer device may use M cyclic feature extraction blocks in the cyclic network layer to perform cyclic encoding (i.e. cyclic feature extraction) on the sequence feature map, and may acquire a position sub-feature map corresponding to a previous video frame and/or a next video frame of the current video frame to be encoded in combination with a timing relationship between at least two video frames, as a hidden state vector of the current video frame to be encoded, and encode the position sub-feature map corresponding to the current video frame to be encoded together with the hidden state vector, so that the obtained feature vector corresponding to the current video frame to be encoded is more accurate, that is, an association relationship between contexts of the current video frame to be encoded is combined, and the obtained feature vector is more accurate. Based on an activation function in the cyclic neural network layer, nonlinear combination is carried out on the coding matrix corresponding to the feature map sequence, and the predicted track feature of the target object in the video data is obtained.
Specifically, at least two video frames include video frame X i and video frame X i+1, video frame X i has an adjacent relationship with video frame X i+1, i+1 is less than or equal to the number S of at least two video frames, i and S being positive integers. When the computer device performs cyclic encoding on the feature map sequence to obtain an encoding matrix corresponding to the feature map sequence, cyclic encoding can be performed on the position sub-feature maps corresponding to at least two video frames respectively, that is, if the video frame X i is the video frame to be currently encoded, a first hidden state vector corresponding to the video frame X i in the feature map sequence can be obtained, where the first hidden state vector may refer to a feature vector corresponding to a previous video frame (i.e., the video frame X i-1) of the video frame X i. And inputting the position sub-feature map corresponding to the video frame X i and the first hidden state vector into a circulating network layer of the track recognition model, and encoding the position sub-feature map corresponding to the video frame X i to obtain a feature vector corresponding to the video frame X i. Determining the feature vector corresponding to the video frame X i as a second hidden state vector corresponding to the video frame X i+1 in the feature map sequence, inputting the position sub-feature map corresponding to the video frame X i+1 and the second hidden state vector into a loop network layer of a track recognition model, And encoding the position sub-feature map corresponding to the video frame X i+1 to obtain a feature vector corresponding to the video frame X i+1. When the video frame X i+1 is the last frame video frame of the at least two video frames, generating a coding matrix corresponding to the feature map sequence according to the feature vector corresponding to each video frame of the at least two video frames.
Fig. 9 is a schematic diagram of a coding matrix corresponding to a feature map sequence obtained by performing cyclic coding on the feature map sequence in a cyclic network layer, where fig. 9 shows that the cyclic network layer includes three parts, i.e., an input layer, a hidden layer, and an output layer, and the feature map sequence may be input into the input layer in the cyclic network layer, and the cyclic coding is performed on a position sub-feature map corresponding to each video frame in the feature map sequence according to a time line (i.e., a time sequence relationship between at least two video frames) in the feature map sequence and a weight matrix between each layer. As shown in fig. 9, the left half of fig. 9 shows a basic network structure corresponding to a cyclic network layer, where X refers to a value of an input layer, S refers to a value of a hidden layer, O refers to a value of an output layer, U refers to a weight matrix between the input layer and the hidden layer, V refers to a weight matrix between the hidden layer and the output layer, and W refers to a weight matrix between multiple hidden layers. As shown in the left part of fig. 9, when an input vector X is input into the cyclic network layer, a hidden value S may be obtained through a weight matrix U between the input layer and the hidden layer and a weight matrix W between multiple hidden layers, and then the hidden value S may be encoded through a weight matrix V between the hidden layer and the output layer, to obtain an output vector O.
Specifically, as shown in the right half of fig. 9, after the basic network structure corresponding to the cyclic network layer is expanded, X i refers to an input vector, such as a position sub-feature diagram corresponding to the video frame X i, X i-1 refers to another input vector, For example, the position sub-feature map corresponding to video frame X i-1, X i+1 is another input vector, for example, the position sub-feature map corresponding to video frame X i+1, video frame X i-1, Video frame X i, video frame X i+1 are three video frames adjacent to each other of at least two video frames, video frame X i-1 is the last frame video frame of video frame X i, video frame X i is the last video frame of video frame X i+1, with a continuous relationship between the three video frames. As shown in fig. 9, when encoding the video frame X i, a position sub-feature map corresponding to the video frame X i may be input, the position sub-feature map corresponding to the video frame X i is encoded through the weight matrix U between the input layer and the hidden layer, A first sub-concealment value is obtained. a hidden state vector S t-1 (i.e. the value S t-1 in the hidden layer) corresponding to the video frame X i-1 of the previous frame of the video frame X i is acquired, And encoding the hidden state vector S t-1 according to the weight matrix between the hidden layers to obtain a second hidden value, and summing the first hidden value and the second hidden value to obtain a first hidden state vector S t of the position sub-feature diagram corresponding to the video frame X i in the hidden layers. And according to the weight matrix between the hidden layer and the output layer, the first hidden state vector S t corresponding to the video frame X i is encoded to obtain a feature vector O t corresponding to the video frame X i. Similarly, the video frame X i-1 and the video frame X i+1 may employ the steps described above to obtain the feature vector O t-1 corresponding to the video frame X i-1, And obtaining a feature vector O t+1 corresponding to the video frame X i+1. When the video frame xi+1 is the last frame video frame in the at least two video frames, generating a coding matrix corresponding to the feature map sequence according to the feature vectors corresponding to each video frame in the at least two video frames.
S207, classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object.
Specifically, after the computing device obtains the predicted track characteristics of the target object in the video data, the computing device can perform classification and identification on the predicted track characteristics based on the linear classification layer in the track identification model to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing a verification basis for a service associated with the target object.
As shown in fig. 10, fig. 10 is a schematic diagram of track recognition provided by the embodiment of the present application, as shown in fig. 10, after a computer device obtains video data 1101a containing a target object, the obtained video data containing the target object may be input into a semantic segmentation model 1102b, at least two video frames in the video data are subjected to semantic segmentation, mask images corresponding to the target object in the at least two video frames are obtained, and coordinate sequences corresponding to the target object in the at least two video frames are determined according to the mask images. After obtaining the coordinate sequences corresponding to the target object in the at least two video frames, the computer device may input the coordinate sequences corresponding to the target object in the at least two video frames into the track recognition model 1103c, and perform feature extraction on the coordinate sequences corresponding to the at least two video frames, so as to obtain a track recognition result 1104d of the target object in the video data. The data identification method shown in fig. 10 can be applied to various application scenes, has high applicability, and can accurately identify the motion trail of the target object in any video data. The specific processing manner of the semantic segmentation model 1102b may refer to the content of the step S201, the specific processing manner of the track recognition model 1103cb may refer to the content of the steps S202-S207, and the embodiments of the present application are not described herein.
As shown in fig. 11, fig. 11 is a schematic structural diagram of a track recognition model according to an embodiment of the present application, and as shown in fig. 11, the track recognition model includes a convolutional network layer 1202b, a round robin network layer 1203c, and a linear classification layer 1204d. As shown in fig. 11, the computer device may input the coordinate sequence of the target object in each video frame into a track recognition model 1201a, where the track recognition model may perform vector conversion on the coordinate sequence of the target object in each video frame, input the coordinate sequence after vector conversion into a convolutional network layer 1202b, and perform convolutional processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame respectively. And inputting the position sub-feature images corresponding to each video frame into the circulation network layer 1203c, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to a time sequence relationship, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data. The predicted track characteristics of the target object in the video data are input into the linear classification layer 1204d, classification and identification are carried out on the predicted track characteristics, a track identification result corresponding to the target object is obtained, and a track identification result corresponding to the target object is output 1205e. The specific processing manners corresponding to the convolutional network layer 1202b, the round robin network layer 1203c, and the linear classification layer 1204d may refer to the contents of S202-S207, and the embodiments of the present application are not described herein again.
In the embodiment of the application, the coordinate sequence of the target object in each video frame is obtained by acquiring the video data containing the target object and carrying out semantic segmentation on at least two video frames in the video data, and the coordinate sequence of the target object in each video frame is adopted to carry out track identification on the target object, so that the calculation amount can be reduced. And carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame. The coordinate sequence of the target object in each video frame is subjected to convolution processing, so that the abnormal coordinate sequence can be eliminated, the coordinate sequence of the target object corresponding to each video frame is converted into a unified format, namely into a position sub-feature map, and therefore, the track identification can be carried out on any type of video data, the application range is expanded, and the applicability of the track identification is improved. Meanwhile, the embodiment of the application can adopt one-dimensional convolution to carry out convolution processing on the coordinate sequence of the target object in each video frame, has good smoothing effect on the abnormal coordinate sequence, and weakens the influence of the abnormal coordinate sequence, thereby providing the robustness of the model and further providing the accuracy of feature extraction. In addition, the embodiment of the application can also adopt group normalization to perform staged normalization processing, can better acquire the change information of the target object in at least two video frames, and is not greatly influenced by the size of the video data and the change of the target object in the video data. And acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data. Therefore, the feature extraction is performed according to the time sequence relation between at least two video frames, so that the accuracy of the feature extraction can be improved. And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object. By the method and the device, the accuracy of track identification of the target object can be improved, and the applicability of track identification is improved.
As shown in fig. 12, fig. 12 is a schematic flow chart of a data identifying method provided in an embodiment of the present application, where the method may be performed by a computer device, and the method may be performed by a computer device, where the computer device may be a server (e.g. the server 10 in fig. 1 a) or a target user terminal (e.g. any target user terminal in the target user terminal cluster in fig. 1 a) or a system formed by the server and the target user terminal, which is not limited in this aspect of the present application. As shown in fig. 12, the steps of the data recognition method include S301-306.
S301, acquiring a first initial track recognition model, first sample video data corresponding to the sample article and a first label classification result corresponding to the first sample article in the first sample video data.
Specifically, the computer device may train the first initial trajectory recognition model to obtain a trajectory recognition model of a trajectory recognition result corresponding to the target object in the video data. The computer device may acquire a first initial track recognition model, first sample data corresponding to the sample article, and a first tag classification result corresponding to the first sample article in first sample video data, where the first tag classification result may be obtained by manually judging a motion track of the sample article in the video data, and the first sample video data is obtained by shooting the motion track of the sample article. The first initial trajectory identification model may include an initial convolutional network layer, an initial round robin network layer, and an initial linear classification layer.
S302, a first initial track recognition model is adopted, and convolution processing is carried out on coordinate sequences corresponding to at least two first sample video frames in the first sample video data, so that a position sub-feature map corresponding to each first sample video frame is obtained.
S303, acquiring a sample time sequence relation between at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a sample feature image sequence according to the sample time sequence relation, and extracting time sequence features of the sample feature image sequence to obtain first predicted sample track features of sample objects in first sample video data.
S304, classifying and identifying the first predicted sample track features to obtain a first predicted classification result corresponding to the sample articles.
Specifically, the computer device may use a convolutional network layer in the first initial track recognition model to perform convolutional processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data, so as to obtain a position sub-feature map corresponding to each first sample video frame respectively. And acquiring a sample time sequence relation between at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a sample feature image sequence according to the sample time sequence relation, extracting time sequence features of the sample feature image sequence to obtain first predicted sample track features of sample articles in the first sample video data, and classifying and identifying the first predicted sample track features to obtain a first prediction classification result corresponding to the sample articles.
The content of steps S302-S304 may refer to the specific content described in fig. 2 or fig. 4, and the embodiments of the present application are not described herein again.
S305, determining a first loss function corresponding to the first initial track recognition model according to the first label classification result and the first prediction classification result corresponding to the sample article.
S306, performing iterative training on the first initial track recognition model based on the first loss function, and determining the trained first initial track recognition model as a track recognition model, wherein the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data.
Specifically, the computer device may determine a first loss function corresponding to the first initial track identification model according to a difference between the first label classification result and the first prediction classification result corresponding to the sample article, and adjust a network parameter in the first initial track identification model based on the first loss function, for example, adjust a network parameter in the first initial convolutional network layer or a network parameter in the cyclic network layer. Training based on the first initial track recognition model after parameter adjustment, determining the first initial track recognition model reaching the convergence condition as a track recognition model when the loss value between the first label classification result and the first prediction classification result reaches the convergence condition, wherein the track recognition model is used for acquiring the track recognition result corresponding to the target object in the video data. The first loss function may refer to an cross entropy loss function (Cross Entropy Error Function, used to measure a difference between two objects), used to measure a difference between a first label classification result (i.e., a real result) and a first prediction classification result (a model prediction result), and used to perform iterative training on the first initial trajectory recognition model according to the difference, so as to obtain a trajectory recognition model.
Optionally, the computer device may further perform model training through the twin network structure to obtain a track recognition model for obtaining a track recognition result corresponding to the target object in the video data. In particular, the computer device may output a second predicted sample trajectory characteristic of the sample article in the second sample video data via a second initial trajectory recognition model, the model parameters in the second initial trajectory recognition model being the same as the model parameters in the first initial trajectory recognition model. And classifying and identifying the second predicted sample track features to obtain a second predicted classification result of the sample object corresponding to the second sample video data, and generating a second loss function according to the second predicted classification result and a second label classification result corresponding to the second sample video data. And generating a third loss function according to the first prediction classification result and the second prediction classification result. Generating a total loss function based on the first loss function, the second loss function and the third loss function, performing iterative training on the first initial trajectory recognition model according to the total loss function, and determining the trained first initial trajectory recognition model as a trajectory recognition model.
Specifically, the twin network structure refers to inputting two different sample data into two identical networks through the two identical networks, respectively performing feature extraction on the two different sample data, and comparing differences between the two sample data. If the two sample data are of the same type, the distance between the two sample data is shortened through a loss function; if the two sample data are of different types, the distance between the two sample data is extended through the loss function, namely the samples of the same type can be closer and the samples of different types are farther through the twin network structure. In the embodiment of the application, two initial track recognition models, namely a first track recognition model and a second track recognition model, can be arranged through a twin network structure, and network parameters between the first track recognition model and the second track recognition model are identical, namely the first track recognition model and the second track recognition model are identical models. During training, two different sample video data, namely first sample video data and second sample video data, can be randomly extracted from a sample video database corresponding to a sample object, a first label classification result corresponding to the first sample video data is obtained, and a second label classification result corresponding to the second sample video data is obtained.
Specifically, the computer device may input the first sample video data into a first initial track recognition model, recognize a motion track of the sample object in the first sample video data to obtain a first prediction classification result, and input the second sample video data into a second initial track recognition model, recognize a motion track of the sample object in the second sample video data to obtain a second prediction classification result. And obtaining a second label classification result corresponding to the second sample video data, and generating a second loss function according to the second prediction result and the second label classification result, wherein the second loss function can be referred to as a cross entropy loss function (Cross Entropy Error Function for measuring the difference between two objects) and is used for measuring the difference between the second label classification result (namely a real result) and the second prediction classification result (a model prediction result).
In particular, the computer device may generate a third loss function based on the first predictive classification result and the second predictive classification result. The third loss function may refer to a contrast loss function between the first initial trajectory identification model and the second initial trajectory identification model for use in pulling up distances between samples of the same class and pulling up distances between samples of different classes. Determining whether the first sample video data and the second sample video data are in the same category according to the first tag classification result and the second tag classification result, if so, determining that the first sample video data and the second sample video data are in the same category; if the first tag classification result is different from the second tag classification result, determining that the first sample video data and the second sample video data are of different categories. If the first sample video data and the second sample video data are in the same category, the distance between the first prediction classification result and the second prediction classification result can be shortened by a third loss function (i.e. a contrast loss function).
For example, a first predicted track characteristic value corresponding to the first sample video data is obtained to be 0.5, and a second predicted track characteristic value corresponding to the second sample video data is obtained to be 0.3 through the first initial track recognition model and the second initial track recognition model, if the first sample video data and the second sample video data are in the same category, therefore, the network parameters in the first initial track model and the second track recognition model can be adjusted, so that the difference between the first predicted track characteristic value corresponding to the first sample video data and the second predicted track characteristic value corresponding to the second sample video data is reduced. If the first predicted track characteristic value corresponding to the first sample video data is 0.4 through the adjusted first initial track recognition model, the second predicted track characteristic value corresponding to the second sample video data is 0.4 through the adjusted second initial track recognition model, namely, the distance between the predicted track characteristic values corresponding to the two sample video data is shortened, so that the predicted classification results corresponding to the two sample video data are the same. If the first sample video data and the second sample video data are of different categories, network parameters in the first initial track model and the second track recognition model can be adjusted to increase the difference between the first predicted track characteristic value corresponding to the first sample video data and the second predicted track characteristic value corresponding to the second sample video data, namely, the distance between the predicted track characteristic values corresponding to the two sample video data is pulled away, so that the prediction classification results corresponding to the two sample video data are the same. If the first predicted track characteristic value corresponding to the first sample video data is 0.2 through the adjusted first initial track recognition model, the second predicted track characteristic value corresponding to the second sample video data is 0.7 through the adjusted second initial track recognition model, and the prediction classification results respectively corresponding to the two sample video data are different. Therefore, the track recognition model obtained through training can be used for classifying the motion track of the target object in the video data better.
Specifically, after obtaining a first loss function corresponding to the first initial trajectory recognition model, a second loss function corresponding to the second initial trajectory recognition model, and a third loss function corresponding to the first prediction classification result and the second prediction classification result, the computer equipment generates a total loss function based on the first loss function, the second loss function and the third loss function, performs iterative training on the first initial trajectory recognition model according to the total loss function, and determines the trained first initial trajectory recognition model as a trajectory recognition model. The formula of the total loss function may be as shown in the following formula (2).
loss Total (S) =αLoss Crossover +βLosscontrastive (2)
The Loss Total (S) in the formula (2) refers to a total Loss function, the Loss Crossover refers to an cross entropy Loss function (CrossEntropy Loss), which is used for measuring the losses of the current classification of the first initial track recognition model and the second initial track recognition model, and may refer to a first Loss function corresponding to the first initial track recognition model and a second Loss function corresponding to the second initial track recognition model, and α refers to the proportion of the cross entropy Loss Crossover to the total Loss Total (S) . Losscontrastive is a contrast loss function between the first initial trajectory recognition model and the second trajectory recognition model, which is used to shorten the distance between samples of the same category and to lengthen the distance between samples of different categories, β represents the proportion of the contrast loss Losscontrastive to the total loss Total (S) , and may be set to 0.01 or other values. And carrying out iterative training on the first initial track recognition model and the second track recognition model based on the total loss function, wherein the first initial track recognition model and the second initial track recognition model are of a twin network structure, so that when parameters are adjusted, parameter variation in the two models is consistent, and when convergence conditions are met, the first initial track recognition model or the second track recognition model (the first initial track recognition model and the second initial track recognition model at the moment are the same) meeting the convergence conditions is determined to be the track recognition model, and the track recognition result of the target object in the video data is obtained.
Fig. 13 is a schematic diagram of a track recognition method provided in the embodiment of the present application, where, as shown in fig. 13, the track recognition method may be applied to a scene of track recognition on a motion track of a vehicle, and as shown in fig. 13, after a computer device obtains video data 14a including a target vehicle, the video data may be framed to obtain H video frames, where H is a positive integer, and if H is 1,2,3, …, and semantic division is performed on the H respectively to obtain a coordinate sequence 14b of the target vehicle in each video frame. The coordinate sequence of the target vehicle in each video frame is subjected to convolution processing through the convolution neural network 14c to obtain a position sub-feature map corresponding to each video frame, the position sub-feature map corresponding to each video frame is input into the circulation neural network 14d, the position sub-feature map corresponding to each video frame is subjected to circulation feature extraction to obtain the predicted track feature of the target vehicle in the video data, and the predicted track feature is subjected to linear classification 14e to obtain the track recognition result 14f of the target vehicle in the video data.
In the embodiment of the application, a first initial track recognition model is adopted to carry out convolution processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data by acquiring the first initial track recognition model, the first sample video data corresponding to the sample article and a first label classification result corresponding to the sample article in the first sample video data, so as to obtain a position sub-feature map corresponding to each first sample video frame respectively. And acquiring a sample time sequence relation between at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a sample feature image sequence according to the sample time sequence relation, and extracting time sequence features of the sample feature image sequence to obtain first predicted sample track features of sample objects in the first sample video data. Classifying and identifying the first predicted sample track features to obtain a first predicted classification result corresponding to the sample articles, determining a first loss function corresponding to a first initial track identification model according to a first label classification result corresponding to the sample articles and the first predicted classification result, performing iterative training on the first initial track identification model based on the first loss function, and determining the trained first initial track identification model as a track identification model; the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data. The method and the device can improve the accuracy of track recognition by the track recognition model, improve the applicability of the track recognition model and enhance the robustness of the track recognition model.
Referring to fig. 14, fig. 14 is a schematic structural diagram of a data identification apparatus 1 according to an embodiment of the present application. The data recognition means 1 may be a computer program (comprising program code) running in a computer device, for example the data recognition means 1 is an application software; the data identification apparatus 1 may be used to perform the corresponding steps in the data identification method provided by the embodiment of the present application. As shown in fig. 14, the data recognition apparatus 1 may include: the system comprises a semantic segmentation module 11, a first convolution processing module 12, a first feature extraction module 13, a first classification recognition module 14, a verification module 15 and a first output module 16.
The semantic segmentation module 11 is configured to obtain video data including a target object, and perform semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
The first convolution processing module 12 is configured to perform convolution processing on the coordinate sequence of the target object in each video frame, so as to obtain a position sub-feature map corresponding to each video frame;
The first feature extraction module 13 is configured to obtain a time sequence relationship between at least two video frames, combine the position sub-feature graphs corresponding to each video frame into a feature graph sequence according to the time sequence relationship, and perform time sequence feature extraction on the feature graph sequence to obtain a predicted track feature of the target object in the video data;
the first classification and identification module 14 is configured to perform classification and identification on the predicted track feature, so as to obtain a track identification result corresponding to the target object, where the track identification result is used to provide a verification basis for a service associated with the target object.
Wherein the semantic segmentation module 11 comprises:
A starting unit 1101 for starting the image capturing component in response to an upload operation in the track verification page;
The first determining unit 1102 is configured to capture a target object based on the image capturing component, and determine a motion trajectory of the captured target object as video data including the target object.
Wherein the semantic segmentation module 11 comprises:
A verification unit 1103, configured to receive a track verification request for a target object sent by the target object, and perform authority verification on the target object according to the track verification request;
A first obtaining unit 1104, configured to obtain, if the target object has authority to perform track verification on the target object, a video stream corresponding to the target object from the track verification request;
a second determining unit 1105, configured to decode the video stream, and determine the decoded video stream as video data including the target object.
Wherein, the at least two video frames comprise video frames T i, i is smaller than or equal to the number S of the at least two video frames, and both i and S are positive integers;
the semantic segmentation module 11 further comprises:
The random dividing unit 1106 is configured to randomly divide an image corresponding to the video frame T i to obtain N candidate areas; n is a positive integer;
A pixel extraction unit 1107, configured to perform pixel extraction on the N candidate regions, to obtain region pixel points corresponding to the N candidate regions respectively;
The segmentation processing unit 1108 is configured to perform pixel classification on region pixel points respectively included in the N candidate regions according to a pixel value interval associated with the target object, to obtain target pixel points in each candidate region, and perform segmentation processing on the video frame T i based on the target pixel points, to obtain a mask image corresponding to the target object; the pixel value corresponding to the target pixel point belongs to a pixel value interval;
A third determining unit 1109 is configured to determine a coordinate sequence of the target object in the video frame T i based on the mask image.
The third determining unit 1109 is specifically configured to:
Determining image edge points of the target object in the video frame T i according to the mask image;
performing straight line fitting on the image edge points to determine M fitting straight lines; m is a positive integer;
And acquiring the intersection point between any two adjacent fitting straight lines in the M fitting straight lines, acquiring position coordinate information corresponding to the intersection point, and generating a coordinate sequence of the target object in the video frame T i according to the position coordinate information corresponding to the intersection point.
Wherein the first convolution processing module 12 comprises:
The convolution operation unit 1201 is configured to integrate coordinate sequences of the target object in at least two video frames to obtain a coordinate matrix corresponding to the target object, input the coordinate matrix into a convolution network layer in the track recognition model, and perform convolution operation on the coordinate matrix to obtain position convolution information corresponding to the coordinate matrix;
a normalization processing unit 1202, configured to perform normalization processing on the position convolution information, to obtain normalized position convolution information;
The first nonlinear combination unit 1203 is configured to perform nonlinear combination on the normalized position convolution information based on an activation function in the convolution network layer, so as to generate a position feature map corresponding to a coordinate sequence of the target object in at least two video frames;
A fourth determining unit 1204, configured to determine a location sub-feature map corresponding to each video frame, based on the location feature maps.
The first nonlinear combination unit 1203 is specifically configured to:
Based on an activation function in a convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and candidate position feature diagrams corresponding to coordinate sequences in at least two video frames are obtained;
acquiring the number of output channels and output size information in a convolution network layer, and adjusting the candidate position feature map based on the number of output channels and the output size information to obtain a position feature map corresponding to a coordinate sequence of a target object in at least two video frames; the size information corresponding to the position feature map is the product of the output size information and the number of output channels.
The normalization processing unit 1202 is specifically configured to:
Based on a characteristic standardized network in the convolutional network layer, carrying out grouping processing on the position convolutional information to obtain Q position convolutional groups; q is a positive integer;
And (3) acquiring the mean value and the variance corresponding to the Q position convolution groups respectively, and carrying out normalization processing on the position convolution information in each position convolution group based on the mean value and the variance corresponding to the Q position convolution groups respectively to obtain the position convolution information after normalization processing.
Wherein the first feature extraction module 13 comprises:
A second obtaining unit 1301, configured to obtain a timing relationship between at least two video frames, and combine the position sub-feature graphs corresponding to each video frame respectively into a feature graph sequence according to the timing relationship;
the cyclic coding unit 1302 is configured to perform cyclic coding on the feature map sequence based on a cyclic network layer of the track recognition model, so as to obtain a coding matrix corresponding to the feature map sequence;
The second nonlinear combination unit 1303 is configured to perform nonlinear combination on the encoding matrix based on the activation function in the cyclic network layer, so as to obtain a predicted track feature of the target object in the video data.
The video frames comprise a video frame X i and a video frame X i+1, the video frame X i and the video frame X i+1 have an adjacent relation, i+1 is smaller than or equal to the number S of the at least two video frames, and both i and S are positive integers;
the cyclic encoding unit 1302 specifically functions to:
Acquiring a first hidden state vector corresponding to a video frame X i in a feature map sequence, inputting a position sub-feature map corresponding to a video frame X i and the first hidden state vector into a cyclic network layer of a track identification model, and encoding the position sub-feature map corresponding to a video frame X i to obtain a feature vector corresponding to a video frame X i;
Determining a second hidden state vector corresponding to a video frame X i+1 in the feature map sequence according to the feature vector corresponding to the video frame X i, inputting a position sub-feature map corresponding to the video frame X i+1 and the second hidden state vector into a circulating network layer of a track recognition model, and encoding the position sub-feature map corresponding to the video frame X i+1 to obtain a feature vector corresponding to the video frame X i+1;
When the video frame X i+1 is the last frame video frame of the at least two video frames, generating a coding matrix corresponding to the feature map sequence according to the feature vector corresponding to each video frame of the at least two video frames.
Wherein, the data identification device 1 further comprises:
The verification module 15 is configured to determine that the video data corresponding to the target object is valid video data if the track recognition result corresponding to the target object matches the verification track direction, and verify the validity of the target object according to the video data; the verification track direction refers to the direction of verification required by the business associated with the target object;
The first output module 16 is configured to determine that the video data corresponding to the target object is invalid video data if the track recognition result corresponding to the target object does not match the verification track direction, and output a prompt message, where the prompt message is used to prompt the target object to be retake.
According to one embodiment of the present application, the steps involved in the data identification method shown in fig. 2 may be performed by the respective modules in the data identification apparatus 1 shown in fig. 14. For example, step S101 shown in fig. 2 may be performed by the semantic segmentation module 11 in fig. 14, step S102 shown in fig. 2 may be performed by the first convolution process 12 in fig. 14, step S103 shown in fig. 2 may be performed by the first feature extraction module 13 in fig. 14, step S104 shown in fig. 2 may be performed by the first classification recognition module 14 in fig. 14, and so on.
According to an embodiment of the present application, each module in the data identifying apparatus 1 shown in fig. 14 may be formed by combining one or several units separately or all, or some (some) of the units may be further split into a plurality of sub-units with smaller functions, so that the same operation may be achieved without affecting the implementation of the technical effects of the embodiment of the present application. The above modules are divided based on logic functions, and in practical applications, the functions of one module may be implemented by a plurality of units, or the functions of a plurality of modules may be implemented by one unit. In other embodiments of the application, the testing device may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of a plurality of units.
In the embodiment of the application, the coordinate sequence of the target object in each video frame is obtained by acquiring the video data containing the target object and carrying out semantic segmentation on at least two video frames in the video data, and the coordinate sequence of the target object in each video frame is adopted to carry out track identification on the target object, so that the calculation amount can be reduced. And carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame. The coordinate sequence of the target object in each video frame is subjected to convolution processing, so that the abnormal coordinate sequence can be eliminated, the coordinate sequence of the target object corresponding to each video frame is converted into a unified format, namely into a position sub-feature map, and therefore, the track identification can be carried out on any type of video data, and the application range is enlarged. And acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain the predicted track features of the target object in the video data. Therefore, the feature extraction is performed according to the time sequence relation between at least two video frames, so that the accuracy of the feature extraction can be improved. And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object. By the method and the device, the accuracy of track identification of the target object can be improved, and the applicability of track identification is improved.
Referring to fig. 15, fig. 15 is a schematic structural diagram of a data identifying apparatus 2 according to an embodiment of the present application. The data recognition means 2 may be a computer program (comprising program code) running in a computer device, for example the data recognition means 2 is an application software; the data identification device 2 may be used to perform the corresponding steps in the data identification method provided by the embodiment of the present application. As shown in fig. 15, the data recognition device 2 may include: the acquisition module 21, the second convolution processing module 22, the second feature extraction module 23, the second class identification module 24, the first determination module 25, the second determination module 26, the second output module 27, the third class identification module 28, the first generation module 29 and the second generation module 30.
An obtaining module 21, configured to obtain a first initial trajectory identification model, first sample video data corresponding to a sample article, and a first tag classification result corresponding to the first sample article in the first sample video data;
The second convolution processing module 22 is configured to perform convolution processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data by using the first initial track recognition model, so as to obtain a position sub-feature map corresponding to each first sample video frame respectively;
The second feature extraction module 23 is configured to obtain a sample timing relationship between at least two first sample video frames, combine the position sub-feature graphs corresponding to each first sample video frame into a sample feature graph sequence according to the sample timing relationship, and perform timing feature extraction on the sample feature graph sequence to obtain a first predicted sample track feature of the sample article in the first sample video data;
the second classification recognition module 24 is configured to perform classification recognition on the first predicted sample track feature to obtain a first predicted classification result corresponding to the sample article;
a first determining module 25, configured to determine a first loss function corresponding to the first initial trajectory identification model according to a first label classification result and a first prediction classification result corresponding to the sample article;
a second determining module 26, configured to iteratively train the first initial trajectory recognition model based on the first loss function, and determine the trained first initial trajectory recognition model as a trajectory recognition model; the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data.
Wherein the data recognition device 2 further comprises:
A second output module 27 for outputting a second predicted sample trajectory characteristic of the sample article in the second sample video data via a second initial trajectory identification model; the model parameters in the second initial trajectory identification model are the same as the model parameters in the first initial trajectory identification model;
A third classification and identification module 28, configured to perform classification and identification on the second predicted sample track feature, so as to obtain a second predicted classification result corresponding to the sample object in the second sample video data;
A first generating module 29, configured to generate a second loss function according to the second prediction classification result and a second tag classification result corresponding to the second sample video data;
a second generating module 30, configured to generate a third loss function according to the first prediction classification result and the second prediction classification result;
the second determination module 26 includes:
The fifth determining unit 2601 is configured to generate a total loss function based on the first loss function, the second loss function, and the third loss function, perform iterative training on the first initial trajectory recognition model according to the total loss function, and determine the trained first initial trajectory recognition model as the trajectory recognition model.
According to one embodiment of the present application, the steps involved in the data recognition method shown in fig. 12 may be performed by the respective modules in the data recognition apparatus 2 shown in fig. 15. For example, step S201 shown in fig. 12 may be performed by the acquisition module 21 in fig. 15, step S202 shown in fig. 12 may be performed by the second convolution processing module 22 in fig. 15, step S203 shown in fig. 12 may be performed by the second feature extraction module 23 in fig. 15, step S204 shown in fig. 12 may be performed by the second classification recognition module 24 in fig. 15, step S205 shown in fig. 12 may be performed by the first determination module 25 in fig. 15, step S206 shown in fig. 12 may be performed by the second determination module 26 in fig. 15, and so on.
According to an embodiment of the present application, each module in the data identifying apparatus 2 shown in fig. 15 may be separately or completely combined into one or several units to form a structure, or some (some) of the units may be further split into a plurality of sub-units with smaller functions, so that the same operation may be implemented without affecting the implementation of the technical effects of the embodiment of the present application. The above modules are divided based on logic functions, and in practical applications, the functions of one module may be implemented by a plurality of units, or the functions of a plurality of modules may be implemented by one unit. In other embodiments of the application, the testing device may also comprise other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of a plurality of units.
In the embodiment of the application, a first initial track recognition model is adopted to carry out convolution processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data by acquiring the first initial track recognition model, the first sample video data corresponding to the sample article and a first label classification result corresponding to the sample article in the first sample video data, so as to obtain a position sub-feature map corresponding to each first sample video frame respectively. And acquiring a sample time sequence relation between at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a feature image sequence according to the sample time sequence relation, and extracting time sequence features of the feature image sequence to obtain first predicted sample track features of sample objects in the first sample video data. Classifying and identifying the first predicted sample track features to obtain a first predicted classification result corresponding to the sample articles, determining a first loss function corresponding to a first initial track identification model according to a first label classification result corresponding to the sample articles and the first predicted classification result, performing iterative training on the first initial track identification model based on the first loss function, and determining the trained first initial track identification model as a track identification model; the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data. The method and the device can improve the accuracy of track recognition by the track recognition model, improve the applicability of the track recognition model and enhance the robustness of the track recognition model.
Referring to fig. 16, fig. 16 is a schematic structural diagram of a computer device according to an embodiment of the application. As shown in fig. 16, the above-mentioned computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a target user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The target user interface 1003 may include a Display (Display) and a Keyboard (Keyboard), and the optional target user interface 1003 may further include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a nonvolatile memory (non-volatile memory), such as at least one magnetic disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 16, an operating system, a network communication module, a target user interface module, and a device control application may be included in a memory 1005, which is one type of computer-readable storage medium.
In the computer device 1000 shown in fig. 16, the network interface 1004 may provide network communication functions; while target user interface 1003 is primarily an interface for providing input to a target user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
Alternatively, the processor 1001 may be configured to invoke a device control application program stored in the memory 1005 to implement:
Acquiring video data containing a target object, and performing semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
Carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame;
Acquiring a time sequence relation between at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain predicted track features of a target object in video data;
And classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing verification basis for the service associated with the target object.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device may execute the description of the data identification method in the embodiment corresponding to fig. 2, fig. 4, or fig. 12, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.
As an example, the program instructions described above may be deployed to be executed on one computer device or on multiple computer devices at one site or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain network.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (15)

1. A method of data identification, comprising:
Acquiring video data containing a target object, and performing semantic segmentation on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame;
carrying out convolution processing on the coordinate sequence of the target object in each video frame to obtain a position sub-feature map corresponding to each video frame;
Acquiring a time sequence relation between the at least two video frames, combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation, and extracting time sequence features of the feature image sequence to obtain predicted track features of the target object in the video data;
and classifying and identifying the predicted track features to obtain a track identification result corresponding to the target object, wherein the track identification result is used for providing a verification basis for a service associated with the target object.
2. The method of claim 1, wherein the acquiring video data containing the target item comprises:
responding to uploading operation in the track verification page, and starting the camera shooting assembly;
And shooting a target object based on the shooting assembly, and determining the shot movement track of the target object as video data containing the target object.
3. The method of claim 1, wherein the acquiring video data containing the target item comprises:
Receiving a track verification request aiming at the target object and sent by the target object, and performing authority verification on the target object according to the track verification request;
If the target object has the authority for carrying out track verification on the target object, acquiring a video stream corresponding to the target object from the track verification request;
and decoding the video stream, and determining the decoded video stream as video data containing the target object.
4. The method of claim 1, wherein the at least two video frames comprise video frames T i, i being less than or equal to the number of at least two video frames S, i and S each being a positive integer;
the semantic segmentation is performed on at least two video frames in the video data to obtain a coordinate sequence of the target object in each video frame, including:
Randomly dividing the image corresponding to the video frame T i to obtain N candidate areas; the N is a positive integer;
respectively extracting pixels of the N candidate areas to obtain area pixel points respectively corresponding to the N candidate areas;
According to a pixel value interval associated with the target object, carrying out pixel classification on region pixel points contained in the N candidate regions respectively to obtain target pixel points in each candidate region, and carrying out segmentation processing on the video frame T i based on the target pixel points to obtain a mask image corresponding to the target object; the pixel value corresponding to the target pixel point belongs to the pixel value interval;
Based on the mask image, a sequence of coordinates of the target object in the video frame T i is determined.
5. The method of claim 4, wherein determining the sequence of coordinates of the target object in the video frame T i based on the mask image comprises:
Determining image edge points of the target object in the video frame T i according to the mask image;
Performing straight line fitting on the image edge points to determine M fitting straight lines; m is a positive integer;
And acquiring an intersection point between any two adjacent fitting straight lines in the M fitting straight lines, acquiring position coordinate information corresponding to the intersection point, and generating a coordinate sequence of the target object in the video frame T i according to the position coordinate information corresponding to the intersection point.
6. The method of claim 1, wherein the convolving the coordinate sequence of the target object in each video frame to obtain the position sub-feature map corresponding to each video frame, includes:
Integrating coordinate sequences of the target object in at least two video frames to obtain a coordinate matrix corresponding to the target object, inputting the coordinate matrix into a convolution network layer in a track recognition model, and performing convolution operation on the coordinate matrix to obtain position convolution information corresponding to the coordinate matrix;
Normalizing the position convolution information to obtain normalized position convolution information;
based on an activation function in the convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and a position feature map corresponding to a coordinate sequence of the target object in the at least two video frames is generated;
And determining the position sub-feature map corresponding to each video frame respectively based on the position feature map.
7. The method according to claim 6, wherein the generating a position feature map corresponding to a coordinate sequence of the target object in the at least two video frames based on the activation function in the convolutional network layer by performing nonlinear combination on the normalized position convolution information includes:
based on an activation function in the convolution network layer, nonlinear combination is carried out on the position convolution information after normalization processing, and a candidate position feature map corresponding to the coordinate sequences in the at least two video frames is obtained;
acquiring the number of output channels and output size information in the convolution network layer, and adjusting the candidate position feature map based on the number of output channels and the output size information to obtain a position feature map corresponding to a coordinate sequence of the target object in the at least two video frames; the size information corresponding to the position feature map is the product of the output size information and the output channel number.
8. The method of claim 6, wherein normalizing the position convolution information to obtain normalized position convolution information comprises:
based on a characteristic standardized network in the convolutional network layer, carrying out grouping processing on the position convolutional information to obtain Q position convolutional groups; q is a positive integer;
And acquiring the mean value and the variance corresponding to the Q position convolution groups respectively, and carrying out normalization processing on the position convolution information in each position convolution group based on the mean value and the variance corresponding to the Q position convolution groups respectively to obtain normalized position convolution information.
9. The method according to claim 1, wherein the obtaining the timing relationship between the at least two video frames, combining the position sub-feature maps respectively corresponding to each video frame into a feature map sequence according to the timing relationship, and performing timing feature extraction on the feature map sequence to obtain the predicted track feature of the target object in the video data, includes:
acquiring a time sequence relation between the at least two video frames, and combining the position sub-feature images corresponding to each video frame into a feature image sequence according to the time sequence relation;
Based on a circulating network layer of the track recognition model, carrying out circulating coding on the characteristic graph sequence to obtain a coding matrix corresponding to the characteristic graph sequence;
and based on an activation function in the cyclic network layer, carrying out nonlinear combination on the coding matrix to obtain the predicted track characteristic of the target object in the video data.
10. The method of claim 9, wherein the at least two video frames include a video frame X i and a video frame X i+1, the video frame X i having an adjacency to the video frame X i+1, i+1 being less than or equal to the number S of the at least two video frames, i and S each being a positive integer;
The cyclic network layer based on the track recognition model performs cyclic coding on the feature map sequence to obtain a coding matrix corresponding to the feature map sequence, and the cyclic network layer comprises:
Acquiring a first hidden state vector corresponding to a video frame X i in the feature map sequence, inputting a position sub-feature map corresponding to the video frame X i and the first hidden state vector into a circulating network layer of the track recognition model, and encoding the position sub-feature map corresponding to the video frame X i to obtain a feature vector corresponding to the video frame X i;
Determining a second hidden state vector corresponding to a video frame X i+1 in the feature map sequence according to the feature vector corresponding to the video frame X i, inputting a position sub-feature map corresponding to the video frame X i+1 and the second hidden state vector into a circulating network layer of the track recognition model, and encoding the position sub-feature map corresponding to the video frame X i+1 to obtain a feature vector corresponding to the video frame X i+1;
And when the video frame X i+1 is the last frame of the at least two video frames, generating a coding matrix corresponding to the feature map sequence according to the feature vector corresponding to each video frame of the at least two video frames.
11. The method according to claim 1, wherein the method further comprises:
If the track recognition result corresponding to the target object is matched with the verification track direction, determining that the video data corresponding to the target object is effective video data, and verifying the validity of the target object according to the video data; the verification track direction refers to the direction of verification required by the business associated with the target object;
If the track recognition result corresponding to the target object is not matched with the verification track direction, determining that the video data corresponding to the target object is invalid video data, and outputting prompt information, wherein the prompt information is used for prompting the target object to be shot again.
12. A method of data identification, comprising:
acquiring a first initial track identification model, first sample video data corresponding to a sample article, and a first label classification result corresponding to the sample article in the first sample video data;
carrying out convolution processing on coordinate sequences corresponding to at least two first sample video frames in the first sample video data by adopting the first initial track recognition model to obtain position sub-feature diagrams corresponding to each first sample video frame respectively;
acquiring a sample time sequence relation between the at least two first sample video frames, combining the position sub-feature images corresponding to each first sample video frame into a sample feature image sequence according to the sample time sequence relation, and extracting time sequence features of the sample feature image sequence to obtain first predicted sample track features of the sample object in the first sample video data;
classifying and identifying the first predicted sample track features to obtain a first predicted classification result corresponding to the sample article;
Determining a first loss function corresponding to the first initial track recognition model according to a first label classification result and a first prediction classification result corresponding to the sample article;
performing iterative training on the first initial track recognition model based on the first loss function, and determining the trained first initial track recognition model as a track recognition model; the track recognition model is used for acquiring a track recognition result corresponding to the target object in the video data.
13. The method according to claim 12, wherein the method further comprises:
outputting a second predicted sample track characteristic of the sample article in second sample video data through a second initial track recognition model; the model parameters in the second initial trajectory identification model are the same as the model parameters in the first initial trajectory identification model;
Classifying and identifying the second predicted sample track features to obtain a second predicted classification result corresponding to the sample object in the second sample video data;
Generating a second loss function according to the second prediction classification result and a second label classification result corresponding to the second sample video data;
Generating a third loss function according to the first prediction classification result and the second prediction classification result;
the iterative training of the first initial trajectory recognition model based on the first loss function, determining the converged first initial trajectory recognition model as a trajectory recognition model, includes:
Generating a total loss function based on the first loss function, the second loss function and the third loss function, performing iterative training on the first initial trajectory recognition model according to the total loss function, and determining the trained first initial trajectory recognition model as the trajectory recognition model.
14. A computer device, comprising: a processor and a memory;
The memory stores a computer program which, when executed by a losing processor, performs the method of any one of claims 1 to 13.
15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded by a processor and to perform the method of any of claims 1 to 13.
CN202110304366.8A 2021-03-22 2021-03-22 Data identification method, storage medium and device Active CN114677611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110304366.8A CN114677611B (en) 2021-03-22 2021-03-22 Data identification method, storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110304366.8A CN114677611B (en) 2021-03-22 2021-03-22 Data identification method, storage medium and device

Publications (2)

Publication Number Publication Date
CN114677611A CN114677611A (en) 2022-06-28
CN114677611B true CN114677611B (en) 2024-08-20

Family

ID=82070900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110304366.8A Active CN114677611B (en) 2021-03-22 2021-03-22 Data identification method, storage medium and device

Country Status (1)

Country Link
CN (1) CN114677611B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777947B (en) * 2023-06-21 2024-02-13 上海汉朔信息科技有限公司 User track recognition prediction method and device and electronic equipment
CN118334596B (en) * 2024-06-17 2024-08-13 江西国泰利民信息科技有限公司 Object state detection model construction method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110660082A (en) * 2019-09-25 2020-01-07 西南交通大学 Target tracking method based on graph convolution and trajectory convolution network learning
CN110874865A (en) * 2019-11-14 2020-03-10 腾讯科技(深圳)有限公司 Three-dimensional skeleton generation method and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2908944B2 (en) * 2018-07-24 2023-01-09 Fund Centro Tecnoloxico De Telecomunicacions De Galicia A COMPUTER IMPLEMENTED METHOD AND SYSTEM FOR DETECTING SMALL OBJECTS IN AN IMAGE USING CONVOLUTIONAL NEURAL NETWORKS

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110660082A (en) * 2019-09-25 2020-01-07 西南交通大学 Target tracking method based on graph convolution and trajectory convolution network learning
CN110874865A (en) * 2019-11-14 2020-03-10 腾讯科技(深圳)有限公司 Three-dimensional skeleton generation method and computer equipment

Also Published As

Publication number Publication date
CN114677611A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
CN111523621B (en) Image recognition method and device, computer equipment and storage medium
Khan et al. Deep unified model for face recognition based on convolution neural network and edge computing
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN111738244B (en) Image detection method, image detection device, computer equipment and storage medium
WO2021043168A1 (en) Person re-identification network training method and person re-identification method and apparatus
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN112446398B (en) Image classification method and device
CN112084917B (en) Living body detection method and device
CN111444881A (en) Fake face video detection method and device
CN110929622A (en) Video classification method, model training method, device, equipment and storage medium
EP4085369A1 (en) Forgery detection of face image
CN110222718B (en) Image processing method and device
CN113095370B (en) Image recognition method, device, electronic equipment and storage medium
CN111368672A (en) Construction method and device for genetic disease facial recognition model
CN114677611B (en) Data identification method, storage medium and device
CN115050064A (en) Face living body detection method, device, equipment and medium
CN111160555A (en) Processing method and device based on neural network and electronic equipment
CN114972016A (en) Image processing method, image processing apparatus, computer device, storage medium, and program product
CN114299304B (en) Image processing method and related equipment
An Pedestrian Re‐Recognition Algorithm Based on Optimization Deep Learning‐Sequence Memory Model
CN116977674A (en) Image matching method, related device, storage medium and program product
CN115115552A (en) Image correction model training method, image correction device and computer equipment
CN117765363A (en) Image anomaly detection method and system based on lightweight memory bank
CN117036392A (en) Image detection method and related device
CN115082873A (en) Image recognition method and device based on path fusion and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant