CN111310595A

CN111310595A - Method and apparatus for generating information

Info

Publication number: CN111310595A
Application number: CN202010065727.3A
Authority: CN
Inventors: 安容巧
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2020-06-19
Anticipated expiration: 2040-01-20
Also published as: CN111310595B

Abstract

The embodiment of the disclosure discloses a method and a device for generating information. One embodiment of the method comprises: acquiring a user position information set corresponding to a target video frame sequence, wherein the user position information is used for representing the position of a user displayed in the target video frame sequence, and the user position information comprises human body position information and local human body position information; respectively determining the association degree between the human bodies of the users and the association degree between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information; determining the incidence relation between the human bodies of the users and the incidence relation between the local human bodies displayed in the target video frame sequence according to the determined incidence degree; in response to determining that the determined associations match, generating trajectory information for the user displayed in the sequence of target video frames. The method and the device reduce the complexity of the detection and tracking model and save the network transmission flow.

Description

Method and apparatus for generating information

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method and an apparatus for generating information.

Background

With the rapid development of computer vision, the research and application of multi-target human body tracking are more and more extensive.

The relevant approach is typically to detect and track the pedestrian as a whole. In the scene of occlusion and the like, the prior art generally adopts a mode of constructing a more complex network. For example, on the one hand, a spatial attention map is generated and objects (such as pedestrians) in the image are weighted; and on the other hand, a recurrent neural network is constructed to generate a time attention model to select the target in each frame in a time sequence way, so that the target is tracked.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatuses for generating information.

In a first aspect, an embodiment of the present disclosure provides a method for generating information, the method including: acquiring a user position information set corresponding to a target video frame sequence, wherein the user position information in the user position information set is used for representing the position of a user displayed by a target video frame in the target video frame sequence, and the user position information comprises human body position information and local human body position information; respectively determining the association degree between the human bodies of the users and the association degree between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information; determining the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degree between the human bodies of the users and the association degree between the local human bodies; in response to determining that the determined associations between the local human bodies of the user match the associations between the human bodies of the user, generating trajectory information for the user displayed in the sequence of target video frames.

In some embodiments, the obtaining a set of user location information corresponding to a sequence of target video frames includes: acquiring a target video frame sequence; and inputting the target video frame in the target video frame sequence to a pre-trained user position detection model to obtain user position information corresponding to the target video frame, wherein the user position detection model is used for representing the corresponding relation between the target video frame and the user position information.

In some embodiments, the determining the association between the human bodies of the users and the association between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information respectively includes: extracting an image of a human body of the user indicated by the human body position information from the target video frame sequence as a user human body image; inputting the extracted user human body image into a pre-trained user feature extraction model to obtain a user feature corresponding to the user human body image, wherein the user feature extraction model is used for representing the corresponding relation between the user human body image and the user feature; and determining the association degree between the human bodies of the users displayed in the target video frame sequence according to the distance between the obtained user characteristics and the user characteristics included in the track predicted based on the track.

In some embodiments, the user feature extraction model is obtained by: acquiring a training sample set, wherein the training sample comprises a sample user human body image and sample marking information corresponding to the sample user human body image, and the sample marking information is used for identifying a user; and taking the human body image of the sample user of the training sample in the training sample set as input, taking the user characteristic matched with the sample marking information corresponding to the input human body image of the sample user as expected output, and training to obtain a user characteristic extraction model, wherein the user characteristic matched with the sample marking information corresponding to the input training sample is consistent with the user identified by the sample marking information.

In some embodiments, the local human body indicated by the local human body position information includes a head determined based on the head-shoulder key points; and generating trajectory information for the user displayed in the sequence of target video frames in response to determining that the determined associations between the local human bodies of the user match the associations between the human bodies of the user, comprising: determining an Intersection Over Unit (IOU) between a human body region indicated by human body position information and a head region indicated by local human body position information in a target video frame of a target video frame sequence; determining the distance between the human body position information meeting the preset condition and the position indicated by the local human body position information; generating an incidence relation between the human body position information and the local human body position information in response to determining that the determined distance satisfies a preset distance condition; and generating the track information of the user displayed in the target video frame sequence according to the generated association relation.

In a second aspect, an embodiment of the present disclosure provides an apparatus for generating information, the apparatus including: the device comprises an acquisition unit, a display unit and a display unit, wherein the acquisition unit is configured to acquire a user position information set corresponding to a target video frame sequence, the user position information in the user position information set is used for representing the position of a user displayed by the target video frame in the target video frame sequence, and the user position information comprises human body position information and local human body position information; a first determination unit configured to determine a degree of association between human bodies of users and a degree of association between local human bodies displayed in the target video frame sequence, respectively, based on the human body position information and the local human body position information; a second determining unit configured to determine an association relationship between the human bodies of the users and an association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degrees between the human bodies of the users and the association degrees between the local human bodies; a generating unit configured to generate trajectory information of the user displayed in the sequence of target video frames in response to determining that the determined association between the local human bodies of the user matches the association between the human bodies of the user.

In some embodiments, the obtaining unit includes: an acquisition module configured to acquire a sequence of target video frames; the first generation module is configured to input a target video frame in a target video frame sequence to a pre-trained user position detection model to obtain user position information corresponding to the target video frame, wherein the user position detection model is used for representing a corresponding relation between the target video frame and the user position information.

In some embodiments, the first determining unit includes: an extraction module configured to extract an image of a human body of the user indicated by the human body position information as a user human body image from the target video frame sequence; the second generation module is configured to input the extracted user human body image into a pre-trained user feature extraction model to obtain a user feature corresponding to the user human body image, wherein the user feature extraction model is used for representing a corresponding relation between the user human body image and the user feature; a first determining module configured to determine a degree of association between human bodies of users displayed in the target video frame sequence according to a distance between the obtained user feature and a user feature included in a trajectory predicted based on the trajectory.

In some embodiments, the local human body indicated by the local human body position information includes a head determined based on the head-shoulder key points; the generation unit includes: a second determination module configured to determine an intersection ratio between a human body region indicated by the human body position information and a head region indicated by the local human body position information in a target video frame of the sequence of target video frames; a third determination module configured to determine a distance between the intersection ratio of the human body position information satisfying the preset condition and the position indicated by the local human body position information; a third generating module configured to generate an association between the human body position information and the local human body position information in response to determining that the determined distance satisfies a preset distance condition; and the fourth generation module is configured to generate the track information of the user displayed in the target video frame sequence according to the generated association relation.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor implements the method as described in any of the implementations of the first aspect.

The method and the device for generating information provided by the embodiment of the disclosure firstly acquire a user position information set corresponding to a target video frame sequence. Wherein the user position information in the user position information set is used for representing the position of a user displayed by a target video frame in the target video frame sequence. The user position information includes human body position information and local human body position information. And then, respectively determining the association degree between the human bodies of the users and the association degree between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information. And then, according to the determined association between the human bodies of the users and the association between the local human bodies, determining the association between the human bodies of the users and the association between the local human bodies displayed in the target video frame sequence. Finally, trajectory information for the user displayed in the sequence of target video frames is generated in response to determining that the determined associations between the local human bodies of the user match the associations between the human bodies of the user. Therefore, the complexity of the detection and tracking model can be greatly reduced, and the method is particularly suitable for end equipment and embedded equipment such as unmanned retail stores, monitoring areas and the like, so that the network transmission flow is also saved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for generating information, according to the present disclosure;

fig. 3a, 3b are schematic diagrams of an application scenario of a method for generating information according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for generating information according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for generating information according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary architecture 100 to which the method for generating information or the apparatus for generating information of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, a network 103, and a server 104. The network 103 serves as a medium for providing communication links between the

terminal devices

101, 102 and the server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102 interact with a server 104 via a network 103 to receive or send messages or the like. The

terminal apparatuses

101 and 102 may be hardware or software. When the

terminal devices

101, 102 are hardware, the terminal device 101 may be various electronic devices having a camera and supporting image transmission, including but not limited to various optical video cameras or smart cameras, etc.; the terminal device 102 may be a variety of electronic devices having a display screen as a monitoring or accounting terminal including, but not limited to, smart phones, tablets, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101 and 102 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 104 may be a server that provides various services, such as a background server that performs analysis processing on an image transmitted by the terminal apparatus 101. The backend server may perform various analysis processes on the image transmitted by the terminal apparatus 101, and transmit a processing result (e.g., a movement path of the user) to the terminal apparatus 102 for display.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

The

terminal apparatus

101 or 102 may directly analyze the acquired image, and in this case, the server 104 may not be present.

It should be noted that the method for generating information provided by the embodiment of the present disclosure may be executed by the

terminal device

101 or 102, and accordingly, the apparatus for generating information is generally disposed in the

terminal device

101 or 102. The method for generating information provided by the embodiment of the present disclosure may also be performed by the server 105, and accordingly, the apparatus for generating information may also be disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating information in accordance with the present disclosure is shown. The method for generating information comprises the following steps:

step 201, acquiring a user position information set corresponding to a target video frame sequence.

In this embodiment, an execution subject of the method for generating information (such as the

terminal device

101 or 102 shown in fig. 1) may acquire a set of user location information corresponding to a sequence of target video frames by a wired connection manner or a wireless connection manner. The target video frame sequence may be extracted from a video shot for a preset area (e.g., a security monitoring area, an unmanned retail store, etc.). Here, the above-mentioned extraction method may be equidistant extraction according to the frame rate, or preferentially select the image with higher definition, so as to form the above-mentioned target video frame sequence. The position information in the user position information set can be used for representing the position of a user displayed by a target video frame in the target video frame sequence. Typically, each piece of user location information in the set of user location information corresponds to a frame in the sequence of target video frames. Each piece of user position information may include human body position information and local human body position information. The human body position information may be used to represent the position of the whole human body image in the image, and may include, for example, coordinates of the center position of a predicted bounding box (pedestrian box) for pedestrian detection. The above local body positions can be used to characterize the positions of the local key points of the body in the image. The above-mentioned human body local key points may include key points that can be used to identify the user, which may include, but are not limited to, at least one of: head, shoulder, face.

In this embodiment, as an example, the execution subject may acquire a set of user location information corresponding to a target video frame sequence stored locally in advance. As yet another example, the execution subject may acquire a set of user location information corresponding to a target video frame sequence from an electronic device (e.g., the terminal device 101 shown in fig. 1) communicatively connected thereto. The terminal device 101 may process the captured video to generate the user location information set.

In some optional implementations of this embodiment, the executing body may further obtain a set of user location information corresponding to the target video frame sequence according to the following steps:

in a first step, a sequence of target video frames is obtained.

In these implementations, the execution body may acquire the sequence of target video frames in various ways. As an example, the execution subject may locally retrieve a pre-stored sequence of target video frames. As yet another example, the execution subject may obtain the target video frame sequence from a communicatively connected electronic device.

And secondly, inputting the target video frame in the target video frame sequence to a pre-trained user position detection model to obtain user position information corresponding to the target video frame.

In these implementations, the user location detection model described above may be used to characterize the correspondence between the target video frames and the user location information. The user position detection model can be obtained by training through the following steps:

and S1, acquiring a training sample set.

In these implementations, the training samples described above may include sample video frames and sample user location information corresponding to the sample video frames. The sample user position information may include sample body position information and sample local body position information. The sample human body position information may be used to represent the position of the whole human body image in the sample video frame, and may include, for example, coordinates that mark the center position of a border of the whole human body image of the user. The sample local human body position can be used for representing the position of the human body local key point in the sample video frame. The above-mentioned human body local key points may include key points that can be used to identify the user, which may include, but are not limited to, at least one of: head, shoulder, face.

And S2, taking the sample video frame of the training sample in the training sample set as input, taking the sample user position information corresponding to the input sample video frame as expected output, and training to obtain the user position detection model.

In these implementations, the above-described training approach may be supervised or weakly supervised training of the initial user position detection model using a machine learning approach. The initial user location detection model may include, but is not limited to, at least one of: FSSD (feature Fusion Single Shot Multi Box Detector) model, YoloV3 detection model.

As an example, the execution agent may train the FSSD model using the MobilenetV1 network structure. Optionally, a partial Upsampling layer (Upsampling layer) in the FSSD model may be changed into a Deconvolution layer (Deconvolution layer) to improve the accuracy of the model. Alternatively, the convolutional layer in the FSSD model may be changed to a depth separable convolutional structure (depthwise separable convolution). Thereby, model parameters and inference time can be reduced. And further, the lightweight of the model is realized, so that the model is convenient to use in unmanned retail stores, monitoring terminals and various embedded devices.

As yet another example, the execution agent may train the YoloV 3model in paddlepaddleddle depth framework using a Darknet53model network structure, and use a synchronization approach (sync. Optionally, a spatiotemporal associative pooling layer (spatial temporal pooling) may also be added to the YoloV3 model. Thereby the accuracy of the model can be obviously improved.

Step 202, respectively determining the association between the human bodies of the users and the association between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information.

In this embodiment, the execution body may respectively determine the association degree between the human bodies of the users and the association degree between the local human bodies displayed in the target video frame sequence in various ways according to the human body position information and the local human body position information. As an example, the executing entity may perform trajectory prediction on the human body of the user and the local human body displayed in the target video frame sequence by using a tracking method based on trajectory prediction, and generate a degree of association between each target video frame and the human body of the user and a degree of association between the local human bodies displayed in a previous video frame in the sequence. The trajectory prediction may be performed by various methods. Such as kalman filtering, fitting functions, etc. The above-mentioned degree of association can be determined by the following formula:

wherein, the d (i, j) can be used to represent the correlation between the jth detection result and the ith track. As described above

Can be used to represent the covariance matrix of the observation space at the current moment of the trajectory obtained by the kalman filter. Y is above_iMay be used to represent the predicted outcome of the trajectory at the current time. D above_jMay be used to indicate the status of the jth detection result. The state of the above detection result may be (μ, θ, γ, h). The μ, θ, γ, h may be used to indicate the center position, aspect ratio, and height of the detection frame of the detection result, respectively. The detection result may include at least one of a human body and a partial human body of the user displayed in the target video frame sequence. The prediction result may include at least one of a trajectory of a human body position and a trajectory of a local human body position of the user, which are obtained by performing trajectory prediction based on the target video frame sequence

In some optional implementations of the present embodiment, the executing entity may further determine the degree of association between the human bodies of the users displayed in the target video frame sequence according to the following steps:

first, an image of a user's body indicated by the body position information is extracted from the target video frame sequence as a user body image.

In these implementations, the execution body described above may extract an image of the human body of the user indicated by the human body position information as the user human body image from the target video frame sequence by various human body detection methods.

And secondly, inputting the extracted user human body image into a pre-trained user characteristic extraction model to obtain the user characteristic corresponding to the user human body image.

In these implementations, the user feature extraction model described above may be used to characterize the correspondence between the user human body image and the user features. For example, a Deep residual network (ResNet) may be included.

Optionally, the user feature extraction model may be obtained by:

and S1, acquiring a training sample set.

In these implementations, the training sample may include a sample user human body image and sample annotation information corresponding to the sample user human body image. The sample annotation information described above can be used to identify the user. For example, for the same user located in different scenarios, the sample annotation information is usually consistent.

And S2, taking the human body image of the sample user of the training sample in the training sample set as input, taking the user characteristic matched with the sample marking information corresponding to the input human body image of the sample user as expected output, and training to obtain a user characteristic extraction model.

In these implementations, the user characteristics that match the sample labeling information corresponding to the input training sample are consistent with the user identified by the sample labeling information.

Specifically, the executing agent of the training step may input the human body image of the sample user of the training sample in the training sample set to the initial user feature extraction model, so as to obtain the user feature of the training sample. And then, matching the obtained user characteristics with the user characteristics of the user indicated by the sample marking information to obtain a matched user. Next, a degree of difference between the obtained user feature of the matched user and the user feature of the user specified by the sample labeling information of the training sample may be calculated by using a preset loss function. And then, adjusting the network parameters of the initial user feature extraction model based on the calculated difference degree and the complexity of the model, and finishing the training under the condition of meeting a preset training finishing condition. And finally, determining the initial user feature extraction model obtained by training as a user feature extraction model.

And thirdly, determining the association degree between the human bodies of the users displayed in the target video frame sequence according to the distance between the obtained user characteristics and the user characteristics included in the track predicted based on the track.

In these implementations, the execution subject may first determine a distance between a user feature obtained from the user feature extraction model and a user feature included in a trajectory predicted based on the trajectory. Wherein, the distance may include, but is not limited to, at least one of an euclidean distance and a cosine distance. Then, the smallest distance among the distances may be selected as the degree of association between the human bodies of the users displayed in the target video frame sequence.

And step 203, determining the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degree between the human bodies of the users and the association degree between the local human bodies.

In this embodiment, the execution subject may determine the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence in various ways according to the determined fusion between the association degrees between the human bodies of the users and the association degrees between the local human bodies. The association relationship may be used to represent a correspondence between human bodies of the user and a correspondence between local human bodies of the user, which are displayed on previous and subsequent frames in the target video frame sequence. For example, the executing entity may perform weighted average on the relevance between the human bodies of the users determined in step 202 and the relevance between the corresponding local human bodies, and select the matching pair with the highest relevance as the condition with the relevance.

Step 204, in response to determining that the determined association between the local human bodies of the user matches the association between the human bodies of the user, generating trajectory information of the user displayed in the sequence of target video frames.

In this embodiment, in response to determining that the determined association between the local human bodies of the user matches the association between the human bodies of the user, the execution body may generate trajectory information of the user displayed in the target video frame sequence in various ways. Wherein the above-mentioned association matching generally includes that the user indicated by the determined association between the local human bodies is consistent with the user indicated by the association between the human bodies.

In some optional implementation manners of this embodiment, the execution subject may further determine whether the local human body of the user and the user corresponding to the human body of the user are consistent through a pedestrian Re-identification (ReID) technology.

With continued reference to fig. 3a, fig. 3a is a schematic diagram of an application scenario of a method for generating information according to an embodiment of the present disclosure. In the application scenario of fig. 3a, the camera 301 may send a set of user location information 303 generated from the captured video to the control terminal 302. The user location information set 303 can be seen in particular in fig. 3 b. The

images

3031, 3032 are used to characterize the nth and (n + 1) th images, respectively, in the target video frame sequence. In the image 3031, the detection boxes A, B are respectively used for representing the human body position of the user, and the detection boxes a and b are respectively used for representing the head position of the user. In the image 3032, the detection frames a 'and B' are respectively used for representing the human body position of the user, and the detection frames a 'and B' are respectively used for representing the head position of the user. The user position information in the user position information set 303 includes information (e.g., coordinates) representing the positions of the detection boxes a, B, a ', B'. Then, with continued reference to fig. 3a, the control terminal 302 may determine the association degrees between the human bodies indicated by the detection boxes a and a ', a and B', B and a ', B and B' and the association degrees 304 between the heads indicated by the detection boxes a and a ', a and B', B and a ', B and B'. Next, based on the determined association 304, the control terminal 302 may determine that the human bodies indicated by the detection boxes a and a ', B and B' have an association relationship therebetween, and the heads indicated by the detection boxes a and a ', B and B' have an association relationship 305 therebetween. Then, in response to determining that the users (for example, the user x) indicated by the detection boxes a and a, a 'and a' are consistent, the control terminal 302 may generate the track information of the user x; in response to determining that the detection boxes B and B, B 'and B' indicate that the user (e.g., user y) is consistent, the control terminal 302 may generate trajectory information 306 for user y. Optionally, the control terminal 302 may further display the generated trajectory information 306 of the user on a display screen.

Currently, one of the prior arts generally adopts a way of constructing a neural network with a complicated structure (such as a spatial attention map or a recurrent neural network) to improve the tracking accuracy of a user, resulting in a need for higher-performance hardware devices (such as a high-end GPU). In the method provided by the above embodiment of the present disclosure, the tracking of the target user is decomposed into human body position information and local human body position information, the human body position information and the local human body position information are detected and correlated, and the trajectory information of the target user is generated according to matching between correlation results. Therefore, the complexity of the detection and tracking model is greatly reduced, and the method is particularly suitable for end equipment and embedded equipment in unmanned retail stores, monitoring areas and the like. In addition, information such as images and the like does not need to be transmitted to a background server, so that network transmission flow is saved.

With further reference to fig. 4, a flow 400 of yet another embodiment of a method for generating information is shown. The flow 400 of the method for generating information comprises the steps of:

step 401, obtaining a user position information set corresponding to a target video frame sequence.

In this embodiment, the local human body indicated by the local human body position information may include a head determined based on the head-shoulder key points. The method for determining the head based on the head-shoulder key points may adopt various existing manners, and details are not repeated herein.

Step 402, respectively determining the association between the human bodies of the users and the association between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information.

And step 403, determining the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degree between the human bodies of the users and the association degree between the local human bodies.

Step 401, step 402, and step 403 are respectively consistent with step 201, step 202, step 203 and their optional implementation manners in the foregoing embodiment, and the above description for step 201, step 202, and step 203 also applies to step 401, step 402, and step 403, which is not described again here.

Step 404, determining an intersection ratio between the human body region indicated by the human body position information and the local human body region indicated by the local human body position information in the target video frames in the target video frame sequence.

In the present embodiment, an execution subject of the method for generating information (e.g., the terminal device 102 shown in fig. 1) may determine an intersection ratio between a human body region indicated by human body position information and a local human body region indicated by local human body position information in a target video frame in a sequence of target video frames in various ways. Specifically, for a target video frame in the sequence of target video frames, the execution subject may determine an intersection ratio between a human body region indicated by the human body position information of the target video frame and a head region indicated by the intersected local human body position information. Thus, each target video frame in the sequence of target video frames may generally correspond to at least one cross-over ratio.

Step 405, determining the distance between the human body position information meeting the preset condition and the position indicated by the local human body position information.

In this embodiment, for at least one cross-over ratio corresponding to each target video frame in the target video frame sequence, the execution subject may first select a cross-over ratio satisfying a preset condition from the cross-over ratios. The preset condition may include being greater than a preset intersection ratio threshold. The preset condition may include the first n pieces belonging to the intersection and arranged from large to small. Then, the execution subject may determine a distance between the selected intersection satisfying the preset condition and the corresponding head position and the human body position. The distance between the head position and the human body position can be determined in various ways. As an example, the distance may include a difference in height of tops of the head detection frame and the human body detection frame. As yet another example, the distance may include a distance between center positions of the head detection frame and the human body detection frame.

Step 406, in response to determining that the determined distance satisfies a preset distance condition, generating an association relationship between the human body position information and the local human body position information.

In this embodiment, in response to determining that the distance determined in step 405 satisfies the preset distance condition, the executing entity may generate information representing that the human body position information indicated by the distance has an association relationship with the local human body position information. The condition that the preset distance is met may include being smaller than a preset distance threshold. The satisfying of the preset distance condition may also include the distance minimization. The condition that the preset distance is met may further include that a ratio corresponding to the distance is smaller than a preset ratio threshold. The ratio corresponding to the distance may include, for example, a ratio of a height difference between tops of the head detection frame and the human body detection frame to a height of the human body detection frame. Thereby, the deviation caused by the inconsistency of the portrait sizes displayed in the image can be reduced.

Step 407, generating trajectory information of the user displayed in the target video frame sequence according to the generated association relationship.

In this embodiment, the execution subject may determine the position of the user indicated by the association relationship according to the association relationship generated in step 406. Thus, the execution subject may generate the trajectory of the indicated user according to the position of the same user indicated by the sequentially adjacent target video frames in the sequence of target video frames.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the method for generating information in the present embodiment refines the step of determining the association relationship between the human body position information and the local human body position information according to the head position and the human body position of the user. Therefore, the scheme described in the embodiment can be supplemented by the human body detection result and the head detection result, so that the detection accuracy is improved, and the missing rate is reduced.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating information, which corresponds to the method embodiment shown in fig. 2 or fig. 4, and which may be applied in various electronic devices in particular.

As shown in fig. 5, the apparatus 500 for generating information provided by the present embodiment includes an acquisition unit 501, a first determination unit 502, a second determination unit 503, and a generation unit 504. The acquiring unit 501 is configured to acquire a user position information set corresponding to a target video frame sequence, where user position information in the user position information set is used to represent a position of a user displayed by a target video frame in the target video frame sequence, and the user position information includes human body position information and local human body position information; a first determining unit 502 configured to determine a degree of association between human bodies of users and a degree of association between local human bodies displayed in the target video frame sequence, respectively, according to the human body position information and the local human body position information; a second determining unit 503 configured to determine an association relationship between the human bodies of the users and an association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degrees between the human bodies of the users and the association degrees between the local human bodies; a generating unit 504 configured to generate trajectory information of the user displayed in the sequence of target video frames in response to determining that the determined association between the local human bodies of the user matches the association between the human bodies of the user.

In the present embodiment, in the apparatus 500 for generating information: the specific processing of the obtaining unit 501, the first determining unit 502, the second determining unit 503 and the generating unit 504 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of the present embodiment, the obtaining unit 501 may include an obtaining module (not shown in the figure) and a first generating module (not shown in the figure). Wherein the obtaining module may be configured to obtain the target video frame sequence. The first generation module may be configured to input a target video frame of the sequence of target video frames to a pre-trained user position detection model, to obtain user position information corresponding to the target video frame. The user position detection model can be used for representing the corresponding relation between the target video frame and the user position information.

In some optional implementations of the present embodiment, the first determining unit 502 may include an extracting module (not shown in the figure), a second generating module (not shown in the figure), and a first determining module (not shown in the figure). Wherein the extracting module may be configured to extract an image of a human body of the user indicated by the human body position information from the target video frame sequence as the user human body image. The second generating module may be configured to input the extracted user human body image to a pre-trained user feature extraction model, so as to obtain a user feature corresponding to the user human body image. The user feature extraction model can be used for representing the corresponding relation between the user human body image and the user features. The first determining module may be configured to determine a degree of association between human bodies of the users displayed in the target video frame sequence according to a distance between the obtained user feature and a user feature included in the trajectory predicted based on the trajectory.

In some optional implementations of this embodiment, the user feature extraction model may be obtained by: a set of training samples is obtained. The training sample may include a human body image of the sample user and sample labeling information corresponding to the human body image of the sample user. The sample annotation information described above can be used to identify the user. And taking the human body image of the sample user of the training sample in the training sample set as input, taking the user characteristic matched with the sample marking information corresponding to the input human body image of the sample user as expected output, and training to obtain a user characteristic extraction model. The user characteristics matched with the sample labeling information corresponding to the input training sample are generally consistent with the user identified by the sample labeling information.

In some optional implementations of the present embodiment, the local human body indicated by the local human body position information may include a head determined based on the head-shoulder key points. The generating unit 504 may include: a second determining module (not shown), a third generating module (not shown), and a fourth generating module (not shown). Wherein the second determining module may be configured to determine an intersection ratio between a human body region indicated by the human body position information and a head region indicated by the local human body position information in the target video frames in the sequence of target video frames. The third determining module may be configured to determine a distance between the intersection ratio of the human body position information satisfying the preset condition and the position indicated by the local human body position information. The third generating module may be configured to generate an association relationship between the human body position information and the local human body position information in response to determining that the determined distance satisfies a preset distance condition. The fourth generating module may be configured to generate trajectory information of the user displayed in the target video frame sequence according to the generated association relationship.

The apparatus provided in the above embodiment of the present disclosure first obtains, by an obtaining unit 501, a set of user location information corresponding to a target video frame sequence. Wherein the user position information in the user position information set is used for representing the position of a user displayed by a target video frame in the target video frame sequence. The user position information includes human body position information and local human body position information. Then, according to the human body position information and the local human body position information, the first determining unit 502 determines the association between the human bodies of the users and the association between the local human bodies displayed in the target video frame sequence, respectively. Next, the second determining unit 503 determines the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence, according to the determined association degree between the human bodies of the users and the association degree between the local human bodies. Finally, in response to determining that the determined association between the local human bodies of the user matches the association between the human bodies of the user, the generation unit 504 generates trajectory information of the user displayed in the target video frame sequence. Therefore, the complexity of the detection and tracking model is greatly reduced, and the method is particularly suitable for end equipment and embedded equipment such as unmanned retail stores, monitoring areas and the like, so that the network transmission flow is also saved.

Referring now to fig. 6, a schematic diagram of an electronic device (e.g., terminal device 102 of fig. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The terminal device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the use range of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a user position information set corresponding to a target video frame sequence, wherein the user position information in the user position information set is used for representing the position of a user displayed by a target video frame in the target video frame sequence, and the user position information comprises human body position information and local human body position information; respectively determining the association degree between the human bodies of the users and the association degree between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information; determining the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degree between the human bodies of the users and the association degree between the local human bodies; in response to determining that the determined associations between the local human bodies of the user match the associations between the human bodies of the user, generating trajectory information for the user displayed in the sequence of target video frames.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a first determination unit, a second determination unit, and a generation unit. The names of these units do not in some cases form a limitation on the unit itself, and for example, the acquiring unit may also be described as a unit for acquiring a set of user location information corresponding to a target video frame sequence.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for generating information, comprising:

acquiring a user position information set corresponding to a target video frame sequence, wherein user position information in the user position information set is used for representing the position of a user displayed by a target video frame in the target video frame sequence, and the user position information comprises human body position information and local human body position information;

respectively determining the association degree between the human bodies of the users and the association degree between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information;

determining the association relationship between the human bodies of the users and the association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degree between the human bodies of the users and the association degree between the local human bodies;

generating trajectory information for the user displayed in the sequence of target video frames in response to determining that the determined associations between the local human bodies of the user match the associations between the human bodies of the user.

2. The method of claim 1, wherein the obtaining a set of user location information corresponding to a sequence of target video frames comprises:

acquiring the target video frame sequence;

and inputting the target video frame in the target video frame sequence to a pre-trained user position detection model to obtain user position information corresponding to the target video frame, wherein the user position detection model is used for representing the corresponding relation between the target video frame and the user position information.

3. The method of claim 2, wherein the determining the association between the human bodies of the users and the association between the local human bodies displayed in the target video frame sequence according to the human body position information and the local human body position information respectively comprises:

extracting an image of a human body of the user indicated by the human body position information from the target video frame sequence as a user human body image;

inputting the extracted user human body image into a pre-trained user feature extraction model to obtain a user feature corresponding to the user human body image, wherein the user feature extraction model is used for representing the corresponding relation between the user human body image and the user feature;

and determining the association degree between the human bodies of the users displayed in the target video frame sequence according to the distance between the obtained user characteristics and the user characteristics included in the track predicted based on the track.

4. The method of claim 3, wherein the user feature extraction model is obtained by:

acquiring a training sample set, wherein the training sample comprises a sample user human body image and sample marking information corresponding to the sample user human body image, and the sample marking information is used for identifying a user;

and taking the human body image of the sample user of the training sample in the training sample set as input, taking the user characteristic matched with the sample marking information corresponding to the input human body image of the sample user as expected output, and training to obtain the user characteristic extraction model, wherein the user characteristic matched with the sample marking information corresponding to the input training sample is consistent with the user identified by the sample marking information.

5. The method according to one of claims 1-4, wherein the local body indicated by the local body position information comprises a head determined based on head-shoulder key points; and

the generating trajectory information for the user displayed in the sequence of target video frames in response to determining that the determined associations between the local human bodies of the user match the associations between the human bodies of the user comprises:

determining an intersection ratio between a human body region indicated by human body position information and a head region indicated by local human body position information in a target video frame in the sequence of target video frames;

determining the distance between the human body position information meeting the preset condition and the position indicated by the local human body position information;

generating an incidence relation between the human body position information and the local human body position information in response to determining that the determined distance satisfies a preset distance condition;

and generating the track information of the user displayed in the target video frame sequence according to the generated association relation.

6. An apparatus for generating information, comprising:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is configured to acquire a user position information set corresponding to a target video frame sequence, wherein the user position information in the user position information set is used for representing the position of a user displayed by a target video frame in the target video frame sequence, and the user position information comprises human body position information and local human body position information;

a first determining unit configured to determine a degree of association between human bodies of users and a degree of association between local human bodies displayed in the target video frame sequence, respectively, according to human body position information and local human body position information;

a second determining unit configured to determine an association relationship between the human bodies of the users and an association relationship between the local human bodies displayed in the target video frame sequence according to the determined association degrees between the human bodies of the users and the association degrees between the local human bodies;

a generating unit configured to generate trajectory information of the user displayed in the sequence of target video frames in response to determining that the determined association between the local human bodies of the user matches the association between the human bodies of the user.

7. The apparatus of claim 6, wherein the obtaining unit comprises:

an acquisition module configured to acquire the sequence of target video frames;

the first generation module is configured to input a target video frame in the target video frame sequence to a pre-trained user position detection model to obtain user position information corresponding to the target video frame, wherein the user position detection model is used for representing a corresponding relation between the target video frame and the user position information.

8. The apparatus of claim 7, wherein the first determining unit comprises:

an extraction module configured to extract an image of a human body of a user indicated by human body position information from the target video frame sequence as a user human body image;

the second generation module is configured to input the extracted user human body image into a pre-trained user feature extraction model to obtain a user feature corresponding to the user human body image, wherein the user feature extraction model is used for representing a corresponding relation between the user human body image and the user feature;

a first determining module configured to determine a degree of association between human bodies of users displayed in the target video frame sequence according to a distance between the obtained user feature and a user feature included in a trajectory predicted based on the trajectory.

9. The apparatus of claim 8, wherein the user feature extraction model is obtained by:

10. The apparatus according to one of claims 6-9, wherein the local human body indicated by the local human body position information comprises a head determined based on head-shoulder key points; the generation unit includes:

a second determination module configured to determine an intersection ratio between a human body region indicated by human body position information and a head region indicated by local human body position information in a target video frame of the sequence of target video frames;

a third determination module configured to determine a distance between the intersection ratio of the human body position information satisfying the preset condition and the position indicated by the local human body position information;

a third generating module configured to generate an association between the human body position information and the local human body position information in response to determining that the determined distance satisfies a preset distance condition;

a fourth generating module configured to generate trajectory information of the user displayed in the sequence of target video frames according to the generated association relationship.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.