CN117853573A

CN117853573A - Video processing method, device, electronic equipment and computer readable medium

Info

Publication number: CN117853573A
Application number: CN202211219337.2A
Authority: CN
Inventors: 毛乔梅
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2024-04-09

Abstract

The application discloses a video processing method, a video processing device, electronic equipment and a computer readable medium, and relates to the technical field of computers, wherein the method comprises the following steps: receiving a video processing request, acquiring a corresponding video identifier, and acquiring a video to be processed according to the video identifier; predicting to obtain optical flow according to two adjacent frames of images in the video to be processed, and determining sorted object images in the two adjacent frames of images according to the optical flow; acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed; predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinate; generating a state sequence based on the actual location coordinates and the sort location area; and determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence. Whether abnormal sorting behaviors exist or not is judged by combining multiple frames, and the overall abnormal sorting recognition accuracy is improved.

Description

Video processing method, device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video processing method, a video processing device, an electronic device, and a computer readable medium.

Background

At present, with the development of the Internet, the shopping demand of people on the Internet is increased, the challenges to the logistics industry are increased gradually, the speed of logistics is not the only target any more, and the quality of logistics service is also valued gradually. Therefore, the operation specification of the sorting personnel is required to be restrained in the sorting operation process, so that the probability of package breakage caused by abnormal sorting events is reduced, and the purpose of improving the logistics service quality is achieved.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art:

because the logistics sorting field is disordered in background and contains a large number of targets, the generalized performance is poor when a general moving target detection algorithm is directly applied to logistics scenes, and the recognition accuracy is low because of indiscriminate treatment of moving objects in different states.

Disclosure of Invention

In view of this, the embodiments of the present application provide a video processing method, apparatus, electronic device, and computer readable medium, which can solve the problems that the existing logistics sorting field has a disordered background and contains a large number of targets, and the general moving target detection algorithm is directly applied to the logistics scene, so that the generalization performance is poor, and the recognition accuracy is low due to the fact that the moving objects in different states are treated indiscriminately.

To achieve the above object, according to one aspect of the embodiments of the present application, there is provided a video processing method, including:

receiving a video processing request, acquiring a corresponding video identifier, and further acquiring a video to be processed according to the video identifier;

invoking an unsupervised optical flow estimation network to predict and obtain optical flows according to two adjacent frames of images in the video to be processed, and further determining sorted object images in the two adjacent frames of images according to the optical flows;

acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed;

predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinate;

generating a state sequence based on the actual location coordinates and the sort location area;

and determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence.

Optionally, determining the sorted object image in the two adjacent frames of images according to the optical flow includes:

determining moving object images in two adjacent frames of images according to the optical flow, and generating mask images according to the moving object images;

the mask image is input into the inspection model to output a sorted object image.

Optionally, inputting the mask image into a detection model to output a sorted object image, comprising:

classifying each moving object image in the mask image, and further generating a classification identifier corresponding to each moving object image;

and determining a moving object image corresponding to the classified identification corresponding to the classified object as an image of the classified object and outputting the image.

Optionally, before predicting the sorting location area of the sorted object image in the target continuous video frame image of the video to be processed, the method further comprises:

acquiring a preset number of continuous video frame images after two adjacent frame images in a video to be processed;

a preset number of consecutive video frame images is determined as target consecutive video frame images.

Optionally, predicting, based on the optical flow, the sorted object image and the actual position coordinates, a sorting position area of the sorted object image in the target continuous video frame image of the video to be processed includes:

determining the offset and the displacement angle of each pixel point on the sorted object image according to the optical flow;

in response to detecting the sorted object image in both of the adjacent two frame images, a sorting location area of the sorted object image in the target successive video frame images of the video to be processed is determined based on the actual location coordinates, offset and displacement angle of the sorted object image in the adjacent two frame images.

Optionally, generating the sequence of states based on the actual location coordinates and the sorting location area includes:

determining a state identifier corresponding to the actual position coordinate as a first identifier in response to the actual position coordinate being located in the sorting position area;

determining a state identifier corresponding to the actual position coordinate as a second identifier in response to the actual position coordinate not being located in the sorting position area;

a sequence of states is generated from the first and second identifications in response to the presence of the actual location coordinate within the sort location region and the presence of the actual location coordinate not within the sort location region.

Optionally, determining, according to the state sequence, a sorting attitude identifier corresponding to the sorted object image includes:

determining the number of continuous preset identifiers in a state sequence;

and determining the sorting attitude identification corresponding to the sorted object images as the abnormal sorting identification in response to the number being larger than a preset threshold.

Optionally, before invoking the unsupervised optical flow estimation network, the method further comprises:

acquiring an initial neural network;

acquiring two adjacent frames of images in a video;

calling an initial neural network to obtain a corresponding optical flow based on two adjacent frames of images;

generating a composite image by means of deformation between the optical flow and the next image in the two adjacent images, further calculating the luminosity error between the previous image in the two adjacent images and the composite image, and determining the luminosity error as a loss function of the initial neural network;

And performing iterative training on the initial neural network based on the loss function, thereby obtaining an unsupervised optical flow estimation network.

In addition, the application also provides a video processing device, which comprises:

the receiving unit is configured to receive the video processing request, acquire the corresponding video identification and further acquire the video to be processed according to the video identification;

the sorted object image determining unit is configured to call an unsupervised optical flow estimating network to predict and obtain optical flows according to two adjacent frame images in the video to be processed, and further determine the sorted object images in the two adjacent frame images according to the optical flows;

the acquisition unit is configured to acquire actual position coordinates corresponding to the sorted object images from each frame of the video to be processed;

the prediction unit is configured to predict and obtain a sorting position area of the sorted object image in the target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinates;

a state sequence generating unit configured to generate a state sequence based on the actual position coordinates and the sorting position area;

and the sorting attitude identification determining unit is configured to determine and output the sorting attitude identification corresponding to the sorted object image according to the state sequence.

Optionally, the sorted object image determination unit is further configured to:

Optionally, the prediction unit is further configured to:

Optionally, the state sequence generation unit is further configured to

Optionally, the sorting attitude identification determination unit is further configured to:

determining the number of continuous preset identifiers in a state sequence;

Optionally, the video processing apparatus further comprises a model training unit configured to:

acquiring an initial neural network;

acquiring two adjacent frames of images in a video;

In addition, the application also provides a video processing electronic device, which comprises: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the video processing method as described above.

In addition, the application also provides a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the video processing method as described above.

One embodiment of the above invention has the following advantages or benefits: the method comprises the steps of obtaining a corresponding video identifier by receiving a video processing request, and obtaining a video to be processed according to the video identifier; invoking an unsupervised optical flow estimation network to predict and obtain optical flows according to two adjacent frames of images in the video to be processed, and further determining sorted object images in the two adjacent frames of images according to the optical flows; acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed; predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinate; generating a state sequence based on the actual location coordinates and the sort location area; and determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence. The actual position of the sorted object in each frame is compared with the predicted sorting position area to obtain a state sequence, and a plurality of frames are combined together to judge whether abnormal sorting behaviors exist or not, so that the overall abnormal sorting recognition accuracy is improved.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as unduly limiting the present application. Wherein:

fig. 1 is a schematic diagram of the main flow of a video processing method according to a first embodiment of the present application;

fig. 2 is a schematic diagram of the main flow of a video processing method according to a second embodiment of the present application;

fig. 3 is a schematic flow chart of a video processing method according to a third embodiment of the present application;

FIG. 4 is a logic diagram of an unsupervised optical flow estimation algorithm for a video processing method according to a fourth embodiment of the present application;

FIG. 5 is a visual optical flow gray scale plot of the t-th frame through the t+1st frame of a video to be processed according to a video processing method of a fifth embodiment of the present application;

fig. 6 is a moving object image schematic diagram of a t-th frame image of a video to be processed according to a video processing method of a sixth embodiment of the present application;

fig. 7 is a schematic diagram of a network structure of a detection model of a video processing method according to a seventh embodiment of the present application;

FIG. 8 is a schematic illustration of parcel locations where the t+1st frame of predicted video to be processed is sorted according to a video processing method of an eighth embodiment of the present application;

Fig. 9 is a schematic diagram of main units of a video processing apparatus according to an embodiment of the present application;

FIG. 10 is an exemplary system architecture diagram in which embodiments of the present application may be applied;

fig. 11 is a schematic diagram of a computer system suitable for use in implementing the terminal device or server of the embodiments of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. The data acquisition, storage, use, processing and the like in the technical scheme meet the relevant regulations of national laws and regulations.

Fig. 1 is a schematic diagram of main flow of a video processing method according to a first embodiment of the present application, and as shown in fig. 1, the video processing method includes:

step S101, receiving a video processing request, obtaining a corresponding video identifier, and further obtaining a video to be processed according to the video identifier.

In this embodiment, the execution body (for example, may be a server) of the video processing method may receive the video processing request by means of a wired connection or a wireless connection. The video processing request may be, for example, a request for identifying an object (e.g., a package) that is abnormally sorted from the video, or a request for determining whether there is abnormal sorting behavior in the video. A video identification may be carried in the request, which video identification is used to characterize which video is processed. The executing body can acquire the corresponding video to be processed from the video library according to the video identification carried in the video processing request.

Step S102, an unsupervised optical flow estimation network is called to predict and obtain optical flows according to two adjacent frames of images in the video to be processed, and then the sorted object images in the two adjacent frames of images are determined according to the optical flows.

The execution body may group video frames in the video to be processed, and group two adjacent video frames into one group. The executing entity may then invoke the unsupervised optical flow estimation network to calculate each set of corresponding optical flows for each set of two adjacent video frames. Optical flow is understood as the projection of a movement of an object in three-dimensional space onto a two-dimensional image plane. And specifically, calculating the offset mag and the displacement angle of each pixel point corresponding to the moving object in the video image according to the x-direction displacement matrix and the y-direction displacement matrix corresponding to the optical flow. Then, the execution body may determine the sorted object image in the corresponding adjacent two frame images according to the offset amount mag and the displacement angle of each pixel point. It is understood that the sorted object image may be an image composed of pixels whose offset and displacement angle are greater than respective corresponding thresholds.

Step S103, acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed.

After determining the sorted object, the execution subject may acquire corresponding actual position coordinates from each video frame of the corresponding video to be processed based on the sorted object.

Step S104, based on the optical flow, the sorted object image and the actual position coordinates, a sorting position area of the sorted object image in the target continuous video frame image of the video to be processed is obtained through prediction.

For example, the t-th frame image and the t+1-th frame image of the video to be processed are a group of two adjacent video frames, and the target continuous video frame images corresponding to the two adjacent video frames may be the t+2-th, t+3-th, t+4-th, and t+n-th frame images, where n is any positive integer.

For example, the execution body may calculate, according to the offset and the displacement angle of the pixel point of the object to be sorted in the optical flows corresponding to each group of two adjacent video frames, and according to the actual position coordinates of the object to be sorted in the t frame and the t+1st frame, the possible position areas of the object to be sorted in the t+2st, t+3 st, t+4 th, and the t+n th frames, that is, the sorting position areas of the predicted image of the object to be sorted in the target continuous video frame images of the video to be processed.

Specifically, before predicting a sorting position area of the sorted object image in the target continuous video frame image of the video to be processed, the method further comprises:

acquiring a preset number of continuous video frame images (exemplified, t+2, t+3, t+4,..sup.th and t+n-th frame images) after two adjacent frame images (exemplified, t-th frame image and t+1-th frame image of the video to be processed) in the video to be processed; and determining a preset number of continuous video frame images as target continuous video frame images, namely, t+2 th, t+3 th, t+4 th, the first, the second and the third frame images, wherein n is any positive integer.

Specifically, based on the optical flow, the sorted object image and the actual position coordinates, predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed, wherein the method comprises the following steps: determining the offset and the displacement angle of each pixel point on the sorted object image according to the optical flow; in response to detecting the sorted object image in both of the adjacent two frame images (e.g., the t-th frame image and the t+1-th frame image of the video to be processed), a sorting location area of the sorted object image in the target continuous video frame images (e.g., the t+2, t+3, t+4, the..sup.th, t+n-th frame images of the video to be processed, where n is any positive integer) is determined based on the actual location coordinates, offsets, and displacement angles of the sorted object image in the adjacent two frame images. Specifically, when the object is in the sorted state in the current frame (may be the t frame or the t+1st frame), the execution body may determine the object motion direction and speed of the sorted object in the current frame according to the actual position coordinate of the sorted object in the current frame, the offset and the displacement angle of the pixel point on the sorted object, and predict the possible position area of the sorted object in the next frame of the current frame according to the motion direction and speed of the sorted object, until the possible position area of the sorted object in the t+n frame image (i.e. the predicted sorting position area) is predicted to be ended.

Step S105, generating a state sequence based on the actual position coordinates and the sort position region.

Specifically, generating a sequence of states based on the actual location coordinates and the sort location area, comprising: the execution body responds to the fact that the actual position coordinates are located in the sorting position area (namely, the sorted object images can be detected in the sorting position area), and the state identifier corresponding to the actual position coordinates is determined to be a first identifier; the first identifier may be 1, 0, any other number, letter, picture, etc. The embodiment of the application does not limit the specific content of the first identifier. In the embodiment of the present application, taking the first identifier as 1 as an example, as long as the execution body detects the sorted object image in the sorting position area, the state identifier is 1, and otherwise is 0.

And the execution body determines the state identifier corresponding to the actual position coordinate as a second identifier in response to the actual position coordinate not being positioned in the sorting position area. The second identifier may be 0, 1, any other number, letter, picture, or the like. Taking the second identifier as 0 as an example, the execution subject may determine that the state identifier is 0 as long as the execution subject determines that the actual position coordinate is not located in the sorting position area, that is, the actual position coordinate is not located in the sorting position area, and otherwise is 1.

In response to the presence of the actual location coordinate within the sort location region and the presence of the actual location coordinate not within the sort location region, a sequence of states (e.g., [1, 0 ]) is generated based on the first identification (e.g., 1) and the second identification (e.g., 0).

For example, when some predicted sorting location areas among the target continuous video frame images have sorted object images, and other predicted sorting location areas do not have sorted object images, the execution subject may set the status flag of the video frame image having sorted object images to 1 (the embodiment of the present application does not specifically limit the content of the status flag of the video frame image having sorted object images, but the status flag of the video frame image having sorted object images is the same), and set the content of the status flag of the video frame image not having sorted object images to 0 (the embodiment of the present application does not specifically limit the content of the status flag of the video frame image not having sorted object images, but the status flag of the video frame image not having sorted object images is the same). After the execution subject determines the corresponding state identifiers in the target continuous video frame images, the execution subject can determine the corresponding state identifiers (state) according to the video frame image sequence (e.g. video frames t+2, t+3, t+4, t+5, t+6) in the target continuous video frame images _t+2 ＝1、state _t+3 ＝1、state _t+4 ＝1、state _t+5 ＝1、state _t+6 =0) to generate a state sequence [1, 0 ]]。

And step S106, determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence.

Specifically, determining, according to the state sequence, a sorting attitude identifier corresponding to the sorted object image includes: determining the number of consecutive preset identifiers (e.g. 1 or 0) in the state sequence; in response to the number being greater than a preset threshold (e.g., 3), the number of consecutive occurrences 1 in the state sequence being greater than 3, which indicates that the same sorted object is present in consecutive 3 frames of images, which indicates that the intensity of the sorted object is great, thereby determining that the sort attitude identifier corresponding to the sorted object image is an abnormal sort identifier.

In the state sequence of the embodiment of the present application, preset identifiers, for example, 1, and continuous preset identifiers, for example, continuous 1.

According to the embodiment, the corresponding video identification is obtained by receiving the video processing request, and then the video to be processed is obtained according to the video identification; invoking an unsupervised optical flow estimation network to predict and obtain optical flows according to two adjacent frames of images in the video to be processed, and further determining sorted object images in the two adjacent frames of images according to the optical flows; acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed; predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinate; generating a state sequence based on the actual location coordinates and the sort location area; and determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence. The actual position of the sorted object in each frame is compared with the predicted sorting position area to obtain a state sequence, and a plurality of frames are combined together to judge whether abnormal sorting behaviors exist or not, so that the overall abnormal sorting recognition accuracy is improved.

Fig. 2 is a schematic flow chart of a video processing method according to a second embodiment of the present application, and as shown in fig. 2, the video processing method includes:

step S201, receiving a video processing request, obtaining a corresponding video identifier, and further obtaining a video to be processed according to the video identifier.

Step S202, call an unsupervised optical flow estimation network to predict and obtain the optical flow according to the two adjacent frames of images in the video to be processed.

Specifically, before invoking the unsupervised optical flow estimation network, the method further comprises:

an initial neural network is acquired, which may be an AR-Flow network.

And acquiring two adjacent frames of images in the video. The executing body may crawl real video data of a large number of sorting sites for training to obtain an unsupervised optical Flow estimation network, which may be an AR-Flow network. Because the real label group trunk is not needed for training the unsupervised optical flow estimation network, compared with the supervised optical flow estimation network, the data labeling work is not increased, and the data labeling difficulty in the early stage is reduced.

And calling the initial neural network to obtain a corresponding optical flow based on the two adjacent frames of images. Due to the lack of ground trunk, the initial neural network is implicitly trained using image synthesis methods, as shown in fig. 4, using the light errors on the collected sorted dataset for unsupervised training. The unsupervised optical flow estimation network may be denoted as F (), consisting of two consecutive frames of image I ₁ 、I ₂ And the trainable parameter set θ of the network (e.g., the weights and offsets of the convolutional layers in the network) may be calculated to obtain the predicted optical flow.

Generating a composite image by means of deformation between the optical flow and the next image in the two adjacent images, further calculating the luminosity error between the previous image in the two adjacent images and the composite image, and determining the luminosity error as a loss function of the initial neural network.

Exemplary, predicted optical flow _1，2 And the next frame image I ₂ Image I can be composed by means of deformation ₁ ' calculating the previous frame image I ₁ And composite image I ₁ 'photometric error between' training an unsupervised optical flow estimation network using photometric error as a loss function, mostAnd finally obtaining an optical flow estimation network. After training: the video frames at the time t and the time t+1 of the video to be identified are taken as the input of an optical flow estimation network together, so as to obtain a predicted optical flow _t，t+1 . Optical flow is a concept related to object motion, and because each pixel has a displacement in the x-direction and y-direction, it is expressed by a two-dimensional vector, the predicted optical flow _t，t+1 The two-channel image with the same size as the original image can be represented by an x-direction displacement matrix u and a y-direction displacement matrix v.

Step S203, a moving object image in the two adjacent frames of images is determined according to the optical flow, and then a mask image is generated according to the moving object image.

By optical flow _t，t+1 Still objects are filtered, leaving only moving objects in the video image. Specifically: first, the offset mag and the displacement angle of each pixel point can be calculated using u and v.

Setting a threshold t _mag Filtering out pixels with too small an offset, i.e. retaining only mag > t _mag Can obtain a motion area image with higher speed, is smaller than a threshold t _mag The offset matrix mag (for storing the offset for each pixel) is set to 0 as follows:

next, the minimum bounding rectangle of each optical flow connected domain is calculated (in order to obtain a rectangular frame capable of covering all pixels with larger offset in a certain area, for subsequently obtaining fig. 6), namely:

further merging rectangles with similar or intersected distances, deleting the rectangles with too small areas, setting the values outside/inside the rectangles to 255/0 (the image has three channels, the value range of each channel is [0, 255], when the values of the three channels are all 255, the pixel point is white, the purpose of the step is to keep the color of the pixel point with large offset, and the other areas become white, so that fig. 6) is obtained, and the mask matrix is obtained:

Finally, adding mask matrix to three channels of the t frame image of the video to be identified, wherein all values range from [0, 255], and fig. 6 is the finally obtained mask image only including moving targets 601, 602.

Fig. 5 is a gray scale of an optical flow image, where colors 501 (blue areas), 502, 503, and 504 in the corresponding color images are different, and the colors and brightness in the optical flow image represent the direction and the size of the optical flow, respectively. For example, when the object at 501 in the actual video frame image moves to the upper left, the blue region 501 appears in the corresponding color chart of fig. 5, and the movement speed of the human body is faster, so the colors are more obvious and the brightness is higher than those of other color blocks 502 (light red), 503 (light purple) and 504 (light green). In the optical flow diagram, the faster the movement speed of the object, the more vivid, more obvious and more bright the color.

Step S204, inputting the mask image into the inspection model to output the sorted object image.

The detection model is a network capable of detecting the motion package and the state. A state recognition branch is added to the basic target detection network, the target detection network can be a yolov5 network, and the modified network structure is shown in fig. 7. The input image is an image which only contains a moving object after being processed as shown in fig. 6, and visual features such as edges, corners and the like of the image are extracted. Target localization, target classification, and state recognition are then performed based on the visual features to predict bounding boxes, categories, and states of the target.

The package status of the motion can be classified into three categories according to the actual sorting scenario: sorted packages, packages in a human hand, packages on a conveyor belt. Labeling the three dimensions of the position, the category and the state of the package in the three states, and training a detection model based on the detection network and labeled data. The detection model can identify and position the package in the picture, and can also identify three states of the moving package.

Specifically, inputting the mask image into the inspection model to output a sorted object image, comprising: classifying each moving object image in the mask image, and further generating a classification identifier corresponding to each moving object image; and determining a moving object image corresponding to the classified identification corresponding to the classified object as an image of the classified object and outputting the image.

A plurality of moving objects may be included in the mask image, such as 601 and 602 in fig. 6, after the mask image is input into the inspection model, the inspection model needs to determine the category of each moving object image in the input mask image to determine whether each moving object image is a sorted object image (e.g., a sorted express parcel). The execution subject may generate a classification flag, such as PR, corresponding to sorting if it is an object image to be sorted, and bind and output the classification flag with a corresponding moving object image identified as the object image to be sorted.

In step S205, the actual position coordinates corresponding to the sorted object image are obtained from each frame of the video to be processed.

Step S206, based on the optical flow, the sorted object image and the actual position coordinates, the sorting position area of the sorted object image in the target continuous video frame image of the video to be processed is predicted.

Step S207 generates a state sequence based on the actual position coordinates and the sort position region.

And step S208, determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence.

The embodiment of the application combines an unsupervised optical flow estimation network and a target detection network with an added state identification branch, and is used for detecting moving targets and identifying abnormal sorting behaviors in a sorting scene. The model is trained on the actual sorting video data by using the unsupervised optical flow estimation algorithm, so that the generalization performance of the optical flow estimation algorithm on the actual sorting field data and the accuracy of predicting the optical flow are improved. In order to avoid that the motion packages except the sorted packages influence the identification result, a package state identification branch is additionally added in the target detection network so as to distinguish the packages in three states, and the method can reduce the misjudgment condition. Whether the current frame contains the sorted packages or not is determined through a detection network, coordinates of the sorted packages are recorded when the sorted packages exist, the moving direction of the sorted packages is calculated through optical flow, the position area of the sorted packages in the next frame is predicted jointly through the coordinates and the moving direction, whether abnormal sorting behaviors exist or not is judged jointly through multiple frames, and overall recognition accuracy is improved.

According to the embodiment of the application, unsupervised learning is introduced to effectively utilize the collected sorting video data, so that the generalization performance of an optical flow estimation algorithm in an actual logistics scene and the accuracy of predicting an optical flow are improved; only the regions with significant movement are reserved, and the interference of static packages on identification is removed. The predicted optical flow is not required to be converted into an RGB image, the predicted optical flow data is directly utilized for calculation, only the object with obvious motion is reserved after the pixels with small offset are filtered, the purpose of filtering the background and the static object is achieved, and the processing speed is higher. The basic target detection network is improved, and a target detection network capable of detecting the moving package and the state thereof is trained, so that the influence of the moving package except the sorted package on the identification result is avoided. Predicting the position area of the sorted package in the next frame according to the recognition result of the current frame and combining with the optical flow data, and judging whether the video to be recognized has abnormal sorting behaviors or not by the recognition result of the continuous multi-frame so as to improve the overall recognition accuracy. Focusing more on the actual influence of abnormal sorting actions, some slight abnormal sorting actions may not have great influence on package breakage, and the scheme can identify the sensitivity of the abnormal sorting actions by modifying threshold adjustment.

In sum, when the method is applied to the express sorting scene, abnormal sorting behaviors caused by sorting personnel can be monitored and early-warned in real time, the probability of package breakage is further reduced, and compared with the existing method, the method is higher in recognition accuracy.

Fig. 3 is an application scenario diagram of a video processing method according to a third embodiment of the present application. The video processing method is applied to a logistics sorting site provided with video monitoring equipment and capable of uploading video data to a cloud end, and comprises the steps of but not limited to a warehouse unloading area, a sorting area, logistics network points and the like, the abnormal sorting behaviors of sorting personnel are grabbed through the video data of the cloud end and early warning is carried out, so that the behaviors of the sorting personnel are standardized, the probability of package damage caused by the abnormal sorting behaviors is reduced, and the problems of misleakage and misjudgment in the existing method for judging abnormal sorting actions of express personnel according to monitoring videos are solved. The video processing method of the embodiment of the application comprises two parts: the first part is to collect real data of a sorting site to train an unsupervised optical flow estimation network, wherein the optical flow of two adjacent frames of video segments is predicted by the network to be used for screening out a moving target; and a second part, adding a state identification branch on the basis of a target detection network, detecting the moving packages in three different states in the moving target, predicting the position area of the next frame of packages according to the position and the moving direction of the sorted packages, and judging whether the video segment to be identified contains abnormal sorting behaviors or not by using continuous multiframes.

As shown in fig. 3, in the first part, the execution subject inputs two adjacent frames of images into the unsupervised optical flow estimation network to obtain a corresponding optical flow, and further obtains an image (for example, a moving parcel image) only including a moving object based on the optical flow; the second part is executed, specifically, the offset and the displacement angle of each pixel point in the image (for example, the image of the moving parcel) only containing the moving object are obtained according to the optical flow, so as to obtain the predicted area of the moving parcel of the next frame (namely, the predicted landing area of the parcel image of the next frame in the next frame), and then the image determined according to the optical flow only containing the moving object is input into a network (for example, a detection model) capable of detecting the moving parcel and the state so as to obtain the position and the state of the moving parcel of the previous frame (namely, the position and the state of the parcel image of the previous frame in the image of the previous frame). And determining whether the motion package (namely the motion target) is abnormally sorted according to the obtained position and state of the motion package of the previous frame and the prediction area of the motion package of the next frame.

In the embodiment of the present application, by way of example, the motion package detection result is that each frame of the video to be processed is marked with a state, and the initial state of each frame is 0. The first time a sorted parcel (in the embodiments of the application, all refer to the image of the sorted parcel) is detected in the t-th frame, then state _t =1, while recording the location of the sorted parcel. The motion direction of all pixel points occupied by a single package in the image (in the embodiment of the application, the single package image) should be consistent, and the coordinates of the sorted package in the t-th frame are (x ₁ ，y ₁ ，x ₂ ，y ₂ ) The direction and speed of the package motion can be obtained according to the coordinates of the package by using the offset mag and the displacement angle of each pixel point calculated in the first part of fig. 3, and the possible position area (x 'of the sorted package in the t+1 frame can be predicted' ₁ ，y′ ₁ ，x′ ₂ ，y′ ₂ )。

The location area of the parcel in predicted t+1 frame is shown in fig. 8. Sorted parcels are again detected in the prediction area, then state _t+1 =1, while recording the location of the sorted parcel.

Repeating the above step until no sorted package is detected in the predicted area of the t+n frame, then state _t+n =0. After each frame of video to be identified of length d is marked with a state, a state sequence can be obtained:

[state ₁ ，…，state _t ，state _t+1 ，…，state _t+n-1 ，state _t+n ，…，state _d ]＝[0，…1，1，…，1，0，…，0]

the more severe the abnormal sorting behavior, the greater the value of n, the length n of consecutive 1 in the statistical state sequence. And setting a threshold according to the sensitivity degree required by the abnormal sorting identification, and judging that the video to be identified contains abnormal sorting behaviors when n is greater than the threshold.

According to the embodiment of the application, the supervised optical flow estimation algorithm is replaced by the unsupervised optical flow estimation algorithm, the model is trained by utilizing the data of the real scene, and compared with the model generalization performance obtained by training the virtual data set, the predicted optical flow is more accurate. The moving object state recognition branches are added on the basic object detection network, so that the types and states of the moving objects can be recognized at the same time, and the false recognition of packages in other states can be avoided when the moving object state recognition branches are applied to abnormal sorting recognition.

The embodiment of the application designs a moving target detection algorithm for identifying abnormal sorting behaviors in a sorting scene by combining an unsupervised optical flow estimation network and a target detection network with added state identification branches. The model is trained on the actual sorting video data by using the unsupervised optical flow estimation algorithm, so that the generalization performance of the optical flow estimation algorithm on the actual sorting field data and the accuracy of predicting the optical flow are improved. In order to avoid that the motion packages except the sorted packages influence the identification result, a package state identification branch is additionally added in the target detection network so as to distinguish the packages in three states, and the method can reduce the misjudgment condition. Whether the current frame contains the sorted packages or not is determined through a detection network, coordinates of the sorted packages are recorded when the sorted packages exist, the moving direction of the sorted packages is calculated through optical flow, the position area of the sorted packages in the next frame is predicted jointly through the coordinates and the moving direction, whether abnormal sorting behaviors exist or not is judged jointly through multiple frames, and overall recognition accuracy is improved.

Fig. 9 is a schematic diagram of main units of a video processing apparatus according to an embodiment of the present application. As shown in fig. 9, the video processing apparatus 900 includes a receiving unit 901, a thrown object image determining unit 902, an acquiring unit 903, a predicting unit 904, and a state sequence generating unit 905.

The receiving unit 901 is configured to receive a video processing request, acquire a corresponding video identifier, and further acquire a video to be processed according to the video identifier;

the sorted object image determining unit 902 is configured to call an unsupervised optical flow estimation network to predict and obtain optical flows according to two adjacent frame images in the video to be processed, and further determine the sorted object images in the two adjacent frame images according to the optical flows;

an obtaining unit 903 configured to obtain actual position coordinates corresponding to the sorted object image from each frame of the video to be processed;

a prediction unit 904 configured to predict a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image, and the actual position coordinates;

a state sequence generating unit 905 configured to generate a state sequence based on the actual position coordinates and the sorting position area;

And a sorting attitude identification determining unit 906 configured to determine and output a sorting attitude identification corresponding to the sorted object image according to the state sequence.

In some embodiments, the sorted object image determination unit 902 is further configured to: determining moving object images in two adjacent frames of images according to the optical flow, and generating mask images according to the moving object images; the mask image is input into the inspection model to output a sorted object image.

In some embodiments, the sorted object image determination unit 902 is further configured to: classifying each moving object image in the mask image, and further generating a classification identifier corresponding to each moving object image; and determining a moving object image corresponding to the classified identification corresponding to the classified object as an image of the classified object and outputting the image.

In some embodiments, the prediction unit 904 is further configured to: acquiring a preset number of continuous video frame images after two adjacent frame images in a video to be processed; a preset number of consecutive video frame images is determined as target consecutive video frame images.

In some embodiments, the prediction unit 904 is further configured to: determining the offset and the displacement angle of each pixel point on the sorted object image according to the optical flow; in response to detecting the sorted object image in both of the adjacent two frame images, a sorting location area of the sorted object image in the target successive video frame images of the video to be processed is determined based on the actual location coordinates, offset and displacement angle of the sorted object image in the adjacent two frame images.

In some embodiments, the state sequence generating unit 905 is further configured to determine, in response to the actual position coordinates being within the sorting location area, a state identifier corresponding to the actual position coordinates as the first identifier; determining a state identifier corresponding to the actual position coordinate as a second identifier in response to the actual position coordinate not being located in the sorting position area; a sequence of states is generated from the first and second identifications in response to the presence of the actual location coordinate within the sort location region and the presence of the actual location coordinate not within the sort location region.

In some embodiments, the sort attitude identification determination unit 906 is further configured to: determining the number of continuous preset identifiers in a state sequence; and determining the sorting attitude identification corresponding to the sorted object images as the abnormal sorting identification in response to the number being larger than a preset threshold.

In some embodiments, the video processing apparatus further comprises a model training unit, not shown in fig. 9, configured to: acquiring an initial neural network; acquiring two adjacent frames of images in a video; calling an initial neural network to obtain a corresponding optical flow based on two adjacent frames of images; generating a composite image by means of deformation between the optical flow and the next image in the two adjacent images, further calculating the luminosity error between the previous image in the two adjacent images and the composite image, and determining the luminosity error as a loss function of the initial neural network; and performing iterative training on the initial neural network based on the loss function, thereby obtaining an unsupervised optical flow estimation network.

In the present application, the video processing method and the video processing apparatus have a corresponding relationship in terms of implementation content, and therefore, the description of the repeated content is not repeated.

Fig. 10 illustrates an exemplary system architecture 1000 to which the video processing method or video processing apparatus of embodiments of the present application may be applied.

As shown in fig. 10, a system architecture 1000 may include terminal devices 1001, 1002, 1003, a network 1004, and a server 1005. The network 1004 serves as a medium for providing a communication link between the terminal apparatuses 1001, 1002, 1003 and the server 1005. The network 1004 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user can interact with a server 1005 via a network 1004 using terminal apparatuses 1001, 1002, 1003 to receive or transmit messages or the like. Various communication client applications such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 1001, 1002, 1003.

The terminal devices 1001, 1002, 1003 may be various electronic devices having a video processing screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 1005 may be a server providing various services, such as a background management server (merely an example) providing support for video processing requests submitted by users using the terminal devices 1001, 1002, 1003. The background management server can receive the video processing request, acquire the corresponding video identification, and further acquire the video to be processed according to the video identification; invoking an unsupervised optical flow estimation network to predict and obtain optical flows according to two adjacent frames of images in the video to be processed, and further determining sorted object images in the two adjacent frames of images according to the optical flows; acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed; predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinate; generating a state sequence based on the actual location coordinates and the sort location area; and determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence. The actual position of the sorted object in each frame is compared with the predicted sorting position area to obtain a state sequence, and a plurality of frames are combined together to judge whether abnormal sorting behaviors exist or not, so that the overall abnormal sorting recognition accuracy is improved.

It should be noted that, the video processing method provided in the embodiment of the present application is generally executed by the server 1005, and accordingly, the video processing apparatus is generally disposed in the server 1005.

It should be understood that the number of terminal devices, networks and servers in fig. 10 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 11, there is illustrated a schematic diagram of a computer system 1100 suitable for use in implementing the terminal device of an embodiment of the present application. The terminal device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, the computer system 1100 includes a Central Processing Unit (CPU) 1101, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data required for the operation of the computer system 1100 are also stored. The CPU1101, ROM1102, and RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output section 1107 including a Cathode Ray Tube (CRT), a liquid crystal credit authorization query processor (LCD), and the like, and a speaker, and the like; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments disclosed herein include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 1101.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes a receiving unit, a thrown object image determining unit, an acquiring unit, a predicting unit, a state sequence generating unit, and a sorting attitude identification determining unit. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs, and when the one or more programs are executed by one device, the device receives a video processing request, acquires a corresponding video identifier, and further acquires a video to be processed according to the video identifier; invoking an unsupervised optical flow estimation network to predict and obtain optical flows according to two adjacent frames of images in the video to be processed, and further determining sorted object images in the two adjacent frames of images according to the optical flows; acquiring actual position coordinates corresponding to the sorted object images from each frame of the video to be processed; predicting and obtaining a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image and the actual position coordinate; generating a state sequence based on the actual location coordinates and the sort location area; and determining and outputting a sorting attitude identifier corresponding to the sorted object image according to the state sequence.

According to the technical scheme of the embodiment of the application, the actual position of the sorted object in each frame is compared with the predicted sorting position area to obtain the state sequence, and the frames are combined to jointly judge whether abnormal sorting behaviors exist or not, so that the overall abnormal sorting recognition accuracy is improved.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A video processing method, comprising:

generating a sequence of states based on the actual location coordinates and the sorting location area;

and determining and outputting the sorting attitude identification corresponding to the sorted object image according to the state sequence.

2. The method of claim 1, wherein the determining the sorted object image in the two adjacent frames of images from the optical flow comprises:

determining moving object images in the two adjacent frames of images according to the optical flow, and generating mask images according to the moving object images;

the mask image is input into a detection model to output a sorted object image.

3. The method of claim 2, wherein said inputting the mask image into a detection model to output a sorted object image comprises:

4. The method of claim 1, wherein prior to said predicting a sorting location area of said sorted object image in a target successive video frame image of said video to be processed, said method further comprises:

acquiring a preset number of continuous video frame images after the two adjacent frame images in the video to be processed;

and determining the preset number of continuous video frame images as target continuous video frame images.

5. The method of claim 1, wherein predicting a sorting location area of the sorted object image in the target consecutive video frame image of the video to be processed based on the optical flow, the sorted object image, and the actual location coordinates comprises:

determining the offset and displacement angle of each pixel point on the sorted object image according to the optical flow;

and determining a sorting position area of the sorted object image in the target continuous video frame image of the video to be processed based on the actual position coordinates of the sorted object image in the two adjacent frame images, the offset and the displacement angle in response to the detected sorted object image in the two adjacent frame images.

6. The method of claim 1, wherein the generating a sequence of states based on the actual location coordinates and the sort location region comprises:

a sequence of states is generated from the first and second identifications in response to the presence of the actual location coordinates within the sort location region and the presence of the actual location coordinates not within the sort location region.

7. The method of claim 1, wherein determining a sort attitude identification corresponding to the sorted object image from the sequence of states comprises:

determining the number of continuous preset identifiers in the state sequence;

and determining the sorting attitude identification corresponding to the sorted object images as an abnormal sorting identification in response to the number being larger than a preset threshold.

8. The method of claim 1, wherein prior to the invoking the unsupervised optical flow estimation network, the method further comprises:

Acquiring an initial neural network;

acquiring two adjacent frames of images in a video;

invoking the initial neural network to obtain a corresponding optical flow based on the two adjacent frames of images;

generating a composite image by means of deformation between the optical flow and a later frame image in the two adjacent frame images, further calculating a luminosity error between the earlier frame image in the two adjacent frame images and the composite image, and determining the luminosity error as a loss function of the initial neural network;

and performing iterative training on the initial neural network based on the loss function, so as to obtain an unsupervised optical flow estimation network.

9. A video processing apparatus, comprising:

the receiving unit is configured to receive a video processing request, acquire a corresponding video identifier and further acquire a video to be processed according to the video identifier;

A prediction unit configured to predict a sorting position area of the sorted object image in a target continuous video frame image of the video to be processed based on the optical flow, the sorted object image, and the actual position coordinates;

10. The apparatus according to claim 9, wherein the sorted object image determination unit is further configured to:

the mask image is input into a detection model to output a sorted object image.

11. A video processing electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.

12. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.