CN114926499A

CN114926499A - Video processing method, video processing device, computer equipment and storage medium

Info

Publication number: CN114926499A
Application number: CN202210545502.7A
Authority: CN
Inventors: 毛丽娟; 盛斌; 黎楚萱; 李震; 王继红; 李庭瑶; 向昊
Original assignee: Shanghai Jiaotong University; Shanghai University of Sport
Current assignee: Shanghai Jiaotong University; Shanghai University of Sport
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-19

Abstract

The present application relates to a video processing method, apparatus, computer device, storage medium and computer program product. The method comprises the following steps: receiving a video to be processed uploaded by a client; performing coordinate conversion on the video to be processed, and sending a first frame of image to be processed after the coordinate conversion to a client; and receiving a labeling result uploaded by the client for the first frame of image to be processed, identifying and tracking a target object in the video to be processed according to the labeling result to generate a target video, and sending the target video to the client. By adopting the method, the target video for identifying and tracking the target object can be produced.

Description

Video processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer image technologies, and in particular, to a video processing method, apparatus, computer device, storage medium, and computer program product.

Background

With the development of computer vision, great convenience is provided for people's life, and the computer vision becomes an integral part of various intelligent/autonomous systems in the fields of manufacturing industry, inspection, document analysis, medical diagnosis, military affairs and the like.

In the related art, only the target object in a section of video can be identified, and the target object of the video cannot be tracked.

Disclosure of Invention

In view of the above, it is necessary to provide a video processing method, an apparatus, a computer device, a computer readable storage medium, and a computer program product capable of tracking and identifying a target object in view of the above technical problems.

In a first aspect, the present application provides a video processing method applied to a server, where the method includes:

receiving a video to be processed uploaded by a client;

performing coordinate conversion on a video to be processed, and sending a first frame of image to be processed after the coordinate conversion to a client;

and receiving an annotation result uploaded by the client for the first frame of image to be processed, identifying and tracking a target object in the video to be processed according to the annotation result, generating a target video, and sending the target video to the client.

In one embodiment, the identifying and tracking a target object in a video to be processed according to a labeling result and generating a target video includes:

when the marking result is qualified, identifying and tracking a target object in the video to be processed after coordinate conversion and generating a target video;

and when the marking result is unqualified, performing coordinate conversion on the video to be processed according to the coordinate information carried in the marking result, and identifying and tracking the target object in the video to be processed after the coordinate conversion to generate the target video.

In one embodiment, the identifying and tracking a target object in the video to be processed after the coordinate transformation and generating a target video includes:

identifying a target object in a video to be processed to obtain an action identification image of each frame in the video to be processed;

tracking a target object in a video to be processed to obtain a target tracking image of each frame in the video to be processed;

matching each frame of action recognition image with each frame of target tracking image to obtain each frame of target image;

and connecting each frame of target image to obtain a target video.

In one embodiment, after the matching of each frame of motion recognition image and each frame of target tracking image to obtain each frame of target image, the method further includes:

calculating a characteristic value of each target object in the target image;

and classifying the target objects according to the characteristic values of the target objects aiming at each target object in the target image to obtain the category corresponding to the target object.

In a second aspect, the present application further provides a video processing method applied to a client, where the method includes:

sending a video to be processed to a server;

receiving a first frame of image to be processed after coordinate conversion sent by the server, judging the first frame of image to be processed after coordinate conversion, generating an annotation result, and sending the annotation result to the server; the first frame of image to be processed is obtained by performing coordinate conversion on a video to be processed by using a service end;

and receiving a target video sent by the server, wherein the target video is generated by identifying and tracking a target object in the video to be processed according to the labeling result by the server.

In a third aspect, the present application further provides a video processing apparatus applied to a server, where the apparatus includes:

the receiving module is used for receiving the video to be processed uploaded by the client;

the coordinate conversion module is used for carrying out coordinate conversion on the video to be processed and sending the first frame of image to be processed after the coordinate conversion to the client;

the identification module is used for receiving the labeling result uploaded by the client for the first frame of image to be processed, identifying and tracking the target object in the video to be processed according to the labeling result, generating a target video and sending the target video to the client.

In a fourth aspect, the present application further provides a video processing apparatus applied to a client, where the apparatus includes:

the sending module is used for sending the video to be processed to the server;

the annotation receiving module is used for receiving the first frame of image to be processed after the coordinate conversion sent by the server, judging the first frame of image to be processed after the coordinate conversion, generating an annotation result and sending the annotation result to the server; the first frame of image to be processed is obtained by performing coordinate conversion on a video to be processed by using a server side;

and the target video receiving module is used for receiving a target video sent by the server, and the target video is generated by the server by identifying and tracking a target object in the video to be processed according to the labeling result.

In a fifth aspect, the application further provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method in any of the above embodiments when the processor executes the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of the above-mentioned embodiments.

In a fifth aspect, the present application further provides a computer program product. Computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.

According to the video identification method, the video identification device, the computer equipment, the storage medium and the computer program product, the server side receives the to-be-processed video uploaded by the client side, then performs coordinate conversion on the to-be-processed video, sends the first to-be-processed image after the coordinate conversion to the client side, judges the coordinate conversion effect of the first to-be-processed image by the client side, and generates the labeling result, so that the quality of coordinate conversion of the to-be-processed video can be guaranteed, and the accuracy of the subsequently generated target video is improved. And then, the server tracks and identifies the target object in the image to be processed according to the label uploaded by the client for the first frame of image to be processed, and generates a target video capable of identifying and tracking the target object at the same time.

Drawings

FIG. 1 is a diagram of an exemplary video processing application;

FIG. 2 is a flow diagram of a video processing method in one embodiment;

FIG. 3 is a schematic diagram of coordinate transformation of a soccer field according to an embodiment;

FIG. 4 is a flow chart illustrating a video processing method according to another embodiment;

FIG. 5 is a diagram illustrating a frame in a target video, in one embodiment;

FIG. 6 is a schematic representation of the operation of a GUI in one embodiment;

FIG. 7 is a GUI interface diagram in one embodiment;

FIG. 8 is a GUI interface diagram in accordance with another embodiment;

FIG. 9 is a schematic diagram of QT transmission in one embodiment;

FIG. 10 is a block diagram showing the structure of a video processing apparatus according to one embodiment;

fig. 11 is a block diagram showing the structure of a video processing apparatus in another embodiment;

FIG. 12 is a diagram of an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

The video processing method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the client 102 communicates with the server 104 over a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be placed on the cloud or other network server. The server 104 firstly receives a to-be-processed video uploaded by the client 102, then performs coordinate conversion on the to-be-processed video to obtain the to-be-processed video after coordinate conversion, sends a first frame of to-be-processed image in the to-be-processed video after coordinate conversion to the client 102, the client 102 judges a coordinate conversion effect according to the first frame of to-be-processed image, generates an annotation result, sends the annotation result to the server 104, the server 104 tracks and identifies a target object in the to-be-processed video according to the annotation result to generate a target video, and finally sends the target video to the client 102, wherein the target video can simultaneously identify and track the target object. The client 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

In an embodiment, as shown in fig. 2, a video processing method is provided, which is described by taking the method as an example applied to the server 104 in fig. 1, and includes the following steps:

s202, receiving the video to be processed uploaded by the client.

The video to be processed refers to a video shot by a camera, a mobile phone, or other devices, and may be a video shot in any scene, such as a football video, a basketball video, or the like, which is not specifically limited herein.

And S204, carrying out coordinate conversion on the video to be processed, and sending the first frame of image to be processed after the coordinate conversion to the client.

The first frame of image to be processed refers to a first frame of image in the video to be processed after coordinate conversion is performed on the video to be processed, and the first frame of image is obtained in the video to be processed after coordinate conversion.

Specifically, after receiving a to-be-processed video uploaded by a client, a server performs coordinate conversion on the to-be-processed video, and then sends a first frame to-be-processed image in the coordinate-converted video to the client, so that the client judges the quality of the coordinate conversion through the first frame to-be-processed image.

In one embodiment, the to-be-processed video may be subjected to coordinate transformation through a pre-trained coordinate transformation model, for example, a coordinate transformation neural network, so as to obtain the to-be-processed video after coordinate transformation.

In an embodiment, specifically referring to fig. 3, fig. 3 is a schematic diagram of coordinate transformation of a football pitch in an embodiment, where the left side is a frame image in a captured football video, the football pitch image is in a perspective shape, the right side is a normal rectangle, and a process of transforming the left side image into the right side image is called coordinate transformation. The left football field image can be converted into the right normal rectangular image through a coordinate conversion model trained in advance.

In an embodiment, after performing coordinate transformation on a first frame of to-be-processed image in a to-be-processed video, sending the first frame of to-be-processed image after coordinate transformation to a client, and continuing coordinate transformation on the to-be-processed video by the client according to an annotation result of the first frame of to-be-processed image.

And S206, receiving the labeling result uploaded by the client for the first frame of the to-be-processed image, identifying and tracking the target object in the to-be-processed video according to the labeling result, generating a target video, and sending the target video to the client.

The marking result refers to a quality judgment result of a first to-be-processed image after the client converts the coordinates, which is uploaded from the client; the target video is obtained by processing the video to be processed and can identify and track a target object in the video to be processed, wherein the target object is a person or an object needing to be identified and tracked in the video to be processed.

Specifically, the server receives an annotation result uploaded by the client for a first frame of image to be processed, and then identifies and tracks a target object in the video to be processed according to the annotation result, wherein optionally, when the annotation result is qualified, the server can continue processing the video to be processed after coordinate conversion according to the video to be processed after coordinate conversion is performed on the video to be processed by the server; and when the marking result is unqualified, performing coordinate conversion on the video to be processed according to the coordinate information carried in the marking result, and then identifying and tracking the target object in the video to be processed after the coordinate conversion. Specifically, the target video may be generated by matching the recognition result of the target object and the tracking result. And finally, the server side sends the generated target video to the client side.

In the video processing method, the server side receives the to-be-processed video uploaded by the client side, then performs coordinate conversion on the to-be-processed video, sends the first to-be-processed image after the coordinate conversion to the client side, and the client side judges the coordinate conversion effect of the first to-be-processed image and generates the labeling result, so that the quality of the coordinate conversion of the to-be-processed video can be ensured, and the accuracy of the subsequent generation of the target video is improved. And then, the server tracks and identifies the target object in the image to be processed according to the label uploaded by the client for the first frame of image to be processed, and generates a target video capable of identifying and tracking the target object at the same time.

In an embodiment, the identifying and tracking a target object in the video to be processed according to the annotation result and generating a target video includes: when the marking result is qualified, identifying and tracking a target object in the video to be processed after coordinate conversion and generating a target video; and when the marking result is unqualified, performing coordinate conversion on the video to be processed according to the coordinate information carried in the marking result, and identifying and tracking the target object in the video to be processed after the coordinate conversion to generate the target video.

The annotation result refers to a result generated by the client judging the coordinate conversion effect of the first frame of image to be processed received from the server, and may include two standards of pass and fail.

Specifically, when the labeling result of the server result is qualified, the target object in the video to be processed after coordinate conversion is directly identified and tracked, and the target video is generated according to the identification result and the tracking result. And when the marking result received by the server is unqualified, performing coordinate conversion on the video to be processed according to the coordinate information carried in the marking result, identifying and tracking the video to be processed which performs coordinate conversion according to the coordinate information, and generating a target video according to the tracking result.

In an embodiment, the server may send the first frame of to-be-processed image in the to-be-processed video to the client after performing coordinate conversion according to the coordinate information, the client determines a coordinate conversion effect of the first frame of to-be-processed image, generates a corresponding annotation result, uploads the annotation result to the server, the server performs coordinate conversion on the to-be-processed video according to the annotation result, and then sends the first frame of image to the client until the annotation result received from the client is qualified.

In one embodiment, the client uploads the football video to be processed to the server at first, the server performs coordinate conversion on the football video to be processed, then the server analyzes a first frame of football image and sends the first frame of football image to the client, and the user downloads a court transformation result, namely the first frame of football image. And then judging the conversion effect of the first frame of football image to generate a corresponding labeling result. With reference to fig. 3, if the preset number of points in the field in the first frame of the soccer video can correspond to the preset number of points in the right normal rectangle, the first frame of the soccer video is determined to be qualified, otherwise, the first frame of the soccer video is calibrated by a calibration program written by python (a programming software) to generate coordinate information. The Python calibration program combines the transformation of a preset number of points to calculate a pkl file (coordinate information), and then the client uploads a labeling result and the coordinate information to the server for further analysis. Preferably, the preset number of points are four points fixed at four corners of the court.

In the embodiment, the marking result uploaded by the client is used as the coordinate conversion effect of the video to be processed, and whether the coordinate conversion is performed on the video to be processed again is determined according to the marking result, so that the coordinate conversion quality of the video to be processed can be ensured, and the accuracy of the generated target video is further ensured.

In an embodiment, the identifying and tracking the target object in the coordinate-converted video to be processed and generating the target video includes: identifying a target object in a video to be processed to obtain an action identification image of each frame in the video to be processed; tracking a target object in a video to be processed to obtain a target tracking image of each frame in the video to be processed; matching each frame of action recognition image with each frame of target tracking image to obtain each frame of target image; and connecting each frame of target image to obtain a target video.

The motion recognition object is an image generated after a target object in a video to be processed is recognized; the target tracking image is an image generated after a target object in a video to be processed is tracked; the target image is an image generated by matching the motion recognition image with a target object in the target tracking correspondence.

Specifically, the server identifies the motion of the target object in each frame of the to-be-processed image in the to-be-processed video, and obtains a motion identification image after identifying the motion of the target object in each frame of the to-be-processed image in the to-be-processed video, that is, the above-mentioned identification result. Alternatively, the target object may be recognized by a pre-trained motion recognition model, such as, for example, a pasanet (a motion recognition model).

Specifically, the server tracks the target object in each frame of the to-be-processed image in the to-be-processed video, and obtains a target tracking image obtained by tracking the target object in each frame of the to-be-processed image in the to-be-processed video, that is, the above-mentioned tracking result. Alternatively, the target object in the image to be processed may be tracked by a pre-trained tracking model, such as yolov5 (young only look once V5, a target detection model).

It should be noted that, the order of identifying and tracking the target object in the video to be processed is not limited herein, that is, the target object in the video to be processed may be identified first or tracked first.

Specifically, after each frame of motion recognition image and each frame of target tracking image in the video to be processed are obtained, the target objects in the corresponding frame of motion recognition image and the corresponding frame of target tracking image are matched to obtain each frame of target image. Specifically, a target calculation value of a detection frame in which a target object in the motion recognition image and the target tracking image is located may be calculated, and the target object with the largest target calculation value may be selected for matching. In one embodiment, the bounding box (detection frame of the target object) obtained by the yolov5 network and the pasanet network is matched, the IOU (interaction over unit, a standard for measuring the accuracy of detecting the corresponding object) values of the detection frames of the target object between corresponding frames are respectively calculated, and the detection frame of the target object with the largest IOU value is taken for matching.

And finally, connecting the target images of each frame to obtain a target video.

In one embodiment, the pasta network is used to obtain the motion of each person at different times, but there is no way to correlate each person (e.g., athlete B detects that the first frame a is performing action a, but detects that a ' is performing action a by the time of the second frame, and does not know whether the current a and a ' are the same), yolov5 performs object detection and tracking, and correlates the tracked result with the output of pasta, because yolov5 can know that a of the previous frame and a ' of the current frame are not the same object, so that both can be identified and tracked.

In the above embodiment, the target object in the video to be processed is respectively identified and tracked to obtain each frame of motion identification image and each frame of target tracking image in the video to be processed, then each frame of motion identification image and each frame of target tracking image are matched to obtain each frame of target image, and finally the target images are connected to obtain the target video capable of identifying and tracking the target object at the same time.

In an embodiment, after the matching of each frame of the motion recognition image and each frame of the target tracking image to obtain each frame of the target image, the method further includes: calculating a characteristic value of each target object in the target image; and classifying the target objects according to the characteristic values of the target objects aiming at each target object in the target image to obtain the corresponding category of the target objects.

The feature value is calculated according to information of a certain position in the detection frame corresponding to the target object, and is used as a standard for classifying the target object.

Specifically, information of the center position of the detection frame corresponding to each target object in each frame of target image is first calculated, color information of the center position of the detection frame is calculated as a feature value in this embodiment, and other information of other positions of the detection frame may be calculated in other embodiments. After the feature value of each target object in each frame of target image in the video to be processed is obtained, averaging the feature values of the same target object in all frames to be used as the feature value of the target object in the video. And then, classifying the target object according to the characteristic value to obtain the class corresponding to the target object. For example, in the case of a football video, the target objects in the football video, i.e., players, can be squashed by classifying the target objects according to the feature values. This facilitates subsequent analysis of the tactics in the soccer match.

In an embodiment, before calculating the feature value of each target object in the target image, the target image is further subjected to noise reduction, wherein optionally, noise reduction may be performed in a gaussian blur manner, so that the signal-to-noise ratio of the target image may be improved, and the original information of the target image is maintained to the maximum extent.

In one embodiment, the video to be processed is a football video, gaussian blurring is performed on each frame of picture (each frame of football target image), color information (characteristic value) of the central position of a bounding box (detection box) of a player (target object) in each frame is sampled, and a color average value is taken from all frames of the full video

As the color feature vector for that person. And performing k-means secondary clustering on the obtained human color vector to obtain the squad information. This enables various analyses to be performed on the football video, such as the technical tactics of the athlete (based on motion recognition), the physical abilities of the athlete (based on tracking the athlete's run to a route), and the purpose of these analyses is to facilitate the subsequent coaches in specifying the tactics.

In the above embodiment, the feature value of each target object in the target image is first calculated, so as to take the feature value to represent the detection frame where the target object is located, so as to facilitate classification; the classification of the target objects according to their feature values is to facilitate further processing of the target video, such as football tactical analysis.

In an embodiment, as shown in fig. 4, a video processing method is provided, which is described by taking the method as an example applied to the server 102 in fig. 1, and includes the following steps:

s402, sending the video to be processed to the server.

S404, receiving the first frame of image to be processed after coordinate conversion sent by the server, judging the first frame of image to be processed after coordinate conversion, generating an annotation result, and sending the annotation result to the server; and the first frame of image to be processed is obtained by performing coordinate conversion on the video to be processed by the service end.

Specifically, the client receives a first frame of image to be processed of coordinate conversion sent by the server, and judges the first frame of image to be processed, that is, the quality of the coordinate conversion of the first frame of image to be processed is analyzed, if a preset number of points in a court in the first frame of football video can correspond to a preset number of points in a right normal rectangle, the first frame of football video is judged to be qualified, otherwise, coordinate information is generated through a calibration program written by python. The Python calibration program combines the transformation of a preset number of points to calculate a pkl file (coordinate information), and then the client uploads the labeling result and the coordinate information to the server for further analysis.

S406, receiving a target video sent by the server, wherein the target video is generated by the server identifying and tracking a target object in the video to be processed according to the labeling result.

In one embodiment, as shown in fig. 5, fig. 5 is a schematic diagram of a frame in a target video in one embodiment, specifically, the left-side field is transformed court coordinates, in which players of a team are shown. The numbers in the figure represent numbers of the players, and the positions of the numbers represent the actual positions of the players in the video after transformation. On the right side is the pose estimation of the coordinate transformation neural network for the player, wherein the ID field represents the number of the player corresponding to the number on the left graph, and the action below the ID represents the part-level pose estimation of the player by the PasaNet, namely the pose estimation of each body part.

In one embodiment, the user downloads the mp4 file (target video), opening the GUI of the analytics portion for viewing. The analysis GUI supports player actions and team tactics that pause and read and write on a frame-by-frame basis to take the frame. If a manual correction is desired, the correction can also be clicked and saved, leaving the study for next viewing.

In the above embodiment, the client first uploads the video to be processed, then receives the first frame of image to be processed after coordinate conversion sent by the server, then performs coordinate conversion on the first frame of image to be processed, generates the annotation result, and uploads the annotation result to the server, so as to instruct the server to continue coordinate conversion on the image to be processed or identify and analyze a target object in the image to be processed, and finally receives the target video sent by the server, where the target video is a video obtained by identifying and tracking the target object in the image to be processed. After the client receives the target video, the user can further process the target video conveniently.

In one embodiment, a GUI (graphical interface) is further designed, and as shown in fig. 6, fig. 6 is a schematic diagram of an operation flow of the GUI in one embodiment. First, a client uploads a video to be analyzed (a video to be processed), and after receiving a connection request, a server downloads the video and starts a neural network for analysis. After the analysis is completed, the server informs the client to download the marked data, and after the data downloading is completed, the client can perform command interaction with the server to complete tasks such as reanalysis and the like, so that the accuracy is further improved. The User GUI is a client GUI, the Server is a Server, the video is a video, and the taped data is a data tag. Fig. 7 and 8 are GUI display diagrams in other embodiments.

In the embodiment, the video to be processed can be processed more conveniently and intuitively through the GUI interface.

In one embodiment, Qt6 (a program development framework) is used to develop client GUI and service end frameworks. The data to be displayed in this embodiment are: football videos, videos after marking, team member position videos and server information. For video playback, frame-accurate is required. Just the Qt6 provides the multimedia module QMedia Player (Qt with multimedia), supports the dragging of video, and can pause frame by frame. Therefore, in the video playing part, the main body of the embodiment adopts QMedia-Player, and is assisted by the video control module; in the text display part, the main body QJson + qtablevidge of this embodiment is assisted by parsing, reading, and writing of a file.

In the above embodiment, the target video can be subsequently processed, for example, tactical analysis of the target video, by Qt6 and qmediplayer.

In one embodiment, the end interaction is implemented by sending a command by the client, receiving and executing the command by the server, and then returning the result to the client. Therefore, the end interaction must satisfy the real-time performance and reliability, so that the qpprocess module and the Connection module written in this embodiment are adopted in this embodiment to achieve this requirement. When executing a command, the server first creates a QPProcess instance into which to initialize the command and then runs the process. When a QPprocess has an output to read, the signal is released. The Connect-ion operation accepts the signal, reads the output from the qpprocess instance to the Connection, which then sends the Connection to the client, and finally displays it in the text edit field.

In one embodiment, the file transfer between the client and the server inherits the design idea of FTP, that is, a plurality of connections are opened, one is used for transferring instructions, and a plurality of connections are used for transferring data. The file and instruction transmission module is written based on a transmission layer API QTcpSect (a socket) provided by Qt, and reliable transmission can be achieved. For file transfer, the present group implements the protocol of qt4.0 for file transfer, including server and client greeting at the beginning, query and confirmation at each metadata transfer, and check at the completion of file transfer. For instruction transmission, the group is also derived based on the underlying QTcpServer, and finally written into Connection class to realize all functions. Specific transmission protocols and transmission results are shown in fig. 9, and fig. 9 is a schematic diagram of QT transmission in one embodiment. The data transmission mode of the server and the client is reliable transmission, and the process is as follows: the client sends a connection request connect request to the server, the server agrees to return the connection request of the client after receiving the request and then transmits data to the server, the server sends a signal ACK for confirming the reception to the client after receiving the data, the transmitting represents the transmission process from the first data to the last data, the client sends an OK signal to tell the server that the data transmission is finished after the last data transmission is finished, the server sends a signal bye for disconnecting the connection to the client after receiving the signal, the client agrees to disconnect and then returns a bye signal to the server, and the server continues to monitor the connection requests from other clients.

In the above embodiment, the client and the server are separated, and the middle is connected by tcp (Transmission Control Protocol), which facilitates subsequent deployment of the project on the cloud. Therefore, not only can the computing resources be utilized, but also the management is convenient. Moreover, because the computing end needs to operate the neural network in the Linux environment, the design that the user end is separated from the computing end also avoids the inconvenience of environment configuration for the user.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a video processing apparatus for implementing the above-mentioned video processing method. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method, so specific limitations in one or more embodiments of the video processing apparatus provided below may refer to the limitations on the video processing method in the foregoing, and details are not described here again.

In one embodiment, as shown in fig. 10, there is provided a video processing apparatus applied to a server, including: a receiving module 100, a coordinate conversion module 200 and a recognition module 300, wherein:

the receiving module 100 is configured to receive a to-be-processed video uploaded by a client.

The coordinate conversion module 200 is configured to perform coordinate conversion on the video to be processed, and send the coordinate-converted first frame of image to be processed to the client.

The identification module 300 is configured to receive an annotation result uploaded by the client for the first frame of to-be-processed image, identify and track a target object in the to-be-processed video according to the annotation result, generate a target video, and send the target video to the client.

In one embodiment, the identification module 300 comprises:

and the marking qualified submodule is used for identifying and tracking the target object in the video to be processed after the coordinate conversion and generating the target video when the marking result is qualified.

And the unqualified labeling submodule is used for performing coordinate conversion on the video to be processed according to coordinate information carried in the labeling result when the labeling result is unqualified, and identifying and tracking a target object in the video to be processed after the coordinate conversion to generate a target video.

In one embodiment, the labeling-eligible sub-module and the labeling-ineligible sub-module include:

and the image identification unit is used for identifying the target object in the video to be processed to obtain each frame of action identification image in the video to be processed.

And the image tracking unit is used for tracking the target object in the video to be processed to obtain a target tracking image of each frame in the video to be processed.

And the image matching unit is used for matching the action recognition image of each frame with the target tracking image of each frame to obtain a target image of each frame.

And the splicing unit is used for connecting the target images of each frame to obtain the target video.

and the characteristic value calculating unit is used for calculating the characteristic value of each target object in the target image.

And the classification unit is used for classifying the target objects according to the characteristic values of the target objects aiming at each target object in the target image to obtain the corresponding category of the target object.

In one embodiment, as shown in fig. 11, there is provided a video processing apparatus applied to a server, including: a sending module 400, an annotation receiving module 500, and a target video receiving module 600, wherein:

a sending module 400, configured to send a video to be processed to a server;

the annotation receiving module 500 is configured to receive the first frame of image to be processed after the coordinate conversion sent by the server, judge the first frame of image to be processed after the coordinate conversion, generate an annotation result, and send the annotation result to the server; the first frame of image to be processed is obtained by the server side performing coordinate conversion on the video to be processed;

a target video receiving module 600, configured to receive a target video sent by the server, where the target video is generated by the server identifying and tracking a target object in the video to be processed according to the tagging result.

The various modules in the video processing apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the video to be processed. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a video processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method in any of the above embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the method of any of the above embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of the method in any of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A video processing method is applied to a server side, and the method comprises the following steps:

receiving a video to be processed uploaded by a client;

performing coordinate conversion on the video to be processed, and sending a first frame of image to be processed after the coordinate conversion to a client;

and receiving a labeling result uploaded by the client for the first frame of image to be processed, identifying and tracking a target object in the video to be processed according to the labeling result, generating a target video, and sending the target video to the client.

2. The method according to claim 1, wherein the identifying and tracking the target object in the video to be processed according to the labeling result and generating the target video comprises:

when the labeling result is qualified, identifying and tracking the target object in the video to be processed after the coordinate conversion and generating a target video;

and when the labeling result is unqualified, performing coordinate conversion on the video to be processed according to coordinate information carried in the labeling result, and identifying and tracking a target object in the video to be processed after the coordinate conversion to generate a target video.

3. The method according to claim 2, wherein the identifying and tracking the target object in the coordinate-converted video to be processed and generating a target video comprises:

identifying a target object in the video to be processed to obtain an action identification image of each frame in the video to be processed;

tracking a target object in the video to be processed to obtain a target tracking image of each frame in the video to be processed;

matching each frame of the action recognition image with each frame of the target tracking image to obtain each frame of the target image;

and connecting the target images of each frame to obtain the target video.

4. The method of claim 3, wherein after matching each frame of the motion recognition image with each frame of the target tracking image to obtain each frame of the target image, further comprising:

calculating a characteristic value of each target object in the target image;

5. A video processing method is applied to a client, and the method comprises the following steps:

sending a video to be processed to a server;

receiving the first frame of image to be processed after the coordinate conversion sent by the server, judging the first frame of image to be processed after the coordinate conversion, generating a labeling result, and sending the labeling result to the server; the first frame of image to be processed is obtained by performing coordinate conversion on the video to be processed for the service end;

and receiving a target video sent by the server, wherein the target video is generated by the server identifying and tracking a target object in the video to be processed according to the labeling result.

6. A video processing apparatus, applied to a server, the apparatus comprising:

and the identification module is used for receiving the marking result uploaded by the client for the first frame of image to be processed, identifying and tracking the target object in the video to be processed according to the marking result to generate a target video, and sending the target video to the client.

7. A video processing apparatus for a client, the apparatus comprising:

the sending module is used for sending the video to be processed to the server;

the annotation receiving module is used for receiving the first frame of image to be processed after the coordinate conversion sent by the server, judging the first frame of image to be processed after the coordinate conversion, generating an annotation result and sending the annotation result to the server; the first frame of image to be processed is obtained by performing coordinate conversion on the video to be processed for the service end;

and the target video receiving module is used for receiving a target video sent by the server, and the target video is generated by the server identifying and tracking a target object in the video to be processed according to the labeling result.

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 4 or 5.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4 or 5.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 4 or 5 when executed by a processor.