CN112101297A

CN112101297A - Training data set determination method, behavior analysis method, device, system and medium

Info

Publication number: CN112101297A
Application number: CN202011096945.XA
Authority: CN
Inventors: 童俊艳; 赵飞; 任烨
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2020-12-18
Anticipated expiration: 2040-10-14
Also published as: CN112101297B

Abstract

The embodiment of the application discloses a training data set determination method, a behavior analysis method, a device, a system and a medium, and belongs to the technical field of deep learning. In the embodiment of the application, a plurality of different initial analysis network models are provided for a user to select, and the training data set corresponding to the initial analysis network model selected by the user is automatically determined according to the plurality of video segments and the corresponding label information.

Description

Training data set determination method, behavior analysis method, device, system and medium

Technical Field

The embodiment of the application relates to the technical field of deep learning, in particular to a training data set determining method, a behavior analyzing method, a device, a system and a medium.

Background

With the development of deep learning technology, a behavior analysis network model for analyzing the behavior of an object in a video is widely popularized. The behavior analysis network model is usually obtained by training a training data set, and then is deployed in a server or on a terminal device (such as an intelligent camera) to identify behaviors existing in a video, locations of the behaviors, and the like. The training data set comprises a video segment and annotation information corresponding to the video segment.

In the related art, for the determination of the training data set, usually, a developer marks a position of a behavior in a video segment on a computer device to obtain corresponding marking information, and the video segment and the corresponding marking information are used as the training data set.

However, the technical threshold is high when the research and development personnel determine the training data set, and the customization period is long when the user wants to customize the behavior analysis network model of a specific behavior.

Disclosure of Invention

The embodiment of the application provides a training data set determining method, a behavior analysis method, a device, a system and a medium, which can reduce the technical threshold of determining a training data set and shorten the customization period of a customized behavior analysis network model. The technical scheme is as follows:

in one aspect, a training data set determination method is provided, the method comprising:

displaying a video image included in each of a plurality of video segments;

when the annotation operation of one or more behaviors in the video images of the plurality of video segments is detected, determining annotation information corresponding to each video segment in the plurality of video segments;

displaying performance information of a plurality of initial analysis network models with different network structures and/or training parameters, wherein the plurality of initial analysis network models are used for behavior analysis in a video;

and when the model selection operation is detected based on the displayed performance information, determining a training data set corresponding to the initial analysis network model selected by the model selection operation according to the plurality of video segments and the corresponding annotation information.

Optionally, the plurality of initial analysis network models include a plurality of initial image analysis network models with different network structures and/or training parameters, and a plurality of initial video analysis network models with different network structures and/or training parameters;

the determining a training data set corresponding to the initial analysis network model selected by the model selection operation according to the plurality of video segments and the corresponding annotation information includes:

determining an image data set according to the video segments and the corresponding annotation information, and taking the image data set as a training data set corresponding to the initial image analysis network model selected by the model selection operation; and/or

And determining a video data set according to the plurality of video segments and the corresponding annotation information, and taking the video data set as a training data set of the initial video analysis network model selected by the model selection operation.

Optionally, the annotation information includes a correspondence between a behavior tag and a behavior position, where the behavior position includes a frame number and an image region where a behavior occurs, each video segment in the multiple video segments has one or more behavior tags, each behavior tag corresponds to multiple frame numbers, and each frame number corresponds to an image region;

the determining an image data set according to the plurality of video segments and the corresponding annotation information includes:

for a first video segment in the plurality of video segments, extracting part or all of video images labeled in the first video segment to obtain a plurality of first video images, wherein the first video segment is one of the plurality of video segments;

acquiring a behavior label and a behavior position corresponding to the frame number of each first video image in the plurality of first video images from the corresponding relation, and taking the behavior labels and the behavior positions as the corresponding annotation information of the corresponding first video images;

and determining the video images extracted from the plurality of video segments and the corresponding annotation information as the image data set.

Optionally, the annotation information includes a correspondence between a behavior tag and a behavior position, where the behavior position includes a frame number and an image region where a behavior occurs, each video segment in the multiple video segments has one or more behavior tags, each behavior tag corresponds to multiple frame numbers, each frame number corresponds to an image region, and the multiple frame numbers include a starting frame number and an ending frame number;

determining a video data set according to the plurality of video segments and the corresponding annotation information, including:

for a first video segment in the video segments, extracting a video segment between a starting frame number and an ending frame number corresponding to each behavior tag of the first video segment to obtain one or more first sub-video segments, wherein the first video segment is one of the video segments;

acquiring a behavior position corresponding to the behavior label of each first sub-video segment from the corresponding relation, and taking the behavior label and the corresponding behavior position of each first sub-video segment as corresponding annotation information corresponding to the first sub-video segment;

and determining the sub-video segments extracted from the plurality of video segments and the corresponding annotation information as the video data set.

Optionally, after determining, according to the plurality of video segments and the corresponding annotation information, a training data set corresponding to the initial analysis network model selected by the model selection operation, the method further includes:

training the initial image analysis network model selected by the model selection operation according to the image data set to obtain an image behavior analysis network model; and/or the presence of a gas in the gas,

and training the initial video analysis network model selected by the model selection operation according to the video data set to obtain a video behavior analysis network model.

Optionally, before training the initial analysis network model selected by the model selection operation, the method further includes:

displaying adjustment indication information of the network structure and/or training parameters of the initial analysis network model selected by the model selection operation;

and when the adjustment operation is detected based on the adjustment indication information, adjusting the network structure and/or training parameters of the initial analysis network model selected by the model selection operation according to the adjustment operation.

Optionally, after training the initial analysis network model selected by the model selection operation, the method further includes:

displaying model test prompt information;

when a determined test instruction is detected based on the model test prompt information, testing the trained behavior analysis network model according to the test data set to obtain a test result;

displaying the test result;

and when a training adjustment instruction is detected based on the test result, retraining the initial analysis network model selected by the model selection operation according to the training adjustment instruction.

Optionally, the method further comprises:

the display model issues prompt information;

and when a model issuing instruction is detected based on the model issuing prompt information, deploying the trained behavior analysis network model on analysis equipment, wherein the analysis equipment is a server and/or terminal equipment.

In another aspect, a method of behavior analysis in a video is provided, the method comprising:

acquiring a target video segment to be subjected to behavior analysis;

analyzing the behaviors in the target video segment through a behavior analysis network model to obtain a behavior analysis result;

the behavior analysis network model is obtained by selecting an initial analysis network model from a plurality of initial analysis network models with different network structures and/or training parameters and then training through a training data set, wherein the training data set is determined by labeling behaviors in video images of a plurality of video segments by a user.

Optionally, the behavior analysis network model includes an image behavior analysis network model and a video behavior analysis network model;

the analyzing the behavior in the target video segment through the behavior analysis network model to obtain a behavior analysis result includes:

analyzing the behaviors in the target video segment through the image behavior analysis network model to obtain one or more candidate frame numbers;

determining one or more second sub-video segments according to the target video segment and the one or more candidate frame numbers;

and analyzing the behaviors in the one or more second sub-video segments through the video behavior analysis network model to obtain the behavior analysis result.

Optionally, the determining one or more second sub-video segments according to the target video segment and the one or more candidate frame numbers includes:

and for a first candidate frame number in the one or more candidate frame numbers, extracting a video segment of a reference frame number or a reference time length which is continuous from a video image corresponding to the first candidate frame number in the target video segment to obtain a second sub-video segment, wherein the first candidate frame number is one of the one or more candidate frame numbers.

In another aspect, a training data set determination apparatus is provided, the apparatus comprising:

a first display module for displaying a video image included in each of a plurality of video segments;

the first determination module is used for determining annotation information corresponding to each video segment in the plurality of video segments when the annotation operation of one or more behaviors in the video images of the plurality of video segments is detected;

the second display module is used for displaying the performance information of a plurality of initial analysis network models with different network structures and/or training parameters, and the plurality of initial analysis network models are all used for behavior analysis in the video;

and the second determining module is used for determining a training data set corresponding to the initial analysis network model selected by the model selecting operation according to the plurality of video segments and the corresponding annotation information when the model selecting operation is detected based on the displayed performance information.

the second determining module includes:

a first determining unit, configured to determine an image data set according to the multiple video segments and corresponding annotation information, and use the image data set as a training data set corresponding to the initial image analysis network model selected by the model selection operation; and/or

And the second determining unit is used for determining a video data set according to the plurality of video segments and the corresponding annotation information, and taking the video data set as a training data set of the initial video analysis network model selected by the model selecting operation.

the first determination unit includes:

the first extraction subunit is configured to, for a first video segment in the multiple video segments, extract a part or all of video images labeled in the first video segment to obtain multiple first video images, where the first video segment is one of the multiple video segments;

a first obtaining subunit, configured to obtain, from the correspondence, a behavior tag and a behavior position corresponding to a frame number of each of the plurality of first video images, as corresponding annotation information corresponding to the first video image;

a first determining subunit, configured to determine, as the image data set, the video image extracted from the multiple video segments and the corresponding annotation information.

the second determination unit includes:

a second extracting sub-unit, configured to extract, for a first video segment in the multiple video segments, a video segment between a starting frame number and an ending frame number corresponding to each behavior tag of the first video segment, so as to obtain one or more first sub-video segments, where the first video segment is one of the multiple video segments;

a second obtaining subunit, configured to obtain, from the correspondence, a behavior position corresponding to the behavior tag of each first sub-video segment, and use the behavior tag and the corresponding behavior position of each first sub-video segment as corresponding annotation information corresponding to the first sub-video segment;

a second determining subunit, configured to determine, as the video data set, a sub-video segment extracted from the video segments and corresponding annotation information.

Optionally, the apparatus further comprises:

the first training module is used for training the initial image analysis network model selected by the model selection operation according to the image data set to obtain an image behavior analysis network model; and/or the presence of a gas in the gas,

and the second training module is used for training the initial video analysis network model selected by the model selection operation according to the video data set to obtain a video behavior analysis network model.

Optionally, the apparatus further comprises:

a third display module, configured to display adjustment indication information of the network structure and/or the training parameters of the initial analysis network model selected by the model selection operation;

and the adjusting module is used for adjusting the network structure and/or the training parameters of the initial analysis network model selected by the model selecting operation according to the adjusting operation when the adjusting operation is detected based on the adjusting indication information.

Optionally, the apparatus further comprises:

the fourth display module is used for displaying model test prompt information;

the test module is used for testing the behavior analysis network model obtained by training according to the test data set when the determined test instruction is detected based on the model test prompt information to obtain a test result;

the fifth display module is used for displaying the test result;

and the third training module is used for retraining the initial analysis network model selected by the model selection operation according to the training adjustment instruction when the training adjustment instruction is detected based on the test result.

Optionally, the apparatus further comprises:

the sixth display module is used for displaying the model release prompt information;

and the deployment module is used for deploying the trained behavior analysis network model on analysis equipment when a model issuing instruction is detected based on the model issuing prompt information, wherein the analysis equipment is a server and/or terminal equipment.

In another aspect, an apparatus for analyzing behavior in a video is provided, the apparatus including:

the acquisition module is used for acquiring a target video segment to be subjected to behavior analysis;

the analysis module is used for analyzing the behaviors in the target video segment through a behavior analysis network model to obtain a behavior analysis result;

the analysis module includes:

the first analysis unit is used for analyzing the behaviors in the target video segment through the image behavior analysis network model to obtain one or more candidate frame numbers;

a first determining unit, configured to determine one or more second sub-video segments according to the target video segment and the one or more candidate frame numbers;

and the second analysis unit is used for analyzing the behaviors in the one or more second sub-video segments through the video behavior analysis network model to obtain the behavior analysis result.

Optionally, the first determining unit includes:

and the extracting subunit is configured to, for a first candidate frame number in the one or more candidate frame numbers, extract a video segment of a reference frame number or a reference duration that continues from a video image corresponding to the first candidate frame number in the target video segment to obtain a second sub-video segment, where the first candidate frame number is one of the one or more candidate frame numbers.

In another aspect, a cloud platform system is provided, where the system includes a user device and a server, and the cloud platform system implements the steps of the training data set determination method through the user device and the server.

In another aspect, a computer device is provided, where the computer device includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus, the memory is used to store a computer program, and the processor is used to execute the program stored in the memory to implement the steps of the training data set determination method or implement the steps of the behavior analysis method in the video.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned training data set determination method or the steps of the above-mentioned behavior analysis method in a video.

In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the steps of the training data set determination method described above, or to implement the steps of the behavior analysis method in video described above.

The technical scheme provided by the embodiment of the application can at least bring the following beneficial effects:

in the embodiment of the application, a plurality of different initial analysis network models are provided for a user to select, and the training data set corresponding to the initial analysis network model selected by the user is automatically determined according to the plurality of video segments and the corresponding label information.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a training data set determination method provided in an embodiment of the present application;

fig. 2 is a flowchart of a training data set determination method provided in an embodiment of the present application;

fig. 3 is a basic usage diagram of a cloud platform system provided in an embodiment of the present application;

fig. 4 is an overall flowchart of a user operating a cloud platform system according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model training process provided by an embodiment of the present application;

fig. 6 is a flowchart of a behavior analysis method in a video according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a training data set determination apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a behavior analysis apparatus in a video according to an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

With the development of deep learning technology, a behavior analysis network model for analyzing the target behavior of an object in a video is widely popularized. However, usually, a training data set related to training a behavior analysis network model is determined by research and development personnel, and the network model construction and adjustment, training parameter setting and adjustment, and the like related to a subsequent training process are all completed by research and development personnel with deep learning technical experience. By the scheme provided by the embodiment of the application, a common user can determine the training data set through simple operation, and then can quickly obtain the behavior analysis network model for analyzing the specific behavior for behavior analysis in the video.

Next, a system architecture related to a training data set determination method provided in an embodiment of the present application is described.

Fig. 1 is a system architecture diagram according to a training data set determination method provided in an embodiment of the present application. Referring to fig. 1, the system architecture includes a user device 101 and a server 102. The user equipment 101 is connected in a wired or wireless manner to communicate with the server 102.

The training data set determining method provided by the embodiment of the application is implemented through a cloud platform system, user equipment 101 serves as front-end equipment of the cloud platform system, a server 102 serves as back-end equipment of the cloud platform system, the user equipment 101 is used for logging in the cloud platform system through a webpage or a client and the like according to detected user operation, interaction with a user is achieved through an interaction interface of the user equipment 101, interaction between the user and the cloud platform system is achieved, cloud service is provided for the user, the cloud service comprises determining of a training data set, namely, the user equipment 101 and the server 102 are used for determining the training data set according to the training data set determining method provided by the embodiment of the application.

The cloud platform system displays video images included in the video segments through the user equipment 101, determines annotation information corresponding to the video segments according to detected annotation operation related to the video images, and stores the video segments and the corresponding annotation information through the server. The cloud platform system can also display a plurality of analysis network models through the user equipment 101, select an analysis network model by a user, determine a training data set corresponding to the analysis network model selected by the user according to the stored video segment and the corresponding annotation information so as to train the corresponding analysis network model, and then perform behavior analysis in the video through the trained behavior analysis network model.

Optionally, the cloud service provided by the cloud platform system further includes model training, that is, training a corresponding analysis network model according to the determined training data set, and deploying the trained behavior analysis network model on the analysis device. For example, deployed on server 102 or other servers to provide an online cloud service for video analytics. For another example, the cloud platform system provides a model downloading service, the user equipment 101 may download the behavior analysis network model from the server 102 to the local, and then deploy the behavior analysis network model on terminal equipment, such as an intelligent camera, through the user equipment 101. For another example, the cloud platform system directly sends the behavior analysis network model to the terminal device through the server 102, so as to be deployed on the terminal device.

Optionally, the user equipment 101 is configured to store the video segment locally and upload the locally stored video segment to the cloud platform system for storage in the server 102, that is, upload the video segment in an offline manner.

Optionally, the system architecture further comprises a video capture device 103, the video capture device 103 being configured to capture a video segment. The video collecting device 103 is connected with the user equipment 101 in a wired or wireless manner to send the collected video segment to the user equipment 101, the user equipment 101 displays a video image included in the video segment, behaviors in the video image are labeled through user operation, and the user equipment 101 is further configured to upload the received video segment to the server 102, that is, to upload the video segment in an offline manner. Or, the video capture device 103 is connected to the server 102 in a wired or wireless manner to send the captured video segment to the server 102, that is, upload the video segment in an online manner, for example, the video capture device 103 is a network camera, and the network camera uploads the captured video segment in real time through the internet.

Optionally, the video capture device 103 is the same device as the terminal device, or a different device.

In this embodiment, the user equipment 101 is a desktop computer, a notebook computer, a tablet computer, or a smart phone. The server 102 is a server, or a server cluster formed by a plurality of servers, or a cloud computing service center. The Video capture device 103 is an IPC (Internet Protocol Camera), an NVR (Network Video Recorder), or the like.

Next, a detailed explanation is given of the training data set determination method provided in the embodiment of the present application.

Fig. 2 is a flowchart of a training data set determination method provided in an embodiment of the present application, where the method is applied to a cloud platform system. Referring to fig. 2, the method includes the following steps.

Step 201: displaying the video images included in each of the plurality of video segments.

As can be seen from the foregoing, the method for determining the training data set provided in the embodiment of the present application is implemented by a cloud platform system, a user device serves as a front-end device of the cloud platform system, a server serves as a back-end device of the cloud platform system, the cloud platform system provides a login entry in the form of a web page or a client, and the user device logs in the cloud platform system through the login entry in the form of a web page or a client.

In the embodiment of the application, when the annotation request is detected by the user equipment, the video image included in each of the plurality of video segments is displayed on the user equipment.

Illustratively, the video images included in each video segment are displayed on the user equipment frame by frame, and after one frame of image is displayed and labeled, the next frame of image is automatically displayed. Optionally, two options of 'previous frame' and 'next frame' may be displayed on the user equipment, wherein the previous frame video image is displayed when the selection operation about the 'previous frame' is detected, and the next frame video image is displayed when the selection operation about the 'next frame' is detected. Optionally, two options of 'last video' and 'next video' may be displayed on the user equipment, when a selection operation on 'last video' is detected, a first frame video image of the last video segment or a frame video image of the last operation is displayed, and when a selection operation on 'next video' is detected, a first frame video image of the next video segment or a frame video image of the last operation is displayed.

Optionally, a plurality of video segments are stored in the user equipment, for example, the user equipment receives and stores a plurality of video segments sent by the video capture device locally, and when an annotation request is detected by the user equipment, a video image included in each of the plurality of video segments is displayed on the user equipment. Or, a plurality of video segments are stored in the server, for example, the server receives and stores a plurality of video segments sent by the user equipment or the video capture device, and when the annotation request is detected by the user equipment, the user equipment caches the plurality of video segments stored in the server to the local, and displays the video images included in the plurality of video segments, that is, temporarily stores the plurality of video segments.

Optionally, in this embodiment of the application, before the user equipment displays the video images included in each of the plurality of video segments, when the uploading instruction is detected, the server acquires the plurality of video segments, that is, stores the plurality of video segments in the server. Illustratively, an upload key is displayed on the user device, and the server starts to acquire the plurality of video segments when a click operation on the upload key is detected.

Optionally, the plurality of video segments stored locally in the user equipment are uploaded to the server, i.e. in an offline manner. Or, the video acquisition device uploads the acquired video segments to the server in real time, that is, the video segments are uploaded in an online manner, for example, by IPC, NVR, and the like through the internet. Or, a part of video segments are uploaded by the user equipment, and another part of video segments are uploaded by the video acquisition equipment, namely, the offline mode and the online mode are combined.

Optionally, a number threshold and/or a data amount threshold are configured in the server, and when the total number (in units, such as several) of the acquired video segments reaches the number threshold and/or the total space size (in units, such as megabytes) reaches the data amount threshold, the server regards all the acquired video segments as the plurality of video segments, and the server does not acquire the video segments any more, or the server may also acquire the video segments continuously.

Optionally, since the same user may have multiple types of behavior analysis requirements, and different users may also have different types of behavior analysis requirements, each type of behavior analysis requirement corresponds to one or more target behaviors, the cloud platform system can provide a service for each type of behavior analysis requirement in the multiple types of behavior analysis requirements for the same user, and can also provide corresponding services for different users. Based on this, in order to facilitate management and maintenance of the cloud platform system, a user may create an analysis task for each behavior analysis requirement through the user equipment, for example, create a task name, a task detail introduction, a task publisher, a user, and the like for each analysis task, submit the created analysis task to a server of the cloud platform system through the user equipment, store the analysis task in the server, execute the corresponding analysis task, and subsequently may feed back an execution condition of the task to the corresponding user.

It should be noted that, when the instruction to create a task is detected by the user equipment, the server starts to acquire the plurality of video segments, that is, starts to execute the task.

Step 202: when the annotation operation of one or more behaviors in the video images of the plurality of video segments is detected, the annotation information corresponding to each video segment in the plurality of video segments is determined.

In this embodiment of the application, the cloud platform system provides an annotation function, for example, an annotation client or an annotation module, and a user can annotate a behavior existing in a video image included in a displayed video segment by operating a user device, and when an annotation operation of one or more behaviors in the video image of the video segment is detected, annotation information corresponding to each of the plurality of video segments is determined.

Optionally, the user annotates a behavior in the video image of the video segment in an offline manner, for example, the user annotates a plurality of locally stored video segments through an installed annotation client or an annotation module provided in a web page manner, and uploads annotation information corresponding to the plurality of video segments to the cloud platform system after all the plurality of video segments are annotated. Or, the user annotates the behavior existing in the video image of the video segment in an online manner, for example, the user equipment caches the video segments stored in the server to the local one by one through the annotation client or the annotation module, and when the annotation operation on each video segment is detected, determines the annotation information of the corresponding video segment and uploads the annotation information to the server until the plurality of video segments are annotated one by one and uploaded.

Illustratively, the user can mark the behavior position in the video image by means of box selection, and mark the behavior tag of the behavior, and the shape of the box selection can be any polygon (such as rectangle, triangle, etc.), circle, etc. For one of the video segments, a user selects a starting frame of a behavior occurrence in the video segment, and marks an image area of the behavior occurrence in the starting frame by frame selection, and continues to mark the image area of the behavior occurrence from the starting frame to the back in a frame-by-frame or frame-by-frame manner until the ending frame of the behavior is marked, that is, the video image of the behavior ending is marked, that is, one behavior in the video segment is marked. If there are multiple behaviors in a video image, each behavior is labeled according to the method, and if there are multiple behaviors in a video segment, each behavior is labeled according to the method. And after the labeling, labeling a video segment, wherein the labeled video segment corresponds to one or more behavior tags, each behavior tag corresponds to a plurality of frame numbers, and each frame number in the plurality of frame numbers corresponds to an image area.

In this embodiment of the present application, after annotating one behavior in one video segment, or after annotating all behaviors in a plurality of video segments, the cloud platform system automatically generates an annotation file (which may be referred to as a calibration truth file), optionally generates a corresponding annotation file for each video segment, or generates an annotation file for the plurality of video segments.

Table 1 shows a format of an annotation file provided in this embodiment of the present application, and referring to table 1, an annotation file is generated for the multiple video segments, the names of the video segments are video identifiers corresponding to the video segments, such as 1.mp4, 2.mp4, etc., behavior tags are behavior identifiers, such as behavior 1, behavior 2, etc., frame numbers of behavior occurrences, such as n1, n2, n3 …, etc., and image areas of the behavior occurrences are identified by multiple coordinates or distances, etc.

Optionally, the relative time of the video image in the video segment is taken as the frame number of the video image, for example, the frame number of a frame of video image is '00: 12: 45', or the sequence number of the video image in the video segment is taken as the frame number, for example, if a frame of video image is the 3 rd frame image in the video segment, and the sequence number is 3, then the frame number of the video image is 3, or the frame number is defined in other manners. It should be noted that, in a manner of taking relative time or serial number as a frame number, the frame number can represent time sequence.

Optionally, a plurality of frame numbers corresponding to a behavior tag of a video segment are arranged according to a time sequence of a corresponding video image, and then a first frame number is a starting frame number and a last frame number is an ending frame number, or the plurality of frame numbers can be arranged in any sequence under the condition that the frame numbers can represent time sequence. Illustratively, assuming that relative time is taken as the frame number, the frame numbers of behavior occurrences corresponding to behavior 1 in 1.mp4 include n1, n2, and n3, and assuming that the frame numbers are sorted in time order, then n1 is the start frame and n3 is the end frame.

Alternatively, if an image area is marked by a rectangular frame, the image area is represented by coordinates of four vertices of the rectangular frame, if a triangular frame is marked by an image area, the image area is represented by coordinates of three vertices of the triangular frame, if an image area is marked by a circular frame, the image area is represented by coordinates of a center of the circular frame and a radius, and table 1 only describes a polygonal frame as an example.

TABLE 1

In the embodiment of the application, the annotation information is stored in the server in the form of an annotation file, or stored in the server in another form, the annotation information includes a correspondence between a behavior tag and a behavior position, the behavior position includes a frame number and an image region where a behavior occurs, each video segment in the plurality of video segments has one or more behavior tags, each behavior tag corresponds to a plurality of frame numbers, each frame number corresponds to an image region, and the plurality of frame numbers include a start frame number and an end frame number.

Step 203: and displaying the performance information of a plurality of initial analysis network models with different network structures and/or training parameters, wherein the plurality of initial analysis network models are all used for behavior analysis in the video.

In this embodiment, the cloud platform system can further display, by the user equipment, performance information of a plurality of initial analysis network models, the plurality of initial analysis network models being used for behavior analysis in the video, and the plurality of initial analysis network models having different network structures and/or different training parameters. That is, the cloud platform system provides a user with a plurality of models to select, each model has different performance, and the user can select an appropriate model according to the requirement.

Optionally, the plurality of initial analysis network models include a plurality of initial image analysis network models with different network structures and/or training parameters and a plurality of initial video analysis network models with different network structures and/or training parameters, that is, the embodiment of the present application provides two types of initial models, which are respectively an initial image analysis network model and an initial video analysis network model, for an initial image analysis network model, a plurality of models with different network structures and/or training parameters, and for an initial video analysis network model, a plurality of models with different network structures and/or training parameters.

The performance information displayed on the user equipment comprises model description information, efficiency information, precision information and the like, and a user can select an initial image analysis network model and/or an initial video analysis network model according to requirements. Illustratively, it is assumed that 3 initial image analysis network models and 3 initial video analysis network models are provided, that is, 6 options are displayed, wherein the performance information of the 3 initial image analysis network models is 'high efficiency, low precision image analysis mode', 'higher efficiency, higher precision image analysis mode', 'low efficiency, high precision image analysis mode', and the performance information of the 3 initial video analysis network models is 'high efficiency, low precision video analysis mode', 'higher efficiency, higher precision video analysis mode', 'low efficiency, high precision video analysis mode', respectively. In addition, the user can be prompted to select one or more images, wherein the performance information corresponding to one initial image analysis network model is 'high analysis speed and low accuracy', the performance information corresponding to one initial video analysis network model is 'low analysis speed and high accuracy', the performance information corresponding to one initial image analysis network model and one initial video analysis network model is selected more, and the performance information corresponding to 'high analysis speed and high accuracy'. In this example, there are 6 choices for the single selection and 9 choices for the multiple selection, i.e., there are 15 choices for the user.

Optionally, three analysis mode options, namely an image analysis mode, a video analysis mode, and an image plus video analysis mode, are displayed on the user equipment, and the performance information corresponding to the 3 modes is 'fast analysis speed, low accuracy', 'slow analysis speed, higher accuracy', 'higher analysis speed, and high accuracy', and after the user selects one of the analysis model options, the user equipment displays the prompt information of the corresponding multiple initial image analysis network models and/or the multiple initial video analysis network models. For example, taking the case of providing 3 initial image analysis network models and 3 initial video analysis network models as an example, after the user selects the image analysis mode, the prompt information of the corresponding 3 initial image analysis network models is displayed, after the user selects the video analysis mode, the prompt information of the corresponding 3 initial video analysis network models is displayed, and after the user selects the image plus video analysis mode, the corresponding 9 combined prompt information is displayed.

Step 204: and when the model selection operation is detected based on the displayed performance information, determining a training data set corresponding to the initial analysis network model selected by the model selection operation according to the plurality of video segments and the corresponding annotation information.

In this embodiment of the application, as can be seen from the foregoing, a user may select a model based on displayed performance information, and when a model selection operation is detected based on the displayed performance information, a server of the cloud platform system determines, according to the plurality of video segments and corresponding annotation information, a training data set corresponding to an initial analysis network model selected by the model selection operation.

As can be seen from the foregoing description, the plurality of initial analysis network models include a plurality of initial image analysis network models with different network structures and/or training parameters, and a plurality of initial video analysis network models with different network structures and/or training parameters, and the initial analysis network model selected by the user's model selection operation includes one initial image analysis network model and/or one video analysis network model. Based on the above, the server determines an image data set according to the plurality of video segments and the corresponding annotation information, and uses the image data set as a training data set corresponding to the initial image analysis network model selected by the model selection operation, and/or determines a video data set according to the plurality of video segments and the corresponding annotation information, and uses the video data set as a training data set of the initial image analysis network model selected by the model selection operation.

That is, in the case where the initial analysis network model selected by the model selection operation includes one initial image analysis network model, the training data set includes the image data set. In the case where the initial analysis network model selected by the model selection operation comprises an initial video analysis network model, the training data set comprises a video data set. In the case where the initial analysis network model selected by the model selection operation includes an initial image analysis network model and a video analysis network model, the training data set includes the image data set and the video data set.

Next, a method of determining an image data set from the plurality of video segments and the corresponding annotation information will be described.

As can be seen from the foregoing, the annotation information of the video segments includes a corresponding relationship between a behavior tag and a behavior position, where the behavior position includes a frame number and an image area where a behavior occurs, each video segment in the multiple video segments has one or more behavior tags, each behavior tag corresponds to multiple frame numbers, and each frame number corresponds to an image area.

Based on this, for a first video segment in the multiple video segments, the server extracts part or all of the video images labeled in the first video segment to obtain multiple first video images, where the first video segment is one of the multiple video segments, obtains a behavior tag and a behavior position corresponding to a frame number of each first video image in the multiple first video images from a corresponding relationship included in the labeling information, as labeling information corresponding to the corresponding first video images, and determines the video images extracted from the multiple video segments and the corresponding labeling information as an image data set.

Illustratively, assuming that the server extracts all video images labeled in a video segment, taking 1.mp4 as an example in table 1, and assuming that 1.mp4 has a behavior tag including behavior 1 and behavior 2, where a frame number corresponding to behavior 1 includes N1, N2, N3, and a frame number corresponding to behavior 2 includes N3, N4, N5, the server extracts all video images corresponding to N1, N2, N3, N4, N5 in the video segment of 1.mp4 to obtain corresponding 5 video images, which are N5, N5 in turn, taking behaviors 1 and (x5, y5, x5, y 5) as labeling information corresponding to N5, taking behaviors 1 and (x5, y5, x5, y 5) as labeling information corresponding to N5, x5, y5, x5, y5, N5, x5, y5, and labeling information corresponding to N5, x5, y5, and labeling information (, y9, x10, y10, …) as the annotation information of N4, and (x11, y11, x12, y12, …) as the annotation information of N5.

In the embodiment of the present application, the server extracts part or all of the video images in the video segment that have been labeled, and in the manner that the server extracts all of the video images in the video segment that have been labeled, the data amount of the image data set determined by the server is sufficient. In the method for extracting the video image of the labeled part in the first video segment by the server, the server determines the extracted video image of the labeled part in a random selection mode, or the server extracts the video image of the labeled part in a frame-by-frame mode.

As can be seen from the above, the server extracts all or part of the video images in the video segment where the behavior occurs, and uses the extracted video images and the corresponding annotation information as an image training set for training the model.

A method for determining a video data set based on the plurality of video segments and the corresponding annotation information is described.

As can be seen from the foregoing, the annotation information of the video segments includes a corresponding relationship between a behavior tag and a behavior position, the behavior position includes a frame number and an image region where a behavior occurs, each video segment in the multiple video segments has one or more behavior tags, each behavior tag corresponds to multiple frame numbers, each frame number corresponds to an image region, and the multiple frame numbers include a starting frame number and an ending frame number.

Based on this, for a first video segment in the multiple video segments, the server extracts a video segment between a starting frame number and an ending frame number corresponding to each behavior tag of the first video segment to obtain one or more first sub-video segments, the first video segment is one of the multiple video segments, a behavior position corresponding to the behavior tag of each first sub-video segment is obtained from a corresponding relationship included in the annotation information, the behavior tag and the corresponding behavior position of each first sub-video segment are used as annotation information corresponding to the corresponding first sub-video segment, and the sub-video segment extracted from the multiple video segments and the corresponding annotation information are determined as a video data set.

Illustratively, taking table 1 as an example of 1.mp4, assuming that 1.mp4 has behavior tags including behavior 1 and behavior 2, the starting frame number and the ending frame number corresponding to behavior 1 are n1 and n3, respectively, and the starting frame number and the ending frame number corresponding to behavior 2 are n3 and n5, respectively, the server extracts a video segment between n1 and n3 from 1.mp4 to obtain a sub-video segment M1 (including n1 and n3), extracts a video segment between n3 and n3 from 1.mp4 to obtain a sub-video segment M3 (including n3 and n3), and takes behaviors 1 and n3(x 3, y3, x3, y3, n3(x 3, y3, n3, x3, y3, n3, x3, y3, n3, x 36, n4(x9, y9, x10, y10, …) and n5(x11, y11, x12, y12, …) are used as the marking information corresponding to the M2.

Optionally, in a manner that the frame number is relative time, the server subtracts the starting frame number from each frame number corresponding to each behavior tag of the first video segment, so as to obtain a frame number corresponding to the corresponding sub-video segment. In the mode that the frame number is the serial number, the server subtracts the initial frame number plus 1 from each frame number corresponding to each behavior tag of the first video segment to obtain the frame number corresponding to the corresponding sub-video segment.

According to the above, the server extracts the sub-video segment of each behavior occurrence in each video segment, and uses the extracted sub-video segment and the corresponding annotation information as the video data set for training the model.

Optionally, the step 203 may be performed before the step 201, and the model selection operation in the step 204 may be performed before the step 201, that is, the performance information of a plurality of initial analysis network models is displayed first, the model selection operation of the user is detected, the selected initial analysis network model is determined, the video segment is displayed, the annotation operation of the user is detected, the annotation information is determined, and then the training data set is determined according to the video segment and the corresponding annotation information. That is, fig. 2 is only an exemplary illustration and does not limit the embodiments of the present application.

Optionally, the cloud platform system is further provided with a default configuration, and when the model selection operation of the user is not detected, the cloud platform system determines an initial behavior analysis model according to the default configuration, and determines a corresponding training data set according to the plurality of video segments and the corresponding annotation information. For example, the default configuration is to select an initial image analysis network model with higher efficiency and higher accuracy and an initial video analysis network model with higher efficiency and higher accuracy.

In the embodiment of the present application, the cloud platform system can also train the selected initial analysis network model according to the determined training data set to obtain a behavior analysis network model for behavior analysis in the video, and then introduce this.

As can be seen from the foregoing, the initial image analysis network model selected by the model selection operation includes an initial image analysis network model and/or an initial video analysis network model, and accordingly, the training data set includes an image data set and/or a video data set. Based on the method, the server trains the initial image analysis network model selected by the model selection operation according to the image data set to obtain the image behavior analysis network model, and/or trains the initial video analysis network model selected by the model selection operation according to the video data set to obtain the video behavior analysis network model.

That is, under the condition that the user selects an initial image analysis network model and the training data set comprises an image data set, the cloud platform system automatically trains to obtain an image behavior analysis network model. Under the condition that a user selects an initial video analysis network model and a training data set comprises a video data set, the cloud platform system automatically trains to obtain a video behavior analysis network model. Under the condition that a user selects an initial image analysis network model and an initial video analysis network model and a training data set comprises an image data set and a video data set, the cloud platform system automatically trains to obtain the image behavior analysis network model and the video behavior analysis network model.

Optionally, in a case that the user does not select a model but an initial analysis network model determined according to a default configuration, the cloud platform system automatically trains according to the default configuration to obtain a corresponding model.

As can be seen from the foregoing, the network structure and the training parameters of the initial video analysis network model selected by the model selection operation are fixed data configured for the cloud platform system, and optionally, the cloud platform system further provides a training adjustment function, and the network structure and/or the training parameters may be adjusted or customized by a user.

In the embodiment of the application, the cloud platform system displays adjustment indication information of the network structure and/or the training parameters of the initial analysis network model selected by the model selection operation through user equipment, and adjusts the network structure and/or the training parameters of the initial analysis network model selected by the model selection operation according to the adjustment operation when the adjustment operation is detected based on the adjustment indication information.

Illustratively, adjustment indication information of several network structures is displayed on user equipment, and assuming that one of the adjustment indication information is a convolutional neural network, the displayed adjustment indication information includes a convolutional layer number input box, a convolutional kernel size input box, a learning step size adjustment slider, an iteration number adjustment slider and the like, and a user can input data into the input box, drag the adjustment slider and the like to adjust the convolutional layer number, the convolutional kernel size, the learning step size, the iteration number and the like of the convolutional neural network.

And the cloud platform system adjusts the network structure and/or training parameters of the initial analysis network model according to the detected adjustment operation, and trains the initial analysis network model according to the adjusted data.

Optionally, default training indexes, such as accuracy, precision, recall rate and the like, are configured in the cloud platform system, and if the behavior analysis network model obtained by the cloud platform system training does not reach the indexes, the cloud platform system automatically adjusts the network structure and/or the training parameters, retrains the initial analysis network model, and determines that the training is completed after the training reaches the standard.

In the embodiment of the application, the cloud platform system further provides a model testing function, after the initial analysis network model selected by the model selection operation is trained through the server, model testing prompt information is displayed on the user equipment, and when a determined test instruction is detected based on the model testing prompt information, the behavior analysis network model obtained through training is tested according to the test data set to obtain a test result. And then, displaying the test result on the user equipment, and when a training adjustment instruction is detected based on the test result, retraining the initial analysis network model selected by the model selection operation according to the training adjustment instruction.

Illustratively, after the image behavior analysis network model and/or the video behavior analysis network model are obtained through training, the cloud platform system displays prompt information of 'training is completed, and whether model testing is performed' through the user equipment, and a user can select a testing model. And the user uploads a test data set to a server of the cloud platform system through the user equipment, and the server tests the trained behavior analysis network model according to the test data set. After the test is finished, the test result is displayed on the user equipment, for example, the position of the behavior analyzed in the video segment for the test is displayed, the test accuracy is displayed, and the like.

In the embodiment of the application, the cloud platform system further provides a model publishing function, and after the model test is passed or the model training is completed, a user can select a publishing model so as to use the trained behavior analysis network model for behavior analysis in the video.

Optionally, the model issuing prompt information is displayed on the user equipment, and when a model issuing instruction is detected based on the model issuing prompt information, the cloud platform system deploys the trained behavior analysis network model on the analysis equipment, where the analysis equipment is a server and/or a terminal device, the server is a server of the cloud platform system, or another server, such as another cloud server, and the terminal device is, for example, an intelligent camera.

Illustratively, the trained behavior analysis network model is sent to a cloud server, and the behavior analysis network model is deployed on the cloud server to provide an online cloud service for video analysis. Or the cloud platform system provides a model downloading service, the user equipment can download the behavior analysis network model from a server of the cloud platform system to the local, and then deploy the behavior analysis network model on the terminal equipment, such as an intelligent camera, through the user equipment. For another example, the cloud platform system directly sends the behavior analysis network model to the terminal device through the server, so as to be deployed on the terminal device.

Optionally, under the condition that one image behavior analysis network model and one video behavior analysis network model are obtained through training, the cloud platform system merges the two trained models into one model package through the server to provide a downloading service, the user equipment can download the model package in the server to the local, and then deploy the model package in the terminal equipment through other tools for behavior analysis in the video.

Optionally, after obtaining the model packages of the two models, the user may also choose to deploy one or both models on the terminal device or the server. That is, regardless of whether the cloud platform system obtains two models by training according to the user selection or obtains two models by training according to the default configuration, the user can select to deploy one of the models or deploy two models. Alternatively, the user may choose to use one or both models for behavior analysis in the video at any time after the two models are deployed.

The above describes a training data set determining method and a method for training a behavior analysis network model, which are provided by the embodiment of the present application, and the cloud platform system provided by the embodiment of the present application is explained again with reference to fig. 3, fig. 4, and fig. 5.

Fig. 3 is a basic usage diagram of a cloud platform system provided in an embodiment of the present application. Referring to fig. 3, the using process of the cloud platform system mainly includes data acquisition, data annotation, model training, and model deployment. Wherein. The data acquisition comprises the acquisition of offline video data and the acquisition of online video data, and the data annotation is mainly realized through an annotation client or an annotation module provided by a cloud platform system, so that a video segment is annotated through the interaction of user equipment and a user, annotation information is determined, and an annotation file is automatically generated according to annotation operation. The model training is realized through a training platform provided by a cloud platform system. The model deployment comprises an online cloud service deployed on a server to provide behavior analysis in the video, and/or a terminal device (such as an AI (Artificial Intelligence) device) to provide behavior analysis service in the video directly on the terminal device.

Fig. 4 is an overall flowchart of a user operating a cloud platform system according to an embodiment of the present disclosure. Referring to fig. 4, the overall process of the user operation mainly includes creating a behavior analysis model task (analysis task), uploading a data set (video segment), tagging the data set (determining tagging information), training the model (determining a training data set and training the model), testing the model, and publishing the model.

Fig. 5 is a schematic diagram of a model training process provided in an embodiment of the present application. Referring to fig. 5, a server of the cloud platform system splits an original data set (including a video segment and corresponding annotation information) to obtain an image data set and a video segment subset (video data set), trains to obtain an image analysis model according to the image data set, trains to obtain a video analysis model according to the video segment subset, and then merges the two trained models into a behavior analysis model package as a trained behavior analysis network model.

Therefore, the cloud platform system provided by the scheme can provide customizable and automatic model production service for users, and greatly improves the production efficiency of the behavior analysis model.

In summary, in the embodiment of the present application, a plurality of different initial analysis network models are provided for a user to select, and a training data set corresponding to the initial analysis network model selected by the user is automatically determined according to a plurality of video segments and corresponding annotation information.

The embodiment of the application also provides a behavior analysis method in the video, and the method is introduced next.

Fig. 6 is a flowchart of a method for analyzing behavior in a video according to an embodiment of the present application, where the method is applied to a terminal device or a server, and a behavior analysis network model obtained by using the model training method according to the foregoing embodiment is deployed in the terminal device or the server. Taking the application of the method to the terminal device as an example, referring to fig. 6, the method includes the following steps.

Step 601: and acquiring a target video segment to be subjected to behavior analysis.

In the embodiment of the application, the behavior analysis network model is obtained by training a training data set after a user selects an initial analysis network model from a plurality of initial analysis network models with different network structures and/or training parameters, wherein the training data set is determined after the user labels behaviors in video images of a plurality of video segments. That is, the behavior analysis network model deployed in the terminal device is the behavior analysis network model obtained by training in the embodiments of fig. 2 to fig. 5, and the specific training process refers to the foregoing description and is not described herein again.

In this embodiment of the application, if the terminal device has a video capture function, for example, the terminal device is an intelligent camera, an intelligent mobile phone, or the like, the terminal device may directly capture a target video segment to be subjected to behavior analysis, or may also receive a video segment sent by another device or uploaded by a tool (e.g., a usb disk, or the like), so as to obtain the target video segment. If the terminal equipment does not have the video acquisition function, the terminal equipment can receive video segments sent by other equipment or uploaded through a tool to obtain a target video segment to be subjected to behavior analysis. That is, the embodiment of the present application does not limit the manner in which the terminal device acquires the target video segment.

Alternatively, for a server deployed with a behavior analysis network model, the server receives a target video segment sent by other devices, for example, a video segment to be subjected to behavior analysis sent by a network camera, a mobile phone, a computer, and the like.

Step 602: and analyzing the behaviors in the target video segment through the behavior analysis network model to obtain a behavior analysis result.

In the embodiment of the application, the terminal device analyzes the behavior in the target video segment through the deployed behavior analysis network model to obtain a behavior analysis result.

As can be seen from the foregoing embodiments, the behavior analysis model includes an image behavior analysis network model and/or a video behavior analysis network model, and based on this, there are three implementation manners in which the terminal device analyzes the target video segment through the behavior analysis network model.

The terminal device analyzes the behaviors in the target video segment through the image behavior analysis network model to obtain one or more candidate frame numbers, determines one or more second sub-video segments according to the target video segment and the one or more candidate frame numbers, and analyzes the behaviors in the one or more second sub-video segments through the video behavior analysis network model to obtain a behavior analysis result.

The terminal device determines the implementation manner of the one or more second sub-video segments according to the target video segment and the one or more candidate frame numbers as follows: and for a first candidate frame number in the one or more candidate frame numbers, extracting a video segment of a reference frame number or a reference time length which is continuous from a video image corresponding to the first candidate frame number in the target video segment to obtain a second sub-video segment, wherein the first candidate frame number is one of the one or more candidate frame numbers.

Illustratively, assuming that the reference frame number is 16, the terminal device extracts 16 frames of video images that continue from the video image corresponding to each candidate frame number in the target video segment, resulting in one sub-video segment. Assuming that the reference time length is 10 seconds or 1 minute, the terminal device extracts a video segment of 10 seconds or 1 minute which is continuous from the video image corresponding to each candidate frame number in the target video segment, and obtains a sub-video segment.

It should be noted that the image behavior analysis network model can analyze the video image with behavior in the target video segment, and output the frame number with behavior as the candidate frame number.

Optionally, the reference frame number or the reference duration is a parameter configured on the terminal device by the user, or the reference frame number or the reference duration is determined by the cloud platform system in a statistical manner according to the annotation information of the video segment used for training. For example, the labeling information includes a corresponding relationship between behavior tags and behavior positions, a total frame number or a total duration of the behavior can be determined according to a plurality of frame numbers included in the behavior position corresponding to one behavior tag, the total frame number or the total duration of the behavior labeled by each behavior tag in the labeling information is counted to obtain a plurality of total frame numbers or a plurality of total durations, and an average value (or a median value, etc.) of the plurality of total frame numbers or the plurality of total durations is used as a reference frame number or a reference duration. Alternatively, the user may adjust the reference frame number or the reference duration at any time.

In the implementation mode, firstly, the sub-video segments possibly having behaviors in the target video segment are preliminarily analyzed through the image behavior analysis network model, then, the video behavior analysis network model is used for further analyzing each sub-video segment to obtain a behavior analysis result, so that the behaviors in the video can be more accurately analyzed through two times of analysis, and the sub-video segments are firstly screened out through the first time of analysis, so that the data amount required to be processed by the second time of video analysis can be reduced, and the analysis rate of the video analysis network model is accelerated.

The second implementation manner and the behavior analysis network model comprise an image behavior analysis network model, and the terminal device analyzes the behavior in the target video segment through the image behavior analysis network model to obtain a behavior analysis result.

It should be noted that the image analysis network model not only can output the frame number of the behavior in the target video segment, but also can output the behavior tag and the image area where the behavior occurs, that is, the behavior analysis result includes the behavior tag and the behavior position.

In the second implementation, the network model is analyzed only by the image analysis to quickly analyze the position of the behavior possibly existing in the target video segment, i.e. the analysis rate is very high.

The third implementation manner and the behavior analysis network model comprise a video behavior analysis network model, and the terminal device analyzes the behavior in the target video segment through the video behavior analysis network model to obtain a behavior analysis result.

It should be noted that the video analysis network model can output the behavior tag and the behavior position (including the frame number and the image area) of the behavior existing in the target video segment, that is, the behavior analysis result includes the behavior tag and the behavior position.

Fig. 7 is a schematic structural diagram of a training data set determining apparatus 700 provided in an embodiment of the present application, where the training data set determining apparatus 700 may be implemented by software, hardware, or a combination of the two as part or all of a computer device, and optionally, the computer device may be a user device or a server of a cloud platform system in the foregoing embodiments. Referring to fig. 7, the apparatus 700 includes: a first display module 701, a first determination module 702, a second display module 703 and a second determination module 704.

A first display module 701, configured to display a video image included in each of a plurality of video segments;

a first determining module 702, configured to determine annotation information corresponding to each of a plurality of video segments when an annotation operation of one or more behaviors in a video image of the plurality of video segments is detected;

a second display module 703, configured to display performance information of multiple initial analysis network models with different network structures and/or training parameters, where the multiple initial analysis network models are all used for behavior analysis in a video;

a second determining module 704, configured to, when a model selecting operation is detected based on the displayed performance information, determine, according to the plurality of video segments and the corresponding annotation information, a training data set corresponding to the initial analysis network model selected by the model selecting operation.

the second determining module includes:

the first determining unit is used for determining an image data set according to the plurality of video segments and the corresponding annotation information, and taking the image data set as a training data set corresponding to the initial image analysis network model selected by the model selecting operation; and/or

Optionally, the annotation information includes a correspondence between a behavior tag and a behavior position, the behavior position includes a frame number and an image region where a behavior occurs, each video segment in the plurality of video segments has one or more behavior tags, each behavior tag corresponds to a plurality of frame numbers, and each frame number corresponds to an image region;

the first determination unit includes:

the first obtaining subunit is configured to obtain, from the correspondence, a behavior tag and a behavior position corresponding to a frame number of each of the plurality of first video images, as annotation information corresponding to the corresponding first video image;

the first determining subunit is configured to determine, as an image data set, the video image extracted from the plurality of video segments and the corresponding annotation information.

Optionally, the annotation information includes a correspondence between a behavior tag and a behavior position, the behavior position includes a frame number and an image region where a behavior occurs, each video segment in the plurality of video segments has one or more behavior tags, each behavior tag corresponds to a plurality of frame numbers, each frame number corresponds to an image region, and the plurality of frame numbers include a starting frame number and an ending frame number;

the second determination unit includes:

the second extraction subunit is configured to, for a first video segment in the multiple video segments, extract a video segment between a starting frame number and an ending frame number corresponding to each behavior tag of the first video segment to obtain one or more first sub-video segments, where the first video segment is one of the multiple video segments;

the second obtaining subunit is configured to obtain, from the correspondence relationship, a behavior position corresponding to the behavior tag of each first sub-video segment, and use the behavior tag and the corresponding behavior position of each first sub-video segment as corresponding annotation information corresponding to the first sub-video segment;

and the second determining subunit is used for determining the sub-video segments extracted from the plurality of video segments and the corresponding annotation information as the video data set.

Optionally, the apparatus 700 further comprises:

and the second training module is used for training the initial video analysis network model selected by the model selection operation according to the video data set to obtain the video behavior analysis network model.

Optionally, the apparatus 700 further comprises:

the third display module is used for displaying the adjustment indication information of the network structure and/or the training parameters of the initial analysis network model selected by the model selection operation;

Optionally, the apparatus 700 further comprises:

the fourth display module is used for displaying model test prompt information;

the test module is used for testing the trained behavior analysis network model according to the test data set when the determined test instruction is detected based on the model test prompt information to obtain a test result;

the fifth display module is used for displaying the test result;

Optionally, the apparatus 700 further comprises:

It should be noted that: the training data set determining apparatus provided in the above embodiment is only illustrated by the division of the above functional modules when determining the training data set, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the above described functions. In addition, the training data set determining apparatus and the training data set determining method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

Fig. 8 is a schematic structural diagram of a behavior analysis apparatus 800 in a video provided in an embodiment of the present application, where the behavior analysis apparatus 800 may be implemented by software, hardware, or a combination of the two as part or all of a computer device, and optionally, the computer device is a terminal device or a server in the foregoing embodiments. Referring to fig. 8, the apparatus 800 includes: an acquisition module 801 and an analysis module 802.

An obtaining module 801, configured to obtain a target video segment to be subjected to behavior analysis;

the analysis module 802 is configured to analyze a behavior in the target video segment through the behavior analysis network model to obtain a behavior analysis result;

the analysis module comprises:

and the second analysis unit is used for analyzing the behaviors in one or more second sub-video segments through the video behavior analysis network model to obtain a behavior analysis result.

Optionally, the first determination unit includes:

It should be noted that: in the above embodiment, when analyzing the behavior in the video, the apparatus for analyzing the behavior in the video is illustrated by only dividing the functional modules, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the behavior analysis device in the video and the behavior analysis method in the video provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

Fig. 9 is a block diagram of a computer device 900 according to an embodiment of the present disclosure. The computer device 900 may be a smart phone, a tablet computer, a notebook computer, or a desktop computer, etc.

Generally, computer device 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, a 9-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement a training data set determination method or a behavior analysis method in video provided by method embodiments herein.

In some embodiments, computer device 900 may also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a touch display screen 905, a camera 906, an audio circuit 907, a positioning component 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 904 may communicate with other computer devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 905 may be one, provided on the front panel of the computer device 900; in other embodiments, the number of the display screens 905 may be at least two, and each of the display screens may be disposed on a different surface of the computer device 900 or may be in a foldable design; in other embodiments, the display 905 may be a flexible display, disposed on a curved surface or on a folded surface of the computer device 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of a computer apparatus, and a rear camera is disposed on a rear surface of the computer apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. The microphones may be multiple and placed at different locations on the computer device 900 for stereo sound acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The Location component 908 is used to locate the current geographic Location of the computer device 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 909 is used to supply power to the various components in the computer device 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, computer device 900 also includes one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 913, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the computer apparatus 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the touch display 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 may detect a body direction and a rotation angle of the computer apparatus 900, and the gyro sensor 912 may cooperate with the acceleration sensor 911 to acquire a 3D motion of the user with respect to the computer apparatus 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensors 913 may be disposed on the side bezel of the computer device 900 and/or underneath the touch display screen 905. When the pressure sensor 913 is disposed on the side frame of the computer device 900, the holding signal of the user to the computer device 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at a lower layer of the touch display 905, the processor 901 controls the operability control on the UI interface according to the pressure operation of the user on the touch display 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the computer device 900. When a physical key or vendor Logo is provided on the computer device 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the touch display 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 905 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 905 is turned down. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

The proximity sensor 916, also known as a distance sensor, is typically disposed on a front panel of the computer device 900. The proximity sensor 916 is used to capture the distance between the user and the front of the computer device 900. In one embodiment, the touch display 905 is controlled by the processor 901 to switch from a bright screen state to a dark screen state when the proximity sensor 916 detects that the distance between the user and the front face of the computer device 900 is gradually decreased; when the proximity sensor 916 detects that the distance between the user and the front of the computer device 900 becomes gradually larger, the touch display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration illustrated in FIG. 9 is not intended to be limiting of the computer device 900 and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components may be employed.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application. The server may be the server in the above-described embodiment, and the server 1000 includes a Central Processing Unit (CPU)1001, a system memory 1004 including a Random Access Memory (RAM)1002 and a Read Only Memory (ROM)1003, and a system bus 1005 connecting the system memory 1004 and the central processing unit 1001. The server 1000 also includes a basic input/output system (I/O system) 1006, which facilitates the transfer of information between devices within the computer, and a mass storage device 1007, which stores an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for user input of information. Wherein a display 1008 and an input device 1009 are connected to the central processing unit 1001 via an input-output controller 1010 connected to the system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1007 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1007 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or the network interface unit 1011 may be used to connect to another type of network or a remote computer system (not shown).

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the training data set determination method or the behavior analysis method in video provided by the embodiments of the present application.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which when executed by a processor implements the steps of the training data set determination method or the behavior analysis method in video in the above embodiments. For example, the computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that the computer-readable storage medium referred to in the embodiments of the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps for implementing the above embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

That is, in some embodiments, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the training data set determination method or the behavior analysis method in video described above.

It is to be understood that reference herein to "at least one" means one or more and "a plurality" means two or more. In the description of the embodiments of the present application, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The above-mentioned embodiments are provided not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of training data set determination, the method comprising:

displaying a video image included in each of a plurality of video segments;

2. The method of claim 1, wherein the plurality of initial analysis network models comprises a plurality of initial image analysis network models with different network structures and/or training parameters, and a plurality of initial video analysis network models with different network structures and/or training parameters;

3. The method according to claim 2, wherein said annotation information comprises correspondence between behavior tags and behavior locations, said behavior locations comprising frame numbers and image regions where behavior occurs, each video segment in said plurality of video segments having one or more behavior tags, each behavior tag corresponding to a plurality of frame numbers, each frame number corresponding to an image region;

4. The method according to claim 2, wherein said annotation information comprises correspondence between behavior tags and behavior positions, said behavior positions comprising frame numbers and picture regions where behavior occurs, each video segment of said plurality of video segments having one or more behavior tags, each behavior tag corresponding to a plurality of frame numbers, each frame number corresponding to a picture region, said plurality of frame numbers comprising a starting frame number and an ending frame number;

5. The method according to claim 2, wherein after determining a training data set corresponding to the initial analysis network model selected by the model selection operation based on the plurality of video segments and corresponding annotation information, the method further comprises:

6. The method of claim 5, wherein prior to training the initial analysis network model selected by the model selection operation, the method further comprises:

7. The method of claim 5, wherein after training the initial analysis network model selected by the model selection operation, the method further comprises:

displaying model test prompt information;

displaying the test result;

8. The method according to any one of claims 5-7, further comprising:

the display model issues prompt information;

9. A method of behavioral analysis in a video, the method comprising:

acquiring a target video segment to be subjected to behavior analysis;

10. The method of claim 9, wherein the behavior analysis network model comprises an image behavior analysis network model and a video behavior analysis network model;

11. The method according to claim 10, wherein said determining one or more second sub-video segments based on said target video segment and said one or more candidate frame numbers comprises:

12. A training data set determination apparatus, the apparatus comprising:

13. An apparatus for behavior analysis in video, the apparatus comprising:

14. A cloud platform system, characterized in that the system comprises a user device and a server, and the system implements the steps of the method of any one of claims 1 to 8 through the user device and the server.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8 or carries out the steps of the method of claim 9 or 10.