CN113487247B

CN113487247B - Digitalized production management system, video processing method, equipment and storage medium

Info

Publication number: CN113487247B
Application number: CN202111039654.1A
Authority: CN
Inventors: 方无迪; 任文婷; 孙熠; 孙凯
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2022-02-01
Anticipated expiration: 2041-09-06
Also published as: CN113487247A

Abstract

The embodiment of the application provides a digital production management system, a video processing method, equipment and a storage medium. In the embodiment of the application, the digital production management system comprises a central management and control node, an edge gateway node, image acquisition equipment deployed in a production environment, and production equipment on each production line. The image acquisition equipment can acquire videos of production behaviors in a production environment and report the acquired video streams to the central control node through the edge network joint point for event identification. In the event identification processing process, the central control node integrates causal convolution processing capable of considering characteristic information in historical video frames, so that event identification can be performed by taking the current video frame as input, the real-time analysis capability of video streams can be improved, and long-term actions can be considered; in the event identification process, the instantaneous state and the long-term action are combined to identify the event, so that the accuracy of the event identification is improved.

Description

Digitalized production management system, video processing method, equipment and storage medium

Technical Field

The present application relates to the field of intelligent manufacturing technologies, and in particular, to a digital production management system, a video processing method, a device, and a storage medium.

Background

With the continuous development of technologies such as cloud computing, internet of things and artificial intelligence, more and more digital factories emerge. The digital factory can realize the data processing of the whole production chain of the product from raw material purchase, product design, production processing and the like; production and manufacturing can also be performed in a flexible manufacturing mode. Under the flexible manufacturing mode, a digital factory takes the requirement of a consumer as a core, reconstructs the traditional production mode with production and marketing, and realizes the intelligent manufacturing according to the requirement.

In the digital production process, some abnormal events are inevitable. For example, in the clothing manufacturing industry, the hanging device may stop rotating, the cut pieces may fall off from the hanging device, etc., and these abnormal events may adversely affect the digital production process, so a solution capable of timely knowing the abnormal events occurring in the production process is needed.

Disclosure of Invention

Aspects of the present disclosure provide a digital production management system, a video processing method, a device, and a storage medium, which are capable of timely and accurately identifying events occurring in a production process.

The embodiment of the application provides a digital production management system, including: the system comprises a central control node, an edge gateway node, image acquisition equipment deployed in a production environment and production equipment on each production line;

the image acquisition equipment is used for acquiring a video stream containing a production behavior generated in a production environment and reporting the video stream to the central control node through the edge gateway node, wherein the video stream comprises continuous video frames;

the central control node is used for identifying the instantaneous state and the long-term action of the current video frame based on the causal convolutional neural network aiming at the received current video frame to obtain a state label and an action label in the current video frame; according to the state tags and the action tags in the current video frame, combining the state tags and the action tags in a plurality of historical video frames to perform event identification to obtain an event identification result; sending the event identification result to corresponding production equipment through the edge gateway node;

the production equipment is used for receiving the event recognition result and outputting the event recognition result; the event identification result comprises whether a specified event occurs in the production process.

An embodiment of the present application provides a video processing method, including:

receiving a current video frame, and identifying instantaneous state and long-term action of the current video frame based on a causal convolutional neural network to obtain a state label and an action label in the current video frame;

according to the state tags and the action tags in the current video frame, combining the state tags and the action tags in a plurality of historical video frames to carry out event identification so as to obtain an event identification result;

wherein, the event identification result comprises whether the specified event occurs.

An embodiment of the present application provides still another video processing apparatus, including: a memory and a processor;

a memory for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps in the video processing method.

Embodiments of the present application provide a computer readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement the steps in the video processing method.

In the embodiment of the application, the digital production management system comprises a central management and control node, an edge gateway node, image acquisition equipment deployed in a production environment, and production equipment on each production line. The image acquisition equipment can acquire videos of production behaviors generated in a production environment and report the acquired video streams to the central control node through the edge network joint point for event identification. In the event identification processing process, the central control node integrates causal convolution processing capable of considering characteristic information in historical video frames, so that event identification can be performed by taking the current video frame as input, the real-time analysis capability of video streams can be improved, and long-term actions can be considered; in the event identification process, the long-term action and the instantaneous state are considered, the instantaneous state and the long-term action can be combined to identify the event, the accuracy of event identification is improved, and the event identification result is more reliable. Of course, based on more accurate and reliable event recognition results, the method can help a digital factory to optimize production field management of production environment and improve production efficiency.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic structural diagram of a digital production management system according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a causal convolutional network according to an exemplary embodiment of the present application;

FIG. 3 is a block diagram illustrating a state-action recognition model according to an exemplary embodiment of the present application;

FIG. 4 is a diagram illustrating an overall process of event recognition based on a state-action recognition model and an event decision model according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a process of event recognition by an event decision model according to an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of tagging a sample with an action tag and a status tag according to an exemplary embodiment of the present application;

fig. 7 is a schematic flowchart of a video processing method according to an exemplary embodiment of the present application;

fig. 8 is a schematic structural diagram of a video processing apparatus according to an exemplary embodiment of the present application;

fig. 9 is a schematic structural diagram of a video processing device according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In daily production site management of a digital factory, some events, such as abnormal events, occurring in the production process need to be known in time. Therefore, the embodiment of the application provides a digital production management system, which comprises a central control node, an edge gateway node, image acquisition equipment deployed in a production environment, and production equipment on each production line. The image acquisition equipment can acquire videos of production behaviors in a production environment and report the acquired video streams to the central control node through the edge network joint point for event identification. In the event identification processing process, the central control node integrates causal convolution processing capable of considering characteristic information in historical video frames, so that event identification can be performed by taking the current video frame as input, the real-time analysis capability of video streams can be improved, and long-term actions can be considered; in the event identification process, the long-term action and the instantaneous state are considered, the instantaneous state and the long-term action can be combined to identify the event, the accuracy of event identification is improved, and the event identification result is more reliable. Of course, based on more accurate and reliable event recognition results, the method can help a digital factory to optimize production field management of production environment and improve production efficiency.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a digital production management system according to an exemplary embodiment of the present application. As shown in fig. 1, the system includes: a central management and control node 10, an edge gateway node 20, and an image capture device 30 deployed in a production environment and production devices 40 on various production lines. The central management and control node 10 may be communicatively connected to the edge gateway node 20 through a wired network or a wireless network. Edge gateway node 20 may also be communicatively coupled to image capture device 30 and production device 40, respectively, via a wired network, or a wireless network. For example, the wired network may include a coaxial cable, a twisted pair, an optical fiber, and the like, and the Wireless network may be a 2G network, a 3G network, a 4G network, or a 5G network, a Wireless Fidelity (WIFI) network, and the like, which is not limited in this application.

In the embodiment of the present application, the central management and control node 10 is located in a cloud, for example, deployed in a central cloud or a traditional data center, and may be a cloud server, a server array, a virtual machine, or the like in implementation form.

In the embodiment of the present application, the edge gateway node 20 is a gateway device that is deployed in a production environment and is capable of data forwarding and processing. The edge gateway node 20 may be deployed in an edge cluster, where the edge cluster is deployed at a side close to an object or a data source, for example, inside a data chemical plant or at another location close to the data chemical plant, and various data analysis and processing services are provided nearby by using an open platform with network, computing, storage, and application core capabilities integrated. The edge gateway node 20 can implement local linkage and data processing analysis of the devices without networking, and can also effectively share load of a cloud (e.g., a central management and control node).

In the embodiment of the present application, the image capturing apparatus 30 may be any apparatus having an image capturing function. For example, the image capturing apparatus 30 of the present embodiment may employ an area camera or a line camera in view of structural characteristics of the sensor. For another example, the image capturing device 30 of the present embodiment may employ a standard definition camera or a high definition camera in view of the picture resolution supported by the camera. For another example, the image pickup device 30 of the present embodiment may employ an analog camera or a digital camera in view of the supported signal types. For another example, the image capturing apparatus 30 of the present embodiment may employ a monocular camera or a binocular camera in view of the number of cameras included in the camera.

In view of the fact that the line-scan camera is prone to cause non-uniform imaging light, and the collected image is bright in the middle and dark on both sides, brightness adjustment needs to be performed on the collected image, and the brightness pre-adjustment process may aggravate image noise, so in the above or below embodiments of the present application, the image collection device 30 may preferably be an area-scan camera with relatively uniform imaging light, but is not limited thereto.

In addition, in consideration that the image clarity is closely related to the detection accuracy of the event recognition result, then, in the above or below-described embodiment of the present application, the image capturing apparatus 30 may preferably select a high-definition camera of HD 720P with a resolution of 1280 × 720, or a high-definition camera of HD 960P with a resolution of 1280 × 960, but is not limited thereto.

In the present embodiment, the production equipment 40 refers to equipment deployed in a production environment to produce products. The production environment refers to a product production site, such as a production plant. Typically, a production plant is deployed with multiple production lines, which may have multiple workstations deployed thereon, each workstation having production equipment 40 and production personnel deployed thereon, as shown in FIG. 1. The production equipment 40 may have different forms of implementation depending on the production process for which the production equipment 40 is responsible, and similarly, the products produced by the production equipment 40 may have forms of semi-finished products or finished products. It should be understood that a semi-finished product refers to a product that still needs to be processed according to the remaining production processes in the whole production process, and a finished product refers to a product that is processed according to all the production processes involved in the whole production process. For example, a garment is produced from a fabric to a garment through a plurality of production processes, such as cloth inspecting, cutting, printing, sewing and ironing, which are usually involved, and accordingly, the production equipment 40 includes a cloth inspecting machine in charge of the cloth inspecting process, a cutting machine in charge of the cutting process, a printing machine in charge of the printing process, a sewing machine in charge of the sewing process and an ironing machine in charge of the ironing process. Taking ironing as an example of the last production process in the whole production process, the clothes ironed by the ironing machine are ready-made clothes, namely finished products, and the clothes processed by the cloth inspecting machine, the cutting machine, the printing machine, the sewing machine and the like are semi-finished products.

In the embodiment of the present application, the central management and control node 10 may provide various services, for example, analyze and process a video stream of the production environment, and optimize production site management of the production environment based on the analysis result to improve production efficiency. For example, a video stream of the production environment is analyzed and processed, and quality control of the production environment is optimized based on the analysis result to improve the quality of the production product. For another example, big data analysis is performed on the production state data of the production environment, and the production schedule of the production environment is optimized based on the big data analysis result to improve the production efficiency. For example, the resources of various devices, personnel and the like in the production environment can be reasonably scheduled by combining the production field data and the scheduling result data, so that the resource utilization rate is improved, and the production efficiency is improved. It should be understood that the central management and control node 10 may provide other suitable services according to actual application requirements.

In the embodiment of the present application, one or more image capturing devices 30 may be deployed in a production environment, and in fig. 1, a plurality of image capturing devices 30 are deployed in a production environment as an example for illustration. The image capturing device 30 is configured to capture a video stream containing a production behavior generated in a production environment, and report the video stream to the central management and control node 10 through the edge gateway node 20, where the video stream includes consecutive video frames. The image capturing device 30 may report the captured video stream to the central control node 10 through the edge gateway node 20 in real time, or may periodically report the captured video stream to the central control node 10 through the edge gateway node 20.

The production behavior may refer to a behavior generated by a target object such as a person, a machine, a material, or a production line in a production environment. Wherein a person includes, but is not limited to, a manufacturing person or a manager. Machines include, but are not limited to, production equipment 40, tools, station equipment, and tooling fixtures. Materials include, but are not limited to, raw materials or auxiliary materials.

It is noted that the target object and the production behavior of the target object are defined according to a specific production scenario. For the sake of easy understanding, the hanging wire, the clothes, the manufacturer, and the like shown in fig. 1 will be described as an example of the target object. The production behavior of the suspension wire is, for example, whether the suspension wire is rotating normally. The production activities of the laundry include, but are not limited to: whether the clothes are hung on a hanging line, whether the clothes are on the table of the production apparatus 40, or whether the clothes fall on the ground. Production activities of the producer include, but are not limited to: whether a producer performs production operation on a workstation, whether the producer walks, whether the producer performs production operation according to production process requirements, and the like. The production activities of the suspension wire include, but are not limited to: whether the suspension wire normally runs, whether the suspension wire is positioned on the workstation, whether the suspension wire is in a hovering state and the like.

After receiving the video stream collected by the image collection device 30 and sent by the edge gateway node 20, the central management and control node 10 identifies the event in the production behavior from two dimensions of an instantaneous state and a long-term action by using a single frame of video frame as a granularity and combining a causal convolutional neural network, so that a large-scale video stream with a large time span and a high repetition degree can be analyzed in real time, and whether a specified event occurs or not can be accurately identified. The following description is provided for the process of identifying events in production behavior from both transient state and long-term action dimensions at the same time by the central management and control node 10 using a single frame of video frames as granularity in combination with the causal convolutional neural network.

The transient state can be understood as a short-term production behavior of the target object in the production environment, and the long-term action can be understood as a long-term production behavior of the target object in the production environment. To some extent, the transient state can be recognized by the action posture of the target object in a short time. The long-term action can be recognized through a plurality of action gestures of the target object in a long term.

For example, in the printing process of the clothing intelligent manufacturing industry, cut pieces needing printing need to be transported to the area where a printing machine is located by means of a hanging line for printing by the printing machine. At this time, the target objects to be paid attention to by the production field management may be hanging lines, cut pieces, production personnel, and a decorating machine. The transient state of the hanging wire comprises that the hanging wire rotates or stops rotating. The long-term action of the hanging wire (identified by the instantaneous state of the hanging wire at a plurality of moments) comprises that the hanging wire always rotates, the hanging wire rotates for a moment and then stops rotating or the hanging wire stops rotating. The instantaneous state of the cut pieces includes the cut pieces hanging on the hanging line or the cut pieces being on the ground. The long-term action of the cut pieces (identified by the instantaneous state of the cut pieces at a plurality of moments) includes the cut pieces being hung on the hanging line all the time or the cut pieces falling from the hanging line to the ground. The instantaneous status of the production person includes that the production person performs the printing operation on the printer side or that the production person does not perform the printing operation on the printer side. The long-term actions of the production personnel (identified by the instantaneous state of the production personnel at a plurality of moments) include that the production personnel always stay at the printing machine side, the production personnel do not stay at the printing machine side, and the production personnel stay at the printing machine side and then do not stay at the printing machine side. The instantaneous state of the printing machine comprises that the printing machine prints the cut pieces and the printing machine does not print the cut pieces. The long term action of calico printing machine includes that calico printing machine carries out the stamp to the cut-out piece always, calico printing machine does not carry out the stamp to the cut-out piece always, and calico printing machine does not carry out the stamp to the cut-out piece again after carrying out the stamp to the cut-out piece earlier.

Further optionally, as shown in fig. 1, the central management and control node 10 sends the event identification result to the corresponding production equipment 40 via the edge gateway node 20. The production apparatus 40 receives the event recognition result and outputs the event recognition result. Wherein, the event identification result comprises whether a specified event occurs in the production process. Wherein, the appointed event is flexibly set according to the actual application requirement. The specified event refers to an event that needs attention in the production field management, and may be one event or a plurality of events.

Further optionally, in the embodiment of the present application, the specified events may be classified according to application requirements, and a category to which one specified event belongs is identified by an event category of the specified event. Alternatively, the designated event may be distinguished into a normal event and an abnormal event according to the role attribute of the event exerted in the production process. Optionally, the specified events may be classified from the dimension of the target object associated with the specified events, or the specified events may be classified from the dimension of the occurrence time of the specified events. Taking the classification of the designated events based on the target object associated with the designated events as an example, the event that the clothes S abnormally fall from the hanging line, the event that the clothes S are transported from the workstation A to the workstation B, and the event that the clothes S are cut by the producer with scissors all belong to the event category corresponding to the clothes S; accordingly, the event that the production tool tweezers P are on the table of the workstation, the event that the tweezers P fall to the ground, and the event that the tweezers P are used by the production personnel all belong to the event category corresponding to the tweezers P. Taking the classification of the designated events based on their occurrence times as an example, the event of abnormal dropping of clothes from the hanging line, the event of transporting clothes from the workstation a to the workstation B, and the event of using the tweezers P by the production staff, which occur in the same time period, all belong to the same event category corresponding to the time period. Based on the above, the event identification result may include, in addition to the information of whether the specified event occurs in the production process, an event category to which the specified event belongs when the specified event occurs.

Further optionally, when a specified event occurs in the production environment, the central management and control node 10 outputs the event identification result to the production equipment 40 adapted to the specified event in the production environment; the production facility 40 is also used for: and displaying the event recognition result on a display screen of the event recognition device so as to provide the event recognition result for corresponding production personnel. The production personnel can know whether abnormal events exist or occur in the factory according to the event identification result so as to be convenient for timely processing or adjustment. For example, if a hanging line hovering event occurs, a production person can perform troubleshooting on a production line in time; for another example, if the clothes fall off from the hanging line abnormally, the fallen clothes need to be collected in time, the quality of the clothes needs to be checked, and meanwhile, the state of a mounting part used for hanging the clothes on the hanging line can be checked; for another example, if an event that a production person leaves a workstation for a long time is found, a manager can timely contact the production person to know the reason of the event. For another example, if an event that the printing machine does not perform the printing operation in the production period occurs, the production personnel or the management personnel can check whether the printing machine fails or not in time. Or, the production device 40 may also send the event identification result to a production line manager to make an abnormal event statistical report, and improve management of production lines, devices, personnel and the like in the digital plant according to the abnormal event statistical report, so as to improve the abnormal management level of the digital plant.

Further optionally, as shown in fig. 1, the digital production management system further includes: a scheduling system 50 for scheduling orders to be produced. The scheduling system 50 is enclosed by a dashed box in FIG. 1, indicating that the scheduling system is an optional component of the digital production management system.

Scheduling, i.e., production scheduling, refers to the process of distributing production tasks (specifically, to-be-produced orders) to various production lines. On the premise of considering capacity and equipment, under the condition of a certain quantity of materials, the production sequence of each production task is arranged, the production sequence is optimized, and the production equipment 40 is optimally selected, so that the waiting time is reduced, and the production loads of each production equipment 40 and production personnel are balanced. Thereby optimizing the productivity, improving the production efficiency and shortening the production period.

The scheduling system may provide scheduling services, and certainly, the scheduling system may also provide various services such as data storage and data calculation. The scheduling system can be realized by hardware or software. When the scheduling system is implemented in hardware, the scheduling system may comprise a single server or a distributed server cluster consisting of multiple servers. When the scheduling system is implemented as software, the scheduling system may be multiple software modules or a single software module, and the software modules may be deployed in a virtual machine, a container, a physical machine, a server cluster, or the like, and the embodiments of the present application are not limited.

When the scheduling system performs scheduling on the order to be produced, the following production scheduling tasks are mainly undertaken but not limited to: determining a production line for producing the products required to be produced by the order to be produced for the products required to be produced by the order to be produced; the production period during which the production line can produce the product required to be produced by the order to be produced is determined.

When the scheduling system performs scheduling on the to-be-produced order, the scheduling system can perform scheduling on the to-be-produced order by adopting various scheduling strategies to obtain scheduling plan information. For example, the scheduling policy may be to schedule the order to be produced by integrating order attribute information such as delivery time of the order to be produced, required production resources, required raw materials, and manufacturing complexity of the product to be produced.

Further optionally, the central management and control node 10 is further configured to: acquiring a specified event occurring in the production process from the event identification result; analyzing production state data on equipment dimension, production line dimension, personnel dimension and/or material dimension according to specified events occurring in the production process; and generating scheduling guide information according to the production state data on the equipment dimension, the production line dimension, the personnel dimension and/or the material dimension, and sending the scheduling guide information to a scheduling system so as to guide the scheduling system to schedule the order to be produced. The production state data of the device dimension refers to which production devices 40 have the specified events and the frequency, time, category and other data of the specified events from the dimension of the production devices 40; the production state data of the production line dimension refers to data such as specified events of which production lines occur and frequency, time, types and the like of the specified events from the dimension of the production lines; the production state data of the staff dimension refers to data such as which production staff or management staff occur specified events and the frequency, time, category and the like of the specified events from the viewpoint of the staff dimension; the production state data of the material dimension refers to the materials which are seen from the dimension of the materials and have the specified events, and the frequency, time, category and other data of the specified events. Wherein the scheduling guidance information may be generated from the production state data in at least one of the dimensions. For example, according to the production state data on the device dimension, information on which production devices 40 are prone to malfunction can be generated as the scheduling guidance information; for another example, according to the production state data on the material dimension, information on which materials are easy to damage can be generated as production scheduling guide information; for another example, according to the production state data on the staff dimension, information of which production staff are easy to have production accidents can be generated as scheduling guidance information; for another example, according to the production state data on the dimension of the production line, information about which production lines have larger daily remaining capacity can be generated as scheduling guidance information; for another example, information including the daily remaining capacity of the production line and which production devices 40 on the production line are likely to fail may be generated as the scheduling guidance information based on the production status data in both the device dimension and the line dimension. For the scheduling system, after receiving the scheduling guidance information, the scheduling system may perform scheduling on the order to be produced according to the scheduling guidance information. For example, in the case that the scheduling guidance information includes the daily remaining capacity of the production line and information about which production devices 40 on the production line are likely to fail, the to-be-produced order is preferentially scheduled for production on the production device 40 on the production line with a large daily remaining capacity and which production device is not likely to fail. For another example, in the case where the scheduling guidance information includes information on which production staff are not likely to have a production accident, the order to be produced is preferentially scheduled to be produced by the production staff that are not likely to have the production accident. Or, in the case that the scheduling guidance information includes information about which materials are easily damaged, the materials which are easily damaged may be processed by using a special production line.

In the embodiment, the production state data is fed back to the scheduling system, so that the guidance of the production state data to the scheduling system is realized, the scheduling rationality is promoted, and the overall productivity and efficiency of the digital factory are improved.

In the above or below embodiments of the present application, the specific implementation of event identification from transient state and long-term action dimension by using a causal convolutional neural network with a central control node 10 using a single video frame as input is not limited. An embodiment in which the central control node 10 uses a single video frame as input and uses a causal convolutional neural network to perform event identification from both transient state and long-term action dimensions is described in detail below.

Specifically, when receiving a video stream including continuous video frames sent by the image acquisition device 30, the central management and control node 10 identifies, based on the causal convolutional neural network, an instantaneous state and a long-term action of the current video frame with respect to the received current video frame, to obtain a state tag and an action tag in the current video frame; and according to the state labels and the action labels in the current video frame, combining the state labels and the action labels in a plurality of historical video frames to carry out event identification, and obtaining an event identification result.

In the embodiment of the present application, a Causal convolution (cause convolution) neural network mainly undertakes, but is not limited to, transient state and long-term action recognition on a single frame video frame. The network structure of the causal convolutional neural network at least comprises a causal convolutional layer, and the specific network structure of the causal convolutional neural network can be flexibly set according to the actual application requirements, which is not limited in the present application.

In addition, the network structure of the causal convolution layer is not limited in the embodiments of the present application. Referring to fig. 2, a causal convolutional Layer is shown, which includes an Input Layer (Input), Hidden layers (Hidden layers), and an Output Layer (Output). It is noted that Sequence Modeling, such as processing a segment of video/audio, can be performed causally to convolutional layer processing Sequence problem (Sequence Modeling), often operating in the time direction (time Sequence). When a causal convolutional layer is used for processing a sequence problem, output data at the time t is jointly decided by learning input data before the time t-1 and combining the input data at the current time, namely the time t, namely the output data at the time t depends on the input data at one or more historical times (such as 1, 2, … and the time t-1) besides the input data at the time t.

Assuming a given input sequencex ₀、x ₁、……x _TAccording to an input sequencex ₀、x ₁、……x _TRespectively outputy ₀、y ₁、……y _T. Wherein at the outputy _tObtained by observation at all timesx ₀、x ₁、……x _tT and T are positive integers, and T belongs to T. Therefore, the causal convolutional layer starts processing from the input data at the current time and traces back the input data at the historical time, and does not consider the input data at the future time, so that the causal convolutional layer has the characteristic of one-sided tilting.

It is worth noting that in the event identification processing process, causal convolution processing capable of considering characteristic information in historical video frames is integrated, so that event identification can be carried out by taking the current video frame as input, the real-time analysis capability of video streams can be improved, and long-term actions can be considered; in the event identification process, the long-term action and the instantaneous state are considered, the instantaneous state and the long-term action can be combined to identify the event, the accuracy of event identification is improved, and the event identification result is more reliable.

It should be understood that the time span of the long-term action is large, the instantaneous state can help to supervise the long-term action, and the instantaneous state and the long-term action of the production behavior enrich the information of event decision, and can help to improve the accuracy of event recognition. In addition, the central control node only needs to input a single frame of video frame without inputting a large number of video frames and combines a small number of historical video frames to realize the multi-dimensional identification of the instantaneous state and the long-term action of the production behavior, thereby reducing the consumption of computing resources caused by processing a large number of video frames and improving the overall efficiency of event identification.

Further optionally, a trained state-action recognition model may be deployed at the central control node, and the state-action recognition model may include a causal convolutional neural network, a state recognition network, and an action recognition network, where a training process of the state-action recognition model is described in detail later. When the method is applied specifically, a current video frame to be identified in a video stream can be input into a causal convolutional neural network, and the causal convolutional neural network respectively outputs first characteristic information reflecting an instantaneous state and second characteristic information reflecting a long-term action in the current video frame; inputting the first characteristic information into a state identification network for instantaneous state identification to obtain a state label in the current video frame; and inputting the second characteristic information into an action recognition network for long-term action recognition to obtain an action label in the current video frame.

Therefore, in the foregoing or following embodiments of the present application, an implementation process of performing instantaneous state and long-term motion recognition on a current video frame based on a causal convolutional neural network to obtain a state tag and a motion tag in the current video frame is as follows: inputting a current video frame into a state-action recognition model, and performing K times of convolution processing on the current video frame inside the state-action recognition model to obtain first characteristic information output by the Nth time of convolution processing and second characteristic information output by the Kth time of convolution processing; identifying a state label in the current video frame according to the first characteristic information, and identifying an action label in the current video frame according to the second characteristic information; wherein K, N is a positive integer, N is more than or equal to 1 and less than K, K is more than or equal to 2, and at least one causal convolution processing exists after the Nth convolution processing.

Specifically, the causal convolutional neural network includes K convolutional layers in total, where the K convolutional layers include one or more causal convolutional layers, and of course, the K convolutional layers may also include one or more Spatial convolution (Spatial convolution) layers. Notably, the spatial convolutional layer is a 2D convolutional layer (i.e., a two-dimensional convolutional layer), a convolutional kernel (k) of the 2D convolutional layer_t,k_w,k_h) K in (1)_tAnd = 1. A causal convolutional layer is a 3D space-time convolutional layer (i.e., a three-dimensional space-time convolutional layer), the convolution kernel (k) of which is a 3D space-time convolutional layer_t,k_w,k_h) K in (1)_t>1. Wherein k is_tAs the convolution kernel size, k, in the time dimension_wAnd k_hIs the convolution kernel size in the spatial dimension. Wherein k is_t=1 indicates that only the current video frame is considered when performing the convolution process. k is a radical of_t>1, when convolution processing is carried out, a current video frame is considered, and at least one historical video frame is traced, wherein the number of the historical video frames is k_t-1. In addition, k_tThe larger the value of (A), the larger the temporal receptive field. E.g. k_tA time field of =1 covers the current time; k is a radical of_tA time field of =3 covers the current time and the first two times earlier than the current time; k is a radical of_tA time field of =5 covers the current time and the first four times earlier than the current time; i.e. k_tTime field of =5 > k_tA time receptive field of = 3; k is a radical of_tTime field of =3 > k_tTime field of = 1.

It is noted that the first feature information reflecting the transient state is a middle-layer feature output by the causal convolutional neural network, the second feature information reflecting the long-term action is a high-layer feature output by the causal convolutional neural network, and the high-layer feature is obtained at least based on performing at least one causal convolutional process on the middle-layer feature.

When the method is applied specifically, the layout of the space convolution layer and the causal convolution layer in the causal convolution neural network can be flexibly set. For example, in a causal convolutional neural network, a spatial convolutional layer and a causal convolutional layer are arranged in order of the hierarchy from low to high. In a specific embodiment, the first N convolutional layers are all spatial convolutional layers, and N spatial convolutions are performed on the current video frame to obtain first feature information; and performing K times of causal convolution on the first characteristic information to obtain second characteristic information. For another example, in a causal convolutional neural network, the first N convolutional layers may include N1 spatial convolutional layers and N2 causal convolutional layers, N1 and N2 are natural numbers ≧ 0, and N1+ N2= N. That is, in the case where neither N1 nor N2 is 0, the first N convolutional layers include both a spatial convolutional layer and a causal convolutional layer. Where N1 is 0 and N2 is not 0, the first N convolutional layers include only causal convolutional layers. When N1 is not 0 and N2 is 0, the first N convolutional layers include only spatial convolutional layers.

In addition, among the K-N convolutional layers between the Nth convolutional layer and the Kth convolutional layer, at least one causal convolutional layer needs to be included, and a spatial convolutional layer may be included or may not be included. Assuming that the number of space convolutional layers among the K-N convolutional layers is L1, the number of causal convolutional layers is L2, L1 is a natural number of 0 or more, L2 is a positive integer, and L1+ L2+ N = K. For example, L1 equals 0, the K-N convolutional layers do not include a space convolutional layer. When L1 is not equal to 0, the K-N convolutional layers include space convolutional layers.

Based on the above, in an optional embodiment of the present application, one implementation process of performing convolution processing on a current video frame K times to obtain first feature information output by convolution processing N times and second feature information output by convolution processing K times is: carrying out N1 times of spatial convolution processing and N2 times of causal convolution processing on a current video frame to obtain first characteristic information; and performing L1 spatial convolution processing and L2 causal convolution processing on the first feature information to obtain second feature information.

Further optionally, when the first N convolutional layers of the causal convolutional neural network are set, in order to make the numbers of the spatial convolutional layers and the causal convolutional layers more balanced, the spatial convolutional layers and the causal convolutional layers may be alternately set. Specifically, if N1= N2= N/2, then N1 spatial convolution processes and N2 causal convolution processes are performed on the current video frame, and an implementation process of obtaining the first feature information is as follows: and alternately carrying out N/2 times of spatial convolution processing and causal convolution processing on the current video frame to obtain first characteristic information. For example, N = 4. The first N convolutional layers are sequentially a space convolutional layer, a causal convolutional layer, a space convolutional layer and a causal convolutional layer; alternatively, the first N convolutional layers are sequentially a causal convolutional layer, a spatial convolutional layer, a causal convolutional layer, and a spatial convolutional layer.

Further optionally, when K-N convolutional layers located after the first N convolutional layers are provided, in order to make the numbers of spatial convolutional layers and causal convolutional layers more balanced, the spatial convolutional layers and causal convolutional layers may be alternately provided. Specifically, if L1= L2= (K-N)/2, L1 spatial convolution processes and L2 causal convolution processes are performed on the first feature information, and one implementation process of obtaining the second feature information is as follows: and (K-N)/2 times of spatial convolution processing and causal convolution processing are alternately carried out on the first characteristic information to obtain second characteristic information. For example, N =4, K =10, the 6 convolutional layers following the first N convolutional layers being, in order, a spatial convolutional layer, a causal convolutional layer, a spatial convolutional layer, and a causal convolutional layer; alternatively, the 6 convolutional layers located after the first N convolutional layers are sequentially a causal convolutional layer, a spatial convolutional layer, a causal convolutional layer, and a spatial convolutional layer.

In the above or below embodiments of the present application, for each causal convolution processing, feature information obtained by a plurality of historical video frames in a previous convolution processing is obtained from a feature cache queue corresponding to the causal convolution processing as a plurality of historical intermediate feature information; and taking the feature information obtained in the previous convolution processing of the current video frame as the current intermediate feature information, and performing causal convolution processing on the current intermediate feature information and the plurality of historical intermediate feature information to obtain the feature information output by the causal convolution processing. Note that, if the causal convolution process is the nth convolution process, the feature information output by the causal convolution process is the first feature information. And if the causal convolution processing is the Kth convolution processing, the characteristic information output by the causal convolution processing is second characteristic information.

Notably, each causal convolutional layer is configured with its own feature buffer queue, the queue length of which is constrained by the size of the convolutional kernel of the causal convolutional layer in the time dimension, preferably, the queue length is equal to the size of the convolutional kernel in the time dimension; of course, the length of the feature buffer queue may also be larger than the size of the convolution kernel in the time dimension. For any video frame input into the causal convolutional neural network, after the video frame is subjected to one or more convolution processes, if the next convolution process is the causal convolutional process, inserting the feature information obtained in the previous convolution process of the video frame into a feature cache queue of the next causal convolutional process. And as time goes on, more and more feature information of the historical video frames is stored in the feature buffer queue of the next causal convolution processing. If the number of the feature information of the historical video frames stored in the feature cache queue is equal to the queue length, the feature information of the historical video frames at the tail of the feature cache queue is discarded, and then the feature information of the current video frames is inserted into the head of the feature cache queue according to a first-in first-out principle.

For ease of understanding, this is illustrated in connection with FIG. 4. In FIG. 4, the length of the feature buffer queue of a causal convolutional layer is k_t，k_tWhich is also the size of the convolution kernel of a causal convolution layer in the time dimension. Inserting the feature data obtained by one or more convolution processes of the current video frame into the head of a feature cache queue of a certain causal convolution layer, and then acquiring k from the feature cache queue_tThe feature data is subjected to a spatio-temporal convolution calculation (i.e., a causal convolution calculation). It should be noted that before inserting the feature data obtained by performing one or more convolution processes on the current video frame into the head of the feature cache queue of a certain causal convolutional layer, first determining whether the feature cache queue of the certain causal convolutional layer is full, if the feature cache queue of the certain causal convolutional layer is full, first discarding the data at the tail of the feature cache queue of the certain causal convolutional layer, and then inserting the current feature data into the head of the feature cache queue of the certain causal convolutional layerAnd inserting the feature data obtained by performing one or more convolution processes on the video frame into the head of a feature cache queue of a certain causal convolution layer. Of course, if the feature buffer queue of a certain causal convolutional layer is not full, the feature data obtained by performing one or more convolution processes on the current video frame is directly inserted into the head of the feature buffer queue of the certain causal convolutional layer.

It is to be noted that, if the first convolution process of the causal convolution neural network is a causal convolution process, the initial value in the feature buffer queue is taken as a plurality of pieces of historical intermediate feature information.

Further optionally, when the number of causal convolution processing is greater than or equal to 2, as the number of causal convolution processing increases, the size of a convolution kernel used in the causal convolution processing in the time dimension gradually increases, and the length of a corresponding feature buffer queue gradually increases. It should be understood that the larger the size of the convolution kernel used in the causal convolution processing in the time dimension, the more historical video frames are considered in the causal convolution processing, and the larger the receptive field of the model is, the more the identification accuracy of the transient state or the long-term action can be improved.

In the above or below-described embodiments of the present application, on the premise that the causal convolutional neural network in the motion-state recognition model includes K convolutional layers, the convolutional layer corresponding to the nth convolutional process may be connected to the state recognition network, and the convolutional layer corresponding to the kth convolutional process may be connected to the motion recognition network. It should be understood that the state identification network is connected in the middle of the causal convolutional neural network, and the input parameters of the state identification network are middle-layer features of the causal convolutional neural network. The action recognition network is connected to the rear part of the causal convolutional neural network, and input parameters of the action recognition network are high-level characteristics of the causal convolutional neural network.

It is noted that the middle level feature input to the state recognition network may be t in FIG. 3 for the current video frame₀And (4) feature data of the video frame at the moment after spatial convolution processing. The middle layer features input to the state recognition network may be for the current video frame and a small number of frames, taking into account motion blur, object occlusion, video codec loss, timing dependencies, and other factorsThe historical video frame is obtained through spatial convolution and causal convolution processing. At this time, the temporal receptive field of the middle layer features input to the state recognition network is associated with the current video frame and a small number of historical video frames. As shown in FIG. 3, the time domain of the middle layer feature input to the state recognition network is t in FIG. 3₀To t_-sThe time, s, is a positive integer, and the negative sign indicates the historical time.

In addition, the time period covered by the time receptive field of the high-level features input to the motion recognition network is long. As shown in FIG. 3, the temporal field of the high-level features input to the motion recognition network is t in FIG. 3₀To t_-lAt the moment of time, the time of day,lis a positive integer, andlis greater than s. Wherein the content of the first and second substances,lthe values of s and s can be flexibly set according to the application scene, and are not limited. For example, s may take the value of 8,lmay take 15.

It is worth noting that the time period covered by the time receptive field of the high-level characteristics input to the action recognition network is longer, so that the requirement of the long-time action on the time length can be met, and the recognition accuracy of the long-time action is improved. In addition, when the conventional 3D convolution video recognition model recognizes long-term actions, the recognition is limited by the limitation of computing power, and only video frames with a small number of frames can be input for reasoning. For example, the conventional 3D convolution video recognition model can only input 32 frames of video frames for inference, assuming that a frame rate (FPS) is 4FPS, a time length corresponding to the 32 frames of video frames is 8 seconds, and the video frames of 8 seconds are difficult to meet the requirement of the long-term action on the time length, that is, the conventional 3D convolution video recognition model has low recognition accuracy on the long-term action. However, although the motion-state recognition model provided in the embodiment of the present application only needs to use the current video frame for a single inference, the time receptive field benefits from the stacking effect of causal convolution, and can cover video frames of a longer time period, thereby breaking through the limitations of computational power and input frame number, and improving the recognition accuracy of long-term motion.

In the foregoing or following embodiments of the present application, one implementation of identifying the status tag in the current video frame according to the first feature information is: inputting the first characteristic information into a state identification network, performing pooling processing on the first characteristic information to obtain third characteristic information, and classifying the third characteristic information by using a multilayer perceptron to obtain a state label in the current video frame; accordingly, one implementation process for identifying the action tag in the current video frame according to the second feature information is as follows: and inputting the second characteristic information into the action recognition network, performing pooling processing on the second characteristic information to obtain fourth characteristic information, and classifying the fourth characteristic information by using a multilayer perceptron to obtain an action label in the current video frame.

In specific application, referring to fig. 3, a pooling layer and a Multilayer Perceptron (MLP) may be respectively disposed in the state recognition network and the action recognition network; pooling layers can be used to compress the amount of data and parameters, reducing overfitting; multiple tiers of perceptrons may be used for the classification process.

In the above or below embodiments of the present application, referring to fig. 4 and fig. 5, an event decision model may be trained in advance, and event recognition may be performed on the state sequence and the action sequence output by the action-state recognition model through the event decision model. Then, according to the state tags and the action tags in the current video frame, combining the state tags and the action tags in a plurality of historical video frames to perform event identification, so as to obtain an event identification result, an implementation process is as follows: combining the state tags and the action tags in the current video frame with the state tags and the action tags in a plurality of historical video frames respectively to obtain a state sequence and an action sequence; inputting the state sequence and the action sequence into an event decision model, analyzing whether the specified event corresponding to the state sequence and the action sequence exists or not by adopting a decision algorithm based on the corresponding relation between the specified event and the state and the action which are learned in advance, and determining the event type of the specified event under the existing condition.

In fig. 4 and 5, the event categories are illustrated as two types of normal events and abnormal events. The event identification result comprises whether a specified event occurs or not, and whether the specified event is a normal event or an abnormal event. If the specified event is an abnormal event, the type of the abnormal event to which the abnormal event belongs is further judged.

In the embodiment of the application, when the state sequence or the action sequence is combined, the frame number of the historical video frame can be determined according to the actual application requirement. Further optionally, considering that the length of the feature buffer queue of the causal convolutional layer is related to the number of frames of the historical video frame, the length of the feature buffer queue of the corresponding causal convolutional layer should not be exceeded when determining the number of frames of the historical video frame. Further alternatively, the frame number of the history video frame may be flexibly set based on the model recognition accuracy and the calculation amount. The number of the historical video frames required by the state sequence and the number of the historical video frames required by the state sequence can be the same or different. In fig. 5, the number of frames of the history video frame necessary for the state sequence and the number of frames of the history video frame necessary for the state sequence are shown as an example. In FIG. 5, the number of the historical video frames required for the state sequence and the motion sequence is d +1, i.e. the state sequence is represented by t₀Status tag corresponding to video frame at time and earlier than t₀The state tags corresponding to d historical video frames at a moment are formed, and the state sequence comprises d +1 state tags. I.e. the sequence of actions is given by t₀Action tag corresponding to video frame at moment and earlier than t₀The motion sequence comprises d +1 motion labels, and d is a positive integer.

In the embodiment of the application, the action sequence and the state sequence are considered, so that the event identification is more robust. The method benefits from a larger time receptive field and still has better recognition capability on long-term actions; the relatively stable middle-layer characteristics are obtained through state middle-layer supervision, the action recognition still has better robustness for the action of fast switching, and the action sequence and the state sequence are combined and utilized, so that the event judgment is more reliable. The method has the advantages that the causal convolution is benefited, so that only the current frame is input in a single inference, the model has the capability of video stream real-time analysis, and the data utilization and calculation are more efficient.

The embodiment of the present application does not limit the decision algorithm. For example, the decision algorithm may be a heuristic rule algorithm, an LSTM (Long Short Term Memory Network) algorithm, or a random forest algorithm. Heuristic rules algorithms include, for example and without limitation, simulated annealing algorithms, genetic algorithms, list search algorithms, evolutionary programming, evolutionary strategies, ant colony algorithms, artificial neural networks, among others.

The LSTM algorithm incorporates three control gates, an input gate (input gate), a forgetting gate (forget gate), and an output gate (output gate), to control the cell state. At time t, there are three input parameters for the LSTM: input value x of the network at the present moment_tLast time LSTM output value h_t-1And the cell state C of the previous time_t-1(ii) a The output of the LSTM is two: current time LSTM output value h_tAnd the cell state C at the current time_t. Wherein the forgetting gate controls the cell state C at the last moment_t-1How much information remains to the cell state C at the current time_tIn (1). Input gate (input gate) controls the input x of the network at the present moment_tHow much information is stored in the cell state Ct at the present time. Output gate (output gate) controls cell state C_tHow much current output value h is output to LSTM_tIn (1).

In a random forest algorithm, a random forest is a classifier that contains a number of decision trees and whose output classes are dependent on the mode of the class output by the individual trees. The creation of each decision tree in a random forest relies on an independently drawn sample set. Specifically, N samples are repeatedly and randomly extracted in a replacement mode from an original training sample set with the total number of the samples being N to generate new N self-help sample sets, and then N decision trees are generated according to the N self-help sample sets to form a random forest.

In the embodiment of the application, taking a random forest algorithm as an example, when an event decision model is trained, an original training sample set with a total number of samples of N is prepared, and each sample of the N samples is labeled with a state label, an action label, a specified event and a category thereof. And repeatedly and randomly drawing N samples from the N samples in a replacement mode to generate new N self-help sample sets, wherein each self-help sample set comprises the N samples. And training a sub-event decision model by utilizing each self-help sample set, wherein during training, the state label and the action label of each sample are used as model input, the specified event marked by each sample and the category of the specified event are used as model output, and iterative training is carried out until a loss function is converged. After n sub-event decision models are constructed, an event decision model may be constructed based on the n sub-event decision models. And the event recognition result output by the event decision model is the event recognition result with the largest occurrence frequency in the event recognition results of the n sub-event decision models. For example, the event decision model includes 5 sub-event decision models, where 3 sub-event decision models output an event that the clothes fall from the hanging line, and 2 sub-event decision models output an event that the clothes are hung on the hanging line, and then the event recognition result output by the event decision model is an event that the clothes fall from the hanging line.

In the above or following embodiments of the present application, in the model training phase, a sample video labeled with an action tag, a state tag, and a specified event may be obtained; segmenting a sample video to obtain a plurality of sample segments, and performing two-stage model training by using the plurality of sample segments to obtain a state-action recognition model and an event decision model; when the model training is carried out on the current sample segment, values in a feature buffer queue, a state queue and an action queue, which are generated when the model training is carried out on the previous sample segment, are respectively used as initial values of corresponding queues in the current training process.

In the stage of sample marking, action and instantaneous state marking can be carried out on a long video segment at the same time, and the starting and stopping time of each long action and instantaneous state is determined. For long-term actions, a single-label Multi-Classification (Multi-Class Classification) mode is adopted for Classification, namely, one sample only has one label, but the number of the label types is multiple; for the transient state, a Multi-Label Multi-Classification (Multi-Label Classification) mode is adopted for Classification, that is, one Label needs to be marked on a plurality of target objects in one sample, and a plurality of labels are marked on one sample. For example, the action label of long sample 1 in fig. 6 includes action a, i.e. long sample 1 is only labeled in the action dimension; the state labels for long sample 1 include state 1, state 2, and state 3, i.e., long sample 1 is three labeled in the state dimension.

It is worth noting that the dimensionality of the action label is sparse, and the condition of unstable learning characteristics is easy to occur when only the action label is trained; the state label generally has obvious physical significance and is easy to learn, so that the supervision signal is strong; and the model can obtain better and relatively stable bottom layer and middle layer characteristic expression capability through combined training. If the state label is not added, the dimensionality of the action label is sparse, the dimensionalities of the action label are dozens of classes, and the characteristics are not stable enough during learning; one is sparse in the time dimension and one is sparse in the space dimension, and objects that can be labeled are relatively sparse. Whereas objects that can be state labeled are relatively rich. For example, the lifting action of a pair of hand-held scissors needs to be marked by simultaneously including hands, scissors, lifting and the like; in addition, when the state is labeled, hands and scissors can be separately labeled to form different state labels; the action tag and the state tag are fused, so that the event identification can be facilitated.

Typically, the beginning of a new long motion means that there will be a new transient state, and multiple transient state changes are allowed to occur in a long motion (e.g., a single sewing motion may involve multiple panel position changes). Therefore, in the sample labeling stage, a smaller proportion of cases with only transient state or long-term action labeling is also allowed, and in this case, model training is set such that the unlabeled transient state/long-term action does not generate corresponding back propagation loss, and only generates difference loss of a relatively close label.

After the original long video is marked to obtain a sample video, the sample video is segmented to obtain a plurality of sample segments. The method comprises the steps that a sample video can be segmented at equal intervals to obtain a plurality of sample segments with the same length; the sample video can also be segmented by adopting a random segmentation mode to obtain a plurality of sample segments with different lengths so as to enhance the sample randomness.

And for each sample segment, performing video frame extraction on the sample segment to obtain a plurality of sample video frames. When video frame extraction is carried out, 1 frame is extracted at the time of delta t. For example, 1 frame is decimated from every 5 frames of a sample slice to obtain n frames. Theoretically n may be infinite, but it is generally true that the larger the better, but considering the simplicity and regularity of implementation of the scheme, a fixed number of frames may be adopted, n is a positive integer, and n may be 96 frames, for example. In addition, considering that the identification process generally needs to reach at least more than several frames, the label of the previous m frames in the n frames is ignored, that is, the previous m frames do not generate corresponding back propagation loss. Alternatively, m is a positive integer, and m may be set to 4, but is not limited thereto.

If the sample video frame obtained by sampling the current sample segment is greater than or equal to the set frame number threshold, the influence of the previous sample segment does not need to be considered, and at this time, an arbitrary value, such as 0 or other numerical value, is initialized from the initial value of the corresponding queue in the current training process. If the sample video frame obtained by sampling the current sample segment is greater than or equal to the set frame number threshold, the influence generated by the previous sample segment can be considered, and at the moment, values in a feature buffer queue, a state queue and an action queue generated when the previous sample segment is subjected to model training can be respectively used as initial values of corresponding queues in the current training process. Wherein, the threshold value of the set frame number is flexibly set according to the actual application requirement.

Fig. 7 is a flowchart illustrating a video processing method according to an exemplary embodiment of the present application. As shown in fig. 7, the method may include the steps of:

701. and receiving a current video frame, and identifying the instantaneous state and the long-term action of the current video frame based on a causal convolutional neural network to obtain a state label and an action label in the current video frame.

702. And according to the state labels and the action labels in the current video frame, combining the state labels and the action labels in a plurality of historical video frames to carry out event identification so as to obtain an event identification result.

Further optionally, the identifying the instantaneous state and the long-term action of the current video frame based on the causal convolutional neural network to obtain the state tag and the action tag in the current video frame includes:

inputting a current video frame into a state-action recognition model, and performing convolution processing on the current video frame for K times to obtain first characteristic information output by convolution processing for the Nth time and second characteristic information output by convolution processing for the Kth time;

identifying a state label in the current video frame according to the first characteristic information, and identifying an action label in the current video frame according to the second characteristic information;

wherein K, N is a positive integer, N is more than or equal to 1 and less than K, K is more than or equal to 2, and at least one causal convolution processing exists after the Nth convolution processing.

Further optionally, performing convolution processing on the current video frame K times to obtain first feature information output by convolution processing N times and second feature information output by convolution processing K times, including:

carrying out N1 times of spatial convolution processing and N2 times of causal convolution processing on a current video frame to obtain first characteristic information; n1 and N2 are natural numbers not less than 0, and N1+ N2= N;

performing L1 spatial convolution processing and L2 causal convolution processing on the first characteristic information to obtain second characteristic information; l1 is a natural number of 0 or more, L2 is a positive integer, and L1+ L2+ N = K.

Further optionally, if N1= N2= N/2, then N1 spatial convolution processes and N2 causal convolution processes are performed on the current video frame, so as to obtain the first feature information, where the first feature information includes:

and alternately carrying out N/2 times of spatial convolution processing and causal convolution processing on the current video frame to obtain first characteristic information.

Further optionally, if L1= L2= (K-N)/2, then L1 spatial convolution processes and L2 causal convolution processes are performed on the first feature information, so as to obtain second feature information, where the steps include:

and (K-N)/2 times of spatial convolution processing and causal convolution processing are alternately carried out on the first characteristic information to obtain second characteristic information.

Further optionally, the method further includes: for each causal convolution processing, acquiring feature information obtained by a plurality of historical video frames in the previous convolution processing from a feature cache queue corresponding to the causal convolution processing as a plurality of historical intermediate feature information; and taking the feature information obtained in the previous convolution processing of the current video frame as the current intermediate feature information, and performing causal convolution processing on the current intermediate feature information and the plurality of historical intermediate feature information to obtain the feature information output by the causal convolution processing.

Further optionally, when the number of causal convolution processing is greater than or equal to 2, as the number of causal convolution processing increases, the size of a convolution kernel used in the causal convolution processing in the time dimension gradually increases, and the length of a corresponding feature buffer queue gradually increases.

Further optionally, the convolution layer corresponding to the nth convolution processing is connected with a state identification network, and the convolution layer corresponding to the kth convolution processing is connected with an action identification network; then, identifying a status tag in the current video frame according to the first feature information includes: and inputting the first characteristic information into a state identification network, performing pooling processing on the first characteristic information to obtain third characteristic information, and classifying the third characteristic information by using a multilayer perceptron to obtain a state label in the current video frame. Correspondingly, the action tag in the current video frame is identified according to the second characteristic information, and the method comprises the following steps: and inputting the second characteristic information into the action recognition network, performing pooling processing on the second characteristic information to obtain fourth characteristic information, and classifying the fourth characteristic information by using a multilayer perceptron to obtain an action label in the current video frame.

Further optionally, according to the state tags and the action tags in the current video frame, performing event recognition by combining the state tags and the action tags in a plurality of historical video frames to obtain an event recognition result, including: combining the state tags and the action tags in the current video frame with the state tags and the action tags in a plurality of historical video frames respectively to obtain a state sequence and an action sequence; inputting the state sequence and the action sequence into an event decision model, analyzing whether the specified event corresponding to the state sequence and the action sequence exists or not by adopting a decision algorithm based on the corresponding relation between the specified event and the state and the action which are learned in advance, and determining the event type of the specified event under the existing condition.

Further optionally, the method further includes: acquiring a sample video of the marked action label, the state label and the specified event; segmenting a sample video to obtain a plurality of sample segments, and performing two-stage model training by using the plurality of sample segments to obtain a state-action recognition model and an event decision model; when the model training is carried out on the current sample segment, values in a feature buffer queue, a state queue and an action queue, which are generated when the model training is carried out on the previous sample segment, are respectively used as initial values of corresponding queues in the current training process.

Further optionally, when model training is performed on the current sample segment, sampling is performed on the current sample segment to obtain a plurality of sample video frames; if the total frame number of the sample video frame is smaller than the set frame number threshold, values in a feature buffer queue, a state queue and an action queue generated when model training is carried out on the previous sample segment are respectively used as initial values of corresponding queues in the current training process.

The detailed implementation of the video processing method has been described in detail in the embodiment of the digital production management system, and will not be elaborated herein.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 701 to 702 may be device a; for another example, the execution subject of step 701 may be device a, and the execution subject of step 702 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 701, 702, etc., are merely used for distinguishing different operations, and the sequence numbers themselves do not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

Fig. 8 is a schematic structural diagram of a video processing apparatus according to an exemplary embodiment of the present application. As shown in fig. 8, the apparatus may include:

a receiving module 81, configured to receive a current video frame; the processing module 82 is used for identifying the instantaneous state and the long-term action of the current video frame based on the causal convolutional neural network to obtain a state label and an action label in the current video frame; and according to the state labels and the action labels in the current video frame, combining the state labels and the action labels in a plurality of historical video frames to carry out event identification so as to obtain an event identification result. Wherein, the event identification result comprises whether the specified event occurs.

Further optionally, when the processing module 82 identifies the instantaneous state and the long-term action of the current video frame, the processing module is specifically configured to: inputting a current video frame into a state-action recognition model, and performing convolution processing on the current video frame for K times to obtain first characteristic information output by convolution processing for the Nth time and second characteristic information output by convolution processing for the Kth time; identifying a state label in the current video frame according to the first characteristic information, and identifying an action label in the current video frame according to the second characteristic information; wherein K, N is a positive integer, N is more than or equal to 1 and less than K, K is more than or equal to 2, and at least one causal convolution processing exists after the Nth convolution processing.

Further optionally, when the processing module 82 obtains the second feature information, the processing module is specifically configured to: carrying out N1 times of spatial convolution processing and N2 times of causal convolution processing on a current video frame to obtain first characteristic information; n1 and N2 are natural numbers not less than 0, and N1+ N2= N; performing L1 spatial convolution processing and L2 causal convolution processing on the first characteristic information to obtain second characteristic information; l1 is a natural number of 0 or more, L2 is a positive integer, and L1+ L2+ N = K.

Further optionally, N1= N2= N/2, when the processing module obtains the first feature information, the processing module is specifically configured to: and alternately carrying out N/2 times of spatial convolution processing and causal convolution processing on the current video frame to obtain first characteristic information.

Further optionally, L1= L2= (K-N)/2, when the processing module obtains the second feature information, the processing module is specifically configured to: and (K-N)/2 times of spatial convolution processing and causal convolution processing are alternately carried out on the first characteristic information to obtain second characteristic information.

Further optionally, the processing module 82 is further configured to: for each causal convolution processing, acquiring feature information obtained by a plurality of historical video frames in the previous convolution processing from a feature cache queue corresponding to the causal convolution processing as a plurality of historical intermediate feature information; and taking the feature information obtained in the previous convolution processing of the current video frame as the current intermediate feature information, and performing causal convolution processing on the current intermediate feature information and the plurality of historical intermediate feature information to obtain the feature information output by the causal convolution processing.

Further optionally, the convolution layer corresponding to the nth convolution processing is connected with a state identification network, and the convolution layer corresponding to the kth convolution processing is connected with an action identification network; then, when the processing module 82 identifies the status tag in the current video frame, it is specifically configured to: inputting the first characteristic information into a state identification network, performing pooling processing on the first characteristic information to obtain third characteristic information, and classifying the third characteristic information by using a multilayer perceptron to obtain a state label in the current video frame;

accordingly, when the processing module 82 identifies the action tag in the current video frame, it is specifically configured to: and inputting the second characteristic information into the action recognition network, performing pooling processing on the second characteristic information to obtain fourth characteristic information, and classifying the fourth characteristic information by using a multilayer perceptron to obtain an action label in the current video frame.

Further optionally, when the processing module 82 performs event recognition, it is specifically configured to: combining the state tags and the action tags in the current video frame with the state tags and the action tags in a plurality of historical video frames respectively to obtain a state sequence and an action sequence;

inputting the state sequence and the action sequence into an event decision model, analyzing whether the specified event corresponding to the state sequence and the action sequence exists or not by adopting a decision algorithm based on the corresponding relation between the specified event and the state and the action which are learned in advance, and determining the event type of the specified event under the existing condition.

Further optionally, the processing module 82 is further configured to: acquiring a sample video of the marked action label, the state label and the specified event; segmenting a sample video to obtain a plurality of sample segments, and performing two-stage model training by using the plurality of sample segments to obtain a state-action recognition model and an event decision model; when the model training is carried out on the current sample segment, values in a feature buffer queue, a state queue and an action queue, which are generated when the model training is carried out on the previous sample segment, are respectively used as initial values of corresponding queues in the current training process.

Further optionally, when performing model training on the current sample segment, the processing module 82 is further configured to: sampling the current sample fragment to obtain a plurality of sample video frames; if the total frame number of the sample video frame is smaller than the set frame number threshold, values in a feature buffer queue, a state queue and an action queue generated when model training is carried out on the previous sample segment are respectively used as initial values of corresponding queues in the current training process.

The video processing apparatus in fig. 8 can execute the video processing method in the embodiment shown in fig. 7, and the implementation principle and the technical effect are not described again. The specific manner in which the video processing apparatus in the above embodiments performs operations by the respective modules and units has been described in detail in the embodiments related to the digital production management system, and will not be described in detail herein.

Fig. 9 is a schematic structural diagram of a video processing device according to an exemplary embodiment of the present application. As shown in fig. 9, the video processing apparatus includes: a memory 91 and a processor 92.

Memory 91 is used to store computer programs and may be configured to store other various data to support operations on the computing platform. Examples of such data include instructions for any application or method operating on the computing platform, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 91 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 92, coupled to the memory 91, for executing the computer program in the memory 91 for: receiving a current video frame; identifying instantaneous state and long-term action of a current video frame based on a causal convolutional neural network to obtain a state label and an action label in the current video frame; and according to the state labels and the action labels in the current video frame, combining the state labels and the action labels in a plurality of historical video frames to carry out event identification so as to obtain an event identification result. Wherein, the event identification result comprises whether the specified event occurs.

Further optionally, when the processor 92 identifies the instantaneous state and the long-term action of the current video frame, the method is specifically configured to: inputting a current video frame into a state-action recognition model, and performing convolution processing on the current video frame for K times to obtain first characteristic information output by convolution processing for the Nth time and second characteristic information output by convolution processing for the Kth time; identifying a state label in the current video frame according to the first characteristic information, and identifying an action label in the current video frame according to the second characteristic information; wherein K, N is a positive integer, N is more than or equal to 1 and less than K, K is more than or equal to 2, and at least one causal convolution processing exists after the Nth convolution processing.

Further optionally, when the processor 92 obtains the second feature information, the processor is specifically configured to: carrying out N1 times of spatial convolution processing and N2 times of causal convolution processing on a current video frame to obtain first characteristic information; n1 and N2 are natural numbers not less than 0, and N1+ N2= N; performing L1 spatial convolution processing and L2 causal convolution processing on the first characteristic information to obtain second characteristic information; l1 is a natural number of 0 or more, L2 is a positive integer, and L1+ L2+ N = K.

Further optionally, N1= N2= N/2, when the processor 92 obtains the first feature information, the first feature information is specifically used for: and alternately carrying out N/2 times of spatial convolution processing and causal convolution processing on the current video frame to obtain first characteristic information.

Further optionally, L1= L2= (K-N)/2, when the processor 92 obtains the second feature information, the processor is specifically configured to: and (K-N)/2 times of spatial convolution processing and causal convolution processing are alternately carried out on the first characteristic information to obtain second characteristic information.

Further optionally, the processor 92 is further configured to: for each causal convolution processing, acquiring feature information obtained by a plurality of historical video frames in the previous convolution processing from a feature cache queue corresponding to the causal convolution processing as a plurality of historical intermediate feature information; and taking the feature information obtained in the previous convolution processing of the current video frame as the current intermediate feature information, and performing causal convolution processing on the current intermediate feature information and the plurality of historical intermediate feature information to obtain the feature information output by the causal convolution processing.

Further optionally, the convolution layer corresponding to the nth convolution processing is connected with a state identification network, and the convolution layer corresponding to the kth convolution processing is connected with an action identification network; then, when the processor 92 identifies the status tag in the current video frame, it is specifically configured to: and inputting the first characteristic information into a state identification network, performing pooling processing on the first characteristic information to obtain third characteristic information, and classifying the third characteristic information by using a multilayer perceptron to obtain a state label in the current video frame. Accordingly, when the processor 92 identifies the action tag in the current video frame, it is specifically configured to: and inputting the second characteristic information into the action recognition network, performing pooling processing on the second characteristic information to obtain fourth characteristic information, and classifying the fourth characteristic information by using a multilayer perceptron to obtain an action label in the current video frame.

Further optionally, when the processor 92 performs event recognition, it is specifically configured to: combining the state tags and the action tags in the current video frame with the state tags and the action tags in a plurality of historical video frames respectively to obtain a state sequence and an action sequence; inputting the state sequence and the action sequence into an event decision model, analyzing whether the specified event corresponding to the state sequence and the action sequence exists or not by adopting a decision algorithm based on the corresponding relation between the specified event and the state and the action which are learned in advance, and determining the event type of the specified event under the existing condition.

Further optionally, the processor 92 is further configured to: acquiring a sample video of the marked action label, the state label and the specified event; segmenting a sample video to obtain a plurality of sample segments, and performing two-stage model training by using the plurality of sample segments to obtain a state-action recognition model and an event decision model; when the model training is carried out on the current sample segment, values in a feature buffer queue, a state queue and an action queue, which are generated when the model training is carried out on the previous sample segment, are respectively used as initial values of corresponding queues in the current training process.

Further optionally, when performing model training for the current sample segment, the processor 92 is further configured to: sampling the current sample fragment to obtain a plurality of sample video frames; if the total frame number of the sample video frame is smaller than the set frame number threshold, values in a feature buffer queue, a state queue and an action queue generated when model training is carried out on the previous sample segment are respectively used as initial values of corresponding queues in the current training process.

Further, as shown in fig. 9, the video processing apparatus further includes: communication components 93, display 94, power components 95, audio components 96, and the like. Only some of the components are schematically shown in fig. 9, and it is not meant that the video processing apparatus includes only the components shown in fig. 9. In addition, the components within the dashed box in fig. 9 are optional components, not necessary components, and may depend on the product form of the video processing apparatus. The video processing device of this embodiment may be implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, or an IOT device, or may be a server device such as a conventional server, a cloud server, or a server array. If the video processing device of this embodiment is implemented as a terminal device such as a desktop computer, a notebook computer, a smart phone, etc., the video processing device may include components within a dashed line frame in fig. 9; if the video processing device of this embodiment is implemented as a server device such as a conventional server, a cloud server, or a server array, components within a dashed box in fig. 9 may not be included.

Accordingly, the present application further provides a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to implement the steps in the above method embodiments.

Accordingly, the present application also provides a computer program product, which includes a computer program/instruction, when the computer program/instruction is executed by a processor, the processor is enabled to implement the steps in the above method embodiments.

The communication component of fig. 9 described above is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as WiFi, a mobile communication network such as 2G, 3G, 4G/LTE, 9G, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The display in fig. 9 described above includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The power supply assembly of fig. 9 described above provides power to the various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

The audio component of fig. 9 described above may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A digital production management system, comprising: the system comprises a central control node, an edge gateway node, image acquisition equipment deployed in a production environment and production equipment on each production line;

the central control node is used for identifying the instantaneous state and the long-term action of the current video frame based on the causal convolutional neural network aiming at the received current video frame to obtain a state label and an action label in the current video frame; according to the state tags and the action tags in the current video frame, combining the state tags and the action tags in a plurality of historical video frames to perform event identification to obtain an event identification result; sending the event recognition result to corresponding production equipment through the edge network joint point; the causal convolutional neural network comprises K convolutional layers formed by a space convolutional layer and a causal convolutional layer, wherein the first N convolutional layers are used for identifying transient states, and the last K-N convolutional layers at least comprise one causal convolutional layer and are used for identifying long-term actions; wherein K, N is a positive integer, N is more than or equal to 1 and less than K, and K is more than or equal to 2;

2. The system according to claim 1, wherein the central management and control node, when obtaining the status label and the action label in the current video frame, is specifically configured to:

performing convolution processing on the current video frame for K times to obtain first characteristic information output by convolution processing for the Nth time and second characteristic information output by convolution processing for the K times;

wherein there is at least one causal convolution after the nth convolution.

3. The system of claim 1, further comprising: the scheduling system is used for scheduling the order to be produced;

the central management and control node is further configured to: acquiring a specified event occurring in the production process from the event identification result; analyzing production state data on equipment dimension, production line dimension, personnel dimension and/or material dimension according to the specified event occurring in the production process; and generating scheduling guide information according to the production state data on the equipment dimension, the production line dimension, the personnel dimension and/or the material dimension, and sending the scheduling guide information to the scheduling system so as to guide the scheduling system to schedule the order to be produced.

4. A video processing method, comprising:

receiving a current video frame, and identifying instantaneous state and long-term action of the current video frame based on a causal convolutional neural network to obtain a state label and an action label in the current video frame; the causal convolutional neural network comprises K convolutional layers formed by a space convolutional layer and a causal convolutional layer, wherein the first N convolutional layers are used for identifying transient states, and the last K-N convolutional layers at least comprise one causal convolutional layer and are used for identifying long-term actions; wherein K, N is a positive integer, N is more than or equal to 1 and less than K, and K is more than or equal to 2; according to the state tags and the action tags in the current video frame, combining the state tags and the action tags in a plurality of historical video frames to carry out event identification so as to obtain an event identification result;

wherein, the event identification result comprises whether a specified event occurs.

5. The method of claim 4, wherein identifying the instantaneous state and the long-term action of the current video frame based on the causal convolutional neural network to obtain the state tag and the action tag in the current video frame comprises:

wherein there is at least one causal convolution after the nth convolution.

6. The method of claim 5, wherein performing K times of convolution processing on a current video frame to obtain first feature information output by an nth time of convolution processing and second feature information output by a kth time of convolution processing comprises:

performing L1 spatial convolution processing and L2 causal convolution processing on the first feature information to obtain second feature information; l1 is a natural number of 0 or more, L2 is a positive integer, and L1+ L2+ N = K.

7. The method of claim 5 or 6, further comprising:

for each causal convolution processing, acquiring feature information obtained by a plurality of historical video frames in the previous convolution processing from a feature cache queue corresponding to the causal convolution processing as a plurality of historical intermediate feature information;

and taking the feature information obtained in the previous convolution processing of the current video frame as the current intermediate feature information, and performing causal convolution processing on the current intermediate feature information and the plurality of historical intermediate feature information to obtain the feature information output by the causal convolution processing.

8. The method as claimed in claim 7, wherein in the case that the number of causal convolution processes is greater than or equal to 2, as the number of causal convolution processes increases, the size of the convolution kernel used in the causal convolution process in the time dimension increases gradually, and the length of the corresponding feature buffer queue increases gradually.

9. The method according to claim 5 or 6, wherein the convolution layer corresponding to the Nth convolution processing is connected with a state identification network, and the convolution layer corresponding to the Kth convolution processing is connected with an action identification network;

then, identifying a status tag in the current video frame according to the first feature information includes: inputting the first characteristic information into the state identification network, performing pooling processing on the first characteristic information to obtain third characteristic information, and performing classification processing on the third characteristic information by using a multilayer perceptron to obtain a state label in the current video frame;

correspondingly, the step of identifying the action tag in the current video frame according to the second characteristic information comprises the following steps: and inputting the second characteristic information into the action recognition network, performing pooling processing on the second characteristic information to obtain fourth characteristic information, and classifying the fourth characteristic information by using a multilayer perceptron to obtain an action label in the current video frame.

10. The method according to claim 5 or 6, wherein performing event recognition by combining the status tags and the action tags in a plurality of historical video frames according to the status tags and the action tags in the current video frame to obtain an event recognition result, comprises:

combining the state tags and the action tags in the current video frame with the state tags and the action tags in the plurality of historical video frames respectively to obtain a state sequence and an action sequence;

inputting the state sequence and the action sequence into an event decision model, analyzing whether the specified events corresponding to the state sequence and the action sequence exist or not by adopting a decision algorithm based on the corresponding relation between the specified events and the state and the action which are learned in advance, and determining the event type of the specified events under the existing condition.

11. The method of claim 10, further comprising:

acquiring a sample video of the marked action label, the state label and the specified event;

segmenting the sample video to obtain a plurality of sample segments, and performing two-stage model training by using the plurality of sample segments to obtain a state-action recognition model and an event decision model;

when the model training is carried out on the current sample segment, values in a feature buffer queue, a state queue and an action queue, which are generated when the model training is carried out on the previous sample segment, are respectively used as initial values of corresponding queues in the current training process.

12. The method of claim 11, wherein, in model training for a current sample segment, sampling the current sample segment to obtain a plurality of sample video frames; if the total frame number of the sample video frame is smaller than the set frame number threshold, values in a feature buffer queue, a state queue and an action queue generated when model training is carried out on the previous sample segment are respectively used as initial values of corresponding queues in the current training process.

13. A video processing apparatus, comprising: a memory and a processor;

the memory for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps of the method of any of claims 4-12.

14. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 4 to 12.