CN114694254A - Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment - Google Patents

Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment Download PDF

Info

Publication number
CN114694254A
CN114694254A CN202210345959.3A CN202210345959A CN114694254A CN 114694254 A CN114694254 A CN 114694254A CN 202210345959 A CN202210345959 A CN 202210345959A CN 114694254 A CN114694254 A CN 114694254A
Authority
CN
China
Prior art keywords
robbery
frame
straight
elevator
video frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210345959.3A
Other languages
Chinese (zh)
Inventor
马凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Xinchao Media Group Co Ltd
Original Assignee
Chengdu Baixin Zhilian Technology Co ltd
Chengdu Xinchao Media Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Baixin Zhilian Technology Co ltd, Chengdu Xinchao Media Group Co Ltd filed Critical Chengdu Baixin Zhilian Technology Co ltd
Priority to CN202210345959.3A priority Critical patent/CN114694254A/en
Publication of CN114694254A publication Critical patent/CN114694254A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08BSIGNALLING OR CALLING SYSTEMS; ORDER TELEGRAPHS; ALARM SYSTEMS
    • G08B13/00Burglar, theft or intruder alarms
    • G08B13/18Actuation by interference with heat, light, or radiation of shorter wavelength; Actuation by intruding sources of heat, light, or radiation of shorter wavelength
    • G08B13/189Actuation by interference with heat, light, or radiation of shorter wavelength; Actuation by intruding sources of heat, light, or radiation of shorter wavelength using passive radiation detection systems
    • G08B13/194Actuation by interference with heat, light, or radiation of shorter wavelength; Actuation by intruding sources of heat, light, or radiation of shorter wavelength using passive radiation detection systems using image scanning and comparing systems
    • G08B13/196Actuation by interference with heat, light, or radiation of shorter wavelength; Actuation by intruding sources of heat, light, or radiation of shorter wavelength using passive radiation detection systems using image scanning and comparing systems using television cameras
    • G08B13/19602Image analysis to detect motion of the intruder, e.g. by frame subtraction
    • G08B13/19613Recognition of a predetermined image pattern or behaviour pattern indicating theft or intrusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to the technical field of elevator monitoring, and discloses a method, a device and computer equipment for detecting and early warning robbery of articles in a straight elevator. The invention provides a scheme for determining the article robbery event in the straight elevator based on audio and video data, namely, after video data collected by a camera in the straight elevator and audio data collected by a sound pickup in the straight elevator in a target monitoring time period are obtained, mechanisms such as keyword judgment of robbery words in the article robbery event and sound source position comprehensive judgment (to prevent the interference of advertisement sound in the elevator) are added on the basis of behavior identification, and the accuracy of result judgment of the article robbery event can be greatly improved. In addition, the collected video frames can contain complete human body parts, the action is formed by continuous action video frames from the camera angle, the processing speed is increased, the system manufacturing cost and the working condition deployment difficulty of the system are reduced, and the system is convenient for practical application and popularization.

Description

Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment
Technical Field
The invention belongs to the technical field of elevator monitoring, and particularly relates to a method, a device and computer equipment for detecting and early warning robbery of articles in a straight elevator, which are suitable for judging whether an abnormal event that an elevator taking person robbes articles of other persons occurs based on visual detection and voice recognition technology under the condition that only a constant light source exists in a straight elevator car and under the condition that only two persons exist in the straight elevator car.
Background
With the development of the economic level of China, the straight ladder (i.e. the elevator capable of vertically lifting) is gradually integrated into the life of people, and the safety problem of the straight ladder is more and more emphasized by people. Because the vertical elevator car can be in a closed state in the lifting process, the outside is usually difficult to find and stop in time when violent behaviors occur in the vertical elevator car. The elevator monitoring system used at present usually adopts manual force distinguishing behaviors, and often cannot accurately identify abnormal behaviors in a vertical elevator in time and implement rescue.
In order to solve the above problems, the conventional elevator monitoring technology provides an abnormal behavior detection system (CN215402428U) in an elevator, which includes a power module, a data acquisition unit, a communication control unit, an emergency positioning module, a remote server and a communication terminal, wherein the data acquisition unit includes a pressure acquisition module and an image acquisition module, the pressure acquisition module is used for acquiring pressure information including a pressure value and a pressure position on the bottom surface of the elevator car, when the pressure value acquired by the pressure acquisition module is not zero, the angle of the image acquisition module for acquiring video information in the elevator car is adjusted according to the pressure position, then the video information in the elevator car is acquired by the image acquisition module, and then the acquired pressure information and video information are transmitted to the remote server for analysis to obtain abnormal behavior in the elevator car, the abnormal behavior in the lift car is transmitted to the communication terminal and the communication control unit, the communication terminal sends out an alarm signal, and meanwhile the communication control unit starts the emergency positioning module to send alarm position information; therefore, possible abnormal behaviors can be rapidly identified, and rescue information can be rapidly issued, so that people who are in danger can be rescued in time.
However, the above system for detecting abnormal behavior in elevator has the following disadvantages:
(1) from the viewpoint of influencing the accuracy of motion recognition due to the integrity of the collected video information: the image acquisition module can lead to the angle of the acquisition video constantly changing because of the difference of pressure information to and in some article robbery incidents that the action amplitude is great, can be because of the difference of pressure position great for can not contain human whole information in the video frame, the video frame only catches human local information promptly, and the constantly changing of image acquisition module angle can make the motion that is coherent in fact look noncontiguous at the angle of video frame, thereby probably because of there are these two factors: the accuracy of motion recognition is reduced due to the fact that video frames containing human body local information and collected incoherent motions;
(2) in view of the accuracy of the judgment of the article robbery event: the judgment condition of the article robbery event is mainly motion identification, and no other judgment condition is used for judging the article robbery event, so that the probability of event misjudgment is increased;
(3) from the viewpoint of the data processing rate of the whole system: since the pressure information and the video information are transmitted to the remote server, the remote server transmits abnormal behavior in the car to the communication control unit, and the like, a certain time is required, so that the data processing rate of the whole system is reduced;
(4) from a system cost perspective: the hardware of the system is various, and the number of each type of hardware is more than one, so that the cost and the deployment complexity of the system are increased.
Disclosure of Invention
The invention aims to solve the problems of low identification accuracy, low processing speed and high cost of required hardware of an existing elevator internal abnormal behavior detection system for an article robbery event, and provides a method, a device, computer equipment and a computer readable storage medium for detecting and early warning the article robbery in a straight elevator.
In a first aspect, the invention provides a method for detecting and warning robbery of articles in a vertical ladder, which comprises the following steps:
acquiring video data acquired by a camera in the straight elevator and audio data acquired by a pickup in the straight elevator in a target monitoring time period, wherein the target monitoring time period is t1Time τ to t1Period of time + τ, t1Representing a collecting moment corresponding to a target video frame, wherein tau represents a preset specified time length, the target video frame is a video frame which is collected by a camera in the straight elevator, discovers that an object and two human bodies can be robbed in the straight elevator and discovers that a straight elevator door is opened through image recognition processing, the camera in the straight elevator is fixedly arranged in the straight elevator car and faces the straight elevator door, the visual field of a lens fixedly covers the inner area of the car and the area of the straight elevator door, and a pickup in the straight elevator is fixedly arranged in the straight elevator car;
for each video frame in the video data, extracting human body joint point information according to the corresponding frame image to obtain a human body skeleton marked in the corresponding frame image, wherein the human body skeleton comprises human body nodes corresponding to left and right hands and heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right feet and heads;
for each video frame, if the distance from at least one head node in the corresponding frame image to the central point of the object detection frame is judged to be not greater than a preset first distance threshold, determining that the corresponding video frame meets a first preset condition, wherein the object detection frame is a detection frame capable of robbing objects identified in the corresponding frame image;
according to the human body skeleton of each video frame, if the action presenting postures of at least one group of robbery action presenting nodes are judged to belong to pre-labeled robbery postures, determining that a second preset condition is met, wherein the robbery action presenting nodes comprise human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right feet heads;
performing robbery utterance keyword recognition processing by using a trained keyword retrieval system based on an end-to-end speech recognition technology according to the audio data, and determining that a third preset condition is met if at least one robbery utterance keyword is recognized, wherein the confidence coefficient of the robbery utterance keyword is not less than a preset confidence coefficient threshold;
aiming at each robbery speech keyword in the at least one robbery speech keyword, if the corresponding pronunciation sound source is judged to come from a human body in the vertical ladder, determining that the corresponding keyword meets a fourth preset condition;
if the video frame number meeting the first preset condition in the video data is not less than a preset frame number threshold on the premise of meeting the second preset condition and/or at least one robbery utterance keyword meets the fourth preset condition on the premise of meeting the third preset condition, determining that an article robbery event in the straight elevator occurs, and sending an abnormal behavior reminding signal to an elevator monitoring background.
Based on the invention, a scheme for determining the robbery event of goods in the straight elevator based on audio and video data is provided, namely after video data collected by a camera in the straight elevator and audio data collected by a sound pick-up in the straight elevator in a target monitoring period are obtained, whether the determination condition of the robbery event of goods in the straight elevator is met or not is judged based on the image processing result of the video data, whether the determination condition of the robbery event of goods in the straight elevator is met or not is judged based on the recognition result of the robbery words keyword of the audio data and the sound source positioning result, and when any one of the conditions is met, the occurrence of the robbery event of goods in the straight elevator is determined, an abnormal behavior reminding signal is sent to an elevator monitoring background, so that on the basis of behavior recognition, mechanisms such as keyword judgment of the robbery words in the goods robbery event and comprehensive judgment of the sound source position (for preventing the interference of advertising sound in the elevator) are added, the result judgment accuracy of the article robbery event can be greatly improved. In addition, because the camera is installed at the rear side of the elevator car at a fixed angle, the collected video frames contain a complete human body part, the action is formed by consecutive action video frames from the camera angle, and the early warning processing of the object robbery detection in the straight elevator is directly carried out after audio and video data are obtained, and the information transmission back and forth with background equipment is not needed, so that the processing speed can be increased, the algorithm required for realizing the object robbery event is loaded in the computer equipment at the straight elevator side, the hardware equipment can be reduced to one camera (with a sound capturing function) and the equipment required for being externally connected with the camera from multiple original hardware units, the system manufacturing cost and the working condition deployment difficulty of the system are reduced, and the practical application and popularization are facilitated.
In one possible design, for each robbing speech keyword in the at least one robbing speech keyword, if it is determined that the corresponding pronunciation sound source is from a human body in the vertical ladder, it is determined that the corresponding keyword satisfies a fourth preset condition, including:
aiming at a certain robbing speech keyword in the at least one robbing speech keyword, performing corresponding sound source position estimation processing by using a trained sound source position estimation model according to audio data in corresponding start-stop time to obtain a direction angle and an elevation angle of a corresponding sound source relative to the sound pickup in the straight ladder;
determining a first polar angle coordinate of a sound source corresponding to the certain robbing utterance keyword in a frame image of a synchronous video frame and taking an image center as a pole according to the direction angle, the elevation angle and a known position relation between the straight ladder internal camera and the straight ladder internal sound pickup, wherein the synchronous video frame is a video frame collected by the straight ladder internal camera within a starting and ending time corresponding to the certain robbing utterance keyword;
for each of the contemporaneous video frames, determining at least one human head position in the corresponding frame image;
for each of the contemporaneous video frames, if it is determined that there is a human head position in the corresponding at least one human head position that satisfies the following condition: determining that the corresponding video frame meets a fifth preset condition if the absolute difference value of a second polar angle coordinate, taking the image center as a pole, of the human head position in the corresponding frame image and the first polar angle coordinate is not greater than a preset angle threshold;
and if the ratio of the video frame number meeting the fifth preset condition to the total video frame number is judged to be not less than a preset first proportional threshold, determining that a sound source corresponding to the certain robbing words keyword comes from a human body in the straight ladder, and determining that the certain robbing words keyword meets a fourth preset condition, wherein the total video frame number refers to the total number of video frames collected by a camera in the straight ladder within the starting and ending time corresponding to the certain robbing words keyword.
In one possible design, acquiring video data collected by a camera in a straight elevator and audio data collected by a pickup in the straight elevator during a target monitoring period includes:
after a real-time video frame acquired by a camera in the straight elevator is acquired, importing a frame image of the real-time video frame into an article identification model which is trained and based on a target detection algorithm, and outputting to obtain an article identification result, wherein the camera in the straight elevator is fixedly arranged in a car of the straight elevator and faces towards a straight elevator door, and a camera view fixedly covers an inner area of the car and an area of the straight elevator door;
if the article identification result comprises at least one robbable article detection frame, determining that the robbable article exists in the straight ladder, then importing the frame image of the real-time video frame into a trained human body identification model based on a target detection algorithm, and outputting to obtain a human body identification result;
if the human body identification result comprises two human body detection frames, determining that two human bodies exist in the straight ladder, and then judging whether the straight ladder door is in an open state or not through image identification processing according to the frame image of the real-time video frame;
if the straight ladder door is judged to be in the opening state, determining the real-time video frame as a target video frame;
is obtained at the slave t1Time τ to t1Within a time period of + tau, video data collected by the camera in the straight elevator and audio data collected by the sound pickup in the straight elevator, wherein t1Showing the acquisition time corresponding to the target video frame, showing the preset specified time length by tau, and fixedly installing a pickup in the straight elevator car。
In one possible design, after sending the abnormal behavior reminding signal to the elevator monitoring background, the method further comprises the following steps:
continuously acquiring new video frames acquired by the camera in the straight elevator, and judging whether the total duration of the keyword occurrence period is equal to zero or not until less than two human bodies are found in the straight elevator through image recognition processing, wherein the keyword occurrence period refers to the starting and ending time of all the robbing words meeting the fourth preset condition;
if the total time length of the keyword occurrence period is judged to be equal to zero, first robbery record data are sent to the elevator monitoring background, and otherwise, the time length ratio T is judgedsame,12/TsameWhether the first robbery record data is not less than a preset second proportion threshold value or not, wherein the first robbery record data comprises the video data, the audio data, action presenting postures of the at least one group of robbery action presenting nodes and/or a video frame image set meeting the first preset condition, and Tsame,12Representing the total intersection duration of the keyword occurrence period and a delay period, wherein the delay period is from the moment t1To time t2Period of time t2Representing the acquisition time corresponding to the current new video frame, wherein the current new video frame is a new video frame T found to be less than two human bodies in the vertical ladder through image recognition processingsameRepresenting a total duration of the keyword occurrence period;
if the time length ratio T is judgedsame,12/TsameIf the ratio is not less than the second ratio threshold, second robbery record data is sent to the elevator monitoring background, otherwise, whether the second robbery record data meets the second preset condition under the condition that the video data meets the third preset condition is judged, wherein the second robbery record data comprises the video data, the audio data and/or the audio and video data collected in the keyword occurrence period;
and if the video data meet the second preset condition under the condition that the video data meet the third preset condition, third robbery record data are sent to the elevator monitoring background, otherwise, the third robbery record data are not sent to the elevator monitoring background, wherein the third robbery record data comprise the video data, the audio data, the action presenting postures of the at least one group of robbery action presenting nodes, the video frame image set meeting the first preset condition and/or the audio and video data collected during the occurrence period of the keyword.
In one possible design, the sending of the first robbery recording message, the second robbery recording message or the third robbery recording message to the elevator monitoring background includes:
importing the frame image of the current new video frame into a floor information identification model which is trained and based on a target detection algorithm, and outputting to obtain a floor information identification result;
acquiring floor information according to the floor information identification result;
and sending a robbery record message carrying the floor information to the elevator monitoring background.
The invention provides a device for detecting and early warning robbery of articles in a vertical ladder, which comprises an audio and video data acquisition module, a human body skeleton marking module, a first condition determination module, a second condition determination module, a third condition determination module, a fourth condition determination module and an event determination module;
the audio and video data acquisition module is used for acquiring video data acquired by a camera in the straight elevator and audio data acquired by a sound pickup in the straight elevator in a target monitoring time period, wherein the target monitoring time period is t1Time τ to t1Period of time + τ, t1Express the collection moment that corresponds with target video frame, tau expresses that preset is long for appointed, target video frame mean by camera was gathered in the vertical ladder and through image recognition handles the video frame that discovery has can robbed article and two human bodies and discovery vertical ladder door and open in the vertical ladder, camera fixed mounting is in vertical ladder car inside and the vertical ladder door of orientation in the vertical ladder to make the camera lens field of vision fix and cover car internal region and vertical ladder door region, adapter fixed mounting is in vertical ladder car inside;
The human body skeleton labeling module is in communication connection with the audio and video data acquisition module and is used for extracting and processing human body joint point information according to corresponding frame images aiming at each video frame in the video data to obtain a human body skeleton labeled in the corresponding frame images, wherein the human body skeleton comprises human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right feet heads;
the first condition determining module is in communication connection with the human body skeleton labeling module and is used for determining that the corresponding video frames meet a first preset condition if the distance from at least one head node in the corresponding frame image to the central point of the article detection frame is not larger than a preset first distance threshold value aiming at each video frame, wherein the article detection frame is a detection frame which is identified in the corresponding frame image and can rob articles;
the second condition determining module is in communication connection with the human body skeleton labeling module and is used for determining that a second preset condition is met according to the human body skeletons of the video frames if the action presenting postures of at least one group of robbery action presenting nodes belong to pre-labeled robbery postures, wherein the robbery action presenting nodes comprise human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right foot heads;
the third condition determining module is in communication connection with the audio and video data acquiring module and is used for performing robbery utterance keyword recognition processing by utilizing a trained keyword retrieval system based on an end-to-end voice recognition technology according to the audio data, and determining that a third preset condition is met if at least one robbery utterance keyword is recognized, wherein the confidence coefficient of the robbery utterance keyword is not less than a preset confidence coefficient threshold;
the fourth condition determining module is in communication connection with the third condition determining module and is used for determining that the corresponding keyword meets a fourth preset condition if the corresponding pronunciation sound source is judged to come from a human body in the vertical ladder aiming at each robbing speech keyword in the at least one robbing speech keyword;
the event determining module is respectively in communication connection with the first condition determining module, the second condition determining module, the third condition determining module and the fourth condition determining module, and is used for determining that an article robbing event in the vertical elevator occurs and sending an abnormal behavior reminding signal to an elevator monitoring background when the number of video frames in the video data meeting the first preset condition is not less than a preset frame number threshold on the premise that the second preset condition is met, and/or when at least one robbing words keyword meets the fourth preset condition on the premise that the third preset condition is met.
In a third aspect, the present invention provides a computer device, which includes a memory, a processor, and a transceiver that are sequentially connected in a communication manner, where the memory is used to store a computer program, the transceiver is used to send and receive information, and the processor is used to read the computer program and execute the method for detecting and warning robbery of articles in a straight ladder according to any one of the first aspect or the first aspect.
In a fourth aspect, the present invention provides a computer-readable storage medium, where instructions are stored, and when the instructions are run on a computer, the method according to the first aspect or any one of the first aspects may be designed to perform the method for detecting and warning robbery of articles in a vertical ladder.
In a fifth aspect, the present invention provides a computer program product containing instructions, which when run on a computer, cause the computer to execute the method for detecting and warning robbery of articles in a vertical ladder as described in the first aspect or any one of the first aspects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of a vertical ladder scene provided by the present invention.
Fig. 2 is a schematic flow diagram of the method for detecting and warning the robbery of the articles in the vertical ladder according to the present invention.
Fig. 3 is an exemplary diagram of the correspondence between joint points and joint point labels of the COCO18 model provided by the present invention.
Fig. 4 is a schematic diagram of the working principle of the keyword retrieval system based on the end-to-end speech recognition technology provided by the invention.
Fig. 5 is a schematic flow chart of a frame level alignment algorithm in a keyword retrieval process according to the present invention.
Fig. 6 is a schematic flow chart of a keyword matching and deduplication method in a keyword search process according to the present invention.
Fig. 7 is a flow chart of a sound signal preprocessing method provided by the present invention.
Fig. 8 is a schematic structural diagram of a convolutional neural network in a sound source position estimation model provided by the present invention.
Fig. 9 is a schematic flow diagram for acquiring audio/video data of a target monitoring period according to the present invention.
Fig. 10 is a schematic flow chart for confirming an item robbery event in a vertical ladder according to the present invention.
Fig. 11 is a schematic structural diagram of an article robbery detection early warning device in a straight ladder provided by the invention.
Fig. 12 is a schematic structural diagram of a computer device provided by the present invention.
In the above drawings: 1-a vertical elevator car; 11-a vertical ladder door; 2-straight ladder camera equipment; 3-a label; 4-floor display.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. Specific structural and functional details disclosed herein are merely representative of exemplary embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.
It will be understood that, although the terms first, second, etc. may be used herein to describe various objects, these objects should not be limited by these terms. These terms are only used to distinguish one object from another. For example, a first object may be referred to as a second object, and a second object may similarly be referred to as a first object, without departing from the scope of example embodiments of the invention.
It should be understood that, for the term "and/or" as may appear herein, it is merely an associative relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, B exists alone or A and B exist at the same time; for the term "/and" as may appear herein, which describes another associative object relationship, it means that there may be two relationships, e.g., a/and B, which may mean: a exists singly or A and B exist simultaneously; in addition, for the character "/" that may appear herein, it generally means that the former and latter associated objects are in an "or" relationship.
As shown in fig. 1-2, the method for detecting and warning robbery of articles in the straight ladder according to the first aspect of the present embodiment may be, but not limited to, executed by a Computer device having certain computing resources and being respectively in communication connection with the camera in the straight ladder, the pickup in the straight ladder, and the elevator monitoring background, for example, executed by an electronic device such as a camera in the straight ladder, a camera in the straight ladder (which is integrated with the camera and pickup so as to have a sound capturing function), a ladder control device, a Personal Computer (PC, which refers to a multipurpose Computer with a size, price, and performance suitable for Personal use, a desktop Computer, a laptop Computer, a small notebook Computer, a tablet Computer, a super Computer, and the like all belong to a Personal Computer), a smart phone, a Personal digital assistant (PAD), or a wearable device, so as to obtain video data collected by the camera in the straight ladder and audio data collected by the pickup in the straight ladder during a target monitoring period, on the basis of behavior identification, mechanisms such as keyword judgment of a robbing utterance in an article robbery event, comprehensive judgment of a sound source position (for preventing the interference of an intra-elevator advertisement sound) and the like are added, so that the result judgment accuracy of the article robbery event can be greatly improved. As shown in fig. 2, the method for detecting and warning robbery of articles in the vertical ladder may include, but is not limited to, the following steps S1 to S7.
S1, acquiring video data acquired by a camera in the straight elevator and audio data acquired by a pickup in the straight elevator in a target monitoring time period, wherein the target monitoring time period is t1Time τ to t1Period of time + τ, t1Show the collection moment that corresponds with target video frame, tau represents when predetermined appointed, target video frame mean by the video frame that camera was gathered in the vertical ladder and had article and two human bodies and the straight ladder door of discovery to be robbed through image recognition processing discovery in the vertical ladder was opened, camera fixed mounting just faces the straight ladder door inside the vertical ladder car in the vertical ladder to make the fixed car inner zone region and the vertical ladder door region of covering in the camera view, adapter fixed mounting is inside the vertical ladder car in the vertical ladder.
In the step S1, the straight elevator inner camera is configured to acquire a real-time monitoring image inside the straight elevator car, and the straight elevator inner pickup is configured to acquire a real-time monitoring sound inside the straight elevator car; the camera in the straight ladder with the adapter can fixed mounting in the straight ladder can the inside different positions of straight ladder car, also can fixed mounting in same position (be about to the camera in the straight ladder with the adapter is integrated in a data acquisition equipment in the straight ladder). As shown in fig. 1, an integrated camera in the straight elevator and a camera device 2 in the straight elevator of a pickup in the straight elevator are fixedly mounted in a straight elevator car 1, the camera device 2 in the straight elevator is located in the straight elevator car 1 and faces a straight elevator door 11, on one hand, the camera lens visual field can be fixed in the car inner area and the straight elevator door area, so that video data including all information of human bodies in the elevator and consecutive actions under the non-shielding condition can be shot, and on the other hand, audio data acquired synchronously with the video data can be acquired. The computer equipment can acquire the video data and the audio data in a mode of connecting the camera in the straight elevator and the sound pick-up in the straight elevator through wired or wireless communication. The target monitoring time interval is a special detection time interval meeting the following article robbery conditions: the method comprises the following steps that (1) articles can be robbed in the vertical elevator car, and only two persons and a vertical elevator door are opened (namely, the article robbing event in the vertical elevator is mainly considered to occur in the time period that the vertical elevator door is opened and can be quickly escaped by a robbed suspect, including the time period that the vertical elevator door is completely opened, the incompletely opened time period that the vertical elevator door still can be passed by one person when being opened/closed, and the incompletely opened time period that the robbed suspect can take the elevator door open by bare hands when the vertical elevator door is opened/closed), namely, the condition that the robbed suspect cannot normally exit the elevator door is considered, the conditions that the article robbing behavior is carried out in advance and the elevator door is taken open to accelerate the escape are also existed); because subsequent robbery detection and early warning processing is only carried out on the audio and video data of the target monitoring time interval, data analysis can be carried out in real time without the need, the required computing resource requirement and the power consumption requirement are greatly reduced, and the service life of the computer equipment is prolonged. In addition, the specified time τ may be determined according to a result of counting the elapsed time of all the article robbery events in the vertical ladders that have occurred historically, for example, if the statistical average value of the elapsed time is 6 seconds, the specified time τ may be determined to be 3 seconds, that is, the time of the target monitoring period is 6 seconds.
And S2, extracting human body joint point information according to corresponding frame images aiming at each video frame in the video data to obtain a human body skeleton marked in the corresponding frame images, wherein the human body skeleton comprises but is not limited to human body nodes corresponding to left and right hands and heads, left and right elbows, left and right shoulders, left and right waists, left and right knees, left and right feet and the like.
The step S2 includes, but is not limited to, the following steps S21 to S22.
S21, aiming at each video frame in the video data, eighteen corresponding human body joint points which are marked according to COCO18 mode joint point labels are identified from corresponding frame images by using an AlphaPose human body posture estimation algorithm or human body posture identification project OpenPose software, wherein the eighteen human body joint points belong to the same human body.
In step S21, the alphapos human body Pose Estimation algorithm is a prior art that is based on a local Multi-Person Pose Estimation (RMPE) framework and uses three technologies of Symmetric Spatial Transform Network (SSTN), Deep Pollutants Generator (DPG) and Parametric stress prediction (p-NMS) to solve the Multi-Person Pose Estimation problem in the field scene. The human body posture recognition project OpenPose software is open source software developed based on a convolutional neural network and supervised learning and with caffe as a framework, of the University of Carnegie Mellon (abbreviated as CMU). All the above methods can realize posture estimation of human body motion, facial expression, finger motion and the like, and acquire position information and the like of each joint point of human body skeleton in an input image, wherein the joint point label of the COCO18 mode is an existing joint point label model, and the sequence and corresponding positions of the labeled 18 joint points can be as shown in fig. 3. Therefore, based on the existing alpha-Pose human posture estimation algorithm/human posture recognition project OpenPose software and the COCO18 mode joint point labels, the following 18 corresponding joint points can be recognized from each frame image: a nose node (corresponding reference numeral 0), a head node (corresponding reference numeral 1), a right shoulder node (corresponding reference numeral 2), a right elbow node (corresponding reference numeral 3), a right hand head node (corresponding reference numeral 4), a left shoulder node (corresponding reference numeral 5), a left elbow node (corresponding reference numeral 6), a left hand head node (corresponding reference numeral 7), a right waist node (corresponding reference numeral 8), a right knee node (corresponding reference numeral 9), a right foot head node (corresponding reference numeral 10), a left waist node (corresponding reference numeral 11), a left knee node (corresponding reference numeral 12), a left foot head node (corresponding reference numeral 13), a right eye node (corresponding reference numeral 14), a left eye node (corresponding reference numeral 15), a right ear node (corresponding reference numeral 16), a left ear node (corresponding reference numeral 17), and the like. In addition, the number of the eighteen human body joint points is equal to the number of human body identifications in the corresponding frame image, and the eighteen human body joint points are in one-to-one correspondence with the two human bodies.
And S22, removing human body joint points irrelevant to the robbery behavior from the eighteen corresponding human body joint points aiming at each video frame to obtain a human body skeleton marked in a corresponding frame image, wherein the human body skeleton comprises but is not limited to human body nodes corresponding to left and right hands and heads, left and right elbows, left and right shoulders, left and right waists, left and right knees, left and right feet and the like.
In step S22, the trunk and the four limbs of the human body are considered as key execution parts during the process of robbery of the article, so in order to reduce the calculation resources required for the subsequent behavior recognition, only the human body nodes corresponding to the left and right hand heads, the left and right elbows, the left and right shoulders, the left and right waists, the left and right knees, the left and right feet heads, and the like may be reserved in the eighteen human body joint points to obtain the human body skeleton. In addition, if information of two human bodies is included in a certain frame image, two human body skeletons corresponding to the two human bodies one to one are obtained.
And S3, aiming at each video frame, if the distance from at least one head node in the corresponding frame image to the central point of the object detection frame is judged to be not more than a preset first distance threshold value, determining that the corresponding video frame meets a first preset condition, wherein the object detection frame refers to the object detection frame which is identified in the corresponding frame image and can be robbed.
In step S3, if it is determined that the distance from at least one first hand node to the center point of the item detection frame in the frame image of a certain video frame is not greater than the preset first distance threshold, it indicates that a hand is close to or holds an item that can be robbed at that time, and the distance may be used as one of effective criteria for determining whether an item robbing event occurs in the vertical ladder. Specifically, the method includes, but is not limited to, the following steps S31 to S33.
And S31, importing the frame image of a certain video frame into the trained object recognition model based on the target detection algorithm, and outputting to obtain an object recognition result.
In step S31, the target detection algorithm is an existing artificial intelligence recognition algorithm for recognizing objects in the picture and marking the positions of the objects, and specifically, but not limited to, the target detection algorithm was proposed in 2015 by using fast R-CNN (fast Regions with conditional Neural Networks, by hogel et al, which obtains multiple first target detection algorithms in the ILSVRV and COCO race in 2015), SSD (Single Shot multi box Detector, which is one of the target detection algorithms proposed by Wei Liu on ECCV and is one of the currently popular main detection frames) or YOLO (young only look, which has been developed to V5 version recently, which has been widely applied in 2016, which is basically that 2016 predicts 2 frames per grid for 7x7 grids of the input image, and then removing the target window with low possibility according to the threshold, and finally removing the redundant window by using a frame combination mode to obtain a detection result), a target detection algorithm and the like. Therefore, the trained object identification model based on the target detection algorithm can be obtained based on the existing conventional training mode, so that the object identification model has the capability of accurately identifying the robbable object (such as a mobile phone, a wallet or a handbag) and can identify whether the image of the robbable object is contained in the frame image.
Before the step S31, the sample collection and training process for the item identification model may include, but is not limited to, the following: (1) acquiring elevator user article robbed audio and video data DT1-1 which are acquired by a camera in the straight elevator within a specified time period (one year is acquiescent, the time period can be modified), acquired by a car with only two persons in the car, network-derived elevator user article robbed audio and video data DT1-2-1 which are acquired by the network and acquired by the camera in the straight elevator within the car with only two persons, simulating elevator user article robbed audio and video data DT1-2-2 under the condition that only two persons in the car are in the car, and acquiring elevator video data DT1-3 within another specified time period (one month is acquiescent, the time period can also be modified); (2) video data DT2-1 containing the robbed and possibly robbed articles is obtained from DT1-1, DT1-2-1, DT1-2-2 and DT1-3, and classified and labeled to form a target article data set A; (3) and based on the target article data set A, carrying out model training of sample balance by adopting a YOLOv5 target detection algorithm to obtain the article identification model. Furthermore, considering that the training process of the identification model consumes a lot of computing resources, the item identification model is preferably deployed on other computer devices as part of the AI detection algorithm after being trained on the computer devices.
And S32, if the article identification result comprises at least one robbable article detection frame, respectively judging whether the distance from the corresponding node to the central point of each detection frame in the at least one robbable article detection frame in the frame image is not greater than a preset first distance threshold value or not for each head node marked in the frame image of a certain video frame.
And S33, if the distance from a certain hand head node to the center point of a certain detection frame in the frame image of the certain video frame is judged to be not greater than the first distance threshold, determining that the certain video frame meets a first preset condition.
And S4, according to the human body skeleton of each video frame, if the action presenting postures of at least one group of robbery action presenting nodes belong to the pre-marked robbery postures, determining that a second preset condition is met, wherein the robbery action presenting nodes comprise but are not limited to human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees, left and right feet heads and the like.
In step S4, if it is determined that the motion presenting posture of at least one group of robbery motion presenting nodes belongs to the pre-labeled robbery posture, it indicates that there is an object robbery motion in the straight ladder within the target monitoring time period, and the motion presenting posture can be used as a second effective criterion for determining whether there is an object robbery event in the straight ladder. Specifically, the human skeletons of all video frames can be imported into a bone behavior recognition model which is trained and based on an ST-GCN (space-time graph convolutional network model) bone behavior recognition algorithm, a behavior recognition result is output, and if the behavior recognition result contains at least one group of robbery behavior presentation nodes, the action presentation postures of the robbery behavior presentation nodes belong to pre-labeled robbery postures, a second preset condition is determined to be met. The ST-GCN skeletal behavior recognition algorithm is an existing algorithm for human behavior recognition based on a space-time graph convolution network, and is used for modeling dynamic skeletons based on time-series representation of human joint positions and capturing the space-time change relationship by expanding graph convolution into the space-time graph convolution network. The bone behavior recognition model, which is trained and based on the ST-GCN bone behavior recognition algorithm, can be obtained based on the conventional training mode. In addition, the sequence of the step S4 and the step S3 is not limited, and may be executed simultaneously.
Before the step S4, a sample collection and training process for the bone behavior recognition model may include, but is not limited to, the following: (1) acquiring elevator user item robbed audio and video data DT1-1 which are acquired by a camera in the straight elevator and are only acquired by two persons in the elevator car within a specified time period (default one year, and the time period can be modified), elevator user item robbed audio and video data DT1-2-1 which are originated from the network and are only acquired by two persons in the elevator car, and elevator user item robbed audio and video data DT1-2-2 which simulate that the elevator user item is robbed by two persons in the elevator car; (2) acquiring a video DT2-4 containing an article robbery behavior from DT1-1, DT1-2-1 and DT1-2-2, acquiring human body joint point information of a person image which contains the behavior and is positioned in a human body detection area in the video DT2-4 by using an AlphaPose human body posture estimation algorithm, saving a skeleton map (specifically, saving the serial number and the coordinates of each joint point of each frame and the corresponding skeleton map of each frame), defining behavior subdivision labels of the person on the article, ensuring that one label is in a video segment, and the video time length is less than a specified number of seconds (the specific number of seconds is determined by comprehensively evaluating the robbery behavior in the DT1-1, DT1-2-1 and DT1-2-2 videos), numbering video folders, establishing indexes of the behavior labels and video files, thereby obtaining a behavior identification data set B of the person in the straight elevator to the article; (3) and converting the behavior recognition data set B into a file suitable for training by using an ST-GCN (ST-GCN skeletal behavior recognition algorithm), and then carrying out model training to obtain the skeletal behavior recognition model. Furthermore, considering that the training process of the recognition model consumes a lot of computing resources, the skeletal behavior recognition model is preferably deployed on other computer devices as part of the AI detection algorithm after being trained on the computer devices.
And S5, according to the audio data, carrying out robbery utterance keyword recognition processing by using a trained keyword retrieval system based on an end-to-end voice recognition technology, and if at least one robbery utterance keyword is recognized, determining that a third preset condition is met, wherein the confidence coefficient of the robbery utterance keyword is not less than a preset confidence coefficient threshold value.
In step S5, if at least one robbing utterance keyword is identified, it indicates that the context for robbing articles in the vertical elevator exists in the target monitoring period, and the context can be used as a third effective criterion for determining whether an article robbing event in the vertical elevator exists. The keyword retrieval system based on the end-to-end voice recognition technology is mainly used for matching keywords in words of elevator passengers and obtaining starting and stopping time points and confidence degrees of the keywords when suspected goods robbery behaviors occur. The main core parts in the keyword retrieval system are an end-to-end speech recognition system (which adopts a combined CTC/attention architecture based on a Transformer neural network structure as a basic framework of speech recognition), a frame-by-frame phoneme classifier, a frame-level alignment part and keyword matching and deduplication of N-best hypotheses, the schematic diagram is shown in fig. 4, where a dashed box in fig. 4 is a frame-by-frame phoneme classifier, and B dashed box is an end-to-end speech recognition front end combining CTC/attention (it should be noted that the network structures of the frame-by-frame phoneme classifier and the speech recognition encoder are completely the same except for the dimension of the output layer, so that the down-sampling layer of the frame-by-frame phoneme classifier and the end-to-end speech recognition front end and the parameters of several lower encoder layers are shared, but in order to avoid mutual interference during training, several higher encoder layers are isolated).
The processing flow of the keyword retrieval system is as follows: firstly, looking at the dotted line box A, inputting the original speech features into a downsampling shared layer (aiming at reducing the calculation amount of a subsequent neural network), then entering a (lower layer and upper layer) encoder network formed by stacking a plurality of layers, and then passing through a softmax full-connection output layer (namely a phoneme classifier output layer) to obtain the phoneme posterior probability frame by frame (the output of the phoneme classifier is the posterior probability of each phoneme on each frame of speech of a word); meanwhile, as shown by a dotted line B, after the original speech features are input into a downsampling shared layer and a plurality of (lower layer and upper layer) encoder networks, data can enter into a CTC (Connectionist Temporal Classification) output layer (a CTC label sequence is predicted frame by using a neural network of the CTC, continuous same output labels in the sequence are merged, a result sequence is obtained after a specific label is deleted and the label is collapsed), and an attention mechanism (each encoder layer is formed by a feed-forward network and a multi-head self-attention layer), wherein three inputs of self-attention, namely Q-query, K-key and V-value are all outputs of a front sublayer, a single multi-head attention sublayer is arranged between the self-attention and the feed-forward network, the input Q is from the output of the front sublayer, and the input K and the input V are from the last layer of the encoder, so that the attention mechanism is formed, and the attention mechanism is zoom point multiplied by attention) and enters a plurality of decoder layers, the input of the decoder is an embedded vector sequence of text labels, and the output labels predicted in the previous step are used as input (the current labels are predicted in an autoregressive mode) during reasoning; in this case, the CTC/attention joint decoding is performed using the scores of the CTC and the decoder during the inference process under the support of the prediction tag set of the decoder and the CTC result sequence.
To provide a relatively accurate word start-stop time point and reliable confidence for a joint CTC/attention end-to-end speech recognition front-end, this can be achieved using a frame-level alignment method and a resulting posterior probability (per phoneme per frame of speech attributed to the word) obtained using the Softmax output layer, as shown in fig. 5: mapping a decoding-resultant word sequence of speech recognition to a phoneme sequence (delta)1,…,δM) Inserting interval phonemes between the beginning and the end of the sentence and the adjacent words; for each phoneme deltaMAnd the posterior probability of the phoneme on each frame of the voice frame is obtained from the phoneme classifier. Then deltaMThe posterior probability at the nth frame is PnM) Total speech frameIf the number is N, the posterior probability of the phoneme forms an M multiplied by N matrix P; finding the element P from the upper left corner of the matrix using a dynamic programming algorithm1,1To the lower right corner element PM,NThe maximum cumulative posterior probability path (the path only advances to the right or the lower right, each frame corresponds to one phoneme, each phoneme with actual pronunciation corresponds to at least one frame, but interval phonemes can skip); and backtracking the voice frames corresponding to the phonemes to obtain the starting and ending frames of the words in the voice recognition decoding result, calculating starting and ending time points according to the frame rate of the model, calculating the frame average phoneme posterior probability confidence coefficient gamma of the words, and obtaining the result keyword confidence coefficient by utilizing the label posterior probability mean confidence coefficient xi and gamma linear interpolation output by the decoder. To this end, the posterior probabilities of the phoneme classifiers and the joint decoding results have been time-aligned, the time start and end points and confidence of each word have been obtained, and then keyword matching and deduplication in N-best hypotheses (in order to avoid missing potential keyword results) is entered, as shown in fig. 6: clear the result list and then traverse the hypotheses in reverse order (N hypothesis->1 hypothesis) matching keywords are put into the result list, if the same keywords are found in the result list, the keyword with higher confidence degree is reserved, and the keyword with lower confidence degree is deleted.
Before the step S5, the sample collection and training process for the keyword search system may include, but is not limited to, the following: (1) acquiring elevator user article robbed audio and video data DT1-1 which are acquired by a camera in a straight elevator and are only provided with two persons in an elevator car, elevator user article robbed audio and video data DT1-2-1 which are originated from a network and are only provided with two persons in the elevator car, and elevator user article robbed audio and video data DT1-2-2 which simulate the elevator user article robbed audio and video data DT1-2-2 provided with two persons in the elevator car within a specified time period (default one year, and the time period can be modified); (2) obtaining initial voice recognition data (including voice data and corresponding text data; only collecting key words data in article robbery and making homophone and multi-syllable Chinese-language scene-dependent retaining treatment operation; removing non-key words), making manual marking to obtain voice recognition data set C (as training set for using it as training set)Global cepstral mean variance normalization of the data); then, performing voice preprocessing and feature extraction by using Kaldi, adopting 40-dimensional high-resolution Mel frequency cepstrum coefficient and three-dimensional tone features as voice features, manually transcribing texts in a training set in the voice recognition data set C, and generating a certain number of text modeling units by using a Byte Pair Encoding (BPE) algorithm to serve as an end-to-end voice recognition output unit; in addition, a Gaussian mixture model-HMM speech recognition system for triphone modeling is adopted to obtain frame-by-frame phoneme labels of the speech of a training set required by the training of the phoneme classifier; (3) performing model training by adopting a keyword retrieval system based on an end-to-end voice recognition technology according to the voice recognition data set C; wherein the frame-by-frame phoneme classifier and the speech recognition front end are jointly trained in a multi-task learning manner, and the total loss function is determined by a phoneme classifier loss function LPCAnd speech recognition front-end loss function LASRLinear interpolation yields: l ═ beta LPC+(1-β)LASRWherein β represents an interpolation coefficient; and during model training, optimizing by adopting an Adam optimizer with Noam learning rate attenuation, training by using dropout (the probability is 0.1), label smoothing (the coefficient is 0.1), training warm-up (25000 steps) and gradient cutting (the threshold value is 5), and setting the multi-task learning loss interpolation coefficients alpha and beta to be 0.3 and 0.1 respectively to obtain the trained keyword retrieval system based on the end-to-end speech recognition technology.
In addition, in the parameter configuration stage of the keyword retrieval system, the shared bottom layer of the frame-by-frame phoneme classifier and the speech recognition front end can be configured as a 9-layer Transformer encoder, and the other respective higher 3-layer encoders are respectively configured to be used independently; the decoder of the speech recognition front end is configured to be 6 layers, the dimension of multi-head attention in each encoder and decoder layer is configured to be 320, the dimension of head number is configured to be 4, and the dimension of a feedforward neural network is configured to be 2048; the modeling unit of the frame-by-frame phoneme classifier is configured with 22 consonants, 10 vowels, and silence (interval phonemes) of chinese, for a total of 33 phoneme labels. In addition, in the running phase, the CTC/attention joint decoding with the CTC weight of 0.5 is used, and the training process of the system model is considered to consume a large amount of computing resources, the keyword retrieval system is preferably deployed on other computer devices as a part of an AI detection algorithm after completing training on the computer devices, and the sequential execution sequence of the step S5 and the steps S2 to S4 is not limited, and can be executed simultaneously.
And S6, aiming at each robbing speech keyword in the at least one robbing speech keyword, if the corresponding pronunciation sound source is judged to come from a human body in the vertical ladder, determining that the corresponding keyword meets a fourth preset condition.
In step S6, if it is determined that the pronunciation source of a certain robbing word keyword is from a human body in the straight elevator, it indicates that the certain robbing word keyword is actually spoken by the people in the elevator (i.e., spoken by a non-elevator advertising machine or other devices), which can be used as four effective criteria for determining whether there is an object robbing event in the straight elevator. Specifically, for each robbery utterance keyword in the at least one robbery utterance keyword, if it is determined that the corresponding pronunciation sound source is from a human body in the vertical ladder, it is determined that the corresponding keyword satisfies a fourth preset condition, including but not limited to the following steps S61 to S63.
And S61, aiming at a certain robbing speech keyword in the at least one robbing speech keyword, performing corresponding sound source position estimation processing by using a trained sound source position estimation model according to the audio data in the corresponding starting and stopping time to obtain a direction angle and an elevation angle of a corresponding sound source relative to the sound pickup in the straight ladder.
In step S61, the sound source orientation estimation model mainly includes two major components, namely, preprocessing of sound signals and a convolutional neural network structure, where the preprocessing needs to perform framing windowing and noise reduction on the collected sound, and further calculates Gcc-phase between channels through a microphone array structure, and then obtains a six-dimensional Gcc-phase feature through a relationship between array elements according to a four-channel array element structure, as shown in fig. 7. Fig. 8 is a diagram of a Convolutional Neural Network structure in the sound source orientation estimation model, and Network parameters of the CNN (Convolutional Neural Network) Network structure are shown in table 1 below:
TABLE 1 network parameters of CNN network architecture
Figure BDA0003576529360000131
Figure BDA0003576529360000141
Further, a maximum pooling with a pooling window of 2x2 is employed in the CNN network structure and batch normalization is performed after each volume block.
Before the step S61, the sample collection and training process of the sound source location estimation model may include, but is not limited to, the following: (1) acquiring audio and video data DT1-2-2 simulating that the articles of a user taking the elevator are robbed under the condition that only two persons exist in the elevator car; (2) preprocessing audio data in DT1-2-2 data to obtain a Gcc-Phat characteristic, calculating related quantity, and taking the characteristic data, the calculation result of the related quantity, reference coordinates of sound pickup equipment and a sound production position (the center of the front end of the sound pickup equipment is an original point), reference azimuth angle and elevation angle data as a sound positioning estimation data set D; (3) and performing model training on the convolutional neural network of the sound source orientation estimation model by using the sound positioning estimation data set D to obtain the trained sound source orientation estimation model. Furthermore, considering that the training process of the estimation model consumes a lot of computing resources, the sound source orientation estimation model is preferably deployed on other computer devices as part of the AI detection algorithm after being trained on the computer devices.
S62, determining a first polar angle coordinate of a sound source corresponding to the certain robbing words keyword in a frame image of a synchronous video frame and taking an image center as a pole according to the direction angle, the elevation angle and the known position relation between the straight ladder inner camera and the straight ladder inner pickup, wherein the synchronous video frame is a video frame collected by the straight ladder inner camera within the starting and ending time corresponding to the certain robbing words keyword.
In step S62, if the straight intra-ladder camera and the straight intra-ladder sound pickup are integrated, the direction angle and the elevation angle are the direction angle and the elevation angle of the sound source relative to the straight intra-ladder camera, so that the first polar angle coordinate with the image center as the pole of the sound source corresponding to the certain robbing utterance keyword in the frame image of the synchronous video frame can be more easily determined.
S63, determining at least one human head position in the corresponding frame image aiming at each synchronous video frame.
In the step S63, the human head position may be, but is not limited to, a head node (corresponding to reference numeral 1) in a human skeleton marked in a frame image of the contemporaneous video frame.
S64, aiming at each synchronous video frame, if the human head position meeting the following conditions exists in the corresponding at least one human head position: and determining that the corresponding video frame meets a fifth preset condition if the absolute difference value between the second polar angle coordinate of the human head position in the corresponding frame image taking the image center as the pole and the first polar angle coordinate is not greater than a preset angle threshold.
In step S64, if it is determined that there is a human head position satisfying the above condition in the at least one human head position of a certain contemporaneous video frame, it indicates that the human head position at this time is highly overlapped with the sound source of the certain robbing utterance keyword, and it can be considered that the certain robbing utterance keyword may be spoken by an intra-ladder person corresponding to the human head position, and at this time, further confirmation can be performed through the subsequent step S65.
And S65, if the ratio of the video frame number meeting the fifth preset condition to the total video frame number is judged to be not less than a preset first proportional threshold, determining that a sound source corresponding to the certain robbing word keyword comes from a human body in the straight ladder, and determining that the certain robbing word keyword meets the fourth preset condition, wherein the total video frame number refers to the total number of video frames collected by a camera in the straight ladder within the starting and stopping time corresponding to the certain robbing word keyword.
In the step S65, if it is determined that the ratio of the number of video frames satisfying the fifth preset condition to the total number of video frames is not smaller than the first proportional threshold (e.g., 68%), it indicates that the head position of the human body and the sound source of the certain robbing utterance key word are not accidentally highly overlapped due to an unknown factor, and it is inevitable that the certain robbing utterance key word is spoken by the person in the elevator corresponding to the head position of the human body.
And S7, if the number of video frames meeting the first preset condition in the video data is not less than a preset frame number threshold on the premise of meeting the second preset condition and/or at least one robbing words keyword meets the fourth preset condition on the premise of meeting the third preset condition, determining that an article robbing event in the vertical elevator occurs and sending an abnormal behavior reminding signal to an elevator monitoring background.
Thus, based on the method for detecting and warning the robbery of the articles in the straight elevator described in the steps S1-S7, a scheme for determining the robbery event of the articles in the straight elevator based on audio and video data is provided, namely after the video data collected by a camera in the straight elevator and the audio data collected by a sound pick-up in the straight elevator in a target monitoring period are obtained, the determination condition of whether the robbery event of the articles in the straight elevator is met is judged based on the image processing result of the video data, the determination condition of whether the robbery event of the articles in the straight elevator is met is judged based on the recognition result of the robbery words keyword and the sound source positioning result of the audio data, and when any one of the conditions is met, the occurrence of the article robbery event in the straight elevator is determined, an abnormal behavior reminding signal is sent to an elevator monitoring background, and therefore, on the basis of behavior recognition, mechanisms such as keyword judgment of the robbery words in the article robbery event and comprehensive judgment of sound source position judgment (to prevent the interference of the advertisement sound source in the elevator) are added, the result judgment accuracy of the article robbery event can be greatly improved. In addition, because the camera is installed at the rear side of the elevator car at a fixed angle, the collected video frames contain a complete human body part, the action is formed by consecutive action video frames from the camera angle, and the early warning processing of the object robbery detection in the straight elevator is directly carried out after audio and video data are obtained, and the information transmission back and forth with background equipment is not needed, so that the processing speed can be increased, the algorithm required for realizing the object robbery event is loaded in the computer equipment at the straight elevator side, the hardware equipment can be reduced to one camera (with a sound capturing function) and the equipment required for being externally connected with the camera from multiple original hardware units, the system manufacturing cost and the working condition deployment difficulty of the system are reduced, and the practical application and popularization are facilitated.
On the basis of the technical solution of the first aspect, the present embodiment further specifically provides a possible design for how to obtain the audio and video data in real time, that is, obtain the video data collected by the camera in the straight elevator and the audio data collected by the sound pickup in the straight elevator in the target monitoring period, as shown in fig. 9, including but not limited to the following steps S11 to S15.
S11, after a real-time video frame collected by a camera in the straight elevator is obtained, a frame image of the real-time video frame is led into an article recognition model which is trained and based on a target detection algorithm, and an article recognition result is output, wherein the camera in the straight elevator is fixedly installed inside a car of the straight elevator and faces towards a straight elevator door, and a camera view field is enabled to fixedly cover an area inside the car and an area of the straight elevator door.
In step S11, the detailed description of the item identification model can refer to step S31, which is not repeated herein.
And S12, if the article identification result comprises at least one robbable article detection frame, determining that the robbable article exists in the straight ladder, importing the frame image of the real-time video frame into a trained human body identification model based on a target detection algorithm, and outputting to obtain a human body identification result.
In step S12, the introduction content of the human body recognition model is similar to that of the item recognition model, and before step S12, the sample collection and training process for the human body recognition model may include, but is not limited to, the following: (1) acquiring video data DT1-3 in the elevator collected by a camera in the straight elevator within another specified time period (default one month, the time period can be modified); (2) video data DT2-3 containing elevator users are obtained from DT1-3, human body labeling is carried out, and a human body data set E is formed; (3) and based on the human body data set E, carrying out model training of sample balance by adopting a YOLOv5 target detection algorithm to obtain the human body recognition model. In addition, considering that the training process of the recognition model consumes a large amount of computing resources, the human body recognition model is preferably deployed on other computer equipment as part of the AI detection algorithm after being trained on the computer equipment.
And S13, if the human body identification result comprises two human body detection frames, determining that two human bodies exist in the vertical ladder, and then judging whether the vertical ladder door is in an open state or not through image identification processing according to the frame image of the real-time video frame.
In step S13, specifically, it is determined whether the vertical door is in the open state by image recognition processing based on the frame image of the real-time video frame, including but not limited to any one of the following modes (a) to (B).
(A) When the straight ladder door is a split door and a label is preset on the inner surfaces of the two split door leaves, firstly, a frame image of the real-time video frame is imported into a label recognition model which is trained and based on a target detection algorithm, a label recognition result is output, then, the center distance of the two label detection frames is calculated according to the two label detection frames in the label recognition result, and finally, if the center distance is judged to be not smaller than a preset second distance threshold value, the straight ladder door is determined to be in an opening state.
In the above-described aspect (a), as shown in fig. 1, one label 3 is provided in advance on each of the inner surfaces of the ladder with the two door leaves being separated, and the label 3 may be, but is not limited to, a label such as a certificate. Similar to the article identification model, the introduction content of the label identification model also includes, before the step S13, a sample collection and training process for the label identification model, which may include, but is not limited to, the following: (1) acquiring video data DT1-3 in the elevator collected by a camera in the straight elevator within another specified time period (default one month, the time period can be modified); (2) acquiring videos of labels (namely, labels such as certificates and the like on the upper parts of the vertical elevator doors close to two sides of a door gap on the inner door) on the inner door of the vertical elevator car in DT1-3 data, and labeling the labels to form a label data set F; (3) and based on the label data set F, carrying out model training of sample equalization by adopting a YOLOv5 target detection algorithm to obtain the label recognition model. Considering that the article robbery event in the vertical ladder mainly occurs in a time period when the vertical ladder is opened and can be used for a robbery suspect to quickly escape, including a time period when the vertical ladder is completely opened, a time period when the vertical ladder is being opened/closed and the time period when the vertical ladder is still opened/closed and the elevator door is still opened/closed by a single hand, the second distance threshold may be determined by referring to the following manner: according to the collected historical video of robbery, recording the straight ladder door gap distance at which the robber can not normally exit in the opening/closing process of the straight ladder door, and finally taking the recorded minimum value of the straight ladder door gap distance as the second distance threshold. In addition, considering that the training process of the recognition model consumes a large amount of computing resources, the tag recognition model is preferably deployed on other computer devices as part of the AI detection algorithm after being trained on the computer devices.
(B) According to a background frame collected by a camera in the straight elevator when the straight elevator door is completely closed in advance, frame difference processing in a straight elevator door frame area is carried out on a frame image of the real-time video frame to obtain an intra-frame difference image, then discrete point removing processing and corrosion operation processing are carried out on the intra-frame difference image to obtain a new frame difference image, then convex hull processing is carried out on pixel points of which the frame difference absolute values in the new frame difference image are not smaller than a preset frame difference threshold value to obtain a convex hull area, then the total pixel amount of all the pixel points in the convex hull area is counted, and finally if the total pixel amount is judged to be not smaller than a preset number threshold value, the straight elevator door is determined to be in an open state.
In the above mode (B), the frame difference processing, the discrete point removal processing, the erosion operation processing, and the convex hull processing are all conventional graphics processing modes. Because the convex hull processing is carried out on the pixel points which are in the new frame difference image and have the frame difference absolute value not less than the frame difference threshold, the convex hull area can reflect the opening area of the straight ladder door, and then whether the straight ladder door is in the opening state or not can be determined according to the comparison result of the total pixel amount and the number threshold. In addition, the determination manner of the number threshold may also be obtained by referring to the second distance threshold.
And S14, if the straight ladder door is judged to be in the opening state, determining the real-time video frame as a target video frame.
In step S14, in order to further precisely lock the timing at which the vertical ladder article robbery detection early warning needs to be performed, the real-time video frame is determined as the target video frame, including but not limited to any one of the following manners (C) to (E) or any combination thereof.
(C) When the straight ladder door is a split door and a label is respectively preset on the inner surfaces of two split door leaves, firstly, a frame image of the real-time video frame is imported into a label identification model which is trained and based on a target detection algorithm, a label identification result is output, then, the center distance of the two label detection frames is calculated according to the two label detection frames in the label identification result, and finally, if the center distance is judged to be larger than the center distance of the two label detection frames corresponding to the previous video frame, the straight ladder door is determined to be in an opening state, and at the moment, the real-time video frame is determined to be the target video frame; or, according to a background frame collected by a camera in the straight elevator when the straight elevator door is completely closed in advance, performing frame difference processing in a straight elevator door frame area on a frame image of the real-time video frame to obtain an in-frame difference image, then performing discrete point removal processing and corrosion operation processing on the in-frame difference image to obtain a new frame difference image, then performing convex hull processing on pixel points of which the frame difference absolute values are not less than a preset frame difference threshold value in the new frame difference image to obtain a convex hull area, then counting the total pixel amount of all the pixel points in the convex hull area, and finally determining that the straight elevator door is in an opening state if the total pixel amount is judged to be greater than the total pixel amount in the convex hull area corresponding to the previous video frame, and determining the real-time video frame as a target video frame at this moment.
(D) Judging whether the closest distance between the two human body detection frames is smaller than a preset third distance threshold value or not, and if so, determining the real-time video frame as a target video frame; or judging whether the two human body detection frames have an intersection region, and if so, determining the real-time video frame as a target video frame.
(E) Judging whether the at least one robbable article detection frame has the robbable article detection frame meeting the following conditions: the center of the robbable article detection frame is positioned between the two human body detection frames or in the intersection area of the two human body detection frames, and if yes, the real-time video frame is determined as a target video frame.
S15, obtaining from t1Time τ to t1Within a time period of + tau, video data collected by the camera in the straight elevator and audio data collected by the sound pickup in the straight elevator, wherein t1The acquisition time corresponding to the target video frame is represented, the preset specified time length is represented by tau, and the pickup in the straight elevator is fixedly installed inside the straight elevator car.
Therefore, based on the possible design I described in the foregoing steps S11-S15, the timing at which the robbery detection and early warning of the articles in the vertical ladder is required can be accurately locked, and the demand on computing resources is further reduced.
On the basis of the technical solution of the first aspect, the second possible design of customizing the robbery record data according to different situations is further specifically provided in this embodiment, that is, after sending the abnormal behavior alert signal to the elevator monitoring background, as shown in fig. 10, the method further includes, but is not limited to, the following steps S81 to S84.
And S81, continuously acquiring new video frames acquired by the camera in the straight ladder, and judging whether the total duration of the keyword occurrence period is equal to zero or not until less than two human bodies are found in the straight ladder through image recognition processing, wherein the keyword occurrence period refers to the starting and ending time of all the robbery utterance keywords meeting the fourth preset condition.
In step S81, the specific ways of finding less than two human bodies in the straight ladder through the image recognition process include, but are not limited to: importing the frame image of each acquired new video frame into a trained human body recognition model based on a target detection algorithm, and outputting to obtain a human body recognition result of each new video frame; and if the human body recognition result corresponding to a certain new video frame comprises one or zero human body detection frames, considering that less than two human bodies exist in the straight ladder.
S82, if the total time length of the keyword occurrence period is judged to be equal to zero, first robbery record data are sent to the elevator monitoring background, and otherwise, the time length ratio T is judgedsame,12/TsameWhether the first robbery record data is not less than a preset second proportion threshold value, wherein the first robbery record data includes but is not limited to the video data, the audio data, action presenting postures of the at least one group of robbery action presenting nodes and/or a video frame image set meeting the first preset condition, and the like, Tsame,12Representing the total intersection duration of the keyword occurrence period and a delay period, wherein the delay period is from the moment t1To time t2Period of time t2Representing the acquisition time corresponding to the current new video frame, wherein the current new video frame is a new video frame T found to be less than two human bodies in the vertical ladder through image recognition processingsameRepresenting the total duration of the keyword occurrence period.
In step S82, if it is determined that the total duration of the keyword occurrence period is equal to zero, it means that the occurrence of the article robbery event in the vertical ladder is determined only by the fact that the number of video frames in the video data that satisfy the first preset condition is not less than the preset frame number threshold on the premise that the second preset condition is satisfied, and the event is reconfirmed by finding that there are less than two human bodies in the vertical ladder, or else, it may be determined that at least one robbery utterance keyword satisfies the fourth preset condition on the premise that the third preset condition is satisfied, so as to determine that the article robbery event in the vertical ladder occurs, and reconfirme the event by finding that there are less than two human bodies in the vertical ladder.
S83, if the time length ratio T is judgedsame,12/TsameAnd if not, sending second robbery record data to the elevator monitoring background, otherwise, judging whether the video data meet the third preset condition or not, wherein the second robbery record data comprise but are not limited to the video data, the audio data and/or audio and video data collected in the keyword occurrence period, and the like.
In the step S83, if the duration ratio T is determinedsame,12/TsameThe second ratio threshold is not less than the second ratio threshold, that is, the occurrence of the article robbery event in the straight ladder is determined mainly by the recognition result and the sound source positioning result of the robbery utterance keyword based on the audio data, and the event is confirmed again by finding the condition that there are less than two human bodies in the straight ladder.
And S84, if the video data meet the second preset condition under the condition that the video data meet the third preset condition, third robbery record data are sent to the elevator monitoring background, otherwise, the third robbery record data are not sent to the elevator monitoring background, wherein the third robbery record data comprise but not limited to the video data, the audio data, the action presenting postures of the at least one group of robbery action presenting nodes, the video frame image set meeting the first preset condition and/or the audio and video data collected in the keyword occurrence period.
In step S84, if it is determined that the video data meets the third preset condition, the second preset condition is also met, that is, it means that the occurrence of the robbery event of the article in the straight ladder is determined by the recognition result of the robbery utterance keyword based on the audio data and the sound source positioning result, and the robbery behavior is also recognized, so that the robbery record data may be required to be most rich and reliable. Otherwise, the fact that the object robbery event in the straight elevator occurs is determined by the recognition result of the robbery speaking keyword based on the audio data and the sound source positioning result, but the robbery behavior is not recognized, the object robbery event in the straight elevator is misjudged, and the robbery record data does not need to be sent to the elevator monitoring background.
Therefore, based on the second possible design described in the foregoing steps S81 to S84, different robbery record data can be customized according to different situations, and when a misjudgment of the straight elevator article robbery event is found, the transmission of the robbery record data to the elevator monitoring background is terminated.
On the basis of the technical scheme of the second possible design, the third possible design for further enriching the robbery record data is provided, that is, the first robbery record data, the second robbery record data or the third robbery record data are sent to the elevator monitoring background, and the method includes, but is not limited to, the following steps S91 to S93.
And S91, importing the frame image of the current new video frame into a floor information identification model which is trained and based on a target detection algorithm, and outputting to obtain a floor information identification result.
In the step S91, the introduction content of the floor information recognition model is similar to that of the item recognition model, and before the step S91, a sample collection and training process for the floor information recognition model may include, but is not limited to, the following: (1) acquiring video data DT1-3 in the elevator collected by a camera in the straight elevator within another specified time period (default one month, the time period can be modified); (2) video data DT2 containing a floor display is obtained from DT1-3, and data are classified and labeled according to floor information of the display to form a floor information data set G; (3) and based on the floor information data set G, carrying out model training of sample balance by adopting a YOLOv5 target detection algorithm to obtain the floor information identification model. In addition, considering that the training process of the recognition model consumes a large amount of computing resources, the floor information recognition model is preferably deployed on other computer devices as part of the AI detection algorithm after being trained on the computer devices.
And S92, acquiring floor information according to the floor information identification result.
And S93, transmitting the robbery record data which also carries the floor information to the elevator monitoring background.
Therefore, based on the third possible design described in the foregoing steps S91-S93, the floor information of the vertical elevator in the event of the object robbery can be automatically obtained and carried in the robbery record data to the background, which is further beneficial for the background to quickly and accurately perform rescue response.
As shown in fig. 11, a second aspect of this embodiment provides a virtual device for implementing the method for detecting and warning robbery of articles in a vertical ladder according to any one of the first aspect or the first aspect, including an audio/video data acquisition module, a human skeleton labeling module, a first condition determination module, a second condition determination module, a third condition determination module, a fourth condition determination module, and an event determination module;
the audio and video data acquisition module is used for acquiring video data acquired by a camera in the straight elevator and audio data acquired by a sound pickup in the straight elevator in a target monitoring time period, wherein the target monitoring time period is t1Time τ to t1Period of time + τ, t1Representing a collecting moment corresponding to a target video frame, wherein tau represents a preset specified time length, the target video frame is a video frame which is collected by a camera in the straight elevator, discovers that an object and two human bodies can be robbed in the straight elevator and discovers that a straight elevator door is opened through image recognition processing, the camera in the straight elevator is fixedly arranged in the straight elevator car and faces the straight elevator door, the visual field of a lens fixedly covers the inner area of the car and the area of the straight elevator door, and a pickup in the straight elevator is fixedly arranged in the straight elevator car;
the human body skeleton labeling module is in communication connection with the audio and video data acquisition module and is used for extracting and processing human body joint point information according to corresponding frame images aiming at each video frame in the video data to obtain a human body skeleton labeled in the corresponding frame images, wherein the human body skeleton comprises human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right feet heads;
the first condition determining module is in communication connection with the human body skeleton labeling module and is used for determining that the corresponding video frame meets a first preset condition if the distance from at least one head node in the corresponding frame image to the central point of the object detection frame is not greater than a preset first distance threshold value aiming at each video frame, wherein the object detection frame is a detection frame capable of robbing objects identified in the corresponding frame image;
the second condition determining module is in communication connection with the human body skeleton labeling module and is used for determining that a second preset condition is met according to the human body skeletons of the video frames if the action presenting postures of at least one group of robbery action presenting nodes belong to pre-labeled robbery postures, wherein the robbery action presenting nodes comprise human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right foot heads;
the third condition determining module is in communication connection with the audio and video data acquiring module and is used for performing robbery utterance keyword recognition processing by utilizing a trained keyword retrieval system based on an end-to-end voice recognition technology according to the audio data, and determining that a third preset condition is met if at least one robbery utterance keyword is recognized, wherein the confidence coefficient of the robbery utterance keyword is not less than a preset confidence coefficient threshold;
the fourth condition determining module is in communication connection with the third condition determining module and is used for determining that the corresponding keyword meets a fourth preset condition if the corresponding pronunciation sound source is judged to come from a human body in the vertical ladder aiming at each robbing speech keyword in the at least one robbing speech keyword;
the event determining module is respectively in communication connection with the first condition determining module, the second condition determining module, the third condition determining module and the fourth condition determining module, and is used for determining that an article robbing event in the vertical elevator occurs and sending an abnormal behavior reminding signal to an elevator monitoring background when the number of video frames in the video data meeting the first preset condition is not less than a preset frame number threshold on the premise that the second preset condition is met, and/or when at least one robbing words keyword meets the fourth preset condition on the premise that the third preset condition is met.
The working process, working details and technical effects of the device provided in the second aspect of this embodiment may refer to any one of the first aspect or the first aspect that may be designed for the method for detecting and warning robbery of articles in a vertical ladder, which is not described herein again.
As shown in fig. 12, a third aspect of this embodiment provides a computer device for executing the method for detecting and warning robbery of an item in an elevator as may be designed in any of the first aspect or the first aspect, where the computer device includes a memory, a processor, and a transceiver, which are sequentially connected in a communication manner, where the memory is used for storing a computer program, the transceiver is used for sending and receiving information, and the processor is used for reading the computer program and executing the method for detecting and warning robbery of an item in an elevator as may be designed in any of the first aspect or the first aspect. For example, the Memory may include, but is not limited to, a Random-Access Memory (RAM), a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a First-in First-out (FIFO), and/or a First-in Last-out (FILO), and the like; the processor may be, but is not limited to, a microprocessor of the model number STM32F105 family. In addition, the computer device may also include, but is not limited to, a power module, a display screen, and other necessary components.
For the working process, working details, and technical effects of the computer device provided in the third aspect of this embodiment, reference may be made to the first aspect or any one of the first aspects that may be designed for the method for detecting and warning robbery of articles in a vertical ladder, which is not described herein again.
A fourth aspect of this embodiment provides a computer-readable storage medium storing instructions that include any one of the first aspect or the first aspect and may be configured to design the method for detecting and warning robbery of an item inside a straight ladder, where the instructions are stored on the computer-readable storage medium, and when the instructions are run on a computer, the method for detecting and warning robbery of an item inside a straight ladder may be designed according to any one of the first aspect or the first aspect. The computer-readable storage medium refers to a carrier for storing data, and may include, but is not limited to, a computer-readable storage medium such as a floppy disk, an optical disk, a hard disk, a flash Memory, a flash disk and/or a Memory Stick (Memory Stick), and the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
For the working process, the working details, and the technical effects of the computer-readable storage medium provided in the fourth aspect of this embodiment, reference may be made to the first aspect or any one of the first aspects that may be designed for the method for detecting and warning robbery of articles in a vertical ladder, which is not described herein again.
A fifth aspect of the present embodiment provides a computer program product including instructions, which, when executed on a computer, cause the computer to execute the method for detecting and warning robbery of articles in a vertical ladder according to any one of the first aspect or the first aspect. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices.
Finally, it should be noted that the present invention is not limited to the above alternative embodiments, and that various other forms of products can be obtained by anyone in light of the present invention. The above detailed description should not be taken as limiting the scope of the invention, which is defined in the claims, and which the description is intended to be interpreted accordingly.

Claims (10)

1. A method for detecting and early warning robbery of articles in a vertical ladder is characterized by comprising the following steps:
acquiring video data acquired by a camera in the straight elevator and audio data acquired by a pickup in the straight elevator in a target monitoring time period, wherein the target monitoring time period is from t1Time τ to t1Period of time + τ, t1Representing the acquisition time corresponding to a target video frame, wherein tau represents a preset specified time length, and the target video frame is acquired by a camera in the vertical ladder and passes through an imageIdentifying, processing and finding that articles and two human bodies which can be robbed exist in the straight ladder and finding that a straight ladder door is opened, wherein a camera in the straight ladder is fixedly arranged in the straight ladder car and faces the straight ladder door, the visual field of a lens fixedly covers the inner area of the car and the area of the straight ladder door, and a pickup in the straight ladder is fixedly arranged in the straight ladder car;
for each video frame in the video data, extracting human body joint point information according to the corresponding frame image to obtain a human body skeleton marked in the corresponding frame image, wherein the human body skeleton comprises human body nodes corresponding to left and right hands and heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right feet and heads;
for each video frame, if the distance from at least one head node in the corresponding frame image to the central point of the object detection frame is judged to be not greater than a preset first distance threshold, determining that the corresponding video frame meets a first preset condition, wherein the object detection frame is a detection frame capable of robbing objects identified in the corresponding frame image;
according to the human body skeleton of each video frame, if the action presenting postures of at least one group of robbery action presenting nodes are judged to belong to the pre-labeled robbery postures, determining that a second preset condition is met, wherein the robbery action presenting nodes comprise human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right foot heads;
performing robbery utterance keyword recognition processing by using a trained keyword retrieval system based on an end-to-end speech recognition technology according to the audio data, and determining that a third preset condition is met if at least one robbery utterance keyword is recognized, wherein the confidence coefficient of the robbery utterance keyword is not less than a preset confidence coefficient threshold;
aiming at each robbery speech keyword in the at least one robbery speech keyword, if the corresponding pronunciation sound source is judged to come from a human body in the vertical ladder, determining that the corresponding keyword meets a fourth preset condition;
if the video frame number meeting the first preset condition in the video data is not less than a preset frame number threshold on the premise of meeting the second preset condition, and/or at least one robbing words keyword meets the fourth preset condition on the premise of meeting the third preset condition, determining that an article robbing event in the vertical elevator occurs, and sending an abnormal behavior reminding signal to an elevator monitoring background.
2. The method for detecting and warning robbery of articles in the straight ladder according to claim 1, wherein for each of the at least one robbing utterance keyword, if it is determined that the corresponding pronunciation sound source is from a human body in the straight ladder, it is determined that the corresponding keyword satisfies a fourth preset condition, including:
aiming at a certain robbery utterance keyword in the at least one robbery utterance keyword, performing corresponding sound source direction estimation processing by using a trained sound source direction estimation model according to audio data in corresponding start-stop time to obtain a direction angle and an elevation angle of a corresponding sound source relative to a sound pickup in the straight ladder;
determining a first polar angle coordinate of a sound source corresponding to the certain robbing utterance keyword in a frame image of a synchronous video frame and taking an image center as a pole according to the direction angle, the elevation angle and a known position relation between the straight ladder internal camera and the straight ladder internal sound pickup, wherein the synchronous video frame is a video frame collected by the straight ladder internal camera within a starting and ending time corresponding to the certain robbing utterance keyword;
for each of the contemporaneous video frames, determining at least one human head position in the corresponding frame image;
for each of the contemporaneous video frames, if it is determined that there is a human head position in the corresponding at least one human head position that satisfies the following condition: determining that the corresponding video frame meets a fifth preset condition if the absolute difference value of a second polar angle coordinate, taking the image center as a pole, of the human head position in the corresponding frame image and the first polar angle coordinate is not greater than a preset angle threshold;
and if the ratio of the video frame number meeting the fifth preset condition to the total video frame number is judged to be not less than a preset first proportional threshold, determining that a sound source corresponding to the certain robbing words keyword comes from a human body in the straight ladder, and determining that the certain robbing words keyword meets a fourth preset condition, wherein the total video frame number refers to the total number of video frames collected by a camera in the straight ladder within the starting and ending time corresponding to the certain robbing words keyword.
3. The method for detecting and warning robbery of goods inside a straight elevator according to claim 1, wherein the acquiring of video data collected by a camera inside the straight elevator and audio data collected by a sound pickup inside the straight elevator in the target monitoring period comprises:
after a real-time video frame acquired by a camera in the straight elevator is acquired, importing a frame image of the real-time video frame into an article identification model which is trained and based on a target detection algorithm, and outputting to obtain an article identification result, wherein the camera in the straight elevator is fixedly arranged in a car of the straight elevator and faces towards a straight elevator door, and a camera view fixedly covers an inner area of the car and an area of the straight elevator door;
if the article identification result comprises at least one robbable article detection frame, determining that the robbable article exists in the straight ladder, then importing the frame image of the real-time video frame into a trained human body identification model based on a target detection algorithm, and outputting to obtain a human body identification result;
if the human body identification result comprises two human body detection frames, determining that two human bodies exist in the straight ladder, and then judging whether the straight ladder door is in an open state or not through image identification processing according to the frame image of the real-time video frame;
if the straight ladder door is judged to be in the opening state, determining the real-time video frame as a target video frame;
is obtained at the slave t1Time τ to t1Within the time interval of + tau time, video data collected by the camera in the straight ladder and audio data collected by the sound pickup in the straight ladder, wherein t is1Representing correspondence with the target video frameThe collection moment, tau expresses that predetermined appointed is long, the sound pickup fixed mounting is inside the straight ladder car in the straight ladder.
4. The method for detecting and warning robbery of goods inside the vertical ladder according to claim 3, wherein the steps of judging whether the vertical ladder door is in an open state or not through image recognition processing according to the frame image of the real-time video frame include any one of the following modes (A) to (B):
(A) when the straight ladder door is a split door and a label is respectively preset on the inner surfaces of the two split door leaves, firstly, a frame image of the real-time video frame is imported into a label recognition model which is trained and based on a target detection algorithm, a label recognition result is output, then, the center distance of the two label detection frames is calculated according to the two label detection frames in the label recognition result, and finally, if the center distance is judged to be not smaller than a preset second distance threshold value, the straight ladder door is determined to be in an opening state;
(B) according to a background frame collected by a camera in the straight elevator when the straight elevator door is completely closed in advance, frame difference processing in a straight elevator door frame area is carried out on a frame image of the real-time video frame to obtain an intra-frame difference image, then discrete point removing processing and corrosion operation processing are carried out on the intra-frame difference image to obtain a new frame difference image, then convex hull processing is carried out on pixel points of which the frame difference absolute values in the new frame difference image are not smaller than a preset frame difference threshold value to obtain a convex hull area, then the total pixel amount of all the pixel points in the convex hull area is counted, and finally if the total pixel amount is judged to be not smaller than a preset number threshold value, the straight elevator door is determined to be in an open state.
5. The method for detecting and warning robbery of goods in the vertical ladder according to claim 3, wherein the determining of the real-time video frame as the target video frame includes any one or any combination of the following modes (C) to (E):
(C) when the straight ladder door is a split door and a label is pre-arranged on the inner surfaces of two opposite-opening door leaves respectively, firstly importing a frame image of the real-time video frame into a label identification model which is trained and based on a target detection algorithm, outputting to obtain a label identification result, then calculating the center distance of two label detection frames according to the two label detection frames in the label identification result, and finally determining that the straight ladder door is in an opening state if the center distance is judged to be greater than the center distance of the two label detection frames corresponding to the previous video frame, and determining the real-time video frame as a target video frame at the moment;
or, according to a background frame collected by a camera in the straight elevator when the straight elevator door is completely closed in advance, performing frame difference processing in a straight elevator door frame region on a frame image of the real-time video frame to obtain an intra-frame difference image, then performing discrete point removal processing and corrosion operation processing on the intra-frame difference image to obtain a new frame difference image, then performing convex hull processing on pixel points of which the frame difference absolute values in the new frame difference image are not less than a preset frame difference threshold value to obtain a convex hull region, then counting the total pixel amount of all the pixel points in the convex hull region, and finally determining that the straight elevator door is in an opening state if the total pixel amount is judged to be greater than the total pixel amount in the convex hull region corresponding to the previous video frame, and determining the real-time video frame as a target video frame at the moment;
(D) judging whether the closest distance between the two human body detection frames is smaller than a preset third distance threshold value or not, and if so, determining the real-time video frame as a target video frame;
or judging whether the two human body detection frames have an intersection region, and if so, determining the real-time video frame as a target video frame;
(E) judging whether the at least one robbable article detection frame has the robbable article detection frame meeting the following conditions: the center of the robbable article detection frame is positioned between the two human body detection frames or in the intersection area of the two human body detection frames, and if yes, the real-time video frame is determined as a target video frame.
6. The method for detecting and warning robbery of articles in the vertical elevator according to claim 1, wherein after sending the abnormal behavior reminding signal to the elevator monitoring background, the method further comprises:
continuously acquiring new video frames acquired by the camera in the straight elevator, and judging whether the total duration of the keyword occurrence period is equal to zero or not until less than two human bodies are found in the straight elevator through image recognition processing, wherein the keyword occurrence period refers to the starting and ending time of all the robbing words meeting the fourth preset condition;
if the total time length of the keyword occurrence period is judged to be equal to zero, first robbery record data are sent to the elevator monitoring background, and otherwise, the time length ratio T is judgedsame,12/TsameWhether the first robbery record data is not less than a preset second proportion threshold value or not, wherein the first robbery record data comprises the video data, the audio data, action presenting postures of the at least one group of robbery action presenting nodes and/or a video frame image set meeting the first preset condition, and Tsame,12Representing the total intersection duration of the keyword occurrence period and a delay period, wherein the delay period is from the moment t1To time t2Period of time t2Representing the acquisition time corresponding to the current new video frame, wherein the current new video frame is a new video frame T found to be less than two human bodies in the vertical ladder through image recognition processingsameRepresenting a total duration of the keyword occurrence period;
if the time length ratio T is judgedsame,12/TsameIf the second ratio threshold is not less than the second ratio threshold, second robbery record data is sent to the elevator monitoring background, otherwise, whether the second robbery record data meets the second preset condition under the condition that the video data meets the third preset condition is judged, wherein the second robbery record data comprises the video data, the audio data and/or the audio and video data collected in the keyword occurrence period;
and if the video data meet the second preset condition under the condition that the video data meet the third preset condition, third robbery record data are sent to the elevator monitoring background, otherwise, the robbery record data are not sent to the elevator monitoring background, wherein the third robbery record data comprise the video data, the audio data, the action presenting postures of the at least one group of robbery action presenting nodes, the video frame image set meeting the first preset condition and/or the audio and video data collected in the keyword occurrence period.
7. The method for detecting and warning robbery of articles in the vertical elevator according to claim 6, wherein the step of sending a first robbery recording message, a second robbery recording message or a third robbery recording message to the elevator monitoring background comprises the steps of:
importing the frame image of the current new video frame into a floor information identification model which is trained and based on a target detection algorithm, and outputting to obtain a floor information identification result;
acquiring floor information according to the floor information identification result;
and sending a robbery record message carrying the floor information to the elevator monitoring background.
8. A detection and early warning device for robbery of articles in a vertical ladder is characterized by comprising an audio and video data acquisition module, a human body skeleton marking module, a first condition determination module, a second condition determination module, a third condition determination module, a fourth condition determination module and an event determination module;
the audio and video data acquisition module is used for acquiring video data acquired by a camera in the straight elevator and audio data acquired by a sound pickup in the straight elevator in a target monitoring time period, wherein the target monitoring time period is t1Time τ to t1Period of time + τ, t1Showing the acquisition time corresponding to a target video frame, wherein tau shows the preset specified time length, the target video frame is a video frame which is acquired by a camera in the straight elevator, discovers that robbable objects and two human bodies exist in the straight elevator through image recognition processing and discovers that a straight elevator door is opened, and the camera in the straight elevator is fixedly arranged in a straight elevator car and faces towards the inside of the straight elevator carThe camera lens is fixed to cover the inner area of the lift car and the area of the straight lift door, and a pickup in the straight lift is fixedly installed in the straight lift car;
the human body skeleton labeling module is in communication connection with the audio and video data acquisition module and is used for extracting and processing human body joint point information according to corresponding frame images aiming at each video frame in the video data to obtain a human body skeleton labeled in the corresponding frame images, wherein the human body skeleton comprises human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right feet heads;
the first condition determining module is in communication connection with the human body skeleton labeling module and is used for determining that the corresponding video frame meets a first preset condition if the distance from at least one head node in the corresponding frame image to the central point of the object detection frame is not greater than a preset first distance threshold value aiming at each video frame, wherein the object detection frame is a detection frame capable of robbing objects identified in the corresponding frame image;
the second condition determining module is in communication connection with the human body skeleton labeling module and is used for determining that a second preset condition is met according to the human body skeletons of the video frames if the action presenting postures of at least one group of robbery action presenting nodes belong to pre-labeled robbery postures, wherein the robbery action presenting nodes comprise human body nodes corresponding to left and right hand heads, left and right elbows, left and right shoulders, left and right waists, left and right knees and left and right foot heads;
the third condition determining module is in communication connection with the audio and video data acquiring module and is used for performing robbery utterance keyword recognition processing by utilizing a trained keyword retrieval system based on an end-to-end voice recognition technology according to the audio data, and determining that a third preset condition is met if at least one robbery utterance keyword is recognized, wherein the confidence coefficient of the robbery utterance keyword is not less than a preset confidence coefficient threshold;
the fourth condition determining module is in communication connection with the third condition determining module and is used for determining that the corresponding keyword meets a fourth preset condition if the corresponding pronunciation sound source is judged to come from a human body in the vertical ladder aiming at each robbing speech keyword in the at least one robbing speech keyword;
the event determining module is respectively in communication connection with the first condition determining module, the second condition determining module, the third condition determining module and the fourth condition determining module, and is used for determining that an article robbing event in the vertical elevator occurs and sending an abnormal behavior reminding signal to an elevator monitoring background when the number of video frames in the video data meeting the first preset condition is not less than a preset frame number threshold on the premise that the second preset condition is met, and/or when at least one robbing words keyword meets the fourth preset condition on the premise that the third preset condition is met.
9. A computer device, characterized in that, it includes a memory, a processor and a transceiver which are connected in communication in turn, wherein, the memory is used for storing computer program, the transceiver is used for receiving and sending information, the processor is used for reading the computer program, and executing the method for detecting and warning robbery of goods in vertical ladder according to any claim 1-7.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores instructions, and when the instructions are executed on a computer, the method for detecting and warning robbery of goods in a vertical ladder according to any one of claims 1 to 7 is performed.
CN202210345959.3A 2022-03-31 2022-03-31 Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment Pending CN114694254A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210345959.3A CN114694254A (en) 2022-03-31 2022-03-31 Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210345959.3A CN114694254A (en) 2022-03-31 2022-03-31 Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment

Publications (1)

Publication Number Publication Date
CN114694254A true CN114694254A (en) 2022-07-01

Family

ID=82140803

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210345959.3A Pending CN114694254A (en) 2022-03-31 2022-03-31 Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment

Country Status (1)

Country Link
CN (1) CN114694254A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116384879A (en) * 2023-04-07 2023-07-04 豪越科技有限公司 Intelligent management system for rapid warehouse-in and warehouse-out of fire-fighting equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116384879A (en) * 2023-04-07 2023-07-04 豪越科技有限公司 Intelligent management system for rapid warehouse-in and warehouse-out of fire-fighting equipment
CN116384879B (en) * 2023-04-07 2023-11-21 豪越科技有限公司 Intelligent management system for rapid warehouse-in and warehouse-out of fire-fighting equipment

Similar Documents

Publication Publication Date Title
Li et al. Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning.
CN110826466B (en) Emotion recognition method, device and storage medium based on LSTM audio-video fusion
Oliver et al. Layered representations for human activity recognition
Caridakis et al. Multimodal emotion recognition from expressive faces, body gestures and speech
Chibelushi et al. A review of speech-based bimodal recognition
CN107928673B (en) Audio signal processing method, audio signal processing apparatus, storage medium, and computer device
CN112053690B (en) Cross-mode multi-feature fusion audio/video voice recognition method and system
Oliver et al. A comparison of hmms and dynamic bayesian networks for recognizing office activities
Dong et al. A hierarchical depression detection model based on vocal and emotional cues
Zhang et al. A vision-based sign language recognition system using tied-mixture density HMM
EP1671277A1 (en) System and method for audio-visual content synthesis
Sharma et al. D-FES: Deep facial expression recognition system
CN111161715A (en) Specific sound event retrieval and positioning method based on sequence classification
CN114186069B (en) Depth video understanding knowledge graph construction method based on multi-mode different-composition attention network
Ponce-López et al. Multi-modal social signal analysis for predicting agreement in conversation settings
Ben-Youssef et al. Early detection of user engagement breakdown in spontaneous human-humanoid interaction
CN112329438A (en) Automatic lie detection method and system based on domain confrontation training
Tamamori et al. An investigation of recurrent neural network for daily activity recognition using multi-modal signals
Rothkrantz Lip-reading by surveillance cameras
CN114694254A (en) Method and device for detecting and early warning robbery of articles in vertical ladder and computer equipment
CN114400004A (en) On-site service monitoring method based on intelligent voice and video behavior recognition technology
Oliver et al. Selective perception policies for guiding sensing and computation in multimodal systems: A comparative analysis
CN106992000B (en) Prediction-based multi-feature fusion old people voice emotion recognition method
Shashidhar et al. Audio visual speech recognition using feed forward neural network architecture
Asadiabadi et al. Multimodal speech driven facial shape animation using deep neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230117

Address after: 610000 High-tech Zone, Chengdu City, Sichuan Province, No. 99, No. 1, No. 2, No. 15, No. 1, No. 1505, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No

Applicant after: CHENGDU XINCHAO MEDIA GROUP Co.,Ltd.

Address before: 610000 High-tech Zone, Chengdu City, Sichuan Province, No. 99, No. 1, No. 2, No. 15, No. 1, No. 1505, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No. 1, No

Applicant before: CHENGDU XINCHAO MEDIA GROUP Co.,Ltd.

Applicant before: Chengdu Baixin Zhilian Technology Co.,Ltd.