CN113591661A

CN113591661A - Call-making behavior prediction method and system

Info

Publication number: CN113591661A
Application number: CN202110840176.8A
Authority: CN
Inventors: 宋梦
Original assignee: Shenzhen Teamway Electric Co ltd
Current assignee: Shenzhen Teamway Electric Co ltd
Priority date: 2021-07-24
Filing date: 2021-07-24
Publication date: 2021-11-02

Abstract

The application discloses a method and a system for predicting a call-making behavior, which relate to the technical field of image recognition and comprise the following steps: defining a call behavior type, the call behavior type comprising a hand in left ear class, a hand in right ear class, and a hand in nose class; acquiring a sample data set of the calling behavior based on the calling behavior type; analyzing the human skeleton coordinates in the sample data set through a training model, and defining behavior key points, wherein the behavior key points comprise trunk key points and hand key points; respectively acquiring a key point coordinate list of each behavior key point based on the sample data set; calculating and acquiring a key point reference coordinate of the calling behavior based on the key point coordinate list; and acquiring a video image to be tested, and predicting a calling behavior in the video image to be tested based on the key point reference coordinates and the training model. The method and the device have the advantage that the call-making behavior can be predicted or judged.

Description

Call-making behavior prediction method and system

Technical Field

The application relates to the technical field of image recognition, in particular to a method and a system for predicting a call-making behavior.

Background

In recent years, with the development of smart technology, smart phones are used more frequently in various occasions, and then, some public places and working places are prohibited from taking a call, such as gas stations, driving cars, and the like. Therefore, the system for detecting the call-making behavior has great significance for personnel safety consideration and preventing the occurrence of call-making behavior events in specific occasions. In the related technology, multi-order model fusion is adopted, wherein the multi-order model comprises a vehicle window detection model, a human face detection model, a driver positioning detection module, a driver hand detection module and the like, and the calling behavior of a driver in a fixed scene in a cab is detected.

With respect to the related art among the above, the inventors consider that the following drawbacks exist: due to the adoption of a specific detection model, the method is difficult to be applied to other occasions where the calling behavior is forbidden to occur, and the method can only detect the calling behavior and is difficult to predict the calling behavior.

Disclosure of Invention

In order to overcome the defect that a calling behavior detection system is difficult to predict calling behaviors, the application provides a calling behavior prediction method and a calling behavior prediction system.

In a first aspect, the present application provides a method for predicting a call-making behavior, which specifically includes the following steps:

defining a call behavior type, the call behavior type comprising a hand in left ear class, a hand in right ear class, and a hand in nose class;

obtaining a sample data set based on the calling behavior type, wherein the sample data set comprises a plurality of sampling video images containing calling behaviors;

analyzing the human skeleton coordinates in the sample data set through a training model, and defining behavior key points, wherein the behavior key points comprise trunk key points and hand key points;

respectively acquiring a key point coordinate list of each behavior key point based on the sample data set;

calculating and acquiring a key point reference coordinate of the calling behavior based on the key point coordinate list;

and acquiring a video image to be tested, and judging or predicting a calling behavior in the video image to be tested based on the key point reference coordinates and the training model.

By adopting the technical scheme, the calling behavior type is defined according to the behavior characteristics of the public in the big data when the public calls, a plurality of sampling video images are collected according to the calling behavior type and integrated into the sample data set, the sample data set is analyzed through the training model to obtain the key point reference coordinates of the calling behavior, the key point reference coordinates can embody the behavior characteristics in the calling behavior, and then the calling behavior in the video image to be detected can be judged or predicted based on the key point reference coordinates and the training model.

Optionally, the step of respectively obtaining a key point coordinate list of each behavior key point based on the sample data set includes the following steps:

respectively acquiring trunk key points of all sampled video images in the sample data set, wherein the trunk key points comprise a left ear node, a right ear node and a nose node;

respectively acquiring a left ear node coordinate, a right ear node coordinate and a nose node coordinate based on the human body skeleton coordinate;

respectively calculating the trunk key point coordinates of each sampling video image according to the shielding condition of the trunk key points;

and integrating the trunk key point coordinates of all the sampling video images in the sample data set to obtain a trunk key point coordinate list.

By adopting the technical scheme, the calling behavior types comprise a left ear type with a hand, a right ear type with a hand and a nose type with a hand, so that a left ear node, a right ear node and a nose node are obtained from trunk key points of the sampled video image, node coordinates of the three nodes are obtained, and the trunk key point coordinates of the sampled video image are integrated to obtain a trunk key point coordinate list by integrating the trunk key point coordinates of the sampled video image and the sample data and the trunk key point coordinates of all the sampled video images as well as the node of the trunk key point in the sampled video image is possibly shielded by a shielding object.

Optionally, the step of respectively obtaining a key point coordinate list of each behavior key point based on the sample data set further includes the following steps:

respectively acquiring hand key points of all sampled video images in the sample data set, wherein the hand key points comprise a plurality of key nodes;

respectively acquiring key node coordinates of each key node based on the human skeleton coordinates;

respectively calculating the coordinates of the hand key points of each sampling video image according to the shielding condition of the hand key points;

and integrating the hand key point coordinates of all the sampling video images in the sample data set to obtain a hand key point coordinate list.

By adopting the technical scheme, as the calling behavior of the public is usually that the calling equipment is held by hands to carry out calling, the key points of the hands in the sampled video images are required to be acquired, each hand of each hand can be designated as a key node, and the key node coordinates are acquired.

Optionally, respectively calculating the coordinates of the trunk key points of each of the sampled video images according to the occlusion condition of the trunk key points includes the following steps:

judging the shielding conditions of the left ear node, the right ear node and the nose node;

if the three nodes are not shielded, obtaining the coordinates of the key points of the trunk by calculating the coordinate mean value of the three nodes;

if any node is shielded, obtaining the coordinates of the key points of the trunk by calculating the coordinate mean value of the other two nodes;

if any two nodes are blocked, the node coordinate of the other node is the coordinate of the key point of the trunk.

By adopting the technical scheme, the shielded nodes in the trunk key points are screened out, the coordinate mean value of the unshielded nodes is calculated, and the coordinate mean value is used as the coordinates of the trunk key points.

Optionally, respectively calculating the coordinates of the hand key points of each of the sampled video images according to the occlusion condition of the hand key points includes the following steps:

judging the shielding conditions of all key nodes;

if all the key nodes are not shielded, obtaining the coordinates of the key points of the hand by calculating the coordinate mean value of all the key nodes;

if the plurality of key nodes are shielded and the number of the key nodes which are not shielded is not less than 2, obtaining the coordinates of the key points of the hand by calculating the coordinate mean value of the key nodes which are not shielded;

if one key node is not shielded and all other key nodes are shielded, the key node coordinates of the key nodes which are not shielded are the hand key point coordinates.

By adopting the technical scheme, the blocked key nodes in the key points of the hand are screened out, the coordinate mean value of the key nodes which are not blocked is calculated, and the coordinate mean value is used as the coordinates of the key points of the hand.

Optionally, the step of calculating and acquiring the key point reference coordinate of the call-making behavior based on the key point coordinate list includes the following steps:

obtaining a trunk key point reference coordinate by calculating a coordinate mean value of all trunk key point coordinates in the trunk key point coordinate list;

and calculating the coordinate mean value of all the hand key point coordinates in the hand key point coordinate list to obtain the hand key point reference coordinates.

By adopting the technical scheme, the trunk key point coordinate list and the hand key point coordinate list are subjected to mean value calculation, so that trunk key point coordinates and hand key point coordinates with large partial deviation can be screened out, and finally obtained two coordinate mean values can be used as trunk key point reference coordinates and hand key point reference coordinates.

Optionally, the step of judging or predicting the call-making behavior in the video image to be tested based on the key point reference coordinates and the training model includes the following steps:

calculating and acquiring Euclidean distance between the trunk key point reference coordinate and the hand key point reference coordinate;

obtaining a prediction range according to the Euclidean distance;

analyzing and acquiring the actual human skeleton coordinates of the video image to be tested through the training model;

acquiring coordinates of all actual behavior key points based on the actual human skeleton coordinates, and calculating and acquiring an actual Euclidean distance between an actual trunk key point and an actual hand key point in the video image to be detected according to the coordinates of the actual behavior key points;

and judging or predicting the call-making behavior in the video image to be detected according to the actual Euclidean distance and the prediction range.

By adopting the technical scheme, the Euclidean distance between the hand and the trunk in the public telephone calling behavior can be obtained by calculating the reference coordinates of the key points of the trunk and the hand in the space, so that the prediction range is obtained according to the calculated Euclidean distance, the actual Euclidean distance in the video image to be detected is calculated, and the telephone calling behavior in the video image to be detected is judged or predicted according to whether the actual Euclidean distance is in the prediction range or not.

Optionally, the step of judging or predicting the call-making behavior in the video image to be tested according to the actual euclidean distance and the prediction range includes the following steps:

defining a behavior occurrence range based on the prediction range, wherein the range size of the behavior occurrence range is smaller than the range size of the prediction range;

judging whether the actual Euclidean distance is in the behavior occurrence range;

if the actual Euclidean distance is in the behavior occurrence range, judging whether the time when the actual Euclidean distance is in the behavior occurrence range exceeds a preset first judgment time;

if the preset first judgment time is exceeded, judging that the call making behavior appears in the video image to be detected;

and if the actual Euclidean distance is not in the behavior occurrence range or the time that the actual Euclidean distance is in the behavior occurrence range does not exceed a preset first judgment time, judging that the call making behavior does not appear in the video image to be detected.

By adopting the technical scheme, because the call-making action in the video image to be detected can possibly occur, the action occurrence range is defined according to the prediction range, whether the actual Euclidean distance is in the action occurrence range is judged, whether the time in the range exceeds the first judgment time is judged if the actual Euclidean distance is in the action occurrence range, the call-making action is a continuous action, and people in the video image to be detected can possibly have transient actions such as touching ears or lifting hair, so that the actual Euclidean distance is in the action occurrence range in a short time, and therefore misjudgment caused by the transient actions can be screened out by setting the first judgment time.

Optionally, the step of judging or predicting the call-making behavior in the video image to be tested according to the actual euclidean distance and the prediction range further includes the following steps:

judging whether the actual Euclidean distance is in the prediction range;

if the actual Euclidean distance is in the prediction range, judging whether the time when the actual Euclidean distance is in the prediction range exceeds a preset second judgment time;

if the preset second determination time is exceeded, predicting that the call making behavior will appear in the video image to be detected;

and if the actual Euclidean distance is not in the prediction range or the time that the actual Euclidean distance is in the prediction range does not exceed the preset second determination time, predicting that the call-making behavior does not appear in the video image to be detected.

By adopting the technical scheme, the occurrence condition of the call-making behavior in the video image to be tested is predicted by judging whether the actual Euclidean distance is in the prediction range or not, if the actual Euclidean distance is in the prediction range and exceeds the second judgment time, the call-making behavior in the video image to be tested is predicted, and the influence of transient action on the prediction result can be screened out by setting the second judgment time.

In a second aspect, the present application further provides a system for predicting a call-making behavior, where the method for predicting a call-making behavior according to the first aspect specifically includes:

the video image acquisition module is used for acquiring the sampling video image and the video image to be detected;

and the analysis and prediction module is connected with the video image acquisition module to receive the sampling video image and the video image to be tested, analyzes and acquires the behavior characteristics of the call-making behavior according to the sampling video image, and judges or predicts the call-making behavior in the video image to be tested based on the behavior characteristics.

By adopting the technical scheme, the calling behavior type is defined according to the behavior characteristics of the public in the big data when the public calls, a plurality of sampling video images are collected through the video image collection module according to the calling behavior type and are integrated into the sample data set, the sample data set is analyzed through the training model in the analysis and prediction module to obtain the key point reference coordinate of the calling behavior, the key point reference coordinate can embody the behavior characteristics in the calling behavior, the video image to be detected is collected through the video image collection module, and the calling behavior in the video image to be detected can be judged or predicted based on the key point reference coordinate and the training model.

In summary, the present application includes at least one of the following beneficial technical effects:

1. and acquiring a sampled video image of the public telephone making behavior based on big data acquisition, identifying and acquiring the behavior characteristics of the telephone making behavior, and judging or predicting the telephone making behavior in the video image to be tested according to the behavior characteristics.

2. The training model is trained through the sample data set, the calling behavior is judged or predicted according to the training model, and the collection of the sample data set sampling video images can not be limited by occasions, so the training model can be suitable for various occasions.

Drawings

Fig. 1 is a flowchart illustrating a method for predicting a call-making behavior according to an embodiment of the present disclosure.

Fig. 2 is a first flowchart illustrating an embodiment of the present application of respectively obtaining a key point coordinate list of each behavior key point.

Fig. 3 is a schematic flowchart of a second embodiment of the present application, where the key point coordinate list of each behavior key point is obtained separately.

Fig. 4 is a schematic flowchart of a process of separately calculating the coordinates of the torso key points of each of the sampled video images according to an embodiment of the present application.

Fig. 5 is a schematic flowchart of a process of separately calculating coordinates of a hand key point of each of the sample video images according to an embodiment of the present application.

Fig. 6 is a schematic flowchart of acquiring the reference coordinates of the key points of the call-making behavior according to an embodiment of the present application.

Fig. 7 is a first flowchart illustrating a process of determining or predicting a call-making behavior in a video image to be tested according to an embodiment of the present disclosure.

Fig. 8 is a schematic flowchart illustrating a second process of determining or predicting a call-making behavior in a video image to be tested according to an embodiment of the present disclosure.

Fig. 9 is a third flowchart illustrating a process of determining or predicting a call-making behavior in a video image to be tested according to an embodiment of the present disclosure.

Detailed Description

The present application is described in further detail below with reference to figures 1-9.

The embodiment of the application discloses a calling behavior prediction system, which comprises a video image acquisition module and an analysis prediction module, wherein the video image acquisition module can be a camera, and the video image acquisition module is used for acquiring a sampling video image and a video image to be detected. The analysis and prediction module can be an image recognition device, is connected with the video image acquisition module, receives the sampled video image and the video image to be detected, analyzes and acquires the behavior characteristics of the call-making behavior according to the sampled video image, and judges or predicts the call-making behavior in the video image to be detected based on the behavior characteristics.

The embodiment of the application also discloses a method for predicting the call-making behavior.

Referring to fig. 1, the method for predicting a call-making behavior specifically includes the steps of:

101, defining a call behavior type.

And defining the calling behavior type according to the conventional behavior action characteristics of the public in the big data when the public calls, wherein the behavior types comprise a left ear type with a hand, a right ear type with the hand and a nose type with the hand.

And 102, acquiring a sample data set based on the calling behavior type.

The video image acquisition module is used for acquiring a plurality of sampling video images, the sampling video images contain behaviors of people during telephone calling, and the sampling video images are integrated into a sample data set.

And 103, analyzing the human skeleton coordinates in the sample data set through a training model, and defining behavior key points.

The training model can be an openposition model, human skeleton coordinates in each sampling video image in the sample data set can be analyzed through the training model, the human skeleton coordinates comprise coordinates of 25 joint points of a human trunk and 22 joint points of a left hand and a right hand, and a plurality of joint points of the trunk and the hand are defined as behavior key points according to behavior characteristics during calling.

And 104, respectively acquiring a key point coordinate list of each behavior key point based on the sample data set.

And 105, calculating and acquiring the key point reference coordinate of the call-making behavior based on the key point coordinate list.

And 106, acquiring a video image to be tested, and judging or predicting a calling behavior in the video image to be tested based on the key point reference coordinates and the training model.

The video image acquisition module acquires a video image to be detected.

The implementation principle of the embodiment is as follows:

according to behavior characteristics of the public in the big data when the public makes a call, defining a call making behavior type, collecting a plurality of sampling video images according to the call making behavior type and integrating the sampling video images into a sample data set, analyzing the sample data set through a training model to obtain a key point reference coordinate of the call making behavior, wherein the key point reference coordinate can embody the behavior characteristics of the call making behavior, and then the call making behavior in the video image to be detected can be judged or predicted based on the key point reference coordinate and the training model.

In step 104 of the embodiment shown in fig. 1, coordinates of all the trunk key points of each sampled video image in the sample data set are obtained, and then the coordinates of all the trunk key points are integrated into a trunk key point coordinate list, which is specifically described in detail with the embodiment shown in fig. 2.

Referring to fig. 2, the method for obtaining the coordinate list of the trunk key points includes the following steps:

and 201, respectively acquiring trunk key points of all the sampling video images in the sample data set.

In the calling behavior, the handheld mobile phone is usually placed at both ears or near the nose, so that the trunk key points include a left ear node, a right ear node and a nose node, and in the openpos model, the serial number of the left ear node is 17, the serial number of the right ear node is 18, and the serial number of the nose node is 0. In a special scenario, the torso keypoints will also include various nodes on the torso.

And 202, respectively acquiring a left ear node coordinate, a right ear node coordinate and a nose node coordinate based on the human skeleton coordinate.

Wherein, the coordinates of the left ear node are (x)₁₇，y₁₇) The coordinates of the right ear node are (x)₁₈，y₁₈) The coordinates of the nose node are (x)₀，y₀）。

And 203, respectively calculating the coordinate of the trunk key point of each sampling video image according to the shielding condition of the trunk key point.

Wherein, assuming that there are n sampled video images, wherein n is more than or equal to 2, the calculated coordinate of the key point of the trunk is (

，

)。

And 204, integrating the trunk key point coordinates of all the sampling video images in the sample data set to obtain a trunk key point coordinate list.

Wherein, the coordinates of all the trunk key points are integrated to obtain a trunk key point coordinate list, and the list is as follows:

[(

,

),(

,

)，...，(

,

)]

the implementation principle of the embodiment is as follows:

the calling behavior types comprise a left ear type with a hand, a right ear type with a hand and a nose type with a hand, so that a left ear node, a right ear node and a nose node are obtained from trunk key points of the sampled video images, node coordinates of the three nodes are obtained, and the trunk key point coordinates of the sampled video images are integrated to obtain a trunk key point coordinate list by integrating occlusion conditions because the nodes of the trunk key points in the sampled video images are possibly occluded by an occlusion object.

In step 104 of the embodiment shown in fig. 1, coordinates of all the hand key points of each sample video image in the sample data set are obtained, and then the coordinates of all the hand key points are integrated into a hand key point coordinate list, which is specifically described in detail with reference to the embodiment shown in fig. 3.

Referring to fig. 3, acquiring a hand key point coordinate list of hand key points specifically includes the following steps:

301, respectively acquiring the hand key points of all the sampled video images in the sample data set.

The key points of the hand comprise five key nodes of the left hand or five key nodes of the right hand, and in the openposition model, the serial numbers of the five key nodes of the left hand or the five key nodes of the right hand are all [4, 8, 12, 16, 20 ].

And 302, respectively acquiring the key node coordinates of each key node based on the human skeleton coordinates.

Wherein the key node coordinates of the left hand five finger key nodes and the right hand five finger key nodes are defined with reference to step 202.

303, respectively calculating the coordinates of the hand key points of each sampling video image according to the shielding condition of the hand key points.

Wherein, assuming that there are n sampled video images, wherein n is greater than or equal to 2, the calculated left-hand key point coordinate is (

，

) The coordinates of the key points of the right hand are (

，

)。

And 304, integrating the hand key point coordinates of all the sampling video images in the sample data set to obtain a hand key point coordinate list.

The left-handed key point coordinate list is obtained by integrating all the left-handed key point coordinates, and the list is as follows:

[(

,

),(

,

)，...，(

,

)]

and integrating all the coordinates of the right-hand key points to obtain a right-hand key point coordinate list, wherein the right-hand key point coordinate list comprises the following coordinates:

[(

,

),(

,

)，...，(

,

)]

the implementation principle of the embodiment is as follows:

because the handheld call equipment is usually used for calling in the public call behavior, hand key points in the sampled video images need to be acquired, each hand of each hand can be designated as a key node, key node coordinates are acquired, and the key nodes of the hands can be shielded in the sampled video images, so that the hand key point coordinates in the sampled video images are calculated and integrated to obtain a hand key point coordinate list under the condition of integrating the hand key nodes.

In step 203 of the embodiment shown in fig. 2, the coordinates of the torso key point in each sampled video image are calculated, where if the node is occluded, the coordinates are 0, and if the node is not occluded, the coordinates of the node are obtained for calculation, which is specifically described in detail with the embodiment shown in fig. 4.

Referring to fig. 4, the method for calculating the torso key point coordinates of each sampled video image includes the following steps:

401, judging the occlusion conditions of the left ear node, the right ear node and the nose node, and if none of the three nodes are occluded, executing step 402; if any node is occluded, go to step 403; if any two nodes are occluded, step 404 is performed.

And 402, obtaining the coordinates of the key points of the trunk by calculating the coordinate mean of the three nodes.

Wherein, because three nodes are all not sheltered from, consequently calculate the coordinate mean value of three nodes and obtain trunk key point coordinate, specific computational formula is as follows:

，

in the formula (I), the compound is shown in the specification,

is the coordinate abscissa of the key point of the trunk,

and (4) the coordinate ordinate of the key point of the trunk, and i is the ith sampling video image.

And 403, obtaining the coordinates of the key points of the trunk by calculating the coordinate mean of the other two nodes.

Wherein, assuming the nose is occluded:

=0，

=0

，

assuming the right ear is occluded:

=0，

=0

，

assuming the left ear is occluded:

=0，

=0

，

where i is the ith sample video image.

404, the node coordinates of the other node are torso key point coordinates.

The implementation principle of the embodiment is as follows:

and screening out shielded nodes in the trunk key points, calculating the coordinate mean value of the nodes which are not shielded, and taking the coordinate mean value as the coordinates of the trunk key points.

In step 303 of the embodiment shown in fig. 3, coordinates of the hand key point in each sample video image are calculated, where if the node is occluded, the coordinates are 0, and if the node is not occluded, the coordinates of the node are obtained for calculation, which is specifically described in detail with the embodiment shown in fig. 5.

Referring to fig. 5, the method for calculating the coordinates of the key points of the hand of each sampled video image includes the following steps:

501, judging the shielding condition of all key nodes, and if all the key nodes are not shielded, executing step 502; if the plurality of key nodes are occluded and the number of the unoccluded key nodes is not less than 2, executing step 503; if one of the key nodes is not occluded and all other key nodes are occluded, step 504 is performed.

And 502, obtaining the coordinates of the key points of the hand by calculating the coordinate mean value of all the key nodes.

Taking the left hand as an example, the coordinates of the key points of the hand are calculated through a calculation formula, wherein the calculation formula is as follows:

，

in the formula (I), the compound is shown in the specification,

is the coordinate abscissa of the key point of the left hand,

and the coordinate ordinate of the left-hand key point is, and i is the ith sampling video image.

And 503, obtaining the coordinates of the key points of the hand by calculating the coordinate mean of the key nodes which are not shielded.

The specific calculation steps refer to the detailed descriptions of step 403 and step 502.

And 504, the key node coordinates of the key nodes which are not shielded are the hand key point coordinates.

The implementation principle of the embodiment is as follows:

and screening out the occluded key nodes in the hand key points, calculating the coordinate mean value of the unoccluded key nodes, and taking the coordinate mean value as the hand key point coordinate.

In step 105 of the embodiment shown in fig. 1, the key point coordinate list is obtained, and the average value is calculated to obtain the key point reference coordinate of the call-making behavior, which is specifically described in detail with the embodiment shown in fig. 6.

Referring to fig. 6, acquiring a reference coordinate of a key point of a call-making behavior specifically includes the following steps:

601, obtaining the reference coordinates of the trunk key points by calculating the coordinate mean of all the trunk key point coordinates in the trunk key point coordinate list.

If the coordinate lists of the key points of the trunk are all nonzero coordinates, calculating a coordinate mean value through a calculation formula, wherein the specific calculation formula is as follows:

=

=

the torso key point reference coordinate is therefore (

，

）。

And 602, calculating a coordinate mean value of all the coordinates of the hand key points in the hand key point coordinate list to obtain a reference coordinate of the hand key point.

If the coordinate lists of the key points of the left hand are all nonzero coordinates, calculating a coordinate mean value through a calculation formula, wherein the specific calculation formula is as follows:

=

=

if the coordinate lists of the key points of the right hand are all non-zero coordinates, calculating a coordinate mean value through a calculation formula, wherein the specific calculation formula is as follows:

=

=

therefore, the left-handed key point baseQuasi coordinate is (

，

) The reference coordinates of the key points of the right hand are (

，

）。

The implementation principle of the embodiment is as follows:

by means of mean value calculation of the trunk key point coordinate list and the hand key point coordinate list, trunk key point coordinates and hand key point coordinates with large partial deviation can be screened out, and finally obtained two coordinate mean values can be used as trunk key point reference coordinates and hand key point reference coordinates.

In step 601 and step 602 of the embodiment shown in fig. 6, after the key point reference coordinates are acquired, the call behavior in the video image to be tested is judged or predicted by the training model, which is specifically described in detail by the embodiment shown in fig. 7.

Referring to fig. 7, the method for judging or predicting the call-making behavior in the video image to be tested specifically includes the following steps:

701, calculating and acquiring an Euclidean distance between the trunk key point reference coordinate and the hand key point reference coordinate.

The Euclidean distance between the trunk key point reference coordinate and the left-hand key point reference coordinate is calculated, and the calculation formula is as follows:

calculating the Euclidean distance between the reference coordinates of the key points of the trunk and the reference coordinates of the key points of the right hand, wherein the calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

is the Euclidean distance between the reference coordinates of the key points of the trunk and the reference coordinates of the key points of the left hand,

the Euclidean distance between the reference coordinates of the key points of the trunk and the reference coordinates of the key points of the right hand.

And 702, acquiring a prediction range according to the Euclidean distance.

Wherein the prediction range may be set to [ 2 ]

，2

]，[

，2

]。

703, analyzing and acquiring the actual human skeleton coordinates of the video image to be detected through the training model.

And 704, acquiring coordinates of all actual behavior key points based on the actual human skeleton coordinates, and calculating and acquiring the actual Euclidean distance between the actual trunk key point and the actual hand key point in the video image to be detected according to the actual behavior key point coordinates.

In the embodiment shown in fig. 4, the embodiment shown in fig. 5, and the embodiment shown in fig. 6, the actual behavior key point coordinates in the video image to be measured, including the actual torso key point coordinates (are derived and calculated: (i)

，

) Actual left hand keypoint coordinates (

，

) Actual right hand keypoint coordinates (

，

）；

Calculating the actual Euclidean distance between the actual trunk key point and the actual left-hand key point through a formula, wherein the specific formula is as follows:

calculating the actual Euclidean distance between the actual trunk key point and the actual right-hand key point through a formula, wherein the specific formula is as follows:

in the formula (I), the compound is shown in the specification,

is the actual euclidean distance between the actual torso key point and the actual left hand key point,

is the actual euclidean distance between the actual torso key point and the actual right hand key point.

705, judging or predicting the call-making behavior in the video image to be tested according to the actual Euclidean distance and the prediction range.

The implementation principle of the embodiment is as follows:

the Euclidean distance between the hand and the trunk in the public telephone calling behavior can be obtained by calculating the Euclidean distance between the trunk key point reference coordinate and the left hand/right hand key point reference coordinate in the space, so that the prediction range is obtained according to the calculated Euclidean distance, the actual Euclidean distance in the video image to be detected is calculated, and the telephone calling behavior in the video image to be detected is judged or predicted according to whether the actual Euclidean distance is in the prediction range.

In step 705 of the embodiment shown in fig. 7, after the actual euclidean distance is calculated, the call behavior in the video image to be measured is determined according to the prediction range, which is specifically described in detail with the embodiment shown in fig. 8.

Referring to fig. 8, the method for judging the call-making behavior in the video image to be tested specifically includes the following steps:

and 801, defining a behavior occurrence range based on the prediction range, wherein the range size of the behavior occurrence range is smaller than that of the prediction range.

Wherein, if the prediction range is set to [ 2 ]

，2

]，[

，2

]Then the behavior occurrence range may be set to 0,

]，[0，

]。

802, determining whether the actual euclidean distance is within the behavior occurrence range, if yes, executing step 803; if not, go to step 805.

803, judging whether the time of the actual Euclidean distance in the behavior occurrence range exceeds a preset first judgment time, if so, executing step 804; if not, go to step 805.

Wherein the first determination time can be preset to 20s, and at this time, if satisfied

Or

If either condition exceeds 20s, step 804 is performed.

And 804, judging that the call making behavior appears in the video image to be detected.

805, determining that the call-making behavior does not occur in the video image to be tested.

The implementation principle of the embodiment is as follows:

because the call-making behavior in the video image to be detected may have occurred, the behavior occurrence range is defined according to the prediction range, and then whether the actual Euclidean distance is in the behavior occurrence range is judged, if the actual Euclidean distance is in the behavior occurrence range, whether the time in the range exceeds the first judgment time is judged, the call-making behavior is a continuous action, and a person in the video image to be detected may have transient actions such as touching ears or lifting hair, so that the actual Euclidean distance is in the behavior occurrence range in a short time, and therefore the misjudgment caused by the transient actions can be screened out by setting the first judgment time.

In step 705 of the embodiment shown in fig. 7, after the actual euclidean distance is calculated, the call-making behavior in the video image to be measured is predicted according to the prediction range, which is specifically described in detail with the embodiment shown in fig. 9.

Referring to fig. 9, predicting a call-making behavior in a video image to be tested specifically includes the following steps:

901, judging whether the actual Euclidean distance is in the prediction range, if yes, executing a step 902; if not, go to step 904.

902, judging whether the time of the actual euclidean distance in the prediction range exceeds a preset second judgment time, if yes, executing step 903; if not, go to step 904.

Wherein, assuming that the preset second determination time is 10s, at this time, if the preset second determination time is satisfied

Or

If either condition exceeds 10s, step 803 is executed.

And 903, predicting the calling behavior to appear in the video image to be tested.

And 904, predicting that the call-making behavior does not appear in the video image to be tested.

The implementation principle of the embodiment is as follows:

and predicting the occurrence condition of the call-making behavior in the video image to be detected by judging whether the actual Euclidean distance is in the prediction range or not, predicting the call-making behavior in the video image to be detected if the actual Euclidean distance is in the prediction range and exceeds a second determination time, and screening the influence of transient actions on the prediction result through the setting of the second determination time.

The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims

1. A method for predicting a call-making behavior, comprising the steps of:

2. The method of claim 1, wherein the step of obtaining a key point coordinate list of each behavior key point based on the sample data set comprises the steps of:

3. The method of claim 2, wherein the step of obtaining the keypoint coordinate list of each behavior keypoint based on the sample data set further comprises the steps of:

4. The method according to claim 2, wherein the step of calculating the torso key point coordinates of each of the sampled video images according to the occlusion condition of the torso key point comprises the following steps:

5. The method according to claim 3, wherein the step of calculating the coordinates of the key points of the hand of each of the sampled video images according to the occlusion condition of the key points of the hand comprises the following steps:

judging the shielding conditions of all key nodes;

6. The method according to claim 4 or 5, wherein the step of calculating and acquiring the key point reference coordinates of the call-making behavior based on the key point coordinate list comprises the steps of:

7. The method according to claim 6, wherein the step of determining or predicting the call-making behavior in the video image to be tested based on the key point reference coordinates and the training model comprises the steps of:

obtaining a prediction range according to the Euclidean distance;

8. The method according to claim 7, wherein the step of determining or predicting the call-making behavior in the video image to be tested according to the actual Euclidean distance and the prediction range comprises the following steps:

9. The method of claim 8, wherein the step of determining or predicting the call-making behavior in the video image to be tested according to the actual Euclidean distance and the prediction range further comprises the steps of:

judging whether the actual Euclidean distance is in the prediction range;

10. A call behavior prediction system that employs a call behavior prediction method according to any one of claims 1 to 9, characterized by comprising: