CN111260685B

CN111260685B - Video processing method and device and electronic equipment

Info

Publication number: CN111260685B
Application number: CN201811459605.1A
Authority: CN
Inventors: 赵小伟; 沈飞; 陈忱; 张�浩; 刘扬; 文杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-11-30
Filing date: 2018-11-30
Publication date: 2023-03-31
Anticipated expiration: 2038-11-30
Also published as: CN111260685A

Abstract

The embodiment of the invention provides a video processing method, a video processing device and electronic equipment, wherein the method comprises the following steps: acquiring a video stream for shooting user behaviors; detecting the motion trail of a moving target in the video stream by adopting an optical flow tracking algorithm; searching a video segment with a first characteristic in the video stream according to the motion track of the moving target; determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment. According to the video processing method, the video processing device and the electronic equipment, provided by the embodiment of the invention, the user can be monitored and subjected to subsequent settlement processing according to the judgment result of whether the first preset behavior exists, for example, functions such as auxiliary settlement or alarm can be performed according to whether the user has a scanning missing behavior, the scanning missing behavior of the user during the payment is avoided or reduced, the economic loss of a retail store is reduced, and manpower and material resources are saved.

Description

Video processing method and device and electronic equipment

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a video processing method and apparatus, and an electronic device.

Background

With the continuous development of new retail business, how to improve efficiency and reduce cost in retail stores becomes more and more important, for example, how to improve shopping or settlement efficiency of users, or how to improve the shelf efficiency of commodities becomes a problem to be solved urgently.

For example, the self-service cash register terminal is increasingly widely used as a main means for improving the checkout experience and efficiency of users at the line. The self-service cash register is mostly arranged at an outlet of a store, so that a consumer can scan commodities and check out payment in a self-service mode, a queuing process is avoided, and great convenience is provided for the consumer.

In the prior art, the customers often have intentional or unintentional miss-scanning behaviors when using self-service cash registers, and economic losses are brought to retail stores. In order to solve the problem, the current self-service cash register terminal adopts a weighing mode to confirm the commodities scanned by the user, the user needs to settle accounts according to the specified steps, the user behavior is strictly limited, the settlement efficiency is low, and the user experience is poor.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video processing method, an apparatus and an electronic device to reduce the cost of retail stores.

In a first aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a video stream for shooting user behaviors;

detecting the motion trail of a moving target in the video stream by adopting an optical flow tracking algorithm;

searching a video segment with a first characteristic in the video stream according to the motion track of the moving target;

determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

In a second aspect, an embodiment of the present invention provides a video processing method, including:

acquiring a sensing signal sent by a sensing device;

determining the motion track of the hand of the user according to the sensing signal;

searching a video clip with a first characteristic in a video stream for shooting user behaviors according to the motion track of the hand;

In a third aspect, an embodiment of the present invention provides a video processing method, including:

acquiring an offline video for shooting user behaviors;

detecting the motion trail of a moving target in the off-line video by adopting an optical flow tracking algorithm;

searching a video segment with a first characteristic in the offline video according to the motion track of the moving target;

and determining whether the user has a first preset behavior corresponding to the first characteristic according to the video clip.

In a fourth aspect, an embodiment of the present invention provides a store management method, including:

acquiring a video stream for shooting the behavior of a manager;

searching a video segment with a second characteristic in the video stream according to the motion track of the moving target;

and determining whether a second preset behavior corresponding to the second characteristic occurs to the manager according to the video clip.

In a fifth aspect, an embodiment of the present invention provides a store management method, including:

acquiring an offline video for shooting the behavior of a manager;

searching a video segment with a second characteristic in the offline video according to the motion track of the moving target;

In a sixth aspect, an embodiment of the present invention provides a video processing apparatus, including:

the acquisition module is used for acquiring a video stream for shooting user behaviors;

the detection module is used for detecting the motion trail of the moving target in the video stream by adopting an optical flow tracking algorithm;

the searching module is used for searching a video segment with a first characteristic in the video stream according to the motion track of the moving target;

a determining module for determining whether a first predetermined behavior corresponding to the first characteristic occurs according to the video segment.

In a seventh aspect, an embodiment of the present invention provides a video processing apparatus, including:

the acquisition module is used for acquiring a sensing signal sent by the sensing device;

the detection module is used for determining the motion track of the hand of the user according to the sensing signal;

the searching module is used for searching a video clip with a first characteristic in a video stream for shooting user behaviors according to the motion track of the hand;

In an eighth aspect, an embodiment of the present invention provides a video processing apparatus, including:

the acquisition module is used for acquiring an offline video for shooting user behaviors;

the detection module is used for detecting the motion trail of the moving target in the off-line video by adopting an optical flow tracking algorithm;

the searching module is used for searching a video segment with a first characteristic in the offline video according to the motion track of the moving target;

a determining module, configured to determine whether a first predetermined behavior corresponding to the first feature occurs to the user according to the video segment.

In a ninth aspect, an embodiment of the present invention provides a store management apparatus, including:

the acquisition module is used for acquiring a video stream for shooting the behavior of a manager;

the searching module is used for searching a video segment with a second characteristic in the video stream according to the motion track of the moving target;

and the determining module is used for determining whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip.

In a tenth aspect, an embodiment of the present invention provides a store management apparatus, including:

the acquisition module is used for acquiring an offline video for shooting the behavior of a manager;

the searching module is used for searching a video segment with a second characteristic in the offline video according to the motion track of the moving target;

In an eleventh aspect, an embodiment of the present invention provides an electronic device, including: a first memory and a first processor; the first memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor, implement the video processing method of the first aspect.

In a twelfth aspect, an embodiment of the present invention provides an electronic device, including: a second memory and a second processor; the second memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor, implement the video processing method of the second aspect.

In a thirteenth aspect, an embodiment of the present invention provides an electronic device, including: a third memory and a third processor; the third memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the video processing method of the third aspect.

In a fourteenth aspect, an embodiment of the present invention provides an electronic device, including: a third memory and a third processor; the third memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the store management method of the fourth aspect.

In a fifteenth aspect, an embodiment of the present invention provides an electronic device, including: a third memory and a third processor; the third memory is configured to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the store management method of the fifth aspect.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the video processing method according to the first aspect when executed.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the video processing method according to the second aspect when executed.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the video processing method according to the third aspect when executed.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the store management method according to the fourth aspect when executed.

An embodiment of the present invention provides a computer storage medium, configured to store a computer program, where the computer program enables a computer to implement the store management method according to the fifth aspect when executed.

According to the video processing method, the video processing device and the electronic equipment, the video stream for shooting the user behaviors can be obtained, the video segment with the first characteristic in the video stream is searched, whether the first preset behavior corresponding to the first characteristic occurs to the user is determined according to the video segment, and therefore the user is monitored and subjected to subsequent settlement processing according to the judgment result of whether the first preset behavior occurs, for example, functions of auxiliary settlement or alarming and the like can be performed according to whether the user has a missed-scanning behavior, the missed-scanning behavior occurring when the user checks out is avoided or reduced, economic loss of retail stores is reduced, manpower and material resources are saved, the user behaviors are analyzed through the video processing, the shopping and check-out processes of the user are not disturbed, the shopping and check-out processing efficiency of the user is effectively improved, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention;

fig. 2 is an interaction diagram of a self-service cash register terminal according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a self-service cash register terminal according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a video processing method according to a first embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a partition of a placement platform according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a method for determining missing scan logic in a commodity tracking process according to an embodiment of the present invention;

fig. 7 is a schematic flowchart of a second video processing method according to an embodiment of the present invention;

fig. 8 is a schematic flowchart of a third embodiment of a video processing method according to the present invention;

FIG. 9 is a diagram illustrating merging confidence levels according to an embodiment of the present invention;

fig. 10 is a schematic flowchart illustrating a fourth embodiment of a video processing method according to the present invention;

fig. 11 is a schematic flowchart of a store management method according to a first embodiment of the present invention;

fig. 12 is a schematic flowchart of a second store management method according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a first video processing apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of a second video processing apparatus according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of a third video processing apparatus according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of a first store management apparatus according to an embodiment of the present invention;

fig. 17 is a schematic structural diagram of a second store management apparatus according to an embodiment of the present invention;

fig. 18 is a schematic structural diagram of a first electronic device according to an embodiment of the present invention;

fig. 19 is a schematic structural diagram of a second electronic device according to an embodiment of the present invention;

fig. 20 is a schematic structural diagram of a third electronic device according to an embodiment of the present invention;

fig. 21 is a schematic structural diagram of a fourth electronic device according to an embodiment of the present invention;

fig. 22 is a schematic structural diagram of a fifth electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The words "if," if, "and" if, "as used herein, may be interpreted as" when an. Similarly, the phrase "if determined" or "if detected (a stated condition or event)" may be interpreted as "upon determining" or "in response to determining" or "upon detecting (a stated condition or event)" or "in response to detecting (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element described by the phrase "comprising a" does not exclude the presence of additional like elements in a commodity or system comprising the element.

The embodiment of the invention provides a video processing method which can acquire a video stream for shooting user behaviors, search a video segment with a first characteristic in the video stream, and determine whether a first preset behavior corresponding to the first characteristic occurs to a user or not according to the video segment.

Wherein the first characteristic and the first predetermined behavior can be set according to actual needs. Alternatively, the first predetermined behavior may be any behavior of the user in the store, for example, a theft behavior during shopping, a placement of an article in a wrong location, a scanning miss behavior during checkout, and the like, and accordingly, the first characteristic may be a characteristic suspected of the first predetermined behavior.

For example, when the first predetermined action is a theft, the first characteristic may be a characteristic of a suspected theft, such as moving a hand from a shelf to a distance from a body that is less than a preset threshold, and so on. As long as the characteristic of suspected theft is detected, whether the user steals can be judged according to the video clip where the characteristic is located.

There are many ways to determine whether the first predetermined behavior occurs based on the video segment. Optionally, the video segment may be detected by a machine learning model, and it is determined whether the first predetermined behavior occurs in the video segment.

In the embodiment of the invention, the video stream shot in real time is acquired, and when the specific behaviors of the user are analyzed by methods such as a machine learning model and the like, shorter video segments may need to be processed, so that the video segment suspected of the first predetermined behavior can be found from the video stream through the first characteristic, and the video segment is further processed to determine whether the first predetermined behavior occurs.

For convenience of description, the following describes in detail implementation procedures and principles of the embodiments of the present invention, taking the first predetermined behavior as a missing scan behavior as an example.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present invention. As shown in fig. 1, a user may select a commodity to be purchased in a store, and the commodity is generally provided with a barcode, a two-dimensional code, and other identifiers. After the user purchases the goods, the user can settle at the self-service cash register terminal, the self-service cash register terminal can be provided with the scanning device, and the user can scan the identification of the goods through the scanning device, so that the settlement of the goods is realized.

Fig. 2 is an interaction schematic diagram of a self-service cash register terminal according to an embodiment of the present invention. As shown in fig. 2, after a user scans a commodity at the self-service cash register terminal, the self-service cash register terminal may send a scanning result to the server, and the server may query commodity information corresponding to the scanning result, such as a name and a price of the commodity, and send the commodity information to the self-service cash register terminal, and the commodity information is displayed to the user by the self-service cash register terminal.

After the user finishes scanning all the commodities, the self-service cash register terminal can calculate the settlement price, or the server can generate the settlement price according to the prices of all the commodities, discount information and the like and send the settlement price to the self-service cash register terminal, the self-service cash register terminal can display the settlement price to the user and finish settlement according to the payment behavior of the user, and therefore the whole self-service settlement flow is finished.

In the whole self-service checkout flow, the self-service cash register terminal can collect video streams when a user scans commodities and judges whether the user has scanning missing behaviors or not according to the video streams.

Fig. 3 is a schematic structural diagram of a self-service cash register terminal according to an embodiment of the present invention. As shown in fig. 3, the self-service cash register terminal may be provided with a display device, a scanning device, a placing table, a camera, and the like.

The display device can display commodity information, settlement price and other information which are needed to be paid finally. The placing table is used for placing commodities. The scanning device is used for scanning the identification of the commodity, such as a bar code or a two-dimensional code. Optionally, the scanning device may be a scanning device Of a POS (Point Of Sale) device, and the POS device may determine corresponding commodity information according to a scanning result Of the commodity.

The camera is used for shooting self-service checkout behaviors of the user. In the self-service cash register terminal shown in fig. 3, the camera is arranged at the top, and in practical application, the camera may be arranged at any position capable of shooting the user checkout behavior, for example, the camera may be arranged opposite to the user or on the side of the user. When whether the user has the scanning missing behavior is detected by analyzing the video stream, the corresponding detection strategy can be adjusted according to the specific position of the camera.

The embodiment of the invention provides a method for shooting the checkout behavior of a user in the self-service checkout process of the user and processing the shot video stream so as to determine whether the user has the scanning missing behavior. Fig. 1 to 3 show optional application scenarios and structures according to the embodiment of the present invention. It will be understood by those skilled in the art that the specific hardware architecture may be adjusted according to actual needs, as long as the detection of the user's missing scan behavior through the video stream is achieved.

For example, the functions of processing the video stream and determining whether the user has missed scanning may be implemented by a self-service cash register terminal, or may be implemented by a server. Optionally, the self-service cash register terminal may send the collected video stream to the server, and the server detects whether a scanning missing behavior occurs and returns a detection result; or, the self-service cash register terminal may also send the video stream to other devices, such as a back-office monitoring terminal of a store, for video processing.

The following describes an implementation process of a video processing method according to an embodiment of the present invention with reference to the following method embodiment and accompanying drawings. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 4 is a schematic flowchart of a video processing method according to a first embodiment of the present invention. The execution main body of the method in the embodiment may be any electronic device with a video processing function, and optionally, may be a self-service cash register terminal. As shown in fig. 4, the video processing method in this embodiment may include:

step 401, acquiring a video stream for shooting user behavior.

Step 402, finding out a video segment with a first characteristic in the video stream.

Step 403, determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

In the embodiment of the present invention, a self-service cash register terminal is taken as an example for explanation. It will be appreciated by those skilled in the art that the principles and methods of implementing video processing with other devices are similar to self-service checkout terminals.

The first predetermined behavior may be a missing scanning behavior, and the video segment with the first characteristic may be a video segment with a suspected missing scanning behavior. For convenience of description, in the embodiment of the present invention, the suspected missed scanning behavior is regarded as a suspicious behavior, and the video segment having the first characteristic may specifically be a video segment having a suspicious behavior.

Specifically, the video stream may be processed in real time to determine whether there is a video segment of a suspicious behavior in the video stream. In the embodiment of the invention, the specific expression form of the missing scanning can be various. Table 1 shows an example of a classification of the missed scan behavior.

TABLE 1 classification example of missed Scan behavior

As shown in Table 1, the miss-scan behavior can be divided into two major categories: a code scanning missing behavior and a direct bagging behavior. The code scanning missing scanning behavior means that a user has a code scanning action, but the code scanning is not successful finally due to subjective or objective reasons, for example, the user intentionally shields the barcode, or the POS device has slow response and cannot scan the barcode in time. The direct bagging behavior refers to that a user directly moves the commodity to a code scanned area without a code scanning action.

Fig. 5 is a schematic diagram illustrating a partition of a placing table according to an embodiment of the present invention. As shown in fig. 5, looking down on the placement table, the placement table may be divided into two regions: the area A is an area to be scanned with codes, the area B is an area scanned with codes, before settlement, a user can place the commodities in the area A, when settlement is conducted, the commodities are taken up from the area A, the codes are scanned through POS equipment, and then the commodities after the codes are scanned are placed in the area B.

If the user does not scan the code, the user directly moves the goods to the area B, and the direct bagging behavior is considered. Alternatively, the commodities are moved to the area B, the commodities can be moved from the area a to the area B, or the commodities can be moved from other areas to the area B, and the direct bagging behavior can be considered as long as the commodities are moved from the area other than the area B to the area B no matter where the initial position is. That is, in fig. 5, the movement represented by the arrow 1 or the movement represented by the arrow 2 can be regarded as the direct bagging.

As described above, the miss-scanning behavior can be divided into a code-scanning miss-scanning behavior and a direct bagging behavior, and accordingly, as long as a code-scanning action occurs, or as long as a commodity is moved to a code-scanning area, the miss-scanning behavior is considered to be possible and is recorded as a suspicious behavior. Whether missing scan behavior is true may be further confirmed in conjunction with the scanning results of the POS device and/or machine learning models.

Table 1 is merely an example showing several common missed scan behaviors. In general, after the POS device successfully scans the identifier of the commodity, a scanning result corresponding to the commodity may be obtained; if the identification of the commodity is not scanned, the corresponding scanning result is not obtained. It will be appreciated that an action may be considered suspicious if it should be accompanied by the acquisition of scan results (i.e., the scan results must be acquired when the action occurs, otherwise a missing scan is indicated). For example, when a user has a code scanning action, or moves a commodity from a code scanning area to a code scanning area, a scanning result should be acquired, otherwise, scanning is missed, and then the code scanning action of the user, or the action of moving the commodity from the code scanning area to the code scanning area, may be regarded as a suspicious action.

There are many methods for detecting video segments of suspicious behavior in a video stream. Alternatively, suspicious behavior in the video stream may be detected by the recognition model. Specifically, the recognition model may be trained through the sample, and suspicious behaviors in the video stream may be found according to the trained model.

The method comprises the steps that a sample can comprise a plurality of videos, each video is marked, the starting and ending time of a suspicious behavior is marked, then a recognition model is trained according to the sample, and after the training is finished, the video to be detected is input into the recognition model, so that the video fragment of the suspicious behavior can be detected.

Specifically, whether the miss-scanning behavior occurs or not can be judged by combining the scanning result of the POS device and/or a machine learning model.

Optionally, if the scanning result is obtained within the starting and ending time of the video segment of the suspicious behavior, it is determined that the scanning missing behavior does not occur, and if the scanning result is not obtained, it is determined that the scanning missing behavior occurs.

Or, the video segment may be input to a machine learning model, and a detection result of whether the video segment belongs to the missed scanning behavior is obtained. The machine learning model may be trained over a large number of samples.

Or, the POS signal may be combined with a machine learning model, and whether a scanning result is obtained within the start-stop time of the video clip is determined first, and if the scanning result is obtained, it is determined that no missing scanning occurs; if not, the video clip can be input into the machine learning model and further confirmed by the machine learning model.

The method requires that the scanning result is accompanied in each video segment of the suspicious behavior, the tracking process of the commodity is not considered, the logic is simple, the implementation is easy, but the false alarm can occur. To improve accuracy, the video clips can also be combined with the tracking process of the goods.

Optionally, determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment may include: after the tracking process of a commodity is finished, if a video clip with a first characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process; and if the scanning result is not obtained, judging whether a first preset behavior occurs according to the video clip with the first characteristic.

Specifically, after the tracking process of a commodity is finished, if a video segment of a suspicious behavior appears in the tracking process, whether a scanning result of the commodity is obtained in the tracking process is judged; and if the scanning result is obtained in the tracking process of one commodity, determining that the scanning missing behavior does not occur in the tracking process of the commodity. Wherein the tracking process of the commodity is a process of holding the commodity in the hand.

The video segment of the suspicious activity can be a segment in a complete commodity tracking process. For example, the moment of picking up the commodity and the moment of putting down the commodity of the user can be judged according to the hand of the user and the moving track of the commodity, and a complete commodity tracking process can be determined according to the moment of picking up the commodity and the moment of putting down the commodity, wherein the starting and ending time of the process is the moment of picking up the commodity and putting down the commodity. During a complete article tracking process, one or more video segments of suspicious behavior may appear.

For example, after the user starts self-checkout, the 0.5 th second starts, a commodity is picked up and moved to the code scanning device, the code scanning behavior is detected to occur from the 1.5 th second to the 2.0 th second, the movement from the non-B area to the B area is detected from the 2.5 th second to the 3.0 th second, and the commodity is detected to be put down from the 4.0 th second, then the tracking process corresponding to the commodity is from the 0.5 th second to the 4 th second for 3.5 seconds, wherein there are two suspicious segments, a video segment of the code scanning behavior from the 1.5 th second to the 2.0 th second, and a video segment of the movement of the commodity to the B area from the 2.5 th second to the 3.0 th second, and both video segments last for 0.5 seconds.

If the scanning result is obtained in the tracking process of the commodity, namely the process from 0.5 th second to 4 th second, the scanning is not considered to be missed. If the scanning result is not obtained, it may be determined that the missing scanning behavior occurs, or if the scanning result is not obtained, it may be further determined whether the missing scanning behavior occurs according to the video segments of the two suspicious behaviors, and specifically, the video segments of the two suspicious behaviors may be analyzed by a machine learning model to determine whether the missing scanning behavior occurs.

In summary, if the POS device does not have the identity of the scanned item during the tracking of an item, the user may have a miss-scanning behavior. In addition, if the identification of the commodity is scanned, but the commodity information determined by scanning is inconsistent with the commodity information determined by video streaming, the scanning missing behavior can also be considered to occur, the user is prevented from cheating, and the fake identification is used for replacing the real identification, so that the loss is brought to the store.

For example, by processing the video stream, it is found that the commodity in the user's hand is a drink. However, if the scanning result is chewing gum, it indicates that the commodity information determined by the identification is inconsistent with the commodity information detected by the video stream, and it can also be considered that the scanning miss behavior occurs.

After the user finishes scanning all the commodities, the commodities can be settled according to the scanning missing condition of the user. Specifically, if the scanning missing condition of the user meets a preset condition, the commodity scanned by the user can be settled, and the user can leave with the commodity after normally completing payment. If the preset condition is not met, the commodity is not allowed to be settled. The preset conditions can be set according to actual needs.

In an alternative embodiment, as long as the user is detected to have the scanning missing behavior, the settlement of the article is not allowed, and the settlement can be normally performed only if the user does not have the scanning missing behavior in the whole scanning process.

In another optional implementation manner, as long as the number of times of missed scanning of the user is less than a certain value, the user is allowed to settle the scanned items, a fault-tolerant space can be provided for a video processing algorithm, the influence of misjudgment on the shopping experience of the user is prevented, and the shopping process is saved.

Correspondingly, the method in this embodiment may further include: responding to an operation event that the user confirms that the commodity is scanned completely, and counting the times of the user that the scanning missing behavior occurs; and if the times of the user's missed scanning behaviors are less than the preset times, settling the commodities scanned by the user. The preset number of times can be set according to actual needs, and for example, can be 4 times.

The operation event that the user confirms that the scanning of the commodities is completed may refer to an operation that the user determines that all the commodities are scanned completely by clicking a screen, pressing a key, inputting voice, and the like, for example, a "complete" button may be displayed on the self-service cash register terminal, the user may click the "complete" button when all the commodities are scanned completely, and the self-service cash register terminal may settle the bills for the commodities scanned by the user in response to the click operation of the user.

And if the times of the user's missed scanning behavior are not less than the preset times, the commodity is not allowed to be settled. In addition, a settlement prohibiting interface can be displayed, and/or warning information can be sent to the monitoring terminal.

Specifically, the settlement prohibiting interface is used to prompt that the user cannot perform settlement, and optionally, "detecting that there is a scanning missing behavior for you and failing to perform settlement" may be displayed on the settlement prohibiting interface, or "there is a scanning missing behavior for a clerk to come and handle" may be displayed.

The monitoring terminal can be a background monitoring terminal and/or a field monitoring terminal and the like. The on-site monitoring terminal can be any terminal carried by on-site monitoring personnel, for example, a mobile phone or wearable equipment such as a watch and an intelligent bracelet, and the on-site monitoring personnel can be personnel used for assisting a user to complete self-service cash collection on the site, such as a store waiter. After the on-site monitoring terminal receives the warning information, the warning information can be pushed to on-site monitoring personnel according to the warning information, and the on-site monitoring personnel is prompted to process the warning information. For example, it may display or play "xx cashier terminals detect the missed scanning behavior, please go to process before".

The background monitoring terminal is used for monitoring the scanning behavior of the user by background monitoring personnel. The background monitoring personnel can be staff used for monitoring videos in stores, and the background monitoring terminal can be any terminal with a video playing function, such as a mobile phone, a tablet device, a computer, an intelligent television, a display and the like. The backstage monitoring terminal can display the warning information to backstage monitoring personnel after receiving the warning information, and the backstage monitoring personnel can conveniently schedule field monitoring personnel to process or know the service condition of each self-service cash register terminal in the front field.

In practical application, the self-service cash register terminal can collect video streams for shooting user behaviors when a user scans commodities, detects the user behaviors according to the video streams, judges whether the user has a scanning missing behavior, allows the user to normally pay only when the user behaviors meet certain conditions, such as the scanning missing behavior does not occur or the scanning missing times of the user is smaller than preset times, and otherwise can block the user payment behaviors, so that the loss of merchants caused by the scanning missing of the user is prevented.

The embodiment of the invention adopts the video stream to detect whether the user has the scanning missing behavior, and has obvious progress compared with the method for settling accounts by using a weighing device in the prior art.

Among the prior art, receive silver-colored terminal at self-service and be provided with the machine of weighing, utilize gravity-feed induction to weigh the loss prevention, sweep the weight that a sign indicating number commodity corresponds and the commodity on the machine of weighing and contrast, if the weight is different then the suggestion is reported to the police to realize weighing the loss prevention, machine itself takes up an area of great, and every commodity must weigh moreover, and user experience is not good enough. The video processing method provided by the embodiment of the invention realizes the loss prevention function through video processing, the user does not sense the loss prevention function, the interference to the user is reduced, the user cash receiving process is not disturbed, the user experience can be effectively improved, the store space is saved, and the application range is wider.

In the embodiments of the present invention, the missing scanning behavior is taken as an example for detailed description, and it can be understood by those skilled in the art that the missing scanning behavior may be replaced by any other first predetermined behavior, such as a theft behavior, a behavior of putting a wrong article, and the like, and the specific processing procedure may refer to the processing procedure of the missing scanning behavior, which is not described herein again.

In summary, the video processing method provided in this embodiment may obtain a video stream for capturing a user behavior, search for a video segment having a first characteristic in the video stream, and determine whether the user has a first predetermined behavior corresponding to the first characteristic according to the video segment, so as to monitor and perform subsequent settlement processing on the user according to a determination result of whether the first predetermined behavior exists, for example, perform functions such as auxiliary settlement or alarm according to whether the user has a missed-scanning behavior, avoid or reduce the missed-scanning behavior of the user during checkout, reduce economic loss of retail stores, save manpower and material resources, and implement analysis on the user behavior through video processing, so that the shopping and checkout processes of the user are not disturbed, effectively improve processing efficiency of shopping and checkout of the user, and improve user experience.

In order to improve the accuracy of the algorithm, after the video stream is acquired, whether video segments of suspicious behaviors exist in the video stream can be detected in various detection modes. Specifically, an embodiment of the present invention further provides a video processing method, including: acquiring a video stream for shooting user behaviors; respectively inputting the video streams to a plurality of detection modules, and searching for video segments with first characteristics in the video streams; and determining whether a first preset behavior corresponding to the first characteristic occurs according to the searched video clip.

Still taking the first predetermined behavior as the miss-scanning behavior and the video segment with the first characteristic as the video segment with the suspicious behavior as an example, different detection modules use different detection methods to search for the video segment with the suspicious behavior in the video stream. The detection module in the embodiment of the invention can be any module capable of detecting the user behavior.

Optionally, the plurality of detection modules may include at least two of: the device comprises a track detection module, an optical flow detection module and a segmentation detection module. The track detection module, the optical flow detection module and the segmentation detection module respectively realize the detection of the user behaviors through methods of hand tracks, optical flows, segmentation video flows and the like.

Optionally, inputting the video stream to a trajectory detection module, and searching for a video segment of a suspicious behavior in the video stream may include: detecting position information of hands and/or commodities in each frame of image of the video stream; determining the motion trail of the hand and/or the commodity according to the position information of the hand and/or the commodity in each frame of image; and searching the video segment of the suspicious behavior according to the motion trail of the hand and/or the commodity.

Optionally, inputting the video stream to an optical flow detection module, and searching for a video segment of a suspicious behavior in the video stream may include: detecting the motion trail of a moving target in the video stream by adopting an optical flow tracking algorithm; searching a video segment of a suspicious behavior according to the motion trail of the moving target; wherein the moving target comprises a user's hand and/or merchandise.

Optionally, inputting the video stream to a segmentation detection module, and searching for a video segment of a suspicious behavior in the video stream may include: acquiring a video with preset duration in the video stream; and searching video segments of suspicious behaviors in the videos with the preset duration.

Specifically, after the video segments of one or more suspicious behaviors are found through the plurality of modules, whether missing scanning occurs or not can be determined according to the video segments of one or more suspicious behaviors found.

Optionally, after the tracking process of a commodity is finished, if a video clip of a suspicious behavior appears in the tracking process, it may be determined whether a scanning result of the commodity is obtained in the tracking process; if the scanning result is not obtained, judging whether a missing scanning behavior occurs according to the video segment of the suspicious behavior; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

The tracking process of the commodity can be determined by the position information of the commodity and the hand in the video stream. Optionally, the position information of the commodity and the hand in the video stream may be detected, the movement track of the commodity and the hand is determined according to the position information of the commodity and the hand, and whether the commodity is held in the hand is determined according to the movement track of the commodity and the hand.

Specifically, if the position of the commodity coincides with or is close to the position of the hand, and the movement locus is similar, the commodity can be considered to be held in the hand. In other alternative implementations, the item may be considered to be held in the hand as long as the area in which the item is located overlaps the area in which the hand is located.

When the commodity is separated from the hand, the hand can be considered to put down the commodity, and the tracking process is finished. Optionally, after determining that the product is held in the hand, if the time for detecting an empty hand (i.e. no product in the hand) exceeds a preset time, it is determined that the tracking process of the product is finished. If the empty hand is detected but the preset time is not exceeded, the tracking process is not considered to be finished, the misjudgment is prevented, and the detection accuracy is improved.

In the above-described method for detecting a commodity tracking process, it may not be detected which commodity the user has in hand is, and as long as it is detected that there is no commodity in hand for a long time, the user is considered that the commodity is put down, that is, the tracking process of the previous commodity is ended.

Alternatively, the specific type of the commodity may be detected, for example, whether the commodity is a drink or a chewing gum, and if the commodity is detected to be changed in the hand of the user, the tracking process of the previous commodity is ended.

In the embodiment of the invention, the target tracking algorithm can also be adopted to detect the tracking process of the commodity, the accuracy and efficiency of different algorithms can be different, and the algorithms can be selected according to requirements in practical application.

After the tracking process of the commodity is determined, if a plurality of video segments of suspicious behaviors exist in the tracking process, whether the scanning missing behavior occurs or not can be judged according to the last video segment, or the video segment with the overlapping part with the last video segment can be searched, the searched video segment and the last video segment are combined, and whether the scanning missing behavior occurs or not is judged according to the combined video segment.

The last video clip in the embodiment of the present invention refers to a video clip with the end time closest to the end time in the commodity tracking process.

Alternatively, it may be determined by a machine learning model whether the behavior in the video segment belongs to a missed scan behavior.

Fig. 6 is a schematic diagram of a method for determining a missing scan logic in a commodity tracking process according to an embodiment of the present invention. As shown in fig. 6, after the tracking process of a commodity is determined, it may be determined whether a video segment of a suspicious behavior exists in the tracking process, and if not, it is determined that a missing scanning behavior does not occur in the tracking process.

And if video segments of suspicious behaviors appear in the tracking process, judging whether a scanning result is obtained in the tracking process. And if the scanning result is obtained in the tracking process of one commodity, determining that the scanning missing behavior does not occur in the tracking process of the commodity.

In the embodiment of the invention, as long as the scanning result is obtained once in the tracking process, the tracking process can be considered to have no missing scanning, and if the scanning result is not obtained once in the tracking process, the last video segment in the tracking process can be verified through the machine learning model.

Optionally, before the verification of the last video segment, if there is another video segment that coincides with the last video segment, the another video segment is merged with the last video segment, and then the merged last video segment is input to the machine learning model to determine whether a behavior in the video segment is a missing scanning behavior. And if the last video clip is not overlapped with any other video clip, directly inputting the last video clip into the machine learning model, and judging whether the behaviors in the video clip belong to the missing scanning behaviors or not.

If the behavior in the last video clip is determined to be the scanning missing behavior, the scanning missing behavior is shown in the commodity tracking process; and if the behavior in the last video clip is not the scanning missing behavior, determining that the scanning missing behavior does not occur in the commodity tracking process.

The following examples are given. The tracking process of a commodity lasts for 4 seconds from 1.5 th second to 5.5 th second of the video stream, three video clips of suspicious behaviors are detected in the 4 seconds, the first video clip is from 2.0 th second to 2.4 th second, the second video clip is from 3.0 th second to 3.5 th second, and the third video clip is from 3.3 rd second to 3.6 th second.

Now that there is a video segment of suspicious behavior in the tracking process, it can be further determined whether a scanning result is obtained in the tracking process. If the scanning result is obtained between the 1.5 th second and the 5.5 th second of the video stream, it can be determined that the scanning missing behavior does not occur in the tracking process of the commodity.

If no scanning results are obtained, the video segments of suspicious behavior may be input to a machine learning model for further confirmation. According to the previous example, the last video segment is the third video segment, and the second video segment and the third video segment are partially overlapped, the second video segment and the third video segment can be merged to obtain the video segments from the 3.0 th to the 3.6 th seconds.

And inputting the video clips from the 3.0 th second to the 3.6 th second in the video stream into a machine learning model, and determining whether the behaviors are the missing scanning behaviors, wherein if the behaviors are the missing scanning behaviors, the missing scanning behaviors are considered to occur in the commodity tracking process, and otherwise, the missing scanning behaviors are not considered to occur.

When a plurality of video segments of suspicious behaviors exist in the tracking process of the commodity, only the last video segment or the combined last video segment is detected, and the processing efficiency of the video stream can be improved.

In other alternative embodiments, all video segments of suspicious behaviors in the tracking process may also be input to the machine learning model for detection, so as to improve the accuracy of detection.

According to the video processing method, the video segments of the suspicious behaviors are searched by the plurality of detection modules together, so that the algorithm accuracy is effectively improved; in addition, when the scanning result is not obtained in the tracking process of the commodity, the video clip can be input into the machine learning model for missing scanning detection, whether the missing scanning behavior occurs can be confirmed according to the searched video clip, and the processing efficiency and accuracy of the video stream are improved.

In the technical solutions provided by the embodiments of the present invention, a specific implementation method for determining whether behaviors in a video clip belong to missing scanning behaviors through a machine learning model may include: determining the confidence coefficient that the video segment of the suspicious behavior belongs to the missing scanning behavior through a machine learning model; and judging whether a missing scanning behavior occurs or not according to the confidence coefficient.

Specifically, the output of the machine learning model may be a confidence that the input video segment belongs to the missing scan behavior, and if the confidence is greater than a preset threshold, the missing scan behavior is considered to occur, for example, the threshold may be 0.6. Inputting the last video clip in the commodity tracking process into a machine learning model, and obtaining that the confidence coefficient of the missed scanning behavior is 0.3, which shows that the behavior in the video clip has only 30% probability of the missed scanning behavior and is less than a threshold value of 0.6, and at this moment, the missed scanning behavior does not occur in the whole commodity tracking process; if the obtained confidence coefficient of the behavior belonging to the missed scanning is 0.8, the behavior in the video clip is indicated to have 80% probability of belonging to the missed scanning, and the missed scanning behavior can be considered to occur in the commodity tracking process.

In other optional embodiments, the determining whether a missed scanning behavior occurs according to the video segment of the suspicious behavior may include: if a plurality of video segments of suspicious behaviors exist in the tracking process, determining the confidence coefficient that the video segment of each suspicious behavior belongs to the missing scanning behavior through a machine learning model; calculating a weighted sum of confidence degrees corresponding to the video segments of the plurality of suspicious behaviors; and if the weighted sum of the plurality of video clips is greater than a preset threshold value, determining that the scanning missing behavior occurs.

As described above, different detection methods may be used to process the video stream and search for video segments with suspicious behaviors, where the weights corresponding to the video segments searched by the different detection methods may be different. For example, the video segments of suspicious behaviors in the video stream are searched through the algorithm a and the algorithm B, where the accuracy of the algorithm a is higher, the weight of the video segment searched through the algorithm a may be higher, and the accuracy of the algorithm B is lower, the weight of the video segment searched through the algorithm B may be lower. Of course, the weights may also be set according to other policies, for example, the weights of all video segments may be set to be the same, and so on.

In the technical solutions provided by the embodiments of the present invention, determining, by a machine learning model, a confidence that a video segment of a suspicious behavior belongs to a missing scanning behavior may include: determining a corresponding machine learning model according to the type of the video segment of the suspicious behavior; and inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient of the video clip belonging to the missing scanning behavior.

In an alternative embodiment, the video segments may be sorted according to table 1. Optionally, the video clip may be divided into two types, a code scanning missing scanning behavior and a direct bagging behavior, and may also be divided into more detailed types: block code scanning, back code scanning, scan code too fast, a to B, other zones to B, etc. When a video segment of a suspicious activity is searched for in step 602, the type of the searched video segment may be determined.

Accordingly, machine learning models can also be divided into various types: a machine learning model for confirming code scanning missing scanning behaviors and a machine learning model for confirming direct bagging behaviors, or more specifically, multiple types. When training the machine learning model, the machine learning model may be trained according to the corresponding type of samples. When the video clip needs to be confirmed whether to be the miss-scanning behavior, the video clip can be processed by adopting a machine learning model of a corresponding type.

For example, a plurality of video segments of suspicious behaviors are detected in the tracking process of the commodity, and if the last video segment is a video segment of a code scanning missing scanning behavior, a machine learning model for identifying the code scanning missing scanning behavior can be adopted to determine whether the video segment is the missing scanning behavior; if the last video clip is a video clip of the direct bagging behavior, a machine learning model for identifying the direct bagging behavior can be adopted to confirm whether the video clip is the missing scanning behavior.

The implementation principles of different types of machine learning models may be similar, for example, DNN (Deep Neural Network) may be used for implementation, but training samples may be different, so that training is effectively performed for different types of behaviors, and the accuracy of detecting whether different types of video segments are missed scan behaviors is improved.

In other embodiments, other classification methods may be used. For example, video clips can be divided into three types: video clips obtained by the trajectory detection module, video clips obtained by the optical flow detection module, video clips obtained by the segmentation detection module, and the like.

Fig. 7 is a flowchart illustrating a second video processing method according to an embodiment of the present invention. As shown in fig. 7, the video processing method in this embodiment may include:

and step 701, acquiring a video stream for shooting user behaviors.

And step 702, detecting the motion trail of the moving target in the video stream by adopting an optical flow tracking algorithm.

Step 703, searching for a video segment with the first characteristic in the video stream according to the motion trajectory of the moving target.

Step 704, determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

Among them, the Optical Flow (Optical Flow) tracking algorithm is an important method for analyzing moving images at present, and its concept was first proposed by James j.gibson in the 40 th century, when an object is moving, its brightness pattern of a corresponding point on an image is also moving, and the Apparent Motion (Apparent Motion) of the brightness pattern of the image is the Optical Flow.

Optionally, in this embodiment, a Fast Optical Flow using sense Inverse Search Fast calculation method may be adopted to calculate the moving object in the video stream. After a moving target in the video stream is detected according to an optical flow tracking algorithm, the moving target can be used as a hand and/or a commodity of a user, so that a video segment of suspicious behaviors can be searched according to the motion track of the hand and/or the commodity.

Optionally, searching for a video segment with a first characteristic in the video stream according to the motion trajectory of the moving target may include: and if the mobile target is detected to enter a preset area, or the mobile target is detected to leave the preset area, confirming that the video clip with the first characteristic appears.

Or, searching for a video segment with a first feature in the video stream according to the motion trajectory of the moving target may include: and if the moving target is detected to leave the preset area after entering the preset area, determining that the video clip with the first characteristic appears.

The preset area may be a code scanned area, or the preset area may be an area closer to the scanning device. Taking the latter as an example, the preset region may be a region within a preset range of the scanning device. The scanning device is used for acquiring a corresponding scanning result when a user scans a commodity.

Optionally, the preset area may be an area that is smaller than a preset distance value from the scanning device in the vertical direction. Specifically, the optical flow may be analyzed, and the motion trajectory in the Y direction may be modeled and decomposed into two motion modes, i.e., approaching and leaving, so as to determine whether suspicious behavior exists. When the moving object approaches the scanning device and leaves, a suspicious behavior can be considered to occur.

Of course, approaching and departing in the X direction may also be used to confirm that a suspicious behavior occurs, or the X direction and the Y direction are combined, and if the moving object enters a preset range and departs in the X direction and the Y direction, it is indicated that a suspicious behavior occurs.

In addition, after the moving object is detected to approach and leave, the moving object is identified, if the moving object is a commodity, suspicious behaviors are determined to occur, and if the moving object is a non-commodity, such as a mobile phone or a bag, the suspicious behaviors are not considered to occur.

Correspondingly, searching for a video segment with a first characteristic in the video stream according to the motion trajectory of the moving object may include: if the moving target leaves after entering a preset area, identifying the moving target, and judging whether the moving target comprises commodities; and if the moving target comprises a commodity, determining that a video clip with a first characteristic appears.

The start time of the video segment with the first feature may be a time of entering the preset area, and the end time of the video segment with the first feature may be a time of leaving the preset area.

That is, the start-stop time of the suspicious activity may be determined according to the time of entering and/or leaving the preset area. Optionally, the start time of the video segment of the suspicious activity may be a time of entering the preset area, and the end time of the video segment of the suspicious activity may be a time of leaving the preset area.

Alternatively, the adjustment may be performed according to actual needs, for example, the start time of the video segment of the suspicious activity may be N seconds before the time of entering the preset region, and the end time of the video segment of the suspicious activity may be M seconds after the time of leaving the preset region, where N and M are both real numbers.

Optionally, after determining the video segment in which the suspicious behavior occurs, the accurate start time and the accurate end time of the video segment may also be determined according to a machine learning model. Specifically, video segments of suspicious behavior may be input to a machine learning model, from which accurate start and end times are determined.

After determining the exact start and end times, the video segments may be processed by DNN. Specifically, after the video segment of the suspicious behavior is acquired, whether the video segment belongs to the missed scanning behavior may be determined directly through DNN, or the method in this embodiment may be used in combination with the methods in other embodiments. For example, the tracking process of the goods and the obtained scanning result can be combined to comprehensively determine whether to input the video clip into the DNN for further confirmation.

Optionally, in the tracking process of a commodity, video clips of a plurality of suspicious behaviors may be detected, for example, a user may hold the commodity to go in and out of a code scanned area, but as long as a scanning result is obtained in the tracking process of the commodity, it can be considered that no missing scanning behavior occurs, and if a scanning result is not obtained, the last video clip may be input into the DNN for confirmation.

Of course, it is also possible to use multiple detection methods in combination with other detection methods to simultaneously detect the video stream and search the video segments of the suspicious behavior therein.

The current mainstream behavior analysis method adopts an off-line analysis method, namely the start-stop time and the type of the action occurring in the video can be predicted only by seeing a complete video segment containing the action, so that the method is not suitable for the condition of needing real-time early warning. In the embodiment, a simple and efficient solution based on the optical flow is adopted, the scanning missing action can be predicted and judged in real time, and the scanning missing action of the user can be recognized at the first time.

In summary, in the video processing method provided in this embodiment, an optical flow tracking algorithm may be used to detect a motion trajectory of a moving target in a video stream, and search for a video segment with a first feature according to the motion trajectory of the moving target, and determine whether a user has a first predetermined behavior, such as whether a missed scanning behavior exists, by combining with a machine learning model, so that the problem of commodity loss prevention is effectively solved based on visual dimensions, the processing efficiency of user account settlement is improved, the operation behavior of the user does not need to be limited, the operation experience of the user is improved, and the video segment with the first feature can be detected in time based on the optical flow, so as to meet the requirements of real-time monitoring and early warning.

Fig. 8 is a flowchart illustrating a video processing method according to a third embodiment of the present invention. As shown in fig. 8, the video processing method in this embodiment may include:

step 801, acquiring a sensing signal sent by a sensing device.

And step 802, determining the motion track of the hand of the user according to the sensing signal.

And 803, searching for a video clip with the first characteristic in the video stream for shooting the user behavior according to the motion track of the hand.

And step 804, determining whether a first preset behavior corresponding to the first characteristic occurs according to the video clip.

Optionally, searching for a video segment with a first characteristic in a video stream capturing user behavior according to the motion trajectory of the hand may include: judging whether the hand of the user enters a preset area and leaves; and if so, determining that a video clip with the first characteristic appears in the video stream for shooting the user behavior.

Wherein the sensing means may be any type of device capable of detecting the position of a hand. The sensing signal may be any signal capable of representing a change in the position of the hand.

In an alternative implementation, the sensing device may be a distance sensor that can detect the distance from the surrounding obstacle to itself. Accordingly, the sensing signal may be a distance between the hand and the distance sensor. After the distance between the hand and the distance sensor is acquired, the motion track of the hand can be determined according to the distance.

Alternatively, the distance sensor may be provided at a position capable of detecting whether the user's hand enters or leaves the preset area. For example, the distance sensor may be disposed beside the scanning device, and when the hand of the user approaches or leaves the scanning device, the distance between the distance sensor and the user becomes smaller and larger.

In this way, the sensor signal detected by the distance sensor can determine whether the condition of the video segment triggering the suspicious behavior is met. Specifically, if the sensing signal changes from being greater than the preset value to being smaller than the preset value, and then changes from being greater than the preset value again, it indicates that the approach-departure process has passed, and at this time, it can be considered that a video segment of suspicious behavior has occurred.

Similar to the method described above, the start time of the video segment of the suspicious activity may be the time when the hand enters the preset area, and the end time may be the time when the hand leaves the preset area.

In another alternative implementation, the sensing device may be an infrared sensor. The infrared sensor may also detect whether the user approaches or leaves a preset area, and the video segment of the suspicious behavior in the video stream may be searched according to the sensing signal fed back by the infrared sensor, and the specific implementation principle and process may refer to the foregoing embodiments, which are not described herein again.

In summary, the video processing method provided in this embodiment may utilize the sensing device to detect the motion trajectory of the hand, search for the video segment with the first characteristic in the video stream according to the motion trajectory, and determine whether the user has the first predetermined behavior, such as whether the user has the miss-scanning behavior, according to the video segment, so as to effectively improve the accuracy and speed of the detection.

The video clips may be searched by other methods besides the method of searching the video clips according to the optical flow or the sensing information as described in the above embodiments. For example, a video segment with the first feature in the video stream can be found by analyzing the movement track of the hand in the video or by analyzing the video with a preset time length.

An embodiment of the present invention further provides a video processing method, including: acquiring a video stream for shooting user behaviors; determining a movement trajectory of a hand of a user in the video stream; searching a video segment with a first characteristic in the video stream according to the movement track of the hand in the video stream; determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

Specifically, the position information of the hand of the user in each frame image of the video stream may be detected, and the motion trajectory of the hand may be determined according to the position information of the hand in each frame image.

Optionally, searching for a video segment with a first characteristic in the video stream according to the movement trajectory of the hand in the video stream may include any one of the following items: if the hand of the user enters the code-scanned area from the non-code-scanned area, determining that a video clip with a first characteristic appears; if the hand of the user enters the code-scanned area from the non-code-scanned area and the time interval between the hand of the user and the last entry of the code-scanned area is greater than a preset interval, determining that a video clip with a first characteristic appears; and if the hand of the user enters the code-scanned area from the non-code-scanned area and the farthest distance between the hand and the code-scanned area after the hand leaves the code-scanned area last time is greater than the preset distance, determining that the video clip with the first characteristic appears. The following description will be given by taking the skip-scan behavior as an example.

Specifically, whether the hand of the user enters the code-scanned area from the non-code-scanned area or not can be judged through the motion track of the hand of the user. The scanned region may be the region B in fig. 5, and the non-scanned region may refer to any region other than the region B, may be the region a, or may be other regions other than the region a and B.

If the user's hand enters the scanned region from a non-scanned region, then a video segment with suspicious behavior may be determined to have occurred. The video segments of the suspicious behavior may be video segments within a period of time before and after a time when the video segments enter the code-scanned area.

Optionally, the video segment of the suspicious behavior may be a video segment in a first preset time period before entering the code-scanned area and a video segment in a second preset time period after entering the code-scanned area.

Assuming that the first preset time period and the second preset time period are both t ₀ Then if the user's hand T is from a non-scanned region into a scanned region at time T-T ₀ ,T+t ₀ ]The video during this period may be taken as a video clip of the suspicious activity. A more intuitive example is that if the first preset time period and the second preset time period are both 1 second, and the hand of the user enters the scanned code area at the 15 th second of the video stream, the video segment of the corresponding suspicious behavior is the 14 th to 16 th seconds of video.

If the user's hand enters the scanned area from the non-scanned area several times in the video stream, a plurality of corresponding video segments can be found.

Optionally, to avoid false alarm caused by shaking of the hand of the user, after the hand of the user enters the code-scanned area, it is considered that a suspicious behavior occurs, and under the condition that the hand enters the code-scanned area again after leaving the code-scanned area, if the time interval between two times of entry is short or the moving distance of the hand is short, it is considered that the suspicious behavior does not occur again.

That is, when the moving distance and/or moving time of the hand at the boundary of the scanned code region is short, it may be considered as the shaking behavior of the hand, not the behavior of entering the scanned code region for the second time.

In an optional implementation manner, searching for a video segment with suspicious behavior according to the motion trajectory of the hand of the user may include: and if the hand of the user enters the code-scanned area from the non-code-scanned area and the time interval between the hand of the user and the hand of the user entering the code-scanned area last time is larger than a preset interval, determining the video segment with suspicious behaviors.

When the fact that the hand of the user enters the code-scanned area from the non-code-scanned area is detected, if the time interval from the last entry of the hand of the user into the code-scanned area is smaller than the preset interval, the current behavior is not considered to belong to suspicious behavior. The preset interval may be 1 second. For example, if it is detected that the hand of the user enters the code-scanned area in 15.5 seconds, then rapidly leaves and re-enters the code-scanned area in 15.8 seconds, the hand of the user can be considered to have a jitter behavior at the boundary of the code-scanned area between 15.5 seconds and 15.8 seconds, and the two entries are only counted as one suspicious behavior, but not two suspicious behaviors.

In another optional implementation manner, searching for a video segment with suspicious behavior according to the motion trajectory of the hand of the user may include: and if the hand of the user enters the code-scanned area from the non-code-scanned area and the farthest distance between the hand of the user and the code-scanned area after the hand of the user leaves the code-scanned area last time is greater than the preset distance, determining the video segment with suspicious behaviors.

When the fact that the hand of the user enters the code-scanned area from the non-code-scanned area is detected, if the user does not go far after leaving the code-scanned area last time and the farthest distance between the hand and the code-scanned area is smaller than the preset distance, the current behavior is not considered to be suspicious behavior. The preset interval may be 5 cm. For example, if it is detected that the user's hand enters the code-scanned area at 10 th second, then leaves again, and re-enters the code-scanned area at 12 th second, and the distance between the user's hand and the code-scanned area is always less than 5 cm between 10 th second and 12 th second, then the user's hand may be considered to hover at the boundary of the code-scanned area between 10 th second and 12 th second, and the two entries only count as one suspicious behavior, not two suspicious behaviors.

The method for judging whether the video segments of suspicious behaviors appear or not is provided by taking the hands of the user as targets. Similarly, the video clips of suspicious behavior may also be determined to target the merchandise. The specific realization principle and process using the commodity as the target are similar to the realization principle and process using hands, and the realization method using the commodity as the target can be obtained by replacing hands with the commodity in the method.

Further, in order to increase the detection accuracy, the user's hand and the commodity can be used as a target, and whether a video clip of suspicious behavior appears or not can be judged together through the track of the hand and the track of the commodity.

Optionally, searching for a video segment with a first characteristic in the video stream according to the moving track of the hand in the video stream may include any one of the following items: if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area, determining that a video clip with a first characteristic appears; if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area and the time interval between the hands and the commodities of the user and the code-scanned area is larger than the preset interval, determining that a video clip with a first characteristic appears; and if the hands and the commodities of the user enter the code-scanned area from the non-code-scanned area and the farthest distance between the hands and the commodities leaving the code-scanned area last time is greater than the preset distance, determining that the video clip with the first characteristic appears. The following description will be given taking the skip-scan behavior as an example.

The hand and the commodity enter the code scanning area, which may be the hand holding the commodity. The video clips with the first characteristic can be video clips in a first preset time period before the hand-held commodity enters the code scanning area and a second preset time period after the hand-held commodity enters the code scanning area.

Under the condition that whether a video clip of suspicious behaviors appears or not is judged by using the hands of the user and the commodities as detection targets, if only the hands enter the code scanning area and no commodities enter the code scanning area, the suspicious behaviors are not considered to appear, and only if the hands and the commodities enter the code scanning area at the same time, the suspicious behaviors are considered to appear.

After the video segments are found, the video segments of the suspicious behaviors can be detected through a machine learning model, and whether the missing scanning behaviors occur or not is determined.

Alternatively, the video segments may be processed by DNN. The DNN recognition rate is high, and whether the video clip belongs to the scanning missing behavior can be accurately determined.

Optionally, after the video segment of the suspicious behavior is acquired, whether the video segment belongs to the missed scanning behavior may be determined directly by the DNN, or the method in this embodiment may be used in combination with the methods in the foregoing embodiments. For example, the tracking process of the goods and the obtained scanning result can be combined to comprehensively determine whether to input the video clip into the DNN for further confirmation.

Optionally, in the tracking process of a piece of merchandise, video segments of a plurality of suspicious behaviors may be detected, for example, a user may hold the merchandise to go in and out of a code-scanned area, but as long as a scanning result is obtained at least once in the tracking process of a piece of merchandise, it can be considered that a missed scanning behavior does not occur, and if a scanning result is not obtained, the last segment of video segment may be input into the DNN for confirmation.

The video processing method can detect the motion track of the hand of the user in the video stream, analyze the state of the hand according to the motion track of the hand of the user, further analyze whether a video clip with a first characteristic appears, and determine whether the user has a first predetermined behavior such as a missing scanning behavior or not by combining a machine learning model, so that the problem of commodity loss prevention is effectively solved based on visual dimension, the processing efficiency of user account settlement is improved, the operation behavior of the user does not need to be limited, and the operation experience of the user is improved.

The embodiment of the invention also provides a video processing method, which comprises the following steps: acquiring a video with preset duration in a video stream for shooting user behaviors; searching a video clip with a first characteristic in the video with the preset duration; determining whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

Optionally, in this embodiment, after the video stream is obtained, the video stream may be segmented to obtain videos with preset durations, and the videos with the preset durations are respectively processed to determine whether video segments with suspicious behaviors exist in the videos with the preset durations.

The step can be specifically used for acquiring the video with the preset duration. Optionally, the obtaining the video with the preset duration in the video stream may include: determining the starting time of the user code scanning checkout; and segmenting the video stream after the starting time according to a preset time length to obtain the video stream with the preset time length. The preset time period may be set according to actual needs, and may be, for example, 5.2 seconds, that is, every 5.2 seconds of video in the video stream is processed.

Specifically, the time when the user starts to perform code scanning checkout is taken as the 0 th second of the video stream, then the 0 th to 5.2 th seconds are one video, the 5.3 th to 10.4 th seconds are one video, the 10.5 th to 15.6 th seconds are one video, and so on, the video stream can be divided into a plurality of videos.

There may be many ways to determine the start time of the user code-scanning checkout. Optionally, a checkout starting instruction input by the user may be obtained, and according to the checkout starting instruction, the start time of code scanning checkout of the user is determined.

For example, the user may be provided with an option to start the operation, a key, etc. for selection by the user. Alternatively, a "start" button may be displayed on the display device of the self-service checkout terminal, and in response to an operation event in which the user clicks the "start" button, the start time of the user code-scanning checkout may be determined to be the time at which the user clicks the "start" button.

Or, the start time of the user code scanning checkout may be determined according to the time when the first scanning result is obtained. Specifically, after the user takes the first commodity and scans the first commodity successfully, the start time of the code scanning and checkout of the user may be determined as the time of obtaining the scanning result of the first commodity. Therefore, the user does not need to click to start manually, and the time for the user to check out the account by scanning the code is saved.

Optionally, the capturing of the video stream may be started after the user starts to scan the code and check out. When the user does not start to scan the commodity, the video acquisition function can not be started temporarily, and the resource consumption is effectively reduced.

After the video stream is acquired, video segments of suspicious behaviors can be searched for each video with preset duration. There are many methods for searching for video segments of suspicious behavior from videos with preset duration. For example, a video segment meeting the requirement can be extracted from a video with a preset duration through a machine learning method.

Optionally, the 3D convolution feature of the video with the preset duration may be extracted, and the video segment of the suspicious behavior in the video with the preset duration may be determined according to the convolution feature.

Specifically, the 3D convolution feature described in this embodiment may be an unfolded 3D ConvNet feature, or another 3D convolution feature such as a Pseudo-3D ConvNet feature. Based on the extracted 3D convolution characteristics, suspicious behaviors in the video can be detected and identified by adopting an Action Proposal Network. Then, normalization operation may be performed on the detected video segments of the suspicious behavior, so that features of all the video segments have a uniform size, which facilitates subsequent processing, for example, when it is subsequently determined whether the video segments of the suspicious behavior belong to a missed scan behavior, implementation may be performed through 3D convolution features.

In other optional implementation manners, after the video segments of the suspicious behaviors are found through the 3D convolution features, if the number of the video segments is multiple, the video segments of the suspicious behaviors may be merged. For convenience of description, a video segment directly found by the 3D convolution feature is referred to as a sub-segment herein.

Optionally, searching for a video segment of a suspicious behavior in the video with the preset duration may include: searching sub-segments of suspicious behaviors in the video with the preset duration; if the video has a plurality of sub-segments of suspicious behaviors, calculating the confidence coefficient that the missing scanning behaviors exist at each time point in each sub-segment; and obtaining at least one video segment of the suspicious behavior according to the confidence coefficient of the missing scanning behavior existing at each time point.

Specifically, for each 5.2 seconds of video, the 3D convolution feature may be used to find out the sub-segments of the suspicious behaviors, for example, there are 3 sub-segments of the suspicious behaviors in the 5.2 seconds of video, the time lengths of the sub-segments of the suspicious behaviors may be the same or different, and there may be overlapping portions in the sub-segments of the suspicious behaviors.

For each of the 3 sub-segments, a confidence that there is a missing scan behavior in each sub-segment may be calculated, and at times other than the 3 sub-segments in the 5.2 second video, the confidence may be considered to be 0. When the confidence of the missing scanning behavior of each sub-segment is calculated, a curve can be output to represent the confidence of each time point in the sub-segment.

Specifically, for each sub-segment, a confidence coefficient change curve of all the time lengths of the sub-segment may be calculated, or confidence coefficients corresponding to a plurality of time points in the sub-segment may be calculated, and the confidence coefficients corresponding to the plurality of time points are connected into a smooth curve, so as to obtain the confidence coefficient corresponding to the sub-segment.

After the confidence corresponding to each sub-segment is determined, all the detected sub-segments of the suspicious behavior can be traversed, the confidence is accumulated in the time dimension, and the final sub-segment of the suspicious behavior is determined according to the accumulated confidence.

Optionally, obtaining the video segment of the at least one suspicious behavior according to the confidence that the missing scanning behavior exists at each time point may include: for each time point, overlapping the corresponding confidence coefficients of the time point in each sub-segment to obtain a combined confidence coefficient corresponding to the time point; searching a time point with the merging confidence coefficient larger than a preset threshold value; and obtaining at least one video segment of the suspicious behavior according to the searched time point.

For example, the first 5.2 seconds of the video stream are processed to obtain a plurality of sub-segments, wherein the number of the sub-segments containing the 1.5 th second of the video stream is two, the 1 st to 2 nd seconds are the sub-segments of the first suspicious act, and the 1.5 th to 3 rd seconds are the sub-segments of the second suspicious act. In the first sub-segment, the confidence corresponding to the 1.5 th second is 1, that is, when the first sub-segment is processed, the confidence that the 1.5 th second belongs to the missing scan behavior is 1. In the second sub-segment, the confidence corresponding to the 1.5 th second is 0.8. The merging confidence corresponding to 1.5 seconds of the video stream is 1+0.8=1.8.

Then, video segments of suspicious behavior can be determined according to the combined confidence of the time points. Specifically, a segment with a merging confidence greater than a specified threshold (e.g., 1.0) may be taken as a video segment of suspicious behavior.

Fig. 9 is a schematic diagram of merging confidence provided in the embodiment of the present invention. As shown in fig. 9, a segment with a confidence greater than 1.0 may be merged as a video segment of suspicious behavior. For example, if the confidence between the 3.6 th second and the 4 th second is greater than 1.0, then the 3.6 th second to the 4 th second are considered to have a suspicious behavior.

Further, if the time interval between any two video clips is smaller than the preset time interval, the two video clips are merged. For example, the preset time interval may be 0.25 seconds, and when the interval between two segments is less than 0.25 seconds, the two segments are merged.

After the video segments of the suspicious behaviors are found, analyzing whether a code scanning behavior exists in the video segments or a behavior of moving the commodity to a code scanning area through a machine learning model aiming at each video segment, if so, judging whether a scanning result of the commodity is obtained between the starting time and the ending time of the video segment, if not, confirming that scanning missing exists, otherwise, considering that scanning missing does not exist.

Or after the tracking process of a commodity is finished, if a video segment of a suspicious behavior appears in the tracking process, judging whether a scanning result of the commodity is acquired in the tracking process; and if the scanning result is not obtained, judging whether a missing scanning behavior occurs or not through a machine learning model according to the video segment of the suspicious behavior.

In practical applications, three modules may be used: the action detection module, the post-processing module and the action verification module can realize the functions, the three modules can adopt a cascade structure, and the missing scanning action can be more finely detected and classified through the cascade structure.

The motion detection module firstly performs motion detection on an input video stream, processes videos every 5.2 seconds, and detects sub-segments of suspicious behaviors from the video stream. Specifically, the action detection module may determine sub-segments of the suspicious action by means of 3D convolution, and determine, for each sub-segment, a start time, an end time, and a confidence that each intermediate time point belongs to the code scanning action, where there may be an overlap between the sub-segments.

Then, the post-processing module performs post-processing on the sub-segments detected from the video, and connects the segments adjacent in the time dimension to form a complete video segment. Specifically, for each time point, the post-processing module may add the corresponding confidence of the time point in each sub-segment, and output a segment with a confidence greater than 1 (a segment above the middle horizontal line in fig. 11), and may combine the segments close to each other to output a complete video segment.

And finally, the action verification module further confirms each complete video clip to determine whether the missing scanning occurs or not, so that more accurate missing scanning time clips can be obtained, and interference in action detection is eliminated.

The video processing method can acquire the video with the preset time length in the video stream, search the video segment with the first characteristic in the video with the preset time length, detect the video segment through the machine learning model, determine whether the first preset behavior occurs or not, monitor the user under the condition that the user does not sense, and has simple logic and easy realization.

In an offline retail scenario, the checkout process of each commodity of a user is very short, so that commodity detection and customer attitude estimation algorithms are required to be capable of achieving real-time accuracy under limited computing resources.

The embodiment of the invention also provides a video processing method which can detect the position of the commodity in the image and the user posture information. The method can comprise the following steps: processing images in the video stream to obtain a semantic feature map corresponding to the images; and detecting the position information of the commodity and the posture information of the user in the image according to the semantic feature map.

Specifically, a video stream for capturing user behavior may be acquired and decoded to obtain a frame-by-frame image. Then, each frame of image can be processed, and the semantic feature map corresponding to each frame of image is determined. Wherein, the image can be any type of image such as RGB image, gray scale image, YUV image, etc.

Optionally, before determining the semantic feature map corresponding to the image, the image may be centered and subjected to scale normalization. The centralization means that a mean value is subtracted from a pixel value corresponding to each pixel point in the image, and the scale normalization means that each pixel value after the mean value is subtracted is divided by a square difference, so that convergence is facilitated, and the training effect of a subsequent model is better. The mean and the variance refer to the mean and the variance of pixel values corresponding to pixel points in all images in the video sample.

In this embodiment, processing the image in the video stream to obtain the semantic feature map corresponding to the image may include: calculating a characteristic vector corresponding to each pixel point according to the pixel value of each pixel point of the image in the video stream; the semantic feature map corresponding to the image comprises feature vectors corresponding to all pixel points in the image; and the feature vector corresponding to the pixel point comprises probability information of the pixel point belonging to each semantic feature.

For a frame of image, the corresponding semantic feature map comprises probability information of each pixel point in the image belonging to each semantic feature.

Where the semantic features may be any feature, such as a person's hand, a person's eye, a good, a table, etc. If 128 semantic features are preset, in this step, for each pixel point, probability information that the pixel point belongs to each semantic feature may be calculated to obtain a feature vector corresponding to the pixel point, where the feature vector includes probability information that the pixel point belongs to each semantic feature, that is, the feature vector includes 128 numerical values, and each numerical value represents probability information that the pixel point belongs to one semantic feature.

The probability information represents the intensity of the pixel point belonging to the semantic feature, the probability information can be the probability without normalization, and the larger the numerical value is, the larger the probability representing the pixel point belonging to the semantic feature is.

Optionally, the semantic feature map corresponding to each frame of image may be extracted by combining a bottom-up channel-level convolution and a 1x1 group convolution with top-down scale pyramid feature fusion.

The calculation amount of the group convolution of the channel-level convolution and the 1x1 is lower than that of the common convolution of convolution kernels with the same size, so that the calculation cost of forward convolution operation is low, and the scale pyramid feature fusion can fuse image features with different semantics, so that feature representation has strong distinguishability.

And processing each frame of image to obtain a semantic feature map corresponding to each frame of image.

In this embodiment, the position information of the commodity can be determined according to the semantic feature map, and the posture of the user can be estimated at the same time. The position information of the commodity and the user posture estimation can be realized by adopting any target detection method and any posture estimation method.

Optionally, detecting the position information of the commodity and the posture information of the user in the image according to the semantic feature map may include the following steps a to d:

and a, predicting the position information of a plurality of candidate objects in the image according to the semantic feature map.

Optionally, predicting the position information of the multiple candidate objects in the image according to the semantic feature map may include: and predicting the position information of a plurality of candidate objects in the image aiming at the feature vector corresponding to each pixel point.

Specifically, for each pixel point, the positions of a plurality of candidate objects around the pixel point can be predicted according to the feature vector corresponding to the pixel point. The number of candidate objects predicted by each pixel point can be set according to actual needs. For example, for each pixel point, the position information of 15 candidates around the pixel point can be predicted.

The position information of the candidate object may include the coordinates of the center point of the candidate object and the length and width of the rectangular frame in which the candidate object is located.

There are many ways to predict object candidates from semantic feature maps. Alternatively, a mask RCNN algorithm may be used to determine the candidate objects.

Optionally, after predicting the position information of multiple candidate objects for the feature vector corresponding to each pixel point, before classifying the candidate objects, the position information of multiple candidate objects in the image may be obtained by performing deduplication on the candidate objects predicted according to the feature vector corresponding to each pixel point.

Assuming that there are 800 × 1000 pixel points in the image, 15 candidate objects are predicted for each pixel point, so that there are many possibilities that are repeated in the middle of 800 × 1000 × 15 candidate objects, the candidate objects can be de-duplicated through an algorithm, and the de-duplicated candidate objects are classified. Alternatively, deduplication can be implemented by non-maximum suppression algorithms. For example, 800 × 1000 × 15 candidate objects are subjected to the deduplication, and only 1000 candidate objects remain, then the 1000 candidate objects may be subjected to the next classification operation.

B, classifying the candidate objects and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background.

As described above, 1000 de-duplicated candidate objects may be classified, all the candidate objects are classified into three categories, namely, user, commodity and background, and the boundaries of the rectangular frame where the commodity and the user are located may be refined.

There are many methods for classifying the candidate object, and optionally, a machine learning model such as a neural network model may be used to classify the candidate object. After the classification is finished, the rectangular frame where the commodity and the user are located can be refined. The refinement refers to regression of the rectangular frame, namely processing the rectangular frame, so that the rectangular frame where the commodity or the user is located is more accurate.

And c, determining a characteristic vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user.

And d, predicting the posture information of the user according to the characteristic vector corresponding to the area where the user is located.

Specifically, the attitude information of the user can be estimated according to the feature vector corresponding to each pixel point in the area where the user is located. The pose information may include position information for a plurality of key points of the user. For example, for a region identified as a user, the position information of 17 key points of the user's nose, eyes, ears, shoulders, elbows, wrists, pelvis, knees, ankles, etc. can be located by the feature vectors. The position information of the user's hand can be determined from the position information of the key points.

There are many methods for calculating the location information of the user's key points. In this embodiment, the position information of 17 key points may be predicted through a convolution network and a deconvolution network according to the semantic feature map. Specifically, the position information of the key point can be predicted by 4 times of convolution and 2 times of deconvolution, so that the speed is high and the effect is not reduced.

In summary, if a candidate object is considered as a user, the position information of the hand of the user may be further determined according to the gesture of the user; if a candidate object is considered as a commodity, the position information of the commodity can be directly obtained. If a candidate is considered background, no further processing is required on it.

The position information of the commodity may include position information of a polygon frame where the commodity is located; the position information of the hand may include center point coordinates of the hand. The hand position information and the commodity position information can be applied to any flow of the self-service cash registering process.

Optionally, when a tracking process of the commodity needs to be determined, according to the position information of the hand in each image of the video stream and the position information of the commodity, a moving track of the hand and a moving track of the commodity in the video stream can be determined; and determining the tracking process of the commodity according to the moving track of the hand and the moving track of the commodity.

Optionally, when the video segment of the suspicious behavior in the video stream needs to be searched, whether the hands and the commodities enter the code scanned area or not can be determined through the moving tracks of the hands and the commodities, so that whether the video segment of the suspicious behavior appears or not can be determined.

The algorithms used in the embodiments of the present invention may be replaced by any other general algorithm capable of realizing the related functions. For example, when determining the semantic feature map, the semantic feature map of the image may be determined by RCNN (Regions with CNN features), SSD (Single Shot multi box Detector), YOLO (young Only Look one), and the like, and the position information and the category of the candidate object in the image may be detected according to the semantic feature map.

RCNN, SSD and YOLO all belong to target detection algorithms, and can be used for learning through large-scale object labeling and predicting the coordinate and category information of a target in an image.

When the posture of the user is estimated according to the semantic feature diagram, the estimation can be realized by adopting an algorithm such as OpenPose and the like or a convolution and deconvolution mode.

The embodiment can realize the positioning of the commodity and the hand of the user based on deep learning, and finds the behaviors of account omission and account non-settlement of the customer from the visual dimension through the analysis of the commodity and the hand, thereby achieving the purpose of visual loss prevention.

Compared with the prior art, the method in the embodiment has the advantages that the commodity position detection and the user posture detection share the semantic feature map. Specifically, after the position and the category of the candidate object are acquired by the semantic feature map, for the region determined as the user, feature vectors in the region are selected from the semantic feature map, and the feature vectors pass through a shallow full convolution network, so that 17 key points in total, such as the nose, the eyes, the ears, the shoulders, the elbows, the wrists, the pelvis, the knees, the ankles and the like of the customer, can be predicted.

Compared with the method of processing the image through a target detection algorithm such as SSD and YOLO to obtain the position information of the user and the commodity and then processing the image through a posture estimation algorithm such as OpenPose to obtain the posture information of the user, the embodiment can simultaneously realize commodity detection and posture estimation of the user through the same semantic feature map.

After the position information of the user and the commodity is detected, the posture estimation can be realized without extracting the semantic feature map again, the flow of the semantic feature map extracted repeatedly is reduced, and the algorithm complexity is reduced. Commodity detection and attitude estimation can be completed more efficiently in a mode of sharing a semantic feature map by commodity detection and attitude estimation.

According to the video processing method, the images in the video stream can be processed to obtain the semantic feature maps corresponding to the images, the position information of the commodities and the posture information of the user in the images are detected according to the semantic feature maps corresponding to the images, the settlement behaviors of the customers can be analyzed through the positions and the states of the hands of the users and the commodities in the video stream, the settlement missing behaviors and the settlement not behaviors of the customers can be found out from the visual dimension, the purpose of visual loss prevention is achieved, and the efficiency of the self-service cash collecting terminal is improved; in addition, commodity detection and user posture estimation are achieved through the shared semantic feature map, video streams can be detected more efficiently, algorithm processing efficiency is improved, and user experience is improved.

Fig. 10 is a schematic flowchart of a fourth embodiment of a video processing method according to the present invention. As shown in fig. 10, the video processing method in this embodiment may include:

step 1001, acquiring an offline video for shooting user behaviors.

Step 1002, detecting the motion track of the moving target in the off-line video by adopting an optical flow tracking algorithm.

Step 1003, searching a video segment with a first characteristic in the offline video according to the motion track of the moving target.

Step 1004, determining whether the user has a first predetermined behavior corresponding to the first feature according to the video clip.

The implementation principle and process of the method in this embodiment can refer to the foregoing embodiments, and the only difference is that the foregoing embodiments can be used to process a real-time video stream, and the present embodiment can be used to process an offline video.

For parts of the present embodiment that are not described in detail, reference may be made to the related description of the foregoing embodiments. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 11 is a schematic flowchart of a store management method according to a first embodiment of the present invention. As shown in fig. 11, the store management method in this embodiment may include:

step 1101, acquiring a video stream for shooting the behavior of a manager.

And 1102, detecting the motion track of the moving target in the video stream by adopting an optical flow tracking algorithm.

Step 1103, searching for a video segment with a second characteristic in the video stream according to the motion trajectory of the moving target.

And 1104, determining whether a second preset behavior corresponding to the second characteristic occurs to the manager according to the video clip.

Specifically, one or more cameras may be set in a work area of the administrator, and the cameras may capture behaviors of the administrator and send the behaviors to the server for analysis.

Wherein the manager may refer to any person working at the store, such as service personnel, order picking personnel, and the like. The second predetermined behavior may refer to any behavior of a store manager, such as various behaviors of shelving goods, arranging goods shelves, packaging goods, and the like, and the second characteristic may be any characteristic suspected of the second predetermined behavior.

For example, the second predetermined behavior may be an on-shelf behavior, i.e., a behavior of putting an article on a shelf, and the second characteristic may be a characteristic of a behavior suspected of putting an article on a shelf, such as moving a hand from a basket on which an article is placed to a shelf. As long as the characteristic of the suspected shelving behavior is detected, whether the user has the shelving behavior can be judged according to the video clip where the characteristic is located.

There may be many ways to determine whether the second predetermined behavior occurs based on the video segment. Optionally, the video segment may be detected by a machine learning model, and it is determined whether a second predetermined behavior occurs in the video segment.

The specific implementation principle and process of how to search for the video segment and determine whether the predetermined behavior occurs according to the video segment are similar to the foregoing embodiments, and only the first predetermined behavior in the foregoing embodiments needs to be replaced by the second predetermined behavior.

Optionally, monitoring information may also be sent to the monitoring terminal, where the monitoring information may include information about whether a second predetermined behavior occurs or the number of times the second predetermined behavior occurs to the administrator, and the monitoring personnel may monitor the administrator according to the information, and handle the administrator in a manual or machine intervention manner when the behavior of the administrator is abnormal.

To sum up, the store management method provided by the embodiment of the present invention may obtain a video stream for shooting behaviors of a manager, detect a motion trajectory of a moving object in the video stream by using an optical flow tracking algorithm, search a video segment having a second characteristic in the video stream according to the motion trajectory of the moving object, and determine whether the manager has a second predetermined behavior corresponding to the second characteristic according to the video segment, so as to monitor and perform subsequent settlement processing on a user according to a determination result of whether the second predetermined behavior exists, for example, determine whether the work of the manager reaches the standard according to the number of times of putting commodities on shelves by the manager, thereby effectively reducing economic loss of a retail store, and implement analysis on the behaviors of the manager through video processing, so that the work process of the manager is not disturbed, thereby improving the work efficiency of the manager.

Fig. 12 is a schematic flowchart of a second store management method according to an embodiment of the present invention. As shown in fig. 12, the store management method in this embodiment may include:

and step 1201, acquiring an offline video for shooting the behavior of the manager.

And 1202, detecting the motion trail of the moving target in the off-line video by adopting an optical flow tracking algorithm.

Step 1203, searching for a video segment with a second characteristic in the offline video according to the motion track of the moving target.

Step 1204, determining whether a second predetermined behavior corresponding to the second characteristic occurs to the manager according to the video clip.

The implementation principle and process of the method in this embodiment can refer to the store processing method provided in the foregoing embodiment, and the only difference is that the foregoing embodiment can be used to process a real-time video stream, and this embodiment can be used to process an offline video.

A video processing apparatus according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these video processing devices can each be constructed using commercially available hardware components configured through the steps taught by the present scheme.

Fig. 13 is a schematic structural diagram of a video processing apparatus according to a first embodiment of the present invention. As shown in fig. 13, the apparatus may include:

an obtaining module 131, configured to obtain a video stream for capturing a user behavior;

a detecting module 132, configured to detect a motion trajectory of a moving object in the video stream by using an optical flow tracking algorithm;

the searching module 133 is configured to search, according to the motion trajectory of the moving target, a video segment with a first characteristic in the video stream;

a determining module 134, configured to determine whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

Optionally, the first feature is a feature of suspected missed scan behavior; the first predetermined behavior is a missed scan behavior.

Optionally, the search module 133 may be specifically configured to: and if the moving target is detected to enter a preset area, or the moving target is detected to leave the preset area, confirming that the video clip with the first characteristic exists.

Optionally, the search module 133 may be specifically configured to: and if the moving target is detected to leave the preset area after entering the preset area, determining that the video clip with the first characteristic appears.

Optionally, the search module 133 may be specifically configured to: if the moving target leaves after entering a preset area, identifying the moving target, and judging whether the moving target comprises commodities; and if the moving target comprises a commodity, determining that a video clip with a first characteristic appears.

Optionally, the preset region is a region which is smaller than a preset distance value from the scanning device in the vertical direction; the scanning device is used for acquiring a corresponding scanning result when a user scans a commodity.

Optionally, the start time of the video segment with the first characteristic is a time of entering the preset area, and the end time of the video segment with the first characteristic is a time of leaving the preset area.

Optionally, the lookup module 133 may be further configured to: after determining that a video segment with a first feature is present, determining an accurate start time and an accurate end time of the video segment according to a machine learning model.

Optionally, the determining module 134 may specifically include: the system comprises a first judging unit, a second judging unit and a control unit, wherein the first judging unit is used for judging whether a scanning result of a commodity is acquired in the tracking process or not after the tracking process of the commodity is finished and if a video clip with a first characteristic appears in the tracking process; the second judging unit is used for judging whether a first preset behavior occurs or not according to the video clip with the first characteristic when the scanning result is not obtained; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

Optionally, the first determining unit may be further configured to: detecting position information of commodities and hands in the video stream; determining the movement tracks of the commodity and the hand according to the position information of the commodity and the hand; and determining whether the commodity is held in the hand according to the movement tracks of the commodity and the hand.

Optionally, the first determining unit may be further configured to: after the commodity is confirmed to be held in the hand, if the time for detecting the empty hand exceeds the preset time, the tracking process of the commodity is confirmed to be finished.

Optionally, the first determining unit may be further configured to: and if the scanning result is obtained in the tracking process of one commodity, determining that the first preset behavior does not appear in the tracking process of the commodity.

Optionally, the second determining unit may be specifically configured to: and when the scanning result is not obtained, if a plurality of video segments with the first characteristic exist in the tracking process, judging whether a first preset behavior occurs according to the last video segment.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, if a plurality of video clips with the first characteristic exist in the tracking process, searching other video clips with overlapping parts with the last video clip; merging the found video clip with the last video clip; and judging whether a first preset behavior occurs or not according to the merged video clip.

Optionally, the second determining unit may be specifically configured to: when the scanning result is not obtained, determining the confidence coefficient that the video clip with the first characteristic belongs to the first preset behavior through a machine learning model; and judging whether a first preset behavior occurs according to the confidence.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, if a plurality of video segments with first characteristics exist in the tracking process, determining the confidence degree that each video segment with the first characteristics belongs to a first preset behavior through a machine learning model; calculating a weighted sum of the confidence degrees corresponding to the plurality of video segments with the first characteristic; if the weighted sum is greater than a preset threshold, determining that a first predetermined behavior occurs.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, determining a corresponding machine learning model according to the type of the video clip with the first characteristic; inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient that the video clip belongs to a first preset behavior; and judging whether a first preset behavior occurs according to the confidence.

Optionally, the second determining unit may be specifically configured to: when a scanning result is not obtained, if a plurality of video segments with first characteristics exist in the tracking process, determining the confidence degree that each video segment with the first characteristics belongs to a first preset behavior through a machine learning model; calculating a weighted sum of confidence degrees corresponding to the plurality of video segments with the first characteristic; if the weighted sum is larger than a preset threshold value, determining that a first preset action occurs; wherein determining, by the machine learning model, a confidence that the video segment with the first feature belongs to the first predetermined behavior comprises: determining a corresponding machine learning model according to the type of the video clip with the first characteristic; and inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient that the video clip belongs to the first preset behavior.

Optionally, the determining module 134 may further be configured to: responding to an operation event that the user confirms that the commodity is scanned completely, and counting the times of first preset behaviors of the user; and if the times of the first preset behaviors of the user are less than the preset times, the commodity scanned by the user is settled.

Optionally, the determining module 134 may further be configured to: and if the times of the first preset behaviors of the user are not less than the preset times, displaying a settlement forbidding interface and/or sending warning information to a monitoring terminal.

Optionally, the apparatus may further include: the semantic processing module is used for processing the images in the video stream to obtain a semantic feature map corresponding to the images; and the gesture detection module is used for detecting the position information of the commodity and the gesture information of the user in the image according to the semantic feature map.

Optionally, the semantic processing module may be specifically configured to: calculating a characteristic vector corresponding to each pixel point according to the pixel value of each pixel point of the image in the video stream; the semantic feature map corresponding to the image comprises feature vectors corresponding to all pixel points in the image; the feature vector corresponding to the pixel point comprises probability information of the pixel point belonging to each semantic feature.

Optionally, the gesture detection module may be specifically configured to: predicting the position information of a plurality of candidate objects in the image according to the semantic feature map; classifying the candidate objects, and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background; determining a feature vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user; and predicting the posture information of the user according to the feature vector corresponding to the area where the user is located.

Optionally, the gesture detection module may be specifically configured to: predicting the position information of a plurality of candidate objects aiming at the characteristic vector corresponding to each pixel point; removing duplication of candidate objects obtained according to the feature vector prediction corresponding to each pixel point to obtain position information of a plurality of candidate objects in the image; classifying the plurality of candidate objects, and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background; determining a feature vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user; and predicting the posture information of the user according to the feature vector corresponding to the area where the user is located.

Optionally, the gesture detection module may be further configured to: and determining the position information of the hand according to the posture information of the user.

Optionally, the gesture detection module may be further configured to: determining the movement track of the hand and the movement track of the commodity in the video stream according to the position information of the hand and the position information of the commodity in each image of the video stream; and determining the tracking process of the commodity according to the movement track of the hand and the movement track of the commodity.

The apparatus shown in fig. 13 can execute the scheme provided by the second embodiment of the video processing method, and reference may be made to the related description of the second embodiment of this embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 14 is a schematic structural diagram of a second video processing apparatus according to an embodiment of the present invention. As shown in fig. 14, the apparatus may include:

an obtaining module 141, configured to obtain a sensing signal sent by a sensing device;

the detection module 142 is configured to determine a motion trajectory of a hand of the user according to the sensing signal;

the searching module 143 is configured to search, according to the motion trajectory of the hand, a video segment with a first characteristic in a video stream of a user behavior;

a determining module 144, configured to determine whether a first predetermined behavior corresponding to the first feature occurs according to the video segment.

The apparatus shown in fig. 14 can execute the scheme provided by the third embodiment of the video processing method, and reference may be made to the related description of the foregoing embodiment for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 15 is a schematic structural diagram of a third video processing apparatus according to an embodiment of the present invention. As shown in fig. 15, the apparatus may include:

an obtaining module 151, configured to obtain an offline video for capturing a user behavior;

a detection module 152, configured to detect a motion trajectory of a moving object in the offline video by using an optical flow tracking algorithm;

the searching module 153 is configured to search, according to the motion trajectory of the moving target, a video segment with a first characteristic in the offline video;

a determining module 154, configured to determine whether the user has a first predetermined behavior corresponding to the first feature according to the video segment.

The apparatus shown in fig. 15 can execute the solution provided by the fourth embodiment of the video processing method, and reference may be made to the related description of the foregoing embodiment for a part of this embodiment that is not described in detail. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 16 is a schematic structural diagram of a store management apparatus according to a first embodiment of the present invention. As shown in fig. 16, the apparatus may include:

an obtaining module 161, configured to obtain a video stream of shooting a behavior of a manager;

the detection module 162 is configured to detect a motion trajectory of a moving object in the video stream by using an optical flow tracking algorithm;

the searching module 163 is configured to search, according to the motion trajectory of the moving target, a video segment with a second characteristic in the video stream;

a determining module 164, configured to determine whether a second predetermined behavior corresponding to the second feature occurs to the administrator according to the video clip.

The device shown in fig. 16 may execute the scheme provided by the first embodiment of the store management method, and reference may be made to the related description of the foregoing embodiment for a part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 17 is a schematic structural diagram of a second store management apparatus according to an embodiment of the present invention. As shown in fig. 17, the apparatus may include:

the acquiring module 171 is configured to acquire an offline video of a behavior of a shooting manager;

the detection module 172 is configured to detect a motion trajectory of a moving target in the offline video by using an optical flow tracking algorithm;

a searching module 173, configured to search, according to the motion trajectory of the moving target, a video segment with a second characteristic in the offline video;

a determining module 174, configured to determine whether a second predetermined behavior corresponding to the second feature occurs to the administrator according to the video clip.

The device shown in fig. 17 can execute the scheme provided by the second embodiment of the store management method, and reference may be made to the related description of the second embodiment of the present embodiment, which is not described in detail. The implementation process and technical effect of the technical solution refer to the description in the foregoing embodiments, and are not described herein again.

Fig. 18 is a schematic structural diagram of a first electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with a video processing function, such as a self-service cash register terminal, a server and the like. As shown in fig. 18, the electronic device may include: a first processor 21 and a first memory 22. Wherein the first memory 22 is used for storing a program for supporting an electronic device to execute the video processing method provided by any one of the foregoing embodiments, and the first processor 21 is configured to execute the program stored in the first memory 22.

The program comprises one or more computer instructions which, when executed by the first processor 21, are capable of performing the steps of:

acquiring a video stream for shooting user behaviors;

determining whether a first predetermined behavior corresponding to the first feature occurs according to the video clip.

Optionally, the first processor 21 is further configured to perform all or part of the steps in the embodiment shown in fig. 7.

The electronic device may further include a first communication interface 23, which is used for the electronic device to communicate with other devices or a communication network.

Fig. 19 is a schematic structural diagram of a second electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with a video processing function, such as a self-service cash register terminal, a server and the like. As shown in fig. 19, the electronic device may include: a second processor 24 and a second memory 25. Wherein the second memory 25 is used for storing a program for supporting an electronic device to execute the video processing method provided by any one of the foregoing embodiments, and the second processor 24 is configured to execute the program stored in the second memory 25.

The program comprises one or more computer instructions which, when executed by the second processor 24, are capable of performing the steps of:

acquiring a sensing signal sent by a sensing device;

Optionally, the second processor 24 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 8.

The electronic device may further include a second communication interface 26 for communicating with other devices or a communication network.

Fig. 20 is a schematic structural diagram of a third electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with a video processing function, such as a self-service cash register terminal, a server and the like. As shown in fig. 20, the electronic device may include: a third processor 27 and a third memory 28. Wherein the third memory 28 is used for storing a program for supporting the electronic device to execute the video processing method provided by any one of the foregoing embodiments, and the third processor 27 is configured to execute the program stored in the third memory 28.

The program comprises one or more computer instructions which, when executed by the third processor 27, are capable of performing the steps of:

acquiring an offline video for shooting user behaviors;

Optionally, the third processor 27 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 10.

The electronic device may further include a third communication interface 29, which is used for the electronic device to communicate with other devices or a communication network.

Fig. 21 is a schematic structural diagram of a fourth electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with store management function, such as a server. As shown in fig. 21, the electronic device may include: a fourth processor 210 and a fourth memory 211. Wherein the fourth memory 211 is used for storing a program for supporting an electronic device to execute the store management method provided by any one of the foregoing embodiments, and the fourth processor 210 is configured to execute the program stored in the fourth memory 211.

The program comprises one or more computer instructions which, when executed by the fourth processor 210, is capable of performing the steps of:

acquiring a video stream for shooting the behavior of a manager;

Optionally, the fourth processor 210 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 11.

The electronic device may further include a fourth communication interface 212, which is used for the electronic device to communicate with other devices or a communication network.

Fig. 22 is a schematic structural diagram of a fifth electronic device according to an embodiment of the present invention. The electronic device can be any electronic device with store management function, such as a server. As shown in fig. 22, the electronic device may include: a fifth processor 213 and a fifth memory 214. Wherein the fifth memory 214 is used for storing programs that support the electronic device to execute the store management method provided by any one of the foregoing embodiments, and the fifth processor 213 is configured to execute the programs stored in the fifth memory 214.

The program comprises one or more computer instructions, wherein the one or more computer instructions, when executed by the fifth processor 213, enable the following steps to be performed:

acquiring an offline video for shooting the behavior of a manager;

Optionally, the fifth processor 213 is further configured to perform all or part of the steps in the foregoing embodiment shown in fig. 12.

The electronic device may further include a fifth communication interface 215 for communicating with other devices or a communication network.

Additionally, embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed by a processor, cause the processor to perform acts comprising:

acquiring a video stream for shooting user behaviors;

The computer instructions, when executed by a processor, may further cause the processor to perform all or part of the steps involved in the second embodiment of the video processing method.

acquiring a sensing signal sent by a sensing device;

The computer instructions, when executed by the processor, may further cause the processor to perform all or a portion of the steps involved in the third embodiment of the video processing method.

acquiring an offline video for shooting user behaviors;

The computer instructions, when executed by the processor, may further cause the processor to perform all or a portion of the steps involved in the fourth embodiment of the video processing method described above.

acquiring a video stream for shooting the behavior of a manager;

The computer instructions, when executed by a processor, may further cause the processor to perform all or a portion of the steps involved in one embodiment of the store management method described above.

acquiring an offline video for shooting the behavior of a manager;

The computer instructions, when executed by the processor, may further cause the processor to perform all or a portion of the steps involved in embodiment two of the store management method described above.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by a necessary general hardware platform, and may also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable network connection device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable network connection device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable network connection device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable network connection device to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A video processing method, comprising:

acquiring a video stream for shooting user behaviors;

after the tracking process of a commodity is finished, if a video clip with a first characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process;

if the scanning result is not obtained, judging whether a first preset behavior occurs or not according to the video clip with the first characteristic;

wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

2. The method of claim 1, wherein the first characteristic is a characteristic of suspected missed scan behavior; the first predetermined behavior is a missed scan behavior.

3. The method of claim 1, wherein finding the video segment with the first feature in the video stream according to the motion trail of the moving object comprises:

and if the mobile target is detected to enter a preset area, or the mobile target is detected to leave the preset area, confirming that the video clip with the first characteristic appears.

4. The method of claim 1, wherein finding the video segment with the first feature in the video stream according to the motion trail of the moving object comprises:

and if the moving target is detected to leave the preset area after entering the preset area, determining that the video clip with the first characteristic appears.

5. The method of claim 1, wherein finding the video segment with the first feature in the video stream according to the motion trail of the moving object comprises:

if the moving target is detected to leave after entering a preset area, identifying the moving target, and judging whether the moving target comprises a commodity or not;

and if the moving target comprises a commodity, determining that a video clip with a first characteristic appears.

6. The method according to any one of claims 3 to 5, wherein the preset area is an area less than a preset distance value from the scanning device in a vertical direction;

the scanning device is used for acquiring a corresponding scanning result when a user scans a commodity.

7. The method according to claim 6, wherein the start time of the video segment with the first feature is a time of entering the preset area, and the end time of the video segment with the first feature is a time of leaving the preset area.

8. The method of claim 6, further comprising, after determining that the video segment having the first characteristic is present:

and determining the accurate starting time and the accurate ending time of the video clip according to a machine learning model.

9. The method of claim 1, wherein the video clips of the user presence first characteristic comprise video clips of code scanning actions and/or video clips of moving goods to code scanning areas.

10. The method of claim 1, further comprising:

detecting position information of commodities and hands in the video stream;

determining the movement tracks of the commodity and the hand according to the position information of the commodity and the hand;

and determining whether the commodity is held in the hand or not according to the movement tracks of the commodity and the hand.

11. The method of claim 10, further comprising:

after the commodity is confirmed to be held in the hand, if the time for detecting the empty hand exceeds the preset time, the tracking process of the commodity is confirmed to be finished.

12. The method of claim 1, further comprising:

and if the scanning result is obtained in the tracking process of one commodity, determining that the first preset behavior does not appear in the tracking process of the commodity.

13. The method of claim 1, wherein determining whether the first predetermined behavior occurs based on the video segment with the first feature comprises:

and if a plurality of video clips with the first characteristic exist in the tracking process, judging whether a first preset behavior occurs according to the last video clip.

14. The method of claim 13, wherein determining whether the first predetermined behavior occurred based on the last video segment comprises:

searching other video clips with overlapped parts with the last video clip;

merging the found video clip with the last video clip;

and judging whether a first preset behavior occurs or not according to the merged video clip.

15. The method of claim 1, wherein determining whether the first predetermined behavior occurs based on the video segment with the first feature comprises:

determining, by a machine learning model, a confidence that a video segment with a first feature belongs to a first predetermined behavior;

and judging whether a first preset behavior occurs according to the confidence coefficient.

16. The method of claim 1, wherein determining whether the first predetermined behavior occurs based on the video segment with the first feature comprises:

if a plurality of video segments with the first characteristics exist in the tracking process, determining the confidence coefficient that each video segment with the first characteristics belongs to the first preset behavior through a machine learning model;

calculating a weighted sum of the confidence degrees corresponding to the plurality of video segments with the first characteristic;

if the weighted sum is greater than a preset threshold, determining that a first predetermined behavior occurs.

17. The method of claim 15 or 16, wherein determining a confidence that the video segment with the first feature belongs to the first predetermined behavior by the machine learning model comprises:

determining a corresponding machine learning model according to the type of the video segment with the first characteristic;

and inputting the video clip into the corresponding machine learning model to obtain the confidence coefficient that the video clip belongs to the first preset behavior.

18. The method of claim 1, further comprising:

responding to an operation event that the user confirms that the commodity is scanned completely, and counting the times of first preset behaviors of the user;

and if the times of the first preset behaviors of the user are less than the preset times, the commodity scanned by the user is settled.

19. The method of claim 18, further comprising:

and if the times of the first preset behaviors of the user are not less than the preset times, displaying a settlement forbidding interface and/or sending warning information to a monitoring terminal.

20. The method of claim 1, further comprising:

processing images in the video stream to obtain a semantic feature map corresponding to the images;

and detecting the position information of the commodity and the posture information of the user in the image according to the semantic feature map.

21. The method of claim 20, wherein processing the images in the video stream to obtain the semantic feature maps corresponding to the images comprises:

calculating a characteristic vector corresponding to each pixel point according to the pixel value of each pixel point of the image in the video stream;

the semantic feature map corresponding to the image comprises feature vectors corresponding to all pixel points in the image; and the feature vector corresponding to the pixel point comprises probability information of the pixel point belonging to each semantic feature.

22. The method of claim 20, wherein detecting position information of the item and pose information of the user in the image according to the semantic feature map comprises:

predicting the position information of a plurality of candidate objects in the image according to the semantic feature map;

classifying the plurality of candidate objects, and determining the type of each candidate object, wherein the type of the candidate object comprises at least one of the following items: user, commodity, background;

determining a feature vector corresponding to the area where the user is located according to the position information of the candidate object of which the type is the user;

and predicting the posture information of the user according to the feature vector corresponding to the area where the user is located.

23. The method according to claim 22, wherein predicting the position information of the plurality of object candidates in the image according to the semantic feature map comprises:

predicting the position information of a plurality of candidate objects aiming at the characteristic vector corresponding to each pixel point;

and removing the duplication of the candidate objects obtained according to the feature vector prediction corresponding to each pixel point to obtain the position information of a plurality of candidate objects in the image.

24. The method of claim 20, further comprising:

and determining the position information of the hand according to the posture information of the user.

25. The method of claim 20, further comprising:

determining the movement track of the hand and the movement track of the commodity in the video stream according to the position information of the hand and the position information of the commodity in each image of the video stream;

and determining the tracking process of the commodity according to the moving track of the hand and the moving track of the commodity.

26. A video processing method, comprising:

acquiring a sensing signal sent by a sensing device;

if the scanning result is not obtained, judging whether a first preset behavior occurs according to the video clip with the first characteristic;

wherein the tracking process of the commodity is a process of holding the commodity in the hand.

27. The method of claim 26, wherein searching for a video segment with a first feature in a video stream of the user behavior according to the motion trajectory of the hand comprises:

judging whether the hand of the user enters a preset area and leaves;

and if so, determining that a video clip with the first characteristic appears in the video stream for shooting the user behavior.

28. The method of claim 26, wherein the sensing device is a distance sensor or an infrared sensor.

29. A video processing method, comprising:

acquiring an offline video for shooting user behaviors;

searching a video clip with a first characteristic in the offline video according to the motion track of the moving target;

if the scanning result is not obtained, judging whether a first preset behavior corresponding to the first characteristic occurs to the user or not according to the video clip with the first characteristic;

30. A store management method, comprising:

acquiring a video stream for shooting the behavior of a manager;

after the tracking process of a commodity is finished, if a video clip with a second characteristic appears in the tracking process, judging whether a scanning result of the commodity is obtained in the tracking process;

if the scanning result is not obtained, judging whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip with the second characteristic;

31. A store management method, comprising:

acquiring an offline video for shooting the behavior of a manager;

32. A video processing apparatus, comprising:

the determining module is used for judging whether a scanning result of the commodity is acquired in the tracking process or not if a video clip with a first characteristic appears in the tracking process after the tracking process of the commodity is finished; if the scanning result is not obtained, judging whether a first preset behavior occurs according to the video clip with the first characteristic; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

33. A video processing apparatus, comprising:

the searching module is used for searching a video segment with a first characteristic in a video stream for shooting user behaviors according to the motion track of the hand;

the determining module is used for judging whether a scanning result of the commodity is acquired in the tracking process or not if a video clip with a first characteristic appears in the tracking process after the tracking process of the commodity is finished; if the scanning result is not obtained, judging whether a first preset behavior occurs or not according to the video clip with the first characteristic; wherein the tracking process of the commodity is a process of holding the commodity in the hand.

34. A video processing apparatus, comprising:

the determining module is used for judging whether a scanning result of the commodity is acquired in the tracking process or not if a video clip with a first characteristic appears in the tracking process after the tracking process of the commodity is finished;

if the scanning result is not obtained, judging whether a first preset behavior corresponding to the first characteristic occurs to the user or not according to the video clip with the first characteristic; wherein the tracking process of the commodity is a process of holding the commodity in the hand.

35. An store management apparatus, comprising:

the determining module is used for judging whether a scanning result of the commodity is acquired in the tracking process or not if a video clip with a second characteristic appears in the tracking process after the tracking process of the commodity is finished; if the scanning result is not obtained, judging whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip with the second characteristic; wherein, the tracking process of the commodity is the process that the commodity is held in the hand.

36. An store management apparatus, comprising:

the determining module is used for judging whether a scanning result of the commodity is acquired in the tracking process or not if a video clip with a second characteristic appears in the tracking process after the tracking process of the commodity is finished; if the scanning result is not obtained, judging whether a second preset behavior corresponding to the second characteristic occurs to the manager or not according to the video clip with the second characteristic; wherein the tracking process of the commodity is a process of holding the commodity in the hand.

37. An electronic device, comprising: a first memory and a first processor; the first memory is to store one or more computer instructions, wherein the one or more computer instructions, when executed by the first processor, implement the video processing method of any of claims 1 to 25.

38. An electronic device, comprising: a second memory and a second processor; the second memory is to store one or more computer instructions, wherein the one or more computer instructions, when executed by the second processor, implement the video processing method of claim 26.

39. An electronic device, comprising: a third memory and a third processor; the third memory is to store one or more computer instructions, wherein the one or more computer instructions, when executed by the third processor, implement the video processing method of claim 29.

40. An electronic device, comprising: a fourth memory and a fourth processor; the fourth memory is for storing one or more computer instructions which, when executed by the fourth processor, implement the store management method of claim 30.

41. An electronic device, comprising: a fifth memory and a fifth processor; the fifth memory is to store one or more computer instructions that, when executed by the fifth processor, implement the store management method of claim 31.